Converting HTML to Markdown using Pandoc

Markdown is a great plain text format for a lot of applications and is often used to convert to HTML (for example on my WordPress blog here). I recently had a case where I needed to convert from HTML and found that Pandoc makes it really easy.

What’s Pandoc

Pandoc is an open-source utility for converting between a number of common (and rare) document types, for example plain text, HTML, Markdown, MS Word, LaTeX, wiki, and so on. The output formats list is really extensive, and people can write their own “filters” to handle other formats as well, or to customize the existing ones to their exact needs.

My Use Case

My particular use case here is converting about a dozen really old blog posts from this website. I wrote these back in the early days when I managed this site in CityDesk and later migrated to MovableType. The Google Search Console alerted me to some crawler errors which turned out to be caused by raw PHP file content being served instead of real HTML.

My plan for cleaning this up: 1. Convert HTML original articles into Markdown format 2. Do some manual cleanup editing and double-check links are still valid 3. Drop the Markdown into the appropriate Posts within WordPress 4. Modify my existing .htaccess files to do permanent (301) redirects for any of the old URLs that search engines may still have

Examples

Simple HTML Example

With Pandoc installed, you can try a simple test pulling down the installation instructions page:

curl --silent https://pandoc.org/installing.html | pandoc --from html --to markdown_strict -o installing.md

If we take a look at an HTML snippet:

<h2 id="compiling-from-source">Compiling from source</h2>
<p>If for some reason a binary package is not available for your platform, or if you want to hack on pandoc or use a non-released version, you can install from source.</p>
<h3 id="getting-the-pandoc-source-code">Getting the pandoc source code</h3>
<p>Source tarballs can be found at <a href="https://hackage.haskell.org/package/pandoc" class="uri">https://hackage.haskell.org/package/pandoc</a>. For example, to fetch the source for version 1.17.0.3:</p>
<pre><code>wget https://hackage.haskell.org/package/pandoc-1.17.0.3/pandoc-1.17.0.3.tar.gz
tar xvzf pandoc-1.17.0.3.tar.gz
cd pandoc-1.17.0.3</code></pre>

We can see the resulting Markdown looks like this:

## Compiling from source

If for some reason a binary package is not available for your platform, or if you want to hack on pandoc or use a non-released version, you can install from source.

### Getting the pandoc source code

Source tarballs can be found at <a href="https://hackage.haskell.org/package/pandoc" class="uri">https://hackage.haskell.org/package/pandoc</a>. For example, to fetch the source for version 1.17.0.3:

    wget https://hackage.haskell.org/package/pandoc-1.17.0.3/pandoc-1.17.0.3.tar.gz
    tar xvzf pandoc-1.17.0.3.tar.gz
    cd pandoc-1.17.0.3

My Blog Post Conversions

For my dozen old HTML articles, the straight conversion ended up being a bit noisy, especially with the template boilerplate around the content which was no longer needed. To clean those up I used a little bit of Sed to clean it up before conversion:

#!/bin/bash
echo "converting $1"
cat $1 | sed '1,/<div class="asset-header">/d' | sed '/<div class="asset-footer">/,/<\/html>/d' | pandoc --wrap=none --from html --to markdown_strict > $1.md

After that, I just needed to do some minor editing cleanups on the Markdown files before bringing them in to WordPress. Success!

Further Reading

There are a few good online converters you can try; keep in mind some of these are limited in the number of characters they can handle:

To learn more and go deeper on Pandoc, they’ve got an excellent user’s guide.

And finally a big recommendation for Dillinger, a great online tool for editing Markdown text with live HTML rendering.