Category Archives: Software

Converting HTML to Markdown using Pandoc

Markdown is a great plain text format for a lot of applications and is often used to convert to HTML (for example on my WordPress blog here). There are also some good use cases for the opposite: converting from HTML into Markdown. I recently had such a case to convert some older blog posts from raw HTML into Markdown found that Pandoc made it really easy.

What’s Pandoc

Pandoc is an open-source utility for converting between a number of common (and rare) document types, for example plain text, HTML, Markdown, MS Word, LaTeX, wiki, and so on. The output formats list is really extensive, and people can write their own “filters” to handle other formats as well, or to customize the existing ones to their exact needs. The project tagline sums it up nicely:

If you need to convert files from one markup format into another, pandoc is your swiss-army knife.

Screenshot of Pandoc website showing all the supported file formats
The Pandoc website lists all of the support file types it can convert between

My Use Case

My particular use case was to convert about a dozen really old blog posts from this website. I wrote these back in the early days when I managed this site in CityDesk and later migrated to MovableType. The Google Search Console alerted me to some crawler errors which turned out to be caused by raw PHP file content being served instead of real HTML.

My approach for cleaning this up was as follows:

  1. Convert HTML original articles into Markdown format
  2. Do some manual cleanup editing and double-check links are still valid
  3. Drop the Markdown into the appropriate Posts within WordPress
  4. Modify my existing .htaccess files to do permanent (301) redirects for all of the old URLs

Examples

Simple HTML Example

With Pandoc installed, you can try a simple test pulling down the installation instructions page:

curl --silent https://pandoc.org/installing.html | pandoc --from html --to markdown_strict -o installing.md

To see the result, consider this HTML snippet from installing.html:

<h2 id="compiling-from-source">Compiling from source</h2>
<p>If for some reason a binary package is not available for your platform, or if you want to hack on pandoc or use a non-released version, you can install from source.</p>
<h3 id="getting-the-pandoc-source-code">Getting the pandoc source code</h3>
<p>Source tarballs can be found at <a href="https://hackage.haskell.org/package/pandoc" class="uri">https://hackage.haskell.org/package/pandoc</a>. For example, to fetch the source for version 1.17.0.3:</p>
<pre><code>wget https://hackage.haskell.org/package/pandoc-1.17.0.3/pandoc-1.17.0.3.tar.gz
tar xvzf pandoc-1.17.0.3.tar.gz
cd pandoc-1.17.0.3</code></pre>

We can see the resulting Markdown turned out very well:

## Compiling from source

If for some reason a binary package is not available for your platform, or if you want to hack on pandoc or use a non-released version, you can install from source.

### Getting the pandoc source code

Source tarballs can be found at <a href="https://hackage.haskell.org/package/pandoc" class="uri">https://hackage.haskell.org/package/pandoc</a>. For example, to fetch the source for version 1.17.0.3:

    wget https://hackage.haskell.org/package/pandoc-1.17.0.3/pandoc-1.17.0.3.tar.gz
    tar xvzf pandoc-1.17.0.3.tar.gz
    cd pandoc-1.17.0.3

My Blog Post Conversions

For my dozen old HTML articles, the straight conversion ended up being a bit noisy, especially with the some old CMS template boilerplate around the content which was no longer needed. To clean those up I used a little bit of Sed to clean it up before conversion:

#!/bin/bash
echo "converting $1"
cat $1 | sed '1,/<div class="asset-header">/d' | sed '/<div class="asset-footer">/,/<\/html>/d' | pandoc --wrap=none --from html --to markdown_strict > $1.md

(The above Sed commands clean up the HTML source in two passes: first removing everything from top of file to <div class="asset-header">, which is where the blog post started; and then removing all from <div class="asset-footer"> to the end of file.)

After that, I just needed to do some minor editing cleanups on the Markdown files before bringing them in to WordPress. Success!

Further Reading

There are a few good online converters you can try; keep in mind some of these are limited in the number of characters they can handle:

To learn more and go deeper on Pandoc, I recommend going through their excellent user’s guide.

And finally a big recommendation for Dillinger, a great online tool for editing Markdown text with live HTML rendering. I use that for writing these blog articles as well, before moving them in to WordPress.

DataStax Installer with Vagrant

I’ve continued to make improvements to my “Cassandra on Vagrant” project (Using Vagrant for Local Cassandra Development) which shows how to install open-source Cassandra or DataStax Enterprise in a variety of different ways. Using Vagrant is very helpful for local development and testing. Virtual images can be created very quickly and can be erased when done, keeping your primary development system clean.

Recently I added an example which uses the DataStax Enterprise (DSE) standalone installer which first appeared in DSE 4.5. The standalone installer normally runs in a graphical UI mode, but can also be run in an unattended mode which I’m using here.

To play with the examples, grab a copy of the Vagrant projects from GitHub: bcantoni/vagrant-cassandra. Once you have Vagrant and VirtualBox set up, check out example 5. DSE Installer and go through the setup.

On my Mac laptop, creating a 3-node DSE cluster takes less than 5 minutes. (The speed is greatly improved because we only need to download the installer once.) The installer has several options for running in unattended mode, so the installation can be customized as needed.

See the code and more details at bcantoni/vagrant-cassandra.

Tech Advent Calendars – 2014

Update: For the latest, check out Tech Advent Calendars – 2016

It’s that time of the year again – Advent calendars for many tech communities. As in past years (2011, 2012, 2013), I’ve gathered a few here that should be interesting:
* Perf Planet Advent (performance) – Feed
* 24ways Advent (web design/development) – Feed
* Perl Advent (Perl language) – Feed
* Java Advent (Java language) – Feed
* UXMas (UX for everyone) – Feed
* SysAdvent (Sysadmin) – Feed I have a combined RSS feed (created with Yahoo! Pipes) that picks up all of these advent calendars:

http://feeds.feedburner.com/TechAdventCalendars. (Yahoo Pipe source).

Quick Guide to Vagrant on Amazon EC2

Here’s a really quick guide to using Vagrant to create virtual machines on Amazon Web Services EC2. I’ve gotten a lot of use out of Vagrant for local development, but sometimes it’s helpful to build out VMs in the cloud. (In particular, if your local machine isn’t very powerful.)

These steps assume you already have Vagrant installed and have an Amazon Web Services account (and know how to use both).

Installation

First you’ll need to install the Vagrant AWS plugin:

vagrant plugin install vagrant-aws
vagrant box add dummy https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box

Next login to your Amazon AWS console to get a few things:

  • AWS access key
  • AWS secret key
  • SSH keypair name
  • SSH private key file (.pem extension)
  • Make sure the default security group enables SSH (port 22) access from anywhere

I like to set these up as environment variables to keep them out of the Vagrantfile. On Mac or Linux systems you can add this to your ~.profile file:

export AWS_KEY='your-key'
export AWS_SECRET='your-secret'
export AWS_KEYNAME='your-keyname'
export AWS_KEYPATH='your-keypath'

Vagrantfile

Now we can configure our Vagrantfile with the specifics needed for AWS. Refer to the vagrant-aws documentation to understand all the options. In the example below we have all the AWS-related settings in the x.vm.provider :aws block:

VAGRANTFILE_API_VERSION = "2"
Vagrant.configure(VAGRANTFILE_API_VERSION) do |config|
  config.vm.define :delta do |x|
    x.vm.box = "hashicorp/precise64"
    x.vm.hostname = "delta"

    x.vm.provider :virtualbox do |v|
      v.name = "delta"
    end

    x.vm.provider :aws do |aws, override|
      aws.access_key_id = ENV['AWS_KEY']
      aws.secret_access_key = ENV['AWS_SECRET']
      aws.keypair_name = ENV['AWS_KEYNAME']
      aws.ami = "ami-a7fdfee2"
      aws.region = "us-west-1"
      aws.instance_type = "m3.medium"

      override.vm.box = "dummy"
      override.ssh.username = "ubuntu"
      override.ssh.private_key_path = ENV['AWS_KEYPATH']
    end
  end
end

See this Github gist for a longer example file.

Now you can bring up the VM by specifying the AWS plugin as the provider:

vagrant up --provider=aws

After about a minute, the VM should be up and running and available for SSH:

$ vagrant up --provider=aws
Bringing machine 'delta' up with 'aws' provider...
==> delta: Launching an instance with the following settings...
==> delta:  -- Type: m3.medium
==> delta:  -- AMI: ami-a7fdfee2
==> delta:  -- Region: us-west-1
==> delta:  -- Keypair: briancantoni
==> delta:  -- Block Device Mapping: []
==> delta:  -- Terminate On Shutdown: false
==> delta:  -- Monitoring: false
==> delta:  -- EBS optimized: false
==> delta:  -- Assigning a public IP address in a VPC: false
==> delta: Waiting for instance to become "ready"...
==> delta: Waiting for SSH to become available...
==> delta: Machine is booted and ready for use!
==> delta: Rsyncing folder: /Users/briancantoni/dev/vagrant/aws/ => /vagrant

$ vagrant ssh
Welcome to Ubuntu 14.04 LTS (GNU/Linux 3.13.0-29-generic x86_64)

ubuntu@ip-172-31-30-167:~$

Notes

  • You need to configure a specific AMI for Vagrant to use. I find the Ubuntu Amazon EC2 AMI Finder very helpful to match the version and region I wanted to use.
  • A common tripping point is the default security group not allowing SSH (port 22) from any IP address. Also make sure to add any other ports depending on your application (e.g., port 80 for HTTP).
  • Once you have the basics working, make sure to read through the vagrant-aws project to understand all the options available.
  • Make sure to vagrant destroy your VMs when done, and check the AWS Console to make sure they were terminated correctly (to avoid unexpected charges).

Using Vagrant for Local Cassandra Development

vagrant logo cassandra logo

Ever since joining DataStax this year, I’ve spent a lot of time learning and using both Cassandra and the DataStax Enterprise version of it. To really get into it, I wanted to be able to quickly build up and tear down local clusters, without affecting my primary development system (Mac PowerBook).

Vagrant’s tagline says it well:

Create and configure lightweight, reproducible, and portable development environments.

To help those that want to learn and develop with Cassandra, I’ve created a set of sample “getting started” templates and shared them on GitHub: bcantoni/vagrant-cassandra

Take a look at the screencasts linked below, then check out the GitHub project for the detailed instructions.