Converting HTML to Text or Markdown

I’m working on a documentation project where I might need to convert some existing HTML pages back into text or Markdown format for the new system. Rather than manually editing the HTML source, I’m testing with a couple different ways to script it automatically. In the examples below, I’m using a documentation page for our GoToMeeting API method Get Meetings.

Lynx

Lynx is an open-source text web browser that is usually present on Linux machines and can be installed for Mac and Windows. I’ve used it in the past to see how web pages will appear to search engines or for accessibility testing. In both cases, you can quickly tell whether your text is sufficiently communicating your content.

For the case of saving web pages in text format, Lynx also has a command-line option “-dump”:

$ lynx -dump http://www.whatismyip.com/ > example.txt

In my test case I couldn’t convince Lynx to fetch an SSL page, so I download it with Curl and pipe it into Lynx:

$ curl --silent https://developer.citrixonline.com/api/gotomeeting-rest-api/apimethod/get-meetings | lynx -dump -stdin > lynx.txt

Here's a sample section of the output:

URL
https://api.citrixonline.com/G2M/rest/meetings
Method
GET
Response Type
JSON
Parameters
scheduled A string "true" to get all future meetings.
history A string "true" to get past meetings within date range.
startDate If history=true, required start of date range, in ISO8601 UTC
format.
endDate If history=true, required end of date range, in ISO8601 UTC
format.

Pandoc

Pandoc is an open-source "universal document converter" which understands (and can convert between) about two dozen different formats. It's well suited for writing a document in a primary source, then converting to other formats for different publishing options.

The option we'll use here is Pandoc's ability to convert from HTML to Markdown, for example:

$ pandoc -s -r html http://www.whatismyip.com/ -o pandoc.md

For my page, I use the same trick as above because Pandoc can't connect to SSL directly:

$ curl --silent https://developer.citrixonline.com/api/gotomeeting-rest-api/apimethod/get-meetings | pandoc -s -r html -o pandoc.md

And here's the sample output of the same section as above:

### URL
https://api.citrixonline.com/G2M/rest/meetings
### Method
GET
### Response Type
JSON
### Parameters
**scheduled** A string "true" to get all future meetings.
**history** A string "true" to get past meetings within date range.
**startDate** If history=true, required start of date range, in ISO8601
UTC format.
**endDate** If history=true, required end of date range, in ISO8601 UTC
format.

Conclusion

Both of these options do a pretty decent job of converting HTML into text or Markdown format. Pandoc seems slightly better in terms of getting to Markdown format, but I would need to run some more samples to see how much manual editing would be needed after.

I'm also going to play a bit more with Aaron Schwartz's Html2Text. In my quick test, it appeared to have a problem with malformed HTML so I need to do some further testing with it.

Tech Advent Calendars

As in year’s past, several tech communities are running Advent calendars during the month of December:

I have a combined RSS feed (created with Yahoo! Pipes) that picks up all of these advent calendars: http://feeds.feedburner.com/TechAdventCalendars. (Yahoo Pipe source).

Movable Type 5.2 on Mac OS X

It’s been a while since my notes about local development copies of Movable Type and WordPress on XAMPP – a local LAMP stack that runs on Windows. I’m still using both blogging platforms and wanted to update the steps now that I’m doing most of my work on a Mac laptop.

The steps below walk through the installation of a local copy of Movable Type 5.2 on Mac OS X.

These instructions assume you already have MySQL, PHP, Perl, and Apache installed. I believe most of these are present in stock OS X installations, but in my case I’ve upgraded everything manually to more recent versions:

  • Mac OS X 10.6.8
  • PHP 5.3.15
  • Perl 5.10.0
  • Apache 2.2.22
  • MySQL 5.5.17

As an alternative, you could use MAMP which is a packaged, pre-configured compilation of Apache, MySQL, and PHP. You may still need to upgrade some Perl modules, however, as Perl is not included in MAMP.

Important Note: In the instructions below I’m using a lot of settings which should not be used in a production environment (like having a MySQL root user with no password). Since this is a local dev environment, we can take a few shortcuts, but don’t do the same on a live site.

Apache

First we’ll set up Apache for our local hosting of Movable Type.

I’m going to host everything under my home directory: mkdir -p /Users/brian/mt/cgi-bin; chmod 777 /Users/brian/mt. We’ll use this path in the examples below, so change it accordingly for your system.

Next, edit the Apache configuration which is based in /etc/apache2. Edit httpd.conf and make the following changes:

  • Find and comment-out the line that starts: ScriptAliasMatch ^/cgi-bin/... (if present)
  • Add a line somewhere to include our custom config: Include /private/etc/apache2/extra/httpd-mt.conf

Under the /etc/apache2/extra location, create a new file httpd-mt.conf with the directory and virtual host settings:

<Directory "/Users/brian/mt">
AddHandler cgi-script .cgi
AllowOverride None
Options FollowSymLinks +ExecCGI Indexes
Order allow,deny
Allow from all
</Directory>
<VirtualHost *:80>
ServerAdmin brian@example.com
DocumentRoot "/Users/brian/mt"
ServerName localhost.mt
ErrorLog "/private/var/log/apache2/mt-error_log"
CustomLog "/private/var/log/apache2/mt-access_log" common
SetEnv DYLD_LIBRARY_PATH /usr/local/mysql/lib/
</VirtualHost>

Next, restart Apache: sudo apachectl -k restart.

Add our virtual host “localhost.mt” to the /etc/hosts file:

127.0.0.1 localhost.mt

To confirm that Apache is set up correctly, visit http://localhost.mt in a browser. Check /var/log/apache2/ to see content in both of the MT log files and confirm there are no errors.

MySQL

Next we’ll set up the MySQL user (mt), password (password) and empty database (mt) for Movable Type.

$ mysql -u root
mysql> create user 'mt'@'localhost' identified by 'password';
Query OK, 0 rows affected (0.07 sec)
mysql> grant all on mt.* to 'mt'@'localhost';
Query OK, 0 rows affected (0.04 sec)
mysql> create database mt default character set = utf8;
mysql> show grants for 'mt'@'localhost';
+-----------------------------------------------------------------------------------------------------------+
| Grants for mt@localhost                                                                                   |
+-----------------------------------------------------------------------------------------------------------+
| GRANT USAGE ON *.* TO 'mt'@'localhost' IDENTIFIED BY PASSWORD '*2470C0C06DEE42FD1618BB99005ADCA2EC9D1E19' |
| GRANT ALL PRIVILEGES ON `mt`.* TO 'mt'@'localhost'                                                        |
+-----------------------------------------------------------------------------------------------------------+
mysql> exit;

Movable Type Download

In the instructions below we’ll download the latest open source Movable Type. (Similar steps could be followed for any of the commercial versions of Movable Type.)

The steps below use the Movable Type “Install via SSH” instructions if you want to follow on. The first step is to download and example the .zip file, then move and set up the directories:

$ wget http://www.movabletype.org/downloads/stable/MTOS-5.2.zip
$ unzip MTOS-5.2.zip
$ mv MTOS-5.2/* cgi-bin
$ cd cgi-bin
$ chmod 777 mt-static/support
$ chmod 777 themes
$ cd ..
$ ln -s cgi-bin/mt-static mt-static

Note: If your perl is in a location other than /usr/bin/perl, you will need to edit all of the .cgi files to change the first line.

Movable Type System Check

Now we run the Movable Type system checker script: http://localhost.mt/cgi-bin/mt-check.cgi

This script will check all the required Perl modules. Make sure your Perl configuration has all of the required modules and the database module you intend to use. (In this example I’m using MySQL.) Use Perl CPAN to install additional modules as required.

In my case the Perl modules were pretty straightforward except for DBD::mysql. I finally found the solution (including the SetEnv DYLD_LIBRARY_PATH in Apache config above) from this Stackoverflow answer about Perl CGI not working with MySQL.

Movable Type Configuration

Once we have all the required Perl modules in place, create a config file for MovableType. Copy mt-config.cgi-original to mt-config.cgi and set the appropriate values for the web paths and database settings:

CGIPath        http://localhost.mt/cgi-bin/
StaticWebPath  http://localhost.mt/mt-static
ObjectDriver   DBI::mysql
Database       mt
DBUser         mt
DBPassword     password
DBHost         localhost

With the configuration in place we can login to Movable Type for the first time: http://localhost.mt/cgi-bin/mt.cgi

The first step will create the admin login account and password. I suggest using your real email address here to test any of the system messages.

Next is the “Create your first Website” dialog. Make sure to set the website URL and root correctly using the paths above. For example, above we have the Movable Type files in /Users/brian/mt/cgi-bin, and the website root will be /Users/brian/mt.

Finally you should be able to log in to the Movable Type interface. Create a blog and publish it, then confirm you can view it with the local path provided.

Happy hacking!

Understanding Email Security and Privacy

The Electronic Frontier Foundation (EFF) has a good summary of the laws governing email privacy protections in the context of the scandal that led to the resignation of General David Petraeus, the Director of the Central Intelligence Agency. At the heart of the protections afforded to email is the The Electronic Communications Privacy Act (ECPA) from 1986. The EFF article explains how ECPA applies in this case, but is seriously outdated and doesn’t seem to offer much protection at all. Emails over 180 days considered “abandoned”? Read versus unread emails handled differently?. The EFF concluded:

Sound confusing? It is. ECPA is hopelessly out of date, and fails to provide the protections we need in a modern era. Your email privacy should be simple: it should receive the same protection the Fourth Amendment provides for your home.

This is clearly a high-profile case, so there may be some hope for the government clarifying these laws to catch up with today’s reliance on electronic communication. The EFF is part of a new campaign calling for reform: Vanishing Rights – Tell Congress Don’t Let Our Right To Privacy Expire. I think it’s a worthwhile effort, but I’m not optimistic for any privacy improvements soon. I don’t see the government voluntarily increasing privacy, especially in this case which could be considered a “success”.

To learn more about protecting your data and communications, I highly recommend reading the EFF guide to Surveillance Self-Defense. Many people may assume they have nothing to worry about because they aren’t expecting to be investigated by the government, but the guide points out a lot of data that’s available to private parties as well through subpoenas. You may think twice about keeping all your email with Google, Yahoo, AOL, Microsoft, etc.

Or, you could really go old-school like Janet Napolitano (U.S. Secretary of Homeland Security) and don’t use email at all.