Category Archives: Tools

Curl Cheat Sheet

This is a quick introduction and cheat sheet for Curl – a very handy command-line tool for downloading pretty much anything from a URL.

The Curl website describes it as:

… a command line tool for transferring data with URL syntax, supporting DICT, FILE, FTP, FTPS, GOPHER, HTTP, HTTPS, IMAP, IMAPS, LDAP, LDAPS, POP3, POP3S, RTMP, RTSP, SCP, SFTP, SMTP, SMTPS, TELNET and TFTP. curl supports SSL certificates, HTTP POST, HTTP PUT, FTP uploading, HTTP form based upload, proxies, cookies, user+password authentication (Basic, Digest, NTLM, Negotiate, kerberos…), file transfer resume, proxy tunneling and a busload of other useful tricks.

There are many things that Curl can do, and there is a voluminous man page that lists all of the details.

Here I want to boil down all those options into the most common and useful ones for web or webservices developers (using HTTP/HTTPS protocols). If you don’t already have Curl installed on your system (try running curl from a command prompt), see Getting Curl below.

Basic Usage

The basic form of all Curl commands is:

curl [options...] <url>

For example:

$ curl
Google is built by a large team of engineers, designers, researchers, robots, and others in many different sites across the globe. It is updated continuously, and built with more tools and technologies than we can shake a stick at. If you'd like to help us out, see

Common Options

Options are the real power of of Curl. Here we’ll cover the most common ones that I’ve used for typical web and webservices development. (You can get the full set of options on your system with curl --help or curl --manual.)

-A / –user-agent AGENT
Set the HTTP User Agent string if you don’t want the default “curl” string
Add the HTTP header to request compressed content, if the server can provide it
-d / –data DATA
Set data to be sent with a POST request
-D / –dump-header FILE
Save the response headers to a separate file
-H / –header HEADER
Set a custom HTTP header
-i / –include
Include the response headers in the output
-k / –insecure
Skip SSL certification verification
-o / –output FILE
Write output to a file rather than stdout
-s / –silent
Run silently (i.e., don’t show progress meter)
–trace-ascii FILE
Write request and response headers and data to local file
-x / –proxy HOST:PORT
Route data through the given proxy
-X / –request METHOD
Set custom HTTP method (GET, PUT, POST, DELETE)


Fetch and Save Web Page

$ curl --silent -o boston.html
$ head boston.html
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "">
<html lang="en">
<meta http-equiv="Refresh" content="900;url=?refresh=true">
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
<title> - Boston, MA news, breaking news, sports, video</title>

Check Size Without Downloading

$ curl --head
HTTP/1.1 200 OK
x-amz-request-id: 31D80700E3C2811E
Date: Wed, 11 Jan 2012 04:01:35 GMT
Last-Modified: Sat, 07 Jan 2012 02:49:49 GMT
ETag: "1d08609ab5434eea651e95af332ddb3a"
Accept-Ranges: bytes
Content-Type: audio/mpeg
Content-Length: 24474192
Server: AmazonS3

Getting Curl

If you’re running Mac OSX, Linux, FreeBSD, or similar systems, you’ve probably already got Curl installed. (Try curl --version to double-check your version.)

If you’re running Windows, you’ll need to download it yourself. Start at the Curl downloads page and find the Win32 section. I suggest the “Win32 – Generic binary, with SSL” option. You will also need the Windows OpenSSL libraries; I suggest using the “Win32 OpenSSL v1.x Light” installer. Make sure to put both Curl and OpenSSL libraries in the same location, and add that location to your path.

Using Splunk to Analyze Apache Logs

Splunk is an enterprise-grade software tool for collecting and analyzing log files and other data. Actually Splunk uses the broader term “machine data”:

At Splunk we talk a lot about machine data, and by that we mean all of the data generated by the applications, servers, network devices, security devices and other systems that run your business.

Certainly, log files do fall under that umbrella, and are probably the easiest way to understand Splunk’s capabilities. The company offers a free license which has some limitations compared to a paid enterprise license, the most significant limitation being a maximum of 500 MB/day of indexed data. (For more details, see the differences between free and enterprise licenses.) To learn Splunk, or to use it for personal or smaller sites, the limitations are manageable and the free product is a great option.

In this example I’ve uploaded logs from a couple of my websites and let Splunk index them. I also explain the process I used to identify a rogue user agent which I later blocked.

To get started with Splunk, visit the download page and get the appropriate version for your platform. Follow the installation manual (from the excellent documentation site) to get the software installed and configured.

There are several ways to get data into Splunk; for this case I told it to monitor a local directory for files and manually told it the host name to expect. Then I copied down about 6 months’ of compressed Apache logs into that target directory. You can repeat this for each site, using a separate directory and separate hostname.

Splunk will quickly index your data which you’ll see in the Search application. I suggest going through their quick overview to help learn what’s going on. Click on a hostname to start viewing data that Splunk has indexed. Because Splunk automatically recognizes the Apache log file format, it already knows how to pull out the common fields which you can use for searching and filtering, as shown in this screenshot:

Splunk Fields Example

In my case after poking around a bit, I noticed a pretty high amount of traffic fetching my RSS feed file (/rss.xml). The screenshot below shows the number of daily requests, normally hovering around 400 but peaking at about 2,000 per day (click for larger image):

RSS File Accesses Over Time
RSS File Accesses Over Time

By clicking on the useragent field, I found that an agent named “NewsOnFeedsBot” was accounting for over 60% of the total requests (click for larger image):

User Agent Breakout Chart
User Agent Breakout Chart

Once I filtered on just the NewsOnFeedsBot useragent, some more details emerged:

  • The HTTP status code for all requests was 200, meaning it was doing a full request of the 36KB file each time. (Whereas a well-behaved bot would use if-modified-since or other techniques.)
  • All requests were coming from a single IP address
  • The bot was basically continuously fetching the RSS file several times a minute

Blocking this poorly-behaving bot was just a matter of checking for the useragent string and returning a 403 Forbidden response. After I made the change, the bot made a handful of further requests, received the 403, then stopped. At least it has some logic that indicated it should give up trying to fetch this file.

It’s been about a month since I blocked this bot, so I wanted to see an overview of the results. Splunk has a nice built-in charting capability which I used to stack the most popular useragents (again, just for the rss.xml file) and show their portion of visits over the past few months. You can see in the picture below that NewsOnFeedsBot was by far the biggest contributor over the summer, but now it’s gone (click for larger image):

User Agents Over Time
User Agents Over Time