Category Archives: Web

Stack Overflow: Kind of Addicting

Over this past Christmas break I spent some more time on Stack Overflow, answering some questions in a few areas I felt I could contribute. As I answered and contributed more, I saw how the reputation and badges system can really draw you in. Not as a motivator per se, but it’s fun to get “kudos” when someone finds your answers or edits helpful.

Some conclusions so far:

  • Some questions get answered very quickly – if you watch the most recent questions you’ll see a small number of page views, but a bunch of answers already. Must be a lot of people watching the newest questions and trying to contribute.
  • Answering older questions is worthwhile if the originator has a decent accept rate. Corollary: it’s not worth bothering with really old questions asked by User1234 with only 1 reputation point.
  • Editing tag (wiki) descriptions is a good way to contribute for lesser-known tags.
  • Upvoting good quality questions and answers is a good way to keep up the overall quality level of the site


profile for BrianC at Stack Overflow, Q&A for professional and enthusiast programmers

Check out my sweet Stack Overflow flair badge!

Simple Webservice Echo Test

While troubleshooting some PHP Curl issues, I found and used http://respondto.it/ (and later http://requestb.in/) which allows you to create a dummy webservice endpoint which reveals the full request made to it by your code.

An even simpler use case would be a webservice that simply returned data about the request directly to the calling application. I just created such a simple echo webservice on my scooterlabs.com domain.

Update 2012-03-23: Added XML response example as well (scooterlabs.com/echo.xml).

Plain text example

$ curl http://scooterlabs.com/echo
Array
(
[method] => GET
[headers] => Array
(
[User-Agent] => curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r zlib/1.2.3
[Host] => scooterlabs.com
[Accept] => */*
)
[request] => Array
(
[foo] => bar
)
[client_ip] => 68.125.160.82
[time_utc] => 2012-01-08T21:33:28+0000
[info] => Echo service from Scooterlabs (http://www.scooterlabs.com)
)

JSON example

$ curl --silent curl http://scooterlabs.com/echo.json?foo=bar | json_xs
{
"info" : "Echo service from Scooterlabs (http://www.scooterlabs.com)",
"request" : {
"foo" : "bar"
},
"headers" : {
"User-Agent" : "curl/7.21.3 (i386-portbld-freebsd7.3) libcurl/7.21.3 OpenSSL/1.0.0e zlib/1.2.3 libidn/1.22",
"Accept" : "*/*",
"Host" : "scooterlabs.com"
},
"client_ip" : "66.39.158.129",
"time_utc" : "2012-01-08T22:07:54+0000",
"method" : "GET"
}

XML example

$ curl --silent http://scooterlabs.com/echo.xml?foo=bar | xml_pp
<?xml version="1.0"?>
<echo>
<method>GET
<headers>
<User-Agent>curl/7.19.7 (universal-apple-darwin10.0) libcurl/7.19.7 OpenSSL/0.9.8r zlib/1.2.3
<Host>scooterlabs.com
<Accept>*/*
</headers>
<request>
<foo>bar
</request>
<client_ip>68.122.10.221
<time_utc>2012-03-24T17:05:49+0000
<info>Echo service from Scooterlabs (http://www.scooterlabs.com)
</echo>

Source

Source code is up on Github: https://github.com/bcantoni/echotest. If anyone has any comments or feedback, let me know here or on Github.

Tech Advent Calendars

Several awesome tech and programming communities create advent calendars each year with a different article or demo for each day of December. Here are the ones I’m following.

Update: See my updated 2012 Advent Calendar list. It uses the same RSS feed as before.

I’ve also created a Yahoo Pipe to combine all of these RSS feeds into one: 2011 Tech Advent Feed.

Performance

website: http://calendar.perfplanet.com/2011/
feed: http://calendar.perfplanet.com/feed/

Performance Advent Calendar 2011

24ways

website: http://24ways.org/
feed: http://feeds.feedburner.com/24ways

24ways Advent Calendar 2011

Perl

website: http://perladvent.org/2011/
feed: http://perladvent.org/2011/atom.xml

Perl Advent Calendar 2011

PHP

website: http://phpadvent.org/2011
feed: http://feeds.feedburner.com/phpadvent

PHP Advent Calendar 2011

Using Splunk to Analyze Apache Logs

Splunk is an enterprise-grade software tool for collecting and analyzing log files and other data. Actually Splunk uses the broader term “machine data”:

At Splunk we talk a lot about machine data, and by that we mean all of the data generated by the applications, servers, network devices, security devices and other systems that run your business.

Certainly, log files do fall under that umbrella, and are probably the easiest way to understand Splunk’s capabilities. The company offers a free license which has some limitations compared to a paid enterprise license, the most significant limitation being a maximum of 500 MB/day of indexed data. (For more details, see the differences between free and enterprise licenses.) To learn Splunk, or to use it for personal or smaller sites, the limitations are manageable and the free product is a great option.

In this example I’ve uploaded logs from a couple of my websites and let Splunk index them. I also explain the process I used to identify a rogue user agent which I later blocked.

To get started with Splunk, visit the download page and get the appropriate version for your platform. Follow the installation manual (from the excellent documentation site) to get the software installed and configured.

There are several ways to get data into Splunk; for this case I told it to monitor a local directory for files and manually told it the host name to expect. Then I copied down about 6 months’ of compressed Apache logs into that target directory. You can repeat this for each site, using a separate directory and separate hostname.

Splunk will quickly index your data which you’ll see in the Search application. I suggest going through their quick overview to help learn what’s going on. Click on a hostname to start viewing data that Splunk has indexed. Because Splunk automatically recognizes the Apache log file format, it already knows how to pull out the common fields which you can use for searching and filtering, as shown in this screenshot:

Splunk Fields Example

In my case after poking around a bit, I noticed a pretty high amount of traffic fetching my RSS feed file (/rss.xml). The screenshot below shows the number of daily requests, normally hovering around 400 but peaking at about 2,000 per day (click for larger image):

RSS File Accesses Over Time
RSS File Accesses Over Time

By clicking on the useragent field, I found that an agent named “NewsOnFeedsBot” was accounting for over 60% of the total requests (click for larger image):

User Agent Breakout Chart
User Agent Breakout Chart

Once I filtered on just the NewsOnFeedsBot useragent, some more details emerged:

  • The HTTP status code for all requests was 200, meaning it was doing a full request of the 36KB file each time. (Whereas a well-behaved bot would use if-modified-since or other techniques.)
  • All requests were coming from a single IP address
  • The bot was basically continuously fetching the RSS file several times a minute

Blocking this poorly-behaving bot was just a matter of checking for the useragent string and returning a 403 Forbidden response. After I made the change, the bot made a handful of further requests, received the 403, then stopped. At least it has some logic that indicated it should give up trying to fetch this file.

It’s been about a month since I blocked this bot, so I wanted to see an overview of the results. Splunk has a nice built-in charting capability which I used to stack the most popular useragents (again, just for the rss.xml file) and show their portion of visits over the past few months. You can see in the picture below that NewsOnFeedsBot was by far the biggest contributor over the summer, but now it’s gone (click for larger image):

User Agents Over Time
User Agents Over Time

“Read it Later” Apps

For a side project I’m working on, I want to support several different “read it later” type applications. Looking for apps that have both mobile support and APIs, it looks like the most popular options are Instapaper, Read It Later, and Readability.

All of these accomplish a similar task: bookmark a web page for later reading, and formatting it for easier reading. Mobile support is usually included, either for reading articles bookmarked earlier, or marking new ones to read on a desktop at a later time.

Here’s a quick summary of each service:

Instapaper

  • Free service with an optional subscription for $1/month
  • Desktop web browsing
  • Mobile: iPhone/iPad mobile app ($4.99), 3rd-party compatible apps for other mobile platforms
  • API: Simple API (username/password), or Full API (xAuth flavor of OAuth)

Read it Later

  • Desktop: Firefox extension, bookmarklets for others
  • Mobile: Android (pro $0.99), iPhone (free, or pro $2.99)
  • API: Yes, username/password

Readability

  • Subscription service at $5/month (70% of which goes to authors & publishers); $5 is minimum, can do more
  • Desktop web browsing: Yes, also Firefox extension
  • Read Now in browser free, Read Later & Mobile for subscribers only
  • Mobile: Web apps (Android, Blackberry), iPhone/iPad: web now, integration with Instapaper app coming soon
  • API: OAuth

I’ve just started playing with each of these apps and their APIs and will hopefully post more feedback on each.