SDForum Technorati Talk

Tonight I attended the SDForum Web Services SIG meeting whose
topic was “Semantic XHTML — Can your website be your API?”.
The presenters were Kevin Marks and
Tantek Çelik from Technorati.
Following are my rough notes from this interesting presentation.

Update 2004-10-05: Slides from this talk now posted on Tantek’s site

Semantic XHTML

Can your website be your API?

SDForum Web Services SIG, 2004-09-28

Some SDForum general topics:
* Monthly Web Services Working Group will probably be formed in a couple months
* Forming a new Web Client SIG, topics to inclue RSS, Atom, SOAP, REST, etc.;
looking for a host
* New PayPal Hacks book coming out

Background on Technorati

Tracking 4 million blogs now (was 3 million in June). About 4 million posts per
week. New Politics site tracks and summarizes about 10,000 political blogs.
Link analysis is the key attribute of their processing. For international,
they use UTF-8 internally and can convert from the majority of encodings as
needed. Not as much content searching yet for internationals, but not as
critical yet because they rely on links rather than content.

Presentation

HTML started structured, became presentational during browser wars. Explosive
growth because of error tolerance. Table abuse & font tagitis & spacer GIF
layouts caused two backlashes:

  • Backlash for structure — XML; draconian error checking, freedom to make
    own schemas, appeals to programmers
  • Backlash for layout — CSS; move presentation away from structure, content
    independence, appeals to designers, http://www.csszengarden.com

Where does XML fail?

  • schema explosion (everyone makes their own)
  • tag/attribute battling
  • abstraction ratholes – BTO ontology
  • not human readable (partly by design)
  • doesn’t work on “the Web” today

Where does CSS fail?

  • folk coding (design rather than engineering community)
  • variable implementations
  • visual designers thinking about presentation ass structure
  • structure hacks to fix presentation

Can we re-integrate these strands?

  • XHTML is XML (XHTML = HTML made into XML)
  • parseable, modular
  • XHTML supports CSS
  • everyone already has a viewer
  • everyone can make queries

Example – Politics Site. Sample problem:

  • wanted a chart of the top 3 links on a page
  • dynamically generated using some complex app logic to choose
    the link title based on transient data
  • solution: use the site output page as input, easily parsable
    to extract desired information
  • this web page wasn’t originally designed with that in mind,
    but due to its structure was reusable

XHTML building blocks

  • most applications reuse a lot of common concepts
  • strings
  • lists, correspond to program arrays (<ol> and <ul>)
  • tables, can be used for 2D array
  • links with ‘rel’ attribute explicitly defines relationship;
    is extensible and multivalued
  • definition lists, key/value pairs or hashtables
  • citations and quotes; cite a person or source by name,
    popular use in weblogs

Existing examples

  • XFN – XHTML friends network; just add ‘rel’ to your
    blogroll links; define profile using a dictionary:
    http://gmpg.org/xfn/1

Future example

  • attention.xml; what are you reading, how often are you
    reading them, etc. with goal of application that can
    help synchronize what you’re reading, help highlight
    things that you are interested in
  • XSPF – play lists (XML shared playlist format)

New types – Methodology

  • map existing data structures into XHTML equivalents
  • enable new stylable building blocks
  • readily exchange data as mapping is 1:1

New type – People

  • RFC 2426 vCard <-> hCard
  • create an XHTML representation of this
  • embed within a webpage, share to and from the web

New type – Events

  • RFC 2445 iCalendar <-> hCalendar
  • describe events
  • display them and enable parsing