Personal tools
You are here: Home / Members / alex's Home / Syndicating things

Syndicating things

Filed under: , , , , ,
I recently had a bash at augmenting the Zope product RDFGrabber, making it work more easily with Zope page templates. I'm quite chuffed with the results.

Introduction

RDFGrabber is a Zope product for fetching, parsing, and storing RDF metadata in a Zope server, which can then be used for a whole bunch of interesting tasks:

  • syndicating and aggregating news from multiple sites;
  • pulling in information about events from multiple sources and presenting them in a single calendar (see the RDFCalendar product);
  • presenting a unified view of disparate information sources.

The last point is the one I'm most interested in. As a brief example, say you're trying to buy the latest bit of techie kit and do a bit of googling to find reviews, prices and specifications. As always, there's not a single site that offers exactly the view you want: a list of dishwashers rated by their eco-friendliness, price, width and star-rating according to Which-Online.

In an ideal world, the various purveyors of syntactically dubious websites would add metadata to their pages, describing the fact that "this thing here, at this URL, is a dishwasher, and it costs £200". Coupled with a review saying: "the Whirlpool D100 is crap," it should be a simple matter to syndicate, aggregate and massage these facts together with all the others into a nice, simple table, telling you not to buy the Whirlpool.

We Don't Need No Lingua Franca

Indeed, no-one will want to agree on a single, unified language for describing these things. My idea of crapness is no doubt different to everyone-else's. The point of the SemanticWeb and RDF is to give us a unified meta-language, the one that everyone seems to be able to agree on, and one which is rich enough to be able to express the sort of things I outlined above: the mind-map. More formally, a graph, whose nodes are "things" and edges "properties" relating those things together. The washing machine example might look something like this from a SemanticWeb perspective:

Which is nice.

Package Management

One of the things I've been pursuing is ways to make it easier for me to administer linux boxes with different requirements for packages, ensuring that they stay up-to-date and that I can build packages, publish them, and install them as simply as possible. I've been using apt-get for this, and it works rather well.

A couple of the things I want to be able to do are:

  • Generate a list of the most recent RPMs, their titles and version numbers, along with a tooltip showing the most recent change.
  • Create a report of the packages which are out of date with respect to the latest version reported on their creator's web-page.

A requirement on these stories is that the repository of packages I've wrapped up is on a server in my loft, for ease of local testing, while the reports are to be served up on a remote server running Zope.

Add to that the fact that no-one yet uses any kind of semantic mark-up which will make life any easier to figure out the latest releases from all the different web-sites hosting all the different sources.

So I need a generic, universal model, which can cope with the different representations used, use a universal way of pointing to things and making assertions about them, along with a decent way of moving information in that universal model around. It's actually easier to say "RDF" than repeat that sentence too often.

Package Metadata

Usefully, RPMs have a decent amount of metadata stored in them, and python bindings abound for making use of that metadata. I wrote a quick python script, to be used from Zope as an external method, which pulls out the interesting metadata from all packages under a particular directory.

Once the information is available to Zope, it's a simple matter to create RDF using a page template. The (currently notional) schema I'm using for RPMs is fairly flat. Typical metadata for a package might look like the following:

The interesting bit here is the link off to the original source for the package, rpm:url. This gives us a unique ID for talking about Zope, and allows us to start making statements about Zope releases.

Release Metadata

It would be wonderful if everyone were already using the semantic web and making available information about what is on various web pages. In the meantime, since RDF allows "anyone to say anything about anything", we just have to do it ourselves.

The first step is a bit of screen-scraping. For this task, I've been mostly using KebasData, a Zope product which is really easy to use, as long as you can speak regular-expressions. For example, in order to make some semantic markup describing ZWiki releases, first browse to http://zwiki.org/ReleaseNotes and look at the HTML source. I used <h2>Zwiki ([^\s]*) ([^\s]*)</h2> as a regular expression, in order to pull out the version number and date.

Again, I used a page template to present all this collected release information as RDF, iterating over various kebas-data instances. I can then start filling out the gaps in the picture above, e.g.:

Wanderings

To pull all these bits of the picture together, I've been using RDFGrabber, which you can point to a list of URLs and it will fetch and parse the results according to the RDF specification into triples. These triples are used to build representations of the graphs above, with each triple detailing a node, edge, node relationship.

The current RDFGrabber interface for wandering over these graphs is rather combersome however. What I really wanted was a nice template friendly way to walk along the links, e.g.:

  • Iterate over all things of type rpm:Package.
  • For each rpm package, show the rpm:name and rpm:version.
  • Wander through the graph, finding first the rpm:url, then sw:releases, and treat the releases as a list.
  • Sort the list by the dc:date property to get the most recent release.
  • With this release, find the version number and compare it with the original.

The first bit can be achieved by adding a convenience method on an RDFGrabber instance which, given a type url returns a list of individuals, i.e. those nodes with rdf:type the given type.

So now we can write a page template, iterating over all RPM packages, with something like the following:

    <dl tal:define="rpmPackage 'http://schema.floop.org.uk/rpm#Package'
        tal:repeat="rpm python:here.packages.individuals(rpmPackage)">

The next bit is was slightly harder, but overriding the nice __getattr__() python method, and adding some namespace mappings to the RDFGrabber object, it can be achieved and the template code to use it looks something like the following:

      <dt>
        <a tal:attributes="href rpm"
           tal:content="rpm/rpm_name" />
      </dt>

In the above, the repeat variable rpm represents a resource in RDF speak, which is really just a URL, so can be used as the href value in an anchor element. I added a namespace mapping from rpm to http://schema.floop.org.uk/rpm# to the RDFGrabber object, so that when I try to get the rpm_name attribute of the rpm resource, really I'm asking "what's on the end of the http://schema.floop.org.uk/rpm#name link?"

In RDF, the other end of a link can be either another resource, or a literal. In the case of a resource, we can continue wandering along the graph using exactly the same syntax. So to find the collection of releases by following along the path, the expression is rpm/rpm_url/sw_releases (after first adding a namespace mapping for sw).

When I wrote the RDF template for marking up the releases, I used an rdf:Bag for modeling the list of individual release, hence the blank node in the figure above. To make it easier for writing templates, I made sure that if a resource represents one of the RDF collection types, Bag, Seq or Alt, then a list of the things it contains is returned directly. So now, later in the template, I can write something along the lines of:

    <tr tal:define="releases rpm/rpm_url/sw_releases;
                    sort_on python:[('dc_date', 'cmp', 'desc')];
                    sorted_releases python:sequence.sort(releases, sort_on)">

Which sorts the list of releases of a given RPM by the dc:date attribute (I used the Dublin Core date attribute, partly because it sorts nicely). Now I can get back the latest version and pull out the required properties:

      <th align="right">Latest version</th>
      <td tal:content="python:sorted_releases[0].sw_version"/></tr>

Conclusions

I think the final template examples are quite natural; the method of walking along the RDF graph using paths like rpm/rpm_url/sw_releases is powerful yet simple. Representing containers as lists seems natural, and seems to fit nicely with sequence methods like sort().

I've made additions to the RDFGrabber product, since I just wanted to test out some ideas. The changes are available from my subversion repository and an RPM is available from my apt repository

So why bother? Can't this all be done more easily using some SQL or custom objects? To a certain extent, yes, because the tools are more readily available. I reckon that once we start wanting to point outside of our own little worlds and deal with information from multiple sources in a transparent, simple, unified way, then the Semantic Web is the way to go. I'm going to carry on writing tools to play with some of these ideas.

Currently, the method of walking along a path given by a list of properties doesn't take into account that sometimes you might want to reverse and go backwards along a link in the graph. I'm not sure what the natural method for this should be, but wonder whether declaring inverse properties might be the nicer way to go.

Another nice-to-have would be a way of using the owl:sameAs property for declaring that two differently named things really represent the same thing.

As for extending RDFGrabber, some things to do are:

Better caching
A local triple-store is simply a cache of the information available at some given URLs. It would be nice if the grabber could use standard HTTP negotiation to determine when to update its cache.
Better parsing
There are a few cases where the current parser doesn't do the right thing.
Open world
Some information sources will be large, perhaps computed on the fly, and maybe unbounded. A simple interface for wandering over remote RDF graphs would be nice. An example is the RPM metadata, where each package has a list of the files it contains. Since I'm currently not interested in this information, there's no need to generate RDF for it, fetch it, parse it and store it.
Mutable vs immutable stores
When an RDF source is unavailable, should the local cache be used instead? When an RDF source returns only new information (e.g. the last 10 news items), should the old information be kept in the cache? How does an information source invalidate the cache, or explicitly retract information? This fits in somewhat with the caching aspect above.

Comments?

Document Actions

News
Syndicating things Sep 27, 2003
APT repository Jun 10, 2003