Personal tools
You are here: Home / Members / alex's Home / Making the Internet Movie Database really SPARQL

Making the Internet Movie Database really SPARQL

Having used the IMDB for yonks now — back when it was the Cardiff Internet Movie Database in fact — I've longed to use it as a linked open data source. Here are some instructions for setting up your own private IMDB SPARQL endpoint.

It's not a well known fact that the underlying data for the IMDB is available to download for personal/non-commercial use.  The data is available as a set of compressed text files, carefully formatted with all the information and relationships you're used to from the IMDB website, only with certain juicy bits (ratings, IDs, comments) missing.

Not only that, but some industrious individuals have created scripts and tools for converting the data into useable formats, and in our case, generating a relational database.  All we need to do to make our own personal IMDB SPARQL interface is to use something like D2R Server or Virtuoso and write a mapping from the relational representation into RDF.

Getting the Data

First things first, we need to get hold of a local copy of the compressed text files.  Luckily, most mirrors nowadays use the RSYNC protocol, which we can use to get the (currently around 600 MB of) data and keep it up to date without using up huge amounts of bandwidth.  The following command, put in a weekly cron job, should do the trick:

rsync -d ftp.funet.fi::ftp/pub/mirrors/ftp.imdb.com/pub/ /usr/local/imdb

This will copy the compressed text files and other bits and pieces to the /usr/local/imdb directory on your computer.  If you run it again, it should just copy over any changes, although the fact that the files are compressed might make things less optimal.

Into a Relational Database

Once the data is available locally there's a rather good tool for Python called IMDbPY, which can be used to parse the text files and write the data into a relational database.

Document Actions