Finding useful RDF data on the web

At rdfdata.org or elsewhere.

[rdfdata.org logo]

My perennial rant that the world has too many ontologies and not enough useful data for those ontologies to describe goes back several years. At one point in 2004 I thought I’d look around the web for RDF data and compile a central list, and because the domain name rdfdata.org wasn’t taken, I grabbed it.

I wasn’t interested in individual RSS 1.0 files, although I did create entries for RSS collections. I also wasn’t interested in individual FOAF files, which were small and played a disproportionate role in discussions of the semantic web’s potential value at the time, but I did collect a list of larger FOAF resources. I created an RSS feed (1.0, of course) to announce new additions and added a Wiki to the site for people to offer suggestions, but it usually got hijacked.

After a few months, it became increasingly difficult to find new entries to post. A typical 40 minute search using automated Google API scripts—more on these below—might turn up nothing new besides a file that some student submitted with a school project, so I decided to give up. For my final entry on April Fool’s Day, 2005, I posted a link to an RDF file of information on available Elvis impersonators that I had created myself by scraping a booking agency’s website. (When I submitted the “Database of Elvis Impersonators” to BoingBoing, Xeni Jardin actually wrote it up and credited me.) If you live in the U.S., just try to resist going to the booking agency’s website and doing a query for your city and state. (Canadians have a page to query their city and province for Elvis impersonators, but all the pages seem to list the same five or six Americans who are apparently willing to travel pretty far to do their act.)

I created a single RDF file listing all the RDF sources, which may be useful to anyone looking for sample data. Of course, many of its references are now out of date.

If you’re interested in the geekier details of how I tried to automate my searches for RDF files, read on.

The scripts

I based the script that did the actual queries on googly.pl from O’Reilly’s “Google Hacks” book, which is available on the web. (Although mine is also a perl script, I renamed it findBigRDF.txt for downloading purposes.) It’s pretty heavily commented, so it should be self-explanatory. Google allows up to 1000 queries a day with a given API key, so I set it to do less than that. After each query it loops through the results and ignores certain ones that I knew came up a lot. Its current state is the result of a lot of evolution as I had various ideas about finding RDF data files.

Each day I would think of some query terms, write a batch file with lines like this, and run it:

perl findbigRDF.pl dc:creator   > findbigrdf.out
perl findbigRDF.pl dublin   >> findbigrdf.out
perl findbigRDF.pl subject   >> findbigrdf.out
perl findbigRDF.pl bioinformatics   >> findbigrdf.out
perl findbigRDF.pl gene   >> findbigrdf.out
perl findbigRDF.pl chromosome   >> findbigrdf.out
perl findbigRDF.pl commons   >> findbigrdf.out
perl findbigRDF.pl URL   >> findbigrdf.out
perl findbigRDF.pl topic   >> findbigrdf.out

Then I’d sort the results of findbigrdf.out and run a script (python this time; the above script is perl because I found a perl Google API script to use as a model faster than I could find a python equivalent) to compare the results against URLs that were in my existing collection or in a notGoodURLs.xml file that I had also accumulated. Hopefully something interesting popped out at the end, but like I said, the results became skimpier and skimpier over time.

I have no intention of adding any new entries to rdfdata.org, but web server logs show that the site is still surprisingly popular, so I wanted to write this up to give people some leads on existing RDF out there and some tools for finding more. And to everyone who who made suggestions about resources to list on the website, I just wanted to say thank you, thank you very much.

[Elvis impersonator]