SPARQL full-text Wikipedia searching and Wikidata subclass inferencing

Wikipedia querying techniques inspired by a recent paper.

Milhaud and Bacharach

I found all kinds of interesting things in the article “Getting the Most out of Wikidata: Semantic Technology Usage in Wikipedia’s Knowledge Graph”(pdf) by Stanislav Malyshev of the Wikimedia Foundation and four co-authors from the Technical University of Dresden. I wanted to highlight two particular things that I will find useful in the future and then I’ll list a few more.

Before I cover them, I wanted to mention that I’ve really grown to appreciate the little diamond icon in the upper-left of the Wikidata query form. As I refine queries on that form, the queries typically get messier and messier, so the ability to clean it all up with one click is very convenient.

Full text searching of Wikipedia with SPARQL

The paper’s “Custom SPARQL Extensions” section describes several extensions, including the MediaWiki Web API. The Wikidata Query Service/User Manual/MWAPI page describes how you can call the MediaWiki API search functions by using special property functions (that is, properties that instruct the query engine to execute certain special functions).

This API is definitely one of those topics where reviewing the examples will get you started more quickly than trying to read the actual documention. Their first SPARQL query search example, Find all entities with labels “cheese” and get their types, searches Wikipedia for entries that have “cheese” in one of their labels such as the page title or alternative names.

The key difference in the Find articles in Wikipedia example that follows the first cheese example is that its fifth line uses the property function mwapi:srsearch as a predicate instead of mwapi:search, telling the query to search the contents of all of the English (note the “.en” on the fourth line) Wikipedia pages. You can try that example yourself to do a full-text search for “cheese”. I did a similar search for Darius Milhaud Burt Bacharach because I’ve recently been fascinated by the connections between Milhaud, a French composer who rose to prominence in the 1920s as a member of Les Six, and Bacharach, one of the greatest pop songwriters of the 1960s. (Listening to some Milhaud once, it struck me as odd that his use of horns would remind me of some Bacharach songs and arrangements until I found out that the author of “The Look of Love”, “Walk on By”, and “I Say a Little Prayer” studied with Milhaud in the 1940s at McGill University.) This query certainly doesn’t need the “LIMIT 20” at the end like the full-text search for “cheese” does, because these two guys don’t get mentioned on the same page as often as cheese gets mentioned, but it is an interesting set of pages.

Subclass inferencing with Wikidata

I’m still surprised at how many people use RDF without adding any schema information, or worse, without using schema information that’s already there. Wikidata provides plenty for us, and while the Blazegraph instance used as the back end to its SPARQL engine does not have its RDFS inferencing capabilities turned on–understandably, because queries that take advantage of this ask more of a processor and could therefore hamper scalability–a nice property path trick does let us ask for all the instances of a particular class and of its subclasses. This wasn’t even mentioned in the “Getting the Most out of Wikidata” paper, but a mention of how Wikidata uses owl:objectProperty inspired me to dig more into the use of the data modeling, and I came up with this.

The following (try it here) shows that Wikidata currently has data about 125 instances of home computer models:

SELECT (count(*) as ?instances) WHERE  {
  ?instance wdt:P31 wd:Q473708     # Instance has a type of "home computers"
}

This next query (try it here) shows that there are 28 instances of classes that are a direct subclass of “home computers”:

SELECT (COUNT(*) AS ?instances) WHERE {
  ?instance wdt:P31 ?class.
  ?class wdt:P279 wd:Q473708.     # wdt:P279: subclass of 
}

Merely adding the property path asterisk operator to wdt:P31 tells the query engine to find instances of the home computer class and also instances of any class in the subclass tree below it (try it here) and it finds 154 of them:

SELECT (COUNT(*) AS ?instances) WHERE {
  ?instance wdt:P31 ?class.
  ?class wdt:P279* wd:Q473708.
}

As with regular expressions, the asterisk means “0 or more steps away,” so that instances of wd:Q473708 would be counted along with instances of classes from its subclass tree. Using a plus sign instead would have meant “1 or more instances away” so that query would not have found instances of wd:Q473708.

The ability to use class relationships to identify potentially useful data is just one example of how schema metadata adds value to data. And, we get more than just these additional instances; we get additional class names that tell us more about these instances. For example, we can find that the Thomson MO5-CnAM 43737 computer is an instance of the class Thomson M05, which is a subclass of MOTO Gamme, which is a subclass of home computer.

And more

Some other nice things I learned about in the paper:

  • The use of wikibase:around and wikibase:box for additional kinds of geographic queries in addition to the ability to search within a city’s limits as I described in July.

  • A list of additional endpoints that you can use in federated queries sent to Wikidata.

  • Support for Blazegraph’s graph traversal features.

  • Multiple live Grafana dashboards about Wikidata usage such as data about agents and formats requested.

If you’re interested in SPARQL, Wikidata, or especially the combination, you’ll learn some fascinating things from this paper.