Populating a Schema.org dataset from Wikidata
Rock and Roll!
As the Schema.org vocabulary gets applied to more and more data and the data in Wikidata grows and grows, it’s only natural to think about the possibilities of creating Schema.org datasets that are populated from Wikidata.
From the Wikidata side, the Wikidata:Schema.org page provides an excellent discussion of the relationship between the two efforts. To summarize some key points: Schema.org is structurally much simpler than Wikidata to ease adoption, but because Schema.org provides no entity identifiers (for example, identifiers for specific people and places) “Schema.org is considering to encourage the use of Wikidata as a common entity base for the target of the schema:sameAs relation (not to be confused with owl:sameAs).”
From the Schema.org side, https://github.com/schemaorg/schemaorg/issues/280 has some discussion about the mapping of Schema.org to the Wikidata model. It’s mostly about modeling the relationships between common classes and properties—important tasks if you want to automate large-scale conversion between the two models. The schemaOrg-Wikidata-Map page, “for issue-280’s working group subsidy and reference”, has some good ideas for creating those mappings.
In a recent Twitter thread about Wikidata Dan Brickley asked me if I was “interested in cooking up clever queries to help slurp out subsets”. Yes! The query below pulls out almost 21,000 Wikidata triples of album and musician data for bands with a genre of rock and roll (or, in Wikidata terms, bands with a wdt:P136
of wd:Q11399
). Wikidata currently has this kind of data for about 530 bands.
As with any mapping from one data model to another, some properties let you simply substitute a new name for an old name but others require judgment calls and some model traversal to get at what you want. I wanted to point out a domain-specific data model traversal issue I came across and a more general Wikidata one that will be an issue for people working with any data domain, not just rock and roll bands.
The domain-specific issues are important because while there are dreams of a generalized mapping between Wikidata and Schema.org, these two schemas both cover so much territory that it’s just not feasible. Here is my small example: while the Kinks studio album “Face to Face” is an instance of “album” in Wikidata, (wd:Q675825 wdt:P31 wd:Q482994
), the Rolling Stones studio album “Beggars Banquet” is an instance of studio album (wd:Q339065 wdt:P31 wd:Q208569
) which is a subclass of album (wd:Q208569 wdt:P279 wd:Q482994
), as are live album (wd:Q209939
) and compilation album (wd:Q222910
). Because of this, my query that pulls out Wikidata triples to convert to Schema.org must look for instances of album and instances of subclasses of album. If the SPARQL engine could do inferencing, I could ask for instances of album, because an instance of a subclass is an instance of its superclass, but this SPARQL engine won’t do inferencing. Schema.org actually does have a schema:MusicAlbumProductionType
class whose instances such as schema:studioAlbum
, schema:LiveAlbum
, and schema:CompilationAlbum
could store this distinction between various types of albums, but this doesn’t change the fact that Wikidata lists the studio album “Beggars Banquet” as an instance of “studio album” but the studio album “Face to Face” as an instance of the studio album superclass “album”. (Coming soon: how to correct the Wikidata data!)
Wikidata’s SPARQL engine has enough to do without doing inferencing; my query asks for a lot, and getting it to run in under 60 seconds to avoid a timeout took some rearrangement of triple patterns here and there to make it more efficient. I was surprised that I got away with including an OPTIONAL graph pattern and still kept everything under 60 seconds.
The use of UNION also helped retrieve the albums despite their different relationships to the data model. You’ll see that I UNIONed a third expression in there, which brings me to a key aspect of the Wikidata data model that queries must deal with: statements. Instead of having a triple saying that the work is an album, certain albums have triples saying that there are statements claiming that they are albums. (I’m not 100% sure about my wording describing the role of statements here and I’m open to correction.) This gives the query a bit more indirection to follow. Because Wikidata may have multiple statements about a topic, a query can request the highest ranked of these: we want the one that is an instance of wikibase:BestRank
.
Whether you’re modeling rock and roll bands or commodity prices, the structure of these statements and availability of classes such as wikibase:BestRank
will play a role in your programmatic access to Wikidata data. Removing the levels of indirection added by these statements will be typical of any mapping of Wikidata data to simpler models such as Schema.org. My query for band data also references Wikidata statements in order to request information about each album’s release date and each member’s role within the band—for example, that Keith Richards has the role “lead guitarist” with the Rolling Stones. (I would not rank this statement’s claim very highly; when Richards was paired with Brian Jones originally and with Ron Wood since 1976, the lack of clear lead and rhythm guitar roles was always an important part of the band’s sound, and when paired with Mick Taylor, Taylor was the lead guitarist.) Wikidata had minimal data about rock and roll band member roles, so I gingerly put the request in the OPTIONAL graph pattern mentioned above.
Here is the query. Note the use of comments to explain the meaning of each cryptic Wikidata prefixed name for easier readability.
# rockAndRollBandData.rq: retrieve personnel and album data about
# bands with a genre of rock and roll from Wikidata and output triples
# that use the schema.org model.
# From the command line (but executed on a single line):
# curl --data-urlencode "query@rockAndRollBandData.rq"
# -H "Accept: text/turtle"
# https://query.wikidata.org/bigdata/namespace/wdq/sparql
PREFIX schema: <http://schema.org/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
CONSTRUCT {
?band a schema:MusicGroup ;
schema:name ?bandName ;
schema:musicGroupMember ?member ;
schema:albums ?album .
?album a schema:MusicAlbum ;
schema:name ?albumTitle ;
schema:datePublished ?releaseDate .
?member schema:name ?memberName ;
schema:roleName ?roleName .
}
WHERE {
?band wdt:P136 wd:Q11399 ; # band has genre of rock and roll
rdfs:label ?bandName ;
wdt:P527 ?member . # band has-part ?member
FILTER ( lang(?bandName) = "en" )
?member rdfs:label ?memberName .
FILTER ( lang(?memberName) = "en" )
OPTIONAL { # Member's role.
?member p:P361 ?roleStatement . # part-of role statement.
?roleStatement rdf:type wikibase:BestRank ; # The best role statement!
pq:P2868 ?role . # subject-has-role ?role.
?role rdfs:label ?roleName .
FILTER ( lang(?roleName) = "en" )
}
{ ?album wdt:P31 wd:Q482994 . } # instance of album (wd:Q482994)
UNION
{ ?album wdt:P31 ?albumSubclass . # or a subclass of that such as
?albumSubclass p:P279 wd:Q482994 . # live or compilation album
}
UNION
{ ?album wdt:P31 ?albumSubclass .
?albumSubclass p:P279 ?albumClassStatement . # subclass of
?albumClassStatement ps:P279 wd:Q482994 ;
rdf:type wikibase:BestRank .
}
?album wdt:P175 ?band ; # has performer
rdfs:label ?albumTitle ;
p:P577 ?releaseDateStatement . # publication date
FILTER ( lang(?albumTitle) = "en" )
?releaseDateStatement ps:P577 ?releaseDate ; # release date as ISO 8601
rdf:type wikibase:BestRank . # Only the best!
}
I would provide a link to the results, but you can run it yourself with the curl command shown in the query’s header if you store the query in a file called rockAndRollBandData.rq
.
Once I had the Schema.org version it was fun to query that with queries that were much simpler than what would have been necessary with Wikidata. For example, the following asks this extracted data who has been a member of more than one band and what the bands were:
PREFIX schema: <http://schema.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?member ?group1 ?group2 WHERE {
?groupURI1 a schema:MusicGroup ;
schema:name ?group1 ;
schema:musicGroupMember ?memberURI .
?groupURI2 a schema:MusicGroup ;
schema:name ?group2 ;
schema:musicGroupMember ?memberURI .
?memberURI schema:name ?member .
FILTER(?groupURI1 != ?groupURI2)
}
It would make an interesting class project to retrieve a larger, more complex set of data from Wikidata and then map it to a model such as Schema.org. The coordination of the participants’ activity (and triples) would be good work experience for everyone involved, and the project could result in something valuable to a particular domain’s community. This could include the development of procedures for the updating of their locally stored version as Wikidata evolves, as well as for updates to the source Wikidata data itself when there are gaps for that domain. (Again, coming soon: more on that latter issue!)
If you doing something like this on your own or with a group, let me know. I’d love to hear about it.
Share this post