Queries to explore a dataset
Even a schemaless one.
I recently worked on a project where we had a huge amount of RDF and no clue what was in there apart from what we saw by looking at random triples. I developed a few SPARQL queries to give us a better idea of the dataset’s content and structure and these queries are generic enough that I thought that they could be useful to other people.
I’ve written about other exploratory queries before. In Exploring a SPARQL Endpoint I wrote about queries that look for the use of common vocabularies that might be used at a particular endpoint, and how getting a few clues led me to additional related queries. That blog post also mentioned the “Exploring the Data” section of my book Learning SPARQL, which has other general useful queries.
You can see those listed in the book’s table of contents; they often assume that some sort of schema or ontology is in use. A great thing about SPARQL and RDF, though, is that with no knowledge of a schema or any other clues about a dataset’s contents, simple queries can still let you explore that dataset to see what’s there. Today’s exploratory queries were not included among those that I described above.
Example output for each query uses the Beatles Musicians dataset that I described at SPARQL queries of Beatles recording sessions.
How many triples does this dataset have in all?
SELECT (COUNT (*) AS?tripleCount) WHERE {
?s ?p ?o
}
Definitely a hall of fame, classic query. Here is the result for the Beatles musician data after performing the query with the Jena arq command line query engine:
---------------
| tripleCount |
===============
| 4089 |
---------------
Show all the types being used
Never mind whether any types were declared; how many types are used? List them, but don’t repeat any.
SELECT DISTINCT ?type WHERE {
?s a ?type
}
The result with the Beatles musician data:
----------------------------------------------------
| type |
====================================================
| <http://learningsparql.com/ns/schema/Song> |
| <http://learningsparql.com/ns/schema/Musician> |
| <http://learningsparql.com/ns/schema/Instrument> |
----------------------------------------------------
Count instances per type
Of the types that the previous query found being used, how many instances of each are there? This is useful when you are prioritizing what you’re going to do with the data.
SELECT ?type (COUNT (?s) AS ?instanceCount)
WHERE {
?s a ?type .
}
GROUP BY ?type
The result:
--------------------------------------------------------------------
| type | instanceCount |
====================================================================
| <http://learningsparql.com/ns/schema/Instrument> | 180 |
| <http://learningsparql.com/ns/schema/Song> | 293 |
| <http://learningsparql.com/ns/schema/Musician> | 238 |
--------------------------------------------------------------------
Count the properties that each type uses
Of the types that were found above, how many different properties does each use?
SELECT DISTINCT ?type (COUNT(DISTINCT ?p) AS ?c)
WHERE {
?s a ?type .
?s ?p ?o .
}
GROUP BY ?type
Number of properties used in the Beatles data, by type:
----------------------------------------------------------
| type | c |
==========================================================
| <http://learningsparql.com/ns/schema/Instrument> | 2 |
| <http://learningsparql.com/ns/schema/Song> | 182 |
| <http://learningsparql.com/ns/schema/Musician> | 2 |
----------------------------------------------------------
The next query will show us why the Song
class uses so many properties.
List properties per type
What are these properties that each type uses? This is also useful for prioritization. Note the similarities with and differences from the previous query.
SELECT DISTINCT ?type ?property
WHERE {
?s a ?type .
?s ?property ?o .
}
ORDER BY ?type ?property
The following is an excerpt from the middle of this query’s result, with <http://learningsparql.com/ns/schema/Song>
reduced to s:Song
to make it all fit better here. This sample shows that all the different instruments, with all their different spellings, were properties of each song. (Read more about how that worked in my SPARQL queries of Beatles recording sessions blog post.)
| s:Song | <http://learningsparql.com/ns/instrument/guiro>
| s:Song | <http://learningsparql.com/ns/instrument/guitar>
| s:Song | <http://learningsparql.com/ns/instrument/handbell>
| s:Song | <http://learningsparql.com/ns/instrument/handclaps>
| s:Song | <http://learningsparql.com/ns/instrument/harmonica>
| s:Song | <http://learningsparql.com/ns/instrument/harmonium>
| s:Song | <http://learningsparql.com/ns/instrument/harmonyvocals>
Have a query create a schema for this schemaless data
Consider that:
- The dataset has no schema but we found types being used
- We found properties associated with these types
- Schemas are themselves datasets of triples
- SPARQL lets you create triples
This all adds up to the ability to create a schema where there isn’t any. In fact, we can do it with a slight variation on the last query:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
CONSTRUCT {
?type a rdfs:Class .
?property a rdf:Property .
}
WHERE {
?s a ?type .
?s ?property ?o .
}
Note how the WHERE
clause of this query is identical to the one from the preceding SELECT
query. Here is an excerpt of what it created with the Beatles session data:
s:Instrument rdf:type rdfs:Class .
s:Song rdf:type rdfs:Class .
s:Musician rdf:type rdfs:Class .
i:recorder rdf:type rdf:Property .
i:celesta rdf:type rdf:Property .
i:tabla rdf:type rdf:Property .
i:tenorsaxophone rdf:type rdf:Property .
rdfs:label rdf:type rdf:Property .
i:harmonica rdf:type rdf:Property .
We could go a little further by having the schema use the rdfs:domain
and rdfs:range
properties to associate the declared properties with the classes that the query found them with:
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
CONSTRUCT {
?type a rdfs:Class .
?property a rdf:Property .
?property rdfs:domain ?type .
?property rdfs:range ?otype .
}
WHERE {
?s a ?type .
?s ?property ?o .
OPTIONAL { ?o a ?otype }
}
Along with the schema triples you see above, this new version adds triples like these:
i:banjo rdf:type rdf:Property ;
rdfs:domain s:Song ;
rdfs:range s:Musician .
It also gives the rdfs:label
property rdfs:domain
values of s:Instrument
, s:Musician
, and s:Song
, which isn’t quite right; as the RDFS spec tells us, “[t]he rdfs:domain
of rdfs:label
is rdfs:Resource
”. The spec also tells us that “the resources denoted by subjects of triples with predicate P are instances of all the classes stated by the rdfs:domain
properties”, which in the case of my example means that every instance with an rdfs:label
property is an instrument and a musician and song.
We clearly don’t want to say that, but if you are creating a schema for a dataset that lacks one, CONSTRUCT
queries like this can give you a big head start. Just run one or the other with the dataset and then edit the schema that it creates as you see fit.
Comments? Reply to my tweet announcing this blog entry.
Share this post