Using SPARQL do to quick and dirty joins of CSV data
Or data with other delimiters.
I recently needed to join two datasets at work, cross-referencing one property in a spreadsheet with another in a JSON file. I used a combination of jq
, perl
, sort
, uniq
, and… I won’t go into details.
I wondered later if it would have been easier if I had used tarql (which I’ve blogged about before) to convert it all to RDF and then did the join with a SPARQL query. It turned out to be quite easy. A single SPARQL conversion query to run with tarql, after changing one line per dataset that I applied it to, was all I needed to create the RDF that let me do all the joins I wanted with additional simple queries. This will be even easier in the future as I re-use the conversion query with other datasets that I want to join.
To demonstrate this I will show how I did this to join three CSV files: a list of student names and IDs, a list of course names and IDs, and a list of student and course IDs that shows who took which courses.
I didn’t include field name headers in the data files because tarql would use them as property names and I wanted to make my scripts more generic by letting tarql use its default generic names of a
, b
, c
, and so on through the alphabet for dataset property names.
Here are the data files, starting with students.txt
:
s1001,Craig Ellis
s1002,Jane Jones
s1003,Richard Mutt
s1004,Cindy Marshall
courses.txt:
c2001,Linear Algebra I
c2002,Impressionists and Post-Impressionists
c2003,Intro to Theravada Buddhism
c2004,Democracy in the Gilded Age
studentCourse.txt:
s1002,c2001
s1002,c2004
s1003,c2001
s1003,c2004
s1004,c2001
s1004,c2003
I had two goals that would require joins: to list the student names next to the names of the courses they took, without showing any IDs, and then to list the course names with the number of students enrolled in each. The first step was to convert the delimited files to RDF with tarql; I could then write short queries to fulfill the two goals.
I used the query below to convert the students.txt
file to RDF. The ?u a t:student
triple pattern in the CONSTRUCT clause creates a triple saying “this row of data represents an instance of this class” so that the join queries will know which data represents what kinds of things. Modifying this script to handle other data types merely requires changing the object of this one triple pattern. For example, the query that converts courses.txt
to Turtle has t:course
in that triple pattern instead of t:student
.
# constructAllStudents.rq
PREFIX t: <http://learningsparql.com/ns/tarql/>
CONSTRUCT {
?u a t:student .
?u t:a ?a .
?u t:b ?b .
?u t:c ?c .
?u t:d ?d .
?u t:e ?e .
?u t:f ?f .
# As many as you want. Can be more than the number of input columns.
}
WHERE {
BIND (UUID() AS ?u)
}
This query tells tarql to use that query to create the Turtle file for that dataset:
tarql -H constructAllStudents.rq students.txt > students.ttl
I used similar command lines with the slight variations described above on that CONSTRUCT query to create courses.ttl
and studentCourse.ttl
.
Here is the SPARQL query that uses the data from those three Turtle files to join the student names with the course names. Your JOIN query will look a little different from mine, but not too different, because the use of properties such as t:a
, t:b
, and t:c
that correspond to tarql variables like ?a
and ?b
(instead of more specific names from a data file header line) let me make the query more generic.
# joinThem.rq
PREFIX t: <http://learningsparql.com/ns/tarql/>
SELECT ?studentName ?courseName WHERE {
?student a t:student ;
t:a ?studentID ;
t:b ?studentName .
?course a t:course ;
t:a ?courseID ;
t:b ?courseName .
?class a t:studentCourse ;
t:a ?studentID ;
t:b ?courseID .
}
The arq SPARQL processor’s ability to accept more than one -data
argument let me use a single command to run this join query with the three Turtle files as input:
arq --query joinThem.rq -data courses.ttl -data students.ttl -data studentCourse.ttl
Here is the result:
----------------------------------------------------
| studentName | courseName |
====================================================
| "Richard Mutt" | "Linear Algebra I" |
| "Richard Mutt" | "Democracy in the Gilded Age" |
| "Jane Jones" | "Linear Algebra I" |
| "Jane Jones" | "Democracy in the Gilded Age" |
| "Cindy Marshall" | "Linear Algebra I" |
| "Cindy Marshall" | "Intro to Theravada Buddhism" |
----------------------------------------------------
This data enables joins for other purposes as well. This next query joins the data and then shows course names with the number of students enrolled in each one:
# coursePopularity.rq
PREFIX t: <http://learningsparql.com/ns/tarql/>
SELECT ?courseName (COUNT(*) as ?students)
WHERE {
?student a t:student ;
t:a ?studentID ;
t:b ?studentName .
?course a t:course ;
t:a ?courseID ;
t:b ?courseName .
?class a t:studentCourse ;
t:a ?studentID ;
t:b ?courseID .
}
GROUP BY ?courseName
ORDER BY DESC(?students)
The command to run this just substitutes the new query for the previous one on the command line used earlier:
arq --query coursePopularity.rq --data courses.ttl -data students.ttl -data studentCourse.ttl
And here is the result:
--------------------------------------------
| courseName | students |
============================================
| "Linear Algebra I" | 3 |
| "Democracy in the Gilded Age" | 2 |
| "Intro to Theravada Buddhism" | 1 |
--------------------------------------------
I want to reiterate that the trick described here is strictly for quick-and-dirty joins. Calling the first property in all the datasets a
, the second one b
, the third c
, and so on is just a convenience to reduce the amount of query editing needed. If I was going to convert data like this for long-term usage I would use more descriptive names for each one (maybe even take advantage of property names in header rows) and add some more modeling triples that define classes, properties, and their relationships.
I’m starting to think of tarql
and arq
as members of the Linux command toolbox that includes venerable old tools like sort
and uniq
as well as recent tools such as jq
and xmllint
that I’m seeing in more Linux standard distributions. (There’s actually one called join
that can do simple joins with two files but no more than two.) I can mix and match different combinations of these tools to perform many different tasks with many different kinds of data with no need to crank up some server or memory-intensive GUI tool.
Cropped photo of “JOIN” sign by Marcel van Schooten via flickr (CC BY 2.0)
Share this post