Using SPARQL do to quick and dirty joins of CSV data

Or data with other delimiters.

September 27, 2020

I recently needed to join two datasets at work, cross-referencing one property in a spreadsheet with another in a JSON file. I used a combination of jq, perl, sort, uniq, and… I won’t go into details.

I wondered later if it would have been easier if I had used tarql (which I’ve blogged about before) to convert it all to RDF and then did the join with a SPARQL query. It turned out to be quite easy. A single SPARQL conversion query to run with tarql, after changing one line per dataset that I applied it to, was all I needed to create the RDF that let me do all the joins I wanted with additional simple queries. This will be even easier in the future as I re-use the conversion query with other datasets that I want to join.

To demonstrate this I will show how I did this to join three CSV files: a list of student names and IDs, a list of course names and IDs, and a list of student and course IDs that shows who took which courses.

I didn’t include field name headers in the data files because tarql would use them as property names and I wanted to make my scripts more generic by letting tarql use its default generic names of a, b, c, and so on through the alphabet for dataset property names.

Here are the data files, starting with students.txt:

s1001,Craig Ellis
s1002,Jane Jones
s1003,Richard Mutt
s1004,Cindy Marshall

courses.txt:

c2001,Linear Algebra I
c2002,Impressionists and Post-Impressionists
c2003,Intro to Theravada Buddhism
c2004,Democracy in the Gilded Age

studentCourse.txt:

s1002,c2001
s1002,c2004
s1003,c2001
s1003,c2004
s1004,c2001
s1004,c2003

I had two goals that would require joins: to list the student names next to the names of the courses they took, without showing any IDs, and then to list the course names with the number of students enrolled in each. The first step was to convert the delimited files to RDF with tarql; I could then write short queries to fulfill the two goals.

I used the query below to convert the students.txt file to RDF. The ?u a t:student triple pattern in the CONSTRUCT clause creates a triple saying “this row of data represents an instance of this class” so that the join queries will know which data represents what kinds of things. Modifying this script to handle other data types merely requires changing the object of this one triple pattern. For example, the query that converts courses.txt to Turtle has t:course in that triple pattern instead of t:student.

# constructAllStudents.rq
PREFIX t: <http://learningsparql.com/ns/tarql/>

CONSTRUCT {
   ?u a t:student .

   ?u t:a ?a .
   ?u t:b ?b .
   ?u t:c ?c .
   ?u t:d ?d .
   ?u t:e ?e .
   ?u t:f ?f .
   # As many as you want. Can be more than the number of input columns.
} 
 WHERE {
   BIND (UUID() AS ?u)
}

This query tells tarql to use that query to create the Turtle file for that dataset:

  tarql -H constructAllStudents.rq students.txt > students.ttl

I used similar command lines with the slight variations described above on that CONSTRUCT query to create courses.ttl and studentCourse.ttl.

Here is the SPARQL query that uses the data from those three Turtle files to join the student names with the course names. Your JOIN query will look a little different from mine, but not too different, because the use of properties such as t:a, t:b, and t:c that correspond to tarql variables like ?a and ?b (instead of more specific names from a data file header line) let me make the query more generic.

# joinThem.rq

PREFIX t: <http://learningsparql.com/ns/tarql/>

SELECT ?studentName ?courseName WHERE {

   ?student a t:student ;
            t:a ?studentID ;
            t:b ?studentName .

   ?course a t:course ; 
            t:a ?courseID ;
            t:b ?courseName .
   
   ?class a t:studentCourse ; 
           t:a ?studentID ;
           t:b ?courseID .
}

The arq SPARQL processor’s ability to accept more than one -data argument let me use a single command to run this join query with the three Turtle files as input:

arq --query joinThem.rq -data courses.ttl -data students.ttl -data studentCourse.ttl

Here is the result:

----------------------------------------------------
| studentName      | courseName                    |
====================================================
| "Richard Mutt"   | "Linear Algebra I"            |
| "Richard Mutt"   | "Democracy in the Gilded Age" |
| "Jane Jones"     | "Linear Algebra I"            |
| "Jane Jones"     | "Democracy in the Gilded Age" |
| "Cindy Marshall" | "Linear Algebra I"            |
| "Cindy Marshall" | "Intro to Theravada Buddhism" |
----------------------------------------------------

This data enables joins for other purposes as well. This next query joins the data and then shows course names with the number of students enrolled in each one:

# coursePopularity.rq

PREFIX t: <http://learningsparql.com/ns/tarql/>

SELECT ?courseName (COUNT(*) as ?students)
WHERE {

   ?student a t:student ;
            t:a ?studentID ;
            t:b ?studentName .

   ?course a t:course ; 
            t:a ?courseID ;
            t:b ?courseName .
   
   ?class a t:studentCourse ; 
           t:a ?studentID ;
           t:b ?courseID .
}

GROUP BY ?courseName
ORDER BY DESC(?students)

The command to run this just substitutes the new query for the previous one on the command line used earlier:

arq --query coursePopularity.rq --data courses.ttl -data students.ttl -data studentCourse.ttl

And here is the result:

--------------------------------------------
| courseName                    | students |
============================================
| "Linear Algebra I"            | 3        |
| "Democracy in the Gilded Age" | 2        |
| "Intro to Theravada Buddhism" | 1        |
--------------------------------------------

I want to reiterate that the trick described here is strictly for quick-and-dirty joins. Calling the first property in all the datasets a, the second one b, the third c, and so on is just a convenience to reduce the amount of query editing needed. If I was going to convert data like this for long-term usage I would use more descriptive names for each one (maybe even take advantage of property names in header rows) and add some more modeling triples that define classes, properties, and their relationships.

I’m starting to think of tarql and arq as members of the Linux command toolbox that includes venerable old tools like sort and uniq as well as recent tools such as jq and xmllint that I’m seeing in more Linux standard distributions. (There’s actually one called join that can do simple joins with two files but no more than two.) I can mix and match different combinations of these tools to perform many different tasks with many different kinds of data with no need to crank up some server or memory-intensive GUI tool.

Cropped photo of “JOIN” sign by Marcel van Schooten via flickr (CC BY 2.0)

Parsing JSON with Python

Amazon's failed folksonomy and Kevin Federline

RDF serialization formats

Selecting all the triples from all the graphs

Editing schemas, ontologies, and SKOS taxonomies with VocBench

SPARQLing anything

Querying for audio on Wikidata

Use SPARQL to query for movies, then watch them

SPARQL queries of the Billboard Hot 100

Visualizing RDF

blog

home

blog

categories

writing

music

about

Recent Posts