Converting CSV to RDF with Tarql
Quick and easy and, if you like, streaming.
I have seen several tools for converting spreadsheets to RDF over the years. They typically try to cover so many different cases that learning how to use them has taken more effort than just writing a short perl script that uses the split()
command, so that’s what I usually ended up doing. (Several years ago I did come up with another way that was more of a cute trick with Turtle syntax.)
A year or two ago I learned about Tarql, which lets you query delimited files as if they were RDF triples, and I definitely liked it. It seemed so simple, though, that I didn’t think it was worth a whole blog post. Recently, however, I was chatting with Dave McComb of Semantic Arts and learned that this simple utility often plays a large role in the work they do for their clients, so I played some more with Tarql. I also interviewed Boris Pelakh of Semantic Arts about what kinds of tasks they use Tarql for in their customer work.
I downloaded Tarql from https://github.com/tarql/tarql/releases, unzipped it, found a shell script and batch file in a bin
subdirectory of the unzipped version, and was ready to run it.
I’ll just jump in with a simple example before discussing the various possibilities. Here is a file I called test1.csv
:
name,quantity,description,available
widget,3,for framing the blivets,false
blivet,2,needed for widgets,true
"like, wow",4,testing the CSV parsing,true
Here is a sample query to run against it:
# test1.rq
SELECT ?name ?quantity ?available
WHERE {}
From the command line I tell Tarql to run the test1.rq
query with test1.csv
as input:
tarql test1.rq test1.csv
Here is the result:
--------------------------------------
| name | quantity | available |
======================================
| "widget" | "3" | "false" |
| "blivet" | "2" | "true" |
| "like, wow" | "4" | "true" |
--------------------------------------
The first thing I like here is that the comma in “like, wow” doesn’t cause the problems that I had when using the perl split()
function, which split lines at every comma—even the quoted ones. (Perl has a library to get around that, but finding it and installing it was too much trouble for such a simple task.)
If the query above had specified the dataset with the FROM keyword, like this,
# test2.rq
SELECT ?name ?quantity ?available
FROM <file:test1.csv>
WHERE {}
then I wouldn’t have to mention the data source on the command line,
tarql test2.rq
and I would get the same result.
To really turn the data into triples, we can use a CONSTRUCT query. The following does this with the same data and, because Tarql treats everything as a string, it casts the quantity
values to integers and the available
values to Booleans:
PREFIX ex: <http://www.learningsparql.com/ns/example/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
CONSTRUCT {
?u ex:name ?name ;
ex:quantity ?q ;
ex:available ?a .
}
FROM <file:test1.csv>
WHERE {
BIND (UUID() AS ?u)
BIND (xsd:integer(?quantity) AS ?q)
BIND (xsd:boolean(?available) AS ?a)
}
Here is the result:
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix ex: <http://www.learningsparql.com/ns/example/> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<urn:uuid:8a6ad6dc-1b2d-4900-a63f-d25286379a0a>
ex:name "widget" ;
ex:quantity 3 ;
ex:available false .
<urn:uuid:66ddf7f2-8c37-4ecb-86cf-056234aad317>
ex:name "blivet" ;
ex:quantity 2 ;
ex:available true .
<urn:uuid:c8db5512-3772-4193-a172-525181a712de>
ex:name "like, wow" ;
ex:quantity 4 ;
ex:available true .
The Tarql documentation shows a lot more options and its examples page shows several cool things. And, of course, you have the full power of SPARQL to manipulate the data that you’re pulling from tables; one example is my use of UUID()
function in the CONSTRUCT query above. Another nice example is a federated query with Tarql that John Walker put together.
The original version of Tarql is among many contributions that Richard Cyganiak has made to RDF-related software over the years. As he told me in an email,
I started the project in 2013 when I was still at NUI Galway (formerly DERI), with large contributions from my then-colleague Fadi Maali, and Emir Munoz from Fujitsu Labs. We were working with open data from a number of government data catalogs at the time, and this data often came as CSV files. Tarql started out as a quick hack to help with ingesting that data into our RDF-based tools. The hack proved quite successful. But to this day, Tarql is really just a thin wrapper around Apache Jena’s ARQ query engine. All the hard work happens there.
One important point in the design is that it can stream. That is, only a small part of input and output need to be kept in-memory at any given time. That makes it work well on large CSV files. Again, Jena made it possible by providing building blocks that support streaming operation.
It’s a testament to the design of SPARQL, really. The syntax is so nice and concise, and the underlying model so flexible, that it can be adapted to quite different tasks.
Because Richard pointed out that it can stream, I wanted to show this alternative to my first command line above, which does the same thing but using input from stdin:
tarql --stdin test1.rq < test1.csv
I asked Semantic Arts’ Boris Pelakh a few things about the role that Tarql plays in the work that Semantic Arts does for their customers and it turns out that it’s a pretty big role.
Boris, to start, tell me a little about what Semantic Arts does and where Tarql fits in.
Semantic Arts provides consulting services, helping companies transform their data models to a semantic graph paradigm, while helping them achieve data harmonization and improve comprehension and efficiency. We use Tarql to transform tabular data (either spreadsheets or SQL exports) into RDF for further processing. It is an essential part of our ETL process.
Where do the tables come from that you’re feeding to Tarql? From customer data or from tables that Semantic Arts staff develop as part of their research into the company?
It is primarily customer data—bulk CSV, XLS, or SQL exports. In almost all our engagements, the customers already have a large volume of data, either in relational databases or some sort of data warehouse using something like Hadoop or S3. We have used Tarql to transform transaction data, asset inventories, dataset metadata, and so forth.
What do you do with the Tarql output?
It is generally loaded into our local AllegroGraph store during the development process or the client’s chosen triple store during production. We then help our clients build semantic applications on top of that triple store. I have also set up ETL pipelines where Tarql runs in EC2 instances and uses S3 to load the generated RDF into Neptune for a scalable solution.
I believe Semantic Arts helps customers come up with some overall business process schemas or related artifacts; having this data in a triplestore like Allegrograph probably helps a lot with that.
Absolutely. In several engagements we were able to run graph analytics on the imported data for insights as well as running validation, either via SHACL or SPARQL, to help improve data quality.
What kinds of roles does that RDF play in the deliverables that you eventually provide to the customer?
In our view, RDF provides all the best features of both relational and property graph databases, and is an ideal foundation for an enterprise data system. We help our customers migrate their siloed data into a unified, semantic model defined by an enterprise ontology that we help build. Then, we develop a semantic application stack (APIs and UIs) that take advantage of the newly enriched data.
So triples in Allegrograph provide the raw material for what eventually ends up as the enterprise ontology.
While we use AllegroGraph internally, we do not mandate a specific triple store to our customers, instead working with their preferred infrastructure. We have worked with Stardog, AWS Neptune, and MarkLogic, among others. But yes, the instance data created via Tarql, along with classes and properties defined in Protege, provided a unified enterprise ontology for the customer to use.
Tarql provides a lot of potential command line switches to use. Are there interesting ones that you feel many people miss out on?
The --dedup
option added in Tarql 1.2 (I believe) helps reduce the size of the generated RDF by avoiding the generation of duplicate triples. Tuning the deduplication window size is a careful compromise between the memory footprint of the transform and the output size, and is tuned per pipeline. The support for apf:strSplit
, which allows for the generation of multiple RDF result sets from a single input line, has also been helpful in the past, though that is internal to the query and not a command line option.
Tarql has some “magic” predicates?
Yes. For example, tarql:expandPrefix
is very useful when minting URIs in a CONSTRUCT query. It avoids hard-coding of namespaces into the transformation, lending flexibility and ease of maintenance. Also, the magic ?ROWNUM
variable that Tarql provides into the bindings is nice for generating unique IRIs when the data set does not have unique keys.
Boris is also working on a Python implementation of Tarql called pyTARQL that looks like it could be useful for a lot of developers.
Share this post