Hidden gems included with Jena’s command line utilities
Lots of ways to manipulate your RDF from the open-source multiplatform tool kit
- rdfdiff
- shacl
- qparse and uparse
- rsparql
- rupdate
- rdfparse
- Working with Fuseki datasets from the command line
- riot
On page 5 of my book Learning SPARQL I described how the open source RDF processing framework Apache Jena includes command line utilities called arq
and sparql
that let you run SPARQL queries with a simple command line like this:
arq --data mydata.ttl --query myquery.rq
At the time, the arq
one supported some SPARQL extensions that the sparql
one didn’t. I don’t even remember what they were and tended to use arq
just because the name is shorter. I have since learned that with support for the extensions being added to sparql
, there are now no particular differences between the two.
Jena (which recently celebrated release 4.0.0) includes Linux and Windows versions of many other utilities in addition to arq
and sparql
. I’ve mentioned several here when I used one or another to accomplish a particular task, and I thought it would be nice to summarize some of the ones that I have and have not mentioned before. I may be repeating some earlier explanations, but it should be handy to have them in one place.
You’ll find Linux utilities such as arq
and shacl
in Jena’s bin
directory and corresponding Windows utilities such as arq.bat
and shacl.bat
in its bat
directory.
Remember that, like arq
and sparql
, many of these support additional command line parameters beyond the ones I show here. Use --help
with each to find out more. I tried to demo what I found to be the most useful about each.
You can find more background about some of these utilities on the Jena documentation pages ARQ - Command Line Applications (which covers more than just arq
) and the “Command line tools” section of the Reading and Writing RDF in Apache Jena page.
And thanks to Andy Seaborne for reviewing a draft of this!
rdfdiff
Use the rdfdiff
utility to compare two dataset files. It’s like the venerable UNIX command diff
, except that it looks for different triples instead of lines. The order of the input triples doesn’t matter to rdfdiff
, and it can compare data files in different serializations. For example, here is a little RDF/XML file:
<!-- joereceiving.rdf -->
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:d="http://whatever/" >
<rdf:Description rdf:about="http://whatever/emp3">
<d:dept>receiving</d:dept>
<d:name>joe</d:name>
<d:insurance rdf:resource="http://www.uhc.com"/>
</rdf:Description>
</rdf:RDF>
Here is a Turtle file with roughly the same information:
# joereceiving.ttl
@prefix w: <http://whatever/> .
w:emp3 w:name "Joseph" ;
w:dept "receiving" ;
w:insurance <http://www.uhc.com> .
I ran this command to compare the two, also including the names of their formats:
rdfdiff joereceiving.rdf joereceiving.ttl RDF/XML TURTLE
I got this output:
< [http://whatever/emp3, http://whatever/name, "joe"]
> [http://whatever/emp3, http://whatever/name, "Joseph"]
Like the text file comparison utility diff
, the report uses <
as a prefix to show you what was in the first file but not the second and >
to show you what was in the second but not the first.
As with many other Jena utilities, you can use the URL of a remote file instead of the name of a local file for either or both of the first two arguments.
shacl
In Validating RDF data with SHACL I described how to use an open source tool developed by TopQuadrant to validate RDF data against constraints on that data that are described using the W3C SHACL standard. Jena includes a shacl
utility to do the same kind of validation, and when running this with the employees.ttl
file that that blog entry links to, all of my examples described there work with Jena shacl
as well.
Because the employees.ttl
file had class definitions, instance data, and SHACL shapes all defined within that one file, I passed that filename as both the --data
and --shapes
parameter when I ran this command line tool:
shacl validate --data employees.ttl --shapes employees.ttl
It found all of my test constraint violations:
- After I uncommented the data’s
e2
example,shacl
reported that it was missing the requiredhr:jobGrade
value. - After I uncommented the
e3
example, it reported that itshr:jobGrade
value was not an integer. - After I uncommented the
e4
example, it reported that itshr:jobGrade
value fell out of the allowed range.
As the SHACL specification requires, the validation reports produced by shacl
were themselves sets of triples, whether it found violations or not. This makes it easier to fit the tool into an RDF processing pipeline.
Adding -v
for “verbose” after shacl validate
in that command line adds additional information to the output.
The utility’s print
option outputs the rules in the file. It can do this as regular RDF, compact SHACL syntax (surprisingly useful if you have a lot of rules), or the default: a simple text representation.
shacl print --out=RDF employees.ttl # out=RDF, compact, or text
qparse and uparse
The qparse
utility parses a query and can do various things with it as described by its --help
option. I recently learned that it can pretty-print queries, so if the spacing and indentation of a query that you’re trying to understand is a mess, qparse
can make it easier to understand and even capitalize keywords and add line numbers.
Here is a sloppily formatted little query:
# namedept.rq
prefix w: <http://whatever/> Select
* WHERE { ?s w:name ?name . optiONAL { ?s w:dept ?dept } }
I run this command,
qparse --query namedept.rq
and I get this output:
PREFIX w: <http://whatever/>
SELECT *
WHERE
{ ?s w:name ?name
OPTIONAL
{ ?s w:dept ?dept }
}
Adding --num
to the command line would add line numbers to the output.
The uparse
utility can do the same thing for update queries. The following pretty-prints the file updatetest.ru
:
uparse --file=updatetest.ru
Further documentation about both commands is available in the Jena documentation.
rsparql
This sends a local query to a SPARQL endpoint specified with a URL. I would typically use curl
for this, but after reviewing the --help
options for rsparql
I see that it makes it easier to specify that you want the results in text, XML, JSON, CSV, or TSV. When sending a SPARQL query with curl
, you can’t assume that the endpoint supports all of these result formats, and you probably have to look up their mime types, because I certainly haven’t memorized them.
The following sends the SPARQL query in the 5triples.rq
file to the Wikidata endpoint and then outputs the results at the command line:
rsparql --query 5triples.rq --service=https://query.wikidata.org/sparql
rupdate
This send a local update query to a SPARQL endpoint specified with a URL. It will have to be one where you have update permission, which may well be a locally running copy of Fuseki. The following executes the update request stored in updatetest.ru
on the test1 dataset in the locally running copy of Fuseki (assuming that fuseki-server
was started up with the --update
parameter, as described below):
rupdate --service=http://localhost:3030/test1 --update=updatetest.ru
rdfparse
This parses an RDF/XML document. People don’t use RDF/XML much anymore, and with good reason, but if you find any RDF/XML this is a simple way to convert it. The riot
utility, described below, is even better, but I especially like the -R
switch available with rdfparse
; this tells it to search through an arbitrary XML document and extract any triples stored within embedded rdf:RDF
elements. That can be great for processing some RDF that was embedded into XML before JSON-LD or even RDFa were around. Here’s a nice arbitrary XML document that I called xproduct1.xml
:
<myDoc>
<header><whatev/></header>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:d="http://whatever/" >
<rdf:Description rdf:about="http://whatever/emp1">
<d:dept>shipping</d:dept>
<d:name>jane</d:name>
</rdf:Description>
</rdf:RDF>
<arbitraryElement/>
<rdf:RDF
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:d="http://whatever/" >
<rdf:Description rdf:about="http://whatever/emp3">
<d:dept>receiving</d:dept>
<d:name>joe</d:name>
</rdf:Description>
</rdf:RDF>
</myDoc>
I run the following command,
rdfparse -R xproduct1.xml
and it produces this nice ntriples output:
http://whatever/emp1> <http://whatever/dept> "shipping" .
<http://whatever/emp1> <http://whatever/name> "jane" .
<http://whatever/emp3> <http://whatever/dept> "receiving" .
<http://whatever/emp3> <http://whatever/name> "joe" .
Working with Fuseki datasets from the command line
Jena includes several utilities that let you work with datasets created using Jena’s Fuseki SPARQL server. Their ability to load and update data can be very helpful in an automated system that uses Fuseki as its backend data store.
To create some of this data to test with, I used the following command to start up Fuseki in a mode that would allow updates to data that it was storing:
fuseki-server --update
When you go to Fuseki’s GUI interface at http://localhost:3030 and tell it that you want to create a new dataset, you have to choose between three types of dataset: in-memory ones that will not persist from session to session, “Persistent” ones that use the older TDB format, and “Persistent (TDB2)” ones that use the more advanced TDB2 format. For my examples below I just created TDB2 datasets. TDB versions of the commands are also included with Jena, but if you’re creating a new dataset, you may as well use TDB2.
Most of these utilities expect you to specify a path to an assembler file to tell those utilities which Fuseki dataset to operate on. I never tried making my way through the Jena Assembler howto documentation, but I recently noticed that Fuseki creates assembler files for us, so I don’t have to worry about their structure and syntax because I can have Fuseki make them for me. When I used Fuseki’s GUI to create a TDB2 dataset called test1, Fuseki created the assembler file apache-jena-fuseki/run/configuration/test1.ttl
, so I knew where to point the command line utilities.
These command line tools won’t work with the Fuseki datasets if you have Fuseki running because Fuseki locks the files. My examples below assume that I have created the test1 dataset describe above, used the web-based interface to upload data to it (although, as we’ll see, this can be done with command line tools as well), and then shut down the Fuseki server.
Additional information about these commands is available at TDB2 - Command Line Tools.
Dumping dataset contents
The following command showed me the contents of that TDB2 dataset at the command line:
tdb2.tdbdump --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl
Querying a Fuseki dataset
With a SPARQL query stored in myquery.rq
, this command queries the test1 dataset and outputs the results at the command line:
tdb2.tdbquery --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl --query myquery.rq
Setting of the output format is similar to doing it with arq
. Run tdb2.tdbquery --help
to find out more.
Updating a Fuseki dataset
With the file updatetest.ru
storing a SPARQL INSERT update request that inserts a single triple, the following command didn’t show anything at the command line,
tdb2.tdbupdate --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl --update updatetest.ru
but when I restarted the Fuseki server and used the web-based interface to query dataset test1 for all of its triples, I saw the triple inserted by the updatetest.ru
query in there with the triples that had been in there before.
Loading a data file into a Fuseki dataset
The following loaded the triples in the file furniture.ttl
into the test1 dataset (which I confirmed the same way I did with my previous example) and displayed some status messages:
tdb2.tdbloader --tdb ../../apache-jena-fuseki/run/configuration/test1.ttl furniture.ttl
It’s best to make sure that there are no parsing problems with the file you load before you load it. A quick way to do that is with the --validate
parameter of the riot
command:
riot --validate furniture.ttl
Other command line utilities for Fuseki datasets
The following commands all work on the dataset whose assembler file you point to with the --tdb
parameter:
-
tdb2.tdbstats
outputs a LISPy set of parenthesized expressions telling you about the dataset. -
tdb2.tdbbackup
creates a gzipped copy of the dataset’s triples. -
I tried
tdb2.tdbcompact
and got a status message of “Compacted in 0.570s”; someday I’ll try this with a larger dataset to really investigate the effect.
riot
Jena includes many command line utilities that I won’t describe here because riot
(“RDF I/O Technology”) combines them all into one utility that I have been using more and more lately. I mentioned in Pulling Turtle RDF triples from the Google Knowledge Graph how it can accept triples via standard input, which was great for the use case that I described there of converting Google Knowledge Graph JSON-LD to Turtle triples on the fly.
We’ve already seen another nice use of riot
above: validating a file of triples before loading it into dataset stored on a server.
Converting serializations
To simply convert an RDF file from one serialization to another, use the riot
--output
parameter to name the new serialization:
riot --output=JSONLD emps.ttl
The Jena utilities nquads
, ntriples
, rdfxml
, trig
, and turtle
are all specialized versions of riot
that produce the named serializations with no need for an --output
parameter.
Counting triples
When I want to know how many triples are in a Turtle file, here’s what I usually do:
- Look around my hard disk for a query file that uses COUNT to count all the triples.
- Give up looking.
- Look up the COUNT syntax in my book “Learning SPARQL”.
- Write another query file for counting all the triples.
Now I can just use riot
with this simple command line:
riot --count furniture.ttl
It also works with quads.
Concatenating
Jena includes an rdfcat
utility that outputs the concatenated contents of any data files listed on its command line. First, it outputs a header that says “DEPRECATED: Please use ‘riot’ instead”. Providing multiple data file names as arguments when running riot
(I think I just got another pun of the name) will by default output an ntriples version of their concatenated triples with status messages showing where each one starts. Adding --quiet
suppresses the status messages, and --output
lets you specify a different output serialization.
Inferencing
Jena includes an infer
utility that does inferencing from an RDFS model, but I no longer bother with it because riot
can do this as well.
The following little RDFS model shows that two properties from the Oracle and Microsoft sample relational databases are subproperties of similar schema.org properties:
# empmodel.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix oraclehr: <http://snee.com/vocab/schema/OracleHR#> .
@prefix nw: <http://snee.com/vocab/schema/SQLServerNorthwind#> .
oraclehr:employees_first_name rdfs:subPropertyOf schema:givenName .
oraclehr:employees_last_name rdfs:subPropertyOf schema:familyName .
nw:employees_FirstName rdfs:subPropertyOf schema:givenName .
nw:employees_LastName rdfs:subPropertyOf schema:familyName .
Here is some data using the Oracle and Microsoft properties:
# emps.ttl
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix schema: <http://schema.org/> .
@prefix oraclehr: <http://snee.com/vocab/schema/OracleHR#> .
@prefix nw: <http://snee.com/vocab/schema/SQLServerNorthwind#> .
oraclehr:employees_100 oraclehr:employees_last_name "King" ;
oraclehr:employees_first_name "Steven" .
nw:employees_2 nw:employees_LastName "Fuller" ;
nw:employees_FirstName "Andrew" .
This command tells riot
to do inferencing on emps.ttl
using the RDFS modeling in empmodel.ttl
:
riot --rdfs empmodel.ttl emps.ttl
And here is the ntriples result with spaces added for more readability:
<http://snee.com/vocab/schema/OracleHR#employees_100>
<http://snee.com/vocab/schema/OracleHR#employees_last_name> "King" .
<http://snee.com/vocab/schema/OracleHR#employees_100>
<http://schema.org/familyName> "King" .
<http://snee.com/vocab/schema/OracleHR#employees_100>
<http://snee.com/vocab/schema/OracleHR#employees_first_name> "Steven" .
<http://snee.com/vocab/schema/OracleHR#employees_100>
<http://schema.org/givenName> "Steven" .
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_LastName> "Fuller" .
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
<http://schema.org/familyName> "Fuller" .
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_FirstName> "Andrew" .
<http://snee.com/vocab/schema/SQLServerNorthwind#employees_2>
<http://schema.org/givenName> "Andrew" .
The new triples show that these employees have schema.org properties in addition to the original OracleHR and Northwind properties. This ability makes this kind of inferencing great for data integration, as I described in Driving Hadoop data integration with standards-based models instead of code. (In that I used the Python libray rdflib to do the same kind of inferencing, but that’s the beauty of standards—having a choice of tools to implement the same expected behavior.)
Share this post