SPARQLing anything

MS Office files, XML, markdown, plain text, and more.

SPARQL Anything is an open source tool that lets you use SPARQL to query data in a long list of popular formats: XML, JSON, CSV, HTML, Excel, Text, Binary, EXIF, File System, Zip/Tar, Markdown, YAML, Bibtex, DOCx, and PPTx. It has a lot of great documentation and features, but I’ll start here with an example of it in action.

As you’ll see on its github page, there is a command line interface and a server version. I downloaded the jar file from its releases page with the goal of sending a SPARQL query to this spreadsheet, which I called xlsxtest.xlsx:

sample spreadsheet

(I created this spreadsheet with OpenOffice, but saved it as an MS Office Excel file and it worked just fine.)

I then put the following SPARQL query in the file sa1.rq. Note how its SERVICE parameter includes the name of the spreadsheet file above:

CONSTRUCT { ?s ?p ?o }
WHERE {
    SERVICE <x-sparql-anything:xlsxtest.xlsx> {
    ?s ?p ?o
   }
}

I called the jar file with sa1.rq as the query file (run it with no parameters to see a wide choice of other parameters) and redirected the output to a Turtle file:

java -jar ~/temp/sparql-anything-0.9.0.jar -q sa1.rq > xlsxtest.ttl

Here is the Turtle file:

[ a       <http://sparql.xyz/facade-x/ns/root> ;
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
          [ <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
                    "Given-name" ;
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
                    "Family-name" ;
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
                    "Hire-date" ;
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#_4>
                    "random int"
          ] ;
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
          [ <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
                    "Grace" ;
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
                    "Lee" ;
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
                    "45150.0"^^<http://www.w3.org/2001/XMLSchema#double> ;
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#_4>
                    "3.0"^^<http://www.w3.org/2001/XMLSchema#double>
          ] ;
  <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
          [ <http://www.w3.org/1999/02/22-rdf-syntax-ns#_1>
                    "Johnson" ;
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#_2>
                    "Frank" ;
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#_3>
                    "44887.0"^^<http://www.w3.org/2001/XMLSchema#double> ;
            <http://www.w3.org/1999/02/22-rdf-syntax-ns#_4>
                    "54.0"^^<http://www.w3.org/2001/XMLSchema#double>
          ]
] .

It’s nice to see that it didn’t turn all the values into strings—it recognizes decimal numbers and typed them accordingly. The structure of this output, which uses a lot of blank nodes, conforms to a model developed by the SPARQL Anything developers called Facade-X. Before digging into that much, I just played around with SPARQL queries of the above triples and came up with this:

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT ?rowID ?cellID ?value WHERE {
   ?root a <http://sparql.xyz/facade-x/ns/root> ;
   ?rowID ?rowContents.
   ?rowContents ?cellID ?value . 
}
ORDER BY ?rowID ?cellID

Running that query on the Turtle created by SPARQL Anything gave me this:

-------------------------------------------
| rowID  | cellID | value                 |
===========================================
| rdf:_1 | rdf:_1 | "Given-name"          |
| rdf:_1 | rdf:_2 | "Family-name"         |
| rdf:_1 | rdf:_3 | "Hire-date"           |
| rdf:_1 | rdf:_4 | "random int"          |
| rdf:_2 | rdf:_1 | "Grace"               |
| rdf:_2 | rdf:_2 | "Lee"                 |
| rdf:_2 | rdf:_3 | "45150.0"^^xsd:double |
| rdf:_2 | rdf:_4 | "3.0"^^xsd:double     |
| rdf:_3 | rdf:_1 | "Johnson"             |
| rdf:_3 | rdf:_2 | "Frank"               |
| rdf:_3 | rdf:_3 | "44887.0"^^xsd:double |
| rdf:_3 | rdf:_4 | "54.0"^^xsd:double    |
-------------------------------------------

It’s a pretty nice representation of the original spreadsheet.

The SPARQL Anything server looks cool, but I liked how I could do the above with a downloaded jar file and no configuration or setup. I just downloaded and ran it.

At 5:06 of the 15:34 video Streamlining Knowledge Graph Construction with a façade: the SPARQL Anything project - Enrico Daga one of the key SPARQL Anything developers gives some good background on the philosophy of Facade-X. To summarize, it models things as lists of lists. Blank nodes, as we saw above, play a large role. A paper published by Enrico and his colleagues for the ACM Transactions on Internet Technology (also available on one of the SPARQL Anything websites) describes Facade-X in more detail.

For fun, I used SPARQL Anything to send SPARQL queries to some other formats as well, like PPTx and markdown. My query for most of these was simply CONSTRUCT {?s ?p ?o} WHERE {?s ?p ?o} because I just wanted to convert the various formats to RDF and see what that looked like. More sophisticated CONSTRUCT or SELECT queries could pull out information modeled for specific applications.

I can picture SPARQL Anything being useful in many, many projects. For example: about a half dozen of its possible input formats are typically used for natural language unstructured text. Turning that into triples, where the object of each triple stores a document or paragraph of natural language text, is a great way to hand these documents off to RDF-based text analysis tools such as the spacy entity recognition library that I wrote about in my post Entity recognition from within a SPARQL query. (Ontotext’s GraphDB Free triplestore supports other entity recognition libraries that would benefit from these conversions as well.)

I’m sure there are many other potential applications as an increasing number of projects seek to pull information from commonly used file formats to add to Knowledge Graphs. SPARQL Anything is an excellent contribution to anyone’s toolbox of potential RDF workflow pipeline steps.


Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.