SPARQL queries of git repository data
If we're going to think of git data as a graph...
Justin Dowdy recently created an open source project to convert the metadata in a git repository to RDF, and I’ve been having some fun with it. Before getting into the details, as a brief demo I’ll start with a sample SPARQL query that I did to list all of the 2019 commits in my misc github repo:
PREFIX dcterms: <http://purl.org/dc/terms/>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX x: <http://www.w3.org/2001/XMLSchema#>
PREFIX gist: <https://ontologies.semanticarts.com/gist/>
SELECT ?title ?dateTime WHERE {
?commit a wd:Q20058545 ; # it's an instance of the commit class
dcterms:subject ?subject ;
gist:atDateTime ?dateTime .
?subject dcterms:title ?title .
FILTER (?dateTime >= "2019-01-01T00:00:00"^^x:dateTime &&
?dateTime < "2020-01-01T00:00:00"^^x:dateTime)
}
It produced this result:
title dateTime
----- --------
adding sqlite rdf files 2019-07-13T16:19:39-04:00
added tableList.scr 2019-07-13T16:21:39-04:00
adding readme 2019-07-28T12:00:55-04:00
added files to go with 2019-10 blog entry 2019-10-20T16:46:07-04:00
Justin’s software that makes this all possible is at https://github.com/justin2004/git_to_rdf.
Once I installed that software and created a /home/bob/temp/rdf
directory, the following variation on the command line from Justin’s github page read my local copy of the misc
repo and put 35,353 triples about it in two files in /mnt/temp/rdf
:
/home/bob/git/git_to_rdf/git_to_rdf.sh \
--repository /mnt/git/misc --output /mnt/temp/rdf
(Referencing /home/bob/temp/rdf
as /mnt/temp/rdf
is a Docker thing that I don’t completely understand myself. Justin said that he is working to simplify that.) I loaded the new triples into Jena Fuseki and tried a few of my Queries to explore a dataset that I typically use, which is how I found out that it had 35K triples.
To really understand the possibilities, read Justin’s blog entry Git Repositories as RDF Graphs. I especially like how it explained that he didn’t necessarily have to make “thoughtful” RDF (well-modeled RDF that takes advantage of standard vocabularies) and why and how he did so. His blog entry also includes a nice diagram of his data model, generated with RDFox, that you’ll want to keep handy while you develop any queries for you own git repo data converted to RDF.
Several of his sample queries will be especially useful for querying git repos that have commits from multiple people. He demonstrates these with RDF generated from the repo for the cURL utility that I have written about here many times. My misc
repo that I used to generate RDF only has commits from me, so these sample queries were less useful to me, but they still provided a good model for how to get at certain kinds of repo information.
To build on what he wrote there I wanted to create at least one more query that was different from his examples, so I created this one to find the commits that used blocks of text with the word “music” in them:
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX gist: <https://ontologies.semanticarts.com/gist/>
PREFIX dcterms: <http://purl.org/dc/terms/>
SELECT DISTINCT ?commitTitle ?commitTime ?filename ?textLine WHERE {
?commit a wd:Q20058545 ; # it's a commit
gist:hasPart ?part ;
dcterms:subject ?commitSubject ;
gist:atDateTime ?commitTime .
?commitSubject dcterms:title ?commitTitle .
?part gist:produces ?contiguousLines .
?contiguousLines gist:occursIn ?file ;
<http://example.com/containedTextContainer> ?textContainer .
?file gist:name ?filename .
?textContainer ?line ?textLine .
FILTER(contains(?textLine,"music"))
}
And here is the result:
This combination of the world’s most popular version control system and this ability to to manipulate metadata about what it contains could provide the basis for a Content Management System in the broader original sense of the term: something to manage the storage and workflow of multiple kinds of content for multiple kinds of publication media. (In recent years the term’s meaning has narrowed to mean “platform to help automate web publishing”.)
That’s just one of the possibilities. Read Justin’s blog entry and see what ideas it gives you!
Comments? Reply to my tweet announcing this blog entry.
Share this post