(This may look like a long blog entry, but it’s mostly sample schemas, data, and shapes. It should be a quick read.)

if RDF technology uses triples to express everything (except queries), why not automate the creation of SHACL constraints from RDF schema declarations?

In a blog entry titled What else can I do with RDFS? I described how the triple { vcard:given-name rdfs:domain emp:Person} lets us infer that a resource with a vcard:given-name value is an instance of class emp:Person. I then wrote “Sometimes we forget that RDFS and OWL were invented to enable this kind of inferencing across data found on the web. They were not invented to help us define data structures, but as I’ve shown, RDFS is handy to at least document them.”

In the data processing world, the purpose of schemas is usually to describe the structure of some data so that a person or process working with that data knows what to expect. If a standard automated process flags parts of the data that don’t comply with the schema, that’s a Good Thing—it means that the person working with the data doesn’t need to write error-checking code to do that.

As I described above, this was not the reason for RDF schemas, but they’ve still been a handy way to describe the structure of a given dataset. Using these schemas for error checking is not an incorrect use of them; section 4 of the RDF Schema specification tells us “Different applications will use this information in different ways. For example, data checking tools might use this to help discover errors in some data set, an interactive editor might suggest appropriate values, and a reasoning application might use it to infer additional information from instance data.”

Some people though that OWL would make it easier to describe these constraints, but to really enforce the constraints, OWL just made it more complicated. So, the W3C eventually published the Shapes Constraint Language standard, or SHACL. This makes it relatively easy to specify typical constraints such as “an instance of Employee must have a family name and given name value” and “an instance of employee must have another employee instance as its emp:reportsTo value”.

If I want to write out a list of classes and properties that are in a given dataset, though, it’s still much simpler with RDFS. Then I had an idea: if RDF technology uses triples to express everything (except queries), why not automate the creation of SHACL constraints from RDF schema declarations? It turned out to be surprisingly easy.

Here is a sample schema excerpt for a community orchestra. It declares classes for musicians and instruments and describes two properties:

  • the m:Musician class’s m:plays property, whose value is an instance of m:Instrument
  • the same class’s m:joined property, which shows the date that the musician joined the group
@prefix m:    <http://learningsparql.com/ns/music#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

m:Musician a rdfs:Class . 
m:Instrument a rdfs:Class .

m:plays a rdf:Property ;
        rdfs:domain m:Musician ;
        rdfs:range m:Instrument . 

m:joined a rdf:Property ; 
         rdfs:domain m:Musician ;
         rdfs:range xsd:date . 

My goal was to write a SPARQL CONSTRUCT query that created SHACL shapes from the schema above to flag the following errors when they come up in instance data:

  • an m:plays triple whose value was not an instance of m:Instrument
  • an m:joined triple whose value was not a proper ISO 8601 date
  • a musician with more than one m:joined value

The query that creates these shapes should not be about this specific data but work more generally with other object and datatype property values. Having this work with both object properties and datatype properties was very important for handling a wide variety of data structures.

A query to do this was briefer than I thought it would be:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX sh:   <http://www.w3.org/ns/shacl#>

CONSTRUCT {
  ?class a sh:NodeShape ;
  sh:targetClass ?class ;
  sh:property [
     a sh:PropertyShape ; 
     sh:path ?property ;
     ?rangePredicate ?propertyRange ;
     sh:minCount 1 ;
     sh:maxCount 1 
   ]
}
WHERE {
  ?class a rdfs:Class .
  ?property rdfs:domain ?class ;
            rdfs:range ?propertyRange .
  BIND(IF(contains(xsd:string(?propertyRange),
          "http://www.w3.org/2001/XMLSchema#"),
	  sh:datatype, sh:class) AS ?rangePredicate) . 
}

I won’t describe the details of the SHACL syntax that it creates, because you can look that up yourself. The only somewhat tricky part of the query was identifying whether a declared property was a datatype property or an object property. The IF() statement that does this assumes that if a property is not a datatype property, it’s an object property; if you have more complex data, you can nest IF() function calls to cover more complex cases.

The query adds sh:minCount and sh:maxCount values of 1 for all properties so that each property is required and can have only one value. An orchestra member may actually play more than one instrument, so the SHACL shapes that this query outputs can be easily edited to account for that. For me, the real value of the query above is to automate the creation of the shapes and their relationships, leaving me to do easy things like adjusting the count values by hand.

Let’s see it in action. Here are the SHACL shapes that the CONSTRUCT query created from the musician schema above:

@prefix m:    <http://learningsparql.com/ns/music#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

m:Musician  rdf:type    sh:NodeShape;
        sh:property     [ rdf:type     sh:PropertyShape;
                          sh:datatype  xsd:date;
                          sh:maxCount  1;
                          sh:minCount  1;
                          sh:path      m:joined
                        ];
        sh:property     [ rdf:type     sh:PropertyShape;
                          sh:class     m:Instrument;
                          sh:maxCount  1;
                          sh:minCount  1;
                          sh:path      m:plays
                        ];
        sh:targetClass  m:Musician .

Do these shapes do what they’re supposed to do? In the following sample data, the musician kim has two different m:joined values. Musician pat has only one, but it’s not a proper ISO 8601 date. Also, pat has a m:plays value of m:kim, which is not an instance of the m:Instrument class.

@prefix m: <http://learningsparql.com/ns/music#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix sh: <http://www.w3.org/ns/shacl#> .

### instance data ###

m:guitar a m:Instrument .

m:piano a m:Instrument . 

m:kim a m:Musician ;
   m:joined "2024-10-12"^^xsd:date ;
   m:joined "2024-10-13"^^xsd:date ;
   m:plays m:guitar . 
   
m:pat a m:Musician ;
   m:joined "2023-03-13" ;
   m:plays m:kim .

Validating the shapes created by the CONSTRUCT query against this instance data (using the SHACL Playground, GraphDB, and Jena’s SHACL validator), I got this result:

@prefix m:    <http://learningsparql.com/ns/music#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

[ rdf:type     sh:ValidationReport;
  sh:conforms  false;
  sh:result    [ rdf:type                      sh:ValidationResult;
                 sh:focusNode                  m:pat;
                 sh:resultMessage              "DatatypeConstraint[xsd:date]: Expected xsd:date : Actual xsd:string : Node \"2023-03-13\"";
                 sh:resultPath                 m:joined;
                 sh:resultSeverity             sh:Violation;
                 sh:sourceConstraintComponent  sh:DatatypeConstraintComponent;
                 sh:sourceShape                _:b0;
                 sh:value                      "2023-03-13"
               ];
  sh:result    [ rdf:type                      sh:ValidationResult;
                 sh:focusNode                  m:pat;
                 sh:resultMessage              "ClassConstraint[<http://learningsparql.com/ns/music#Instrument>]: Expected class :<http://learningsparql.com/ns/music#Instrument> for <http://learningsparql.com/ns/music#kim>";
                 sh:resultPath                 m:plays;
                 sh:resultSeverity             sh:Violation;
                 sh:sourceConstraintComponent  sh:ClassConstraintComponent;
                 sh:sourceShape                [] ;
                 sh:value                      m:kim
               ];
  sh:result    [ rdf:type                      sh:ValidationResult;
                 sh:focusNode                  m:kim;
                 sh:resultMessage              "maxCount[1]: Invalid cardinality: expected max 1: Got count = 2";
                 sh:resultPath                 m:joined;
                 sh:resultSeverity             sh:Violation;
                 sh:sourceConstraintComponent  sh:MaxCountConstraintComponent;
                 sh:sourceShape                _:b0
               ]
] .

It looks like the shapes created by the CONSTRUCT query did their job. (Isn’t it great that, along with RDFS schemas and SHACL shapes, the validation output is also expressed in triples? This means that you can make it part of a pipeline that combines additional steps into a complex workflow.)

I also tried it with this next scheme, where the hr:Employee class’s hr:reportsTo property should have a value that is another hr:Employee instance, and the hr:jobGrade value must be an integer:

@prefix hr:   <http://learningsparql.com/ns/humanResources#> .
@prefix d:    <http://learningsparql.com/ns/data#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix sh:  <http://www.w3.org/ns/shacl#> .

hr:Employee a rdfs:Class .

hr:reportsTo a rdf:Property ;
rdfs:domain hr:Employee ;
rdfs:range hr:Employee . 

hr:name
   rdf:type rdf:Property ;
   rdfs:domain hr:Employee .

hr:hireDate
   rdf:type rdf:Property ;
   rdfs:domain hr:Employee ;
   rdfs:range xsd:date .

hr:jobGrade
   rdf:type rdf:Property ;
   rdfs:domain hr:Employee ;
   rdfs:range xsd:integer .

The CONSTRUCT query above created these shapes from that:

@prefix d:    <http://learningsparql.com/ns/data#> .
@prefix hr:   <http://learningsparql.com/ns/humanResources#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

hr:Employee  rdf:type   sh:NodeShape;
        sh:property     [ rdf:type     sh:PropertyShape;
                          sh:datatype  xsd:date;
                          sh:maxCount  1;
                          sh:minCount  1;
                          sh:path      hr:hireDate
                        ];
        sh:property     [ rdf:type     sh:PropertyShape;
                          sh:class     hr:Employee;
                          sh:maxCount  1;
                          sh:minCount  1;
                          sh:path      hr:reportsTo
                        ];
        sh:property     [ rdf:type     sh:PropertyShape;
                          sh:datatype  xsd:integer;
                          sh:maxCount  1;
                          sh:minCount  1;
                          sh:path      hr:jobGrade
                        ];
        sh:targetClass  hr:Employee .

My sample test instance data for that has an employee e3 who reports to d:d1, a resource not mentioned elsewhere in the data as an instance of hr:Employee or anything else. Employee e3 also has a non-integer job grade.

@prefix d:    <http://learningsparql.com/ns/data#> .
@prefix hr:   <http://learningsparql.com/ns/humanResources#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .
@prefix sh:   <http://www.w3.org/ns/shacl#> .

d:e1
   a hr:Employee;
   hr:name "Barry Wom" ;
   hr:hireDate "2017-06-03"^^xsd:date ;
   hr:reportsTo d:e3 ; 
   hr:jobGrade 5 .

d:e3
   a hr:Employee;
   hr:name "Stig O'Hara" ;
   hr:hireDate "2017-03-14"^^xsd:date ;
   hr:jobGrade 3.14 ;
   hr:reportsTo d:d1 .

When the employee shapes created by the SPARQL query are run against this sample data, it finds both problems:

@prefix d:    <http://learningsparql.com/ns/data#> .
@prefix hr:   <http://learningsparql.com/ns/humanResources#> .
@prefix rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix sh:   <http://www.w3.org/ns/shacl#> .
@prefix xsd:  <http://www.w3.org/2001/XMLSchema#> .

[ rdf:type     sh:ValidationReport;
  sh:conforms  false;
  sh:result    [ rdf:type                      sh:ValidationResult;
                 sh:focusNode                  d:e3;
                 sh:resultMessage              "ClassConstraint[<http://learningsparql.com/ns/humanResources#Employee>]: Expected class :<http://learningsparql.com/ns/humanResources#Employee> for <http://learningsparql.com/ns/data#d1>";
                 sh:resultPath                 hr:reportsTo;
                 sh:resultSeverity             sh:Violation;
                 sh:sourceConstraintComponent  sh:ClassConstraintComponent;
                 sh:sourceShape                [] ;
                 sh:value                      d:d1
               ];
  sh:result    [ rdf:type                      sh:ValidationResult;
                 sh:focusNode                  d:e3;
                 sh:resultMessage              "DatatypeConstraint[xsd:integer]: Expected xsd:integer : Actual xsd:decimal : Node 3.14";
                 sh:resultPath                 hr:jobGrade;
                 sh:resultSeverity             sh:Violation;
                 sh:sourceConstraintComponent  sh:DatatypeConstraintComponent;
                 sh:sourceShape                [] ;
                 sh:value                      3.14
               ]
] .

Any SHACL fan is going to think of other things that the CONSTRUCT query can deduce from a regular RDFS schema in order to add more useful triples to the SHACL shapes created from that schema. Let me know what you come up with!


Comments? Reply to my Mastodon or Bluesky posts announcing this blog entry.