Filtering foreign literals out of SPARQL query results

And only the foreign literals.

January 26, 2025

At first I was treating this like an overly complex logic puzzle, wondering how I could get literals that were (not (not English)).

It’s easy enough for a SPARQL query to specify that you only want literal values that are tagged with a particular spoken language such as English or French. I had a more complex condition to express recently that has happened to me fairly often: how do I retrieve all the data for a particular resource except the literals tagged in a foreign language? I want all the triples with object property values, and I want all the ones with literal values, regardless of type, unless they are tagged in a language other than English. (Obviously, you can substitute another language tag as the only one whose values you want to see.)

This came up when I was playing with YAGO, but it has also happened when I was working with Wikidata and DBpedia. These are such international data collections that many of the string literal values are available in many languages, which is great, but when I retrieve all the data for a given resource, I see lots and lots of string values that I don’t need.

For example, try a Wikidata query about data for bebop bassist Tommy Potter. Of the 156 triples that get returned, 25 are rdfs:label values for his name tagged for different languages (but usually showing “Tommy Potter”), 3 are skos:altLabel values for his name tagged with different languages, and 27 triples are schema:description values with the English one being “American jazz double bassist (1918–1988)” and the rest being variations on that in other languages.

If I ask for triples whose objects are tagged as English language values, like this,

SELECT * WHERE {
  <http://www.wikidata.org/entity/Q1369941> ?p ?o 
  FILTER( lang(?o) = "en")
}

I’ll only get three search results: the English version of each of the three properties mentioned above. I’ll miss out on literal values that aren’t tagged as English, whether they are strings or other data types such as the one for Potter’s birthday. I’ll also miss out on triples that have a URI as an object.

To come up with a good FILTER, I switched from querying for Tommy Potter data to querying the following small test data set that I created:

@prefix ls: <http://www.learningsparql/ns/test> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .

ls:someEntity rdfs:label "some entity" ;
			  rdfs:label "Some Entity"@en ;
			  rdfs:label "alguna entidad"@es ;
			  ls:created "2025-01-23"^^xsd:date ;
			  ls:amount  4 ;
			  ls:rating  3.14 ;
			  ls:needs  ls:someOtherEntity .

I wanted a query that would retrieve all of this data except for the “alguna entidad” triple. I wanted the one with the “en” language tag, the string with no language tag, the three typed literals, and the triple that has a URI as an object.

At first I was treating this like an overly complex logic puzzle, wondering how I could get literals that were (not (not English)). I finally realized how easy it would be to just have a boolean OR ask for:

The triples where the objects are URIs.
Literals that are tagged as being in English.
Literals that have no language tag. This would get the first “some entity” triple in my sample data, but perhaps more importantly, it would get the ls:created, ls:amount, and ls:rating values.

The following does this.

SELECT * WHERE {
   ?s ?p ?o .
   FILTER( ISIRI(?o) || (lang(?o) = "en")  ||  (!(langMatches(lang(?o),"*"))) )
}

The first two filter conditions are basic SPARQL: if a triple’s object is an IRI, we want it; if the triple’s object has a language tag of “en” for English, we want it.

The third filter condition uses the langMatches() function. I had forgotten about this one but was reminded by the section “Checking, Adding and Removing Spoken Language Tags” of my book Learning SPARQL. Without the ! to do a boolean NOT, the langMatches() expression in this query with “*” as an argument would return True for any value of ?o that has any language tag; with the boolean NOT it returns True for any value that has no language tag. So, it does the job described by the third bullet above.

For my “some entity” sample data this query returned everything but the “alguna entidad”@es triple, as I had hoped. For the query of Tommy Potter data, you can see for yourself that it returns 104 rows instead of 156, with no literal values tagged with a language other than English. The results include only one row for each of the rdfs:label, rdfs:description, and skos:altLabel values. (Changing the “en” in the Tommy Potter query to “de” for German and “es” for Spanish got the expected results.)

If anyone can suggest a more efficient version of that boolean FILTER condition I’d love to see it, but meanwhile I’m sure I’ll be pasting it into a lot more queries in the future when I explore large international datasets.

2025-02-01 update: I have learned that langmatches(lang(?o),"") does the same thing as (!(langMatches(lang(?o),"*"))), so that simplifies the expression more. Thanks Mohammad Hossein Rimaz and Jan Martin Keil!

Comments? Reply to my Mastodon message or Bluesky post announcing this blog entry.

Converting RDFS schemas to SHACL constraints