Using regular expressions to manipulate data in a SPARQL query
A pure, standards-compliant SPARQL query.
I have often lamented that SPARQL’s REGEX function only returned a boolean value. It’s handy in FILTER
tests because it lets you use regular expressions to create more complex conditions about the results that you do or don’t want returned by a query, but instead of just returning True or False I wished that it would let me grab the pieces of a string that match the regular expression pattern and recombine them into new values, like I can with the regular expression support of most programming languages.
I only recently noticed that SPARQL’s REPLACE
function, which comes right after REGEX
in the SPARQL query specification, supports regular expressions, so I can do this regex string manipulation in SPARQL after all.
One of those other languages is JavaScript. In Calling your own JavaScript functions from SPARQL queries I showed how once you write a JavaScript function that does some regex string manipulation, you can then call that function from a SPARQL query being executed with Jena ARQ. (Soon I’ll be showing how to do that with GraphDB on the Ontotext blog.) The demo in my earlier blog entry used a regular expression in a JavaScript function to normalize some U.S. phone numbers.
The SPARQL query below demonstrates why I didn’t need to call those JavaScript functions. Using SPARQL’s REPLACE
function and the same input data as that demo, I can normalize the same phone numbers using nothing but pure W3C-compliant SPARQL.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX v: <http://www.w3.org/2006/vcard/ns#>
SELECT ?name ?phoneNum ?fixedPhone
WHERE {
?s v:given-name ?name ;
v:homeTel ?phoneNum .
BIND (replace(?phoneNum,".*(\\d\\d\\d).*(\\d\\d\\d).*(\\d\\d\\d\\d).*",
"$1-$2-$3") AS ?fixedPhone)
}
The regular expression in the replace()
function call’s second argument looks for two three-digit sequences and then a four-digit sequence, ignoring everything before, after, or in between. Then it returns the found strings separated by hyphens.
Here is the sample data from that earlier blog entry; note the different punctuation and spacing used with the four phone numbers:
@prefix v: <http://www.w3.org/2006/vcard/ns#> .
@prefix d: <http://learningsparql.com/ns/data#> .
d:i9771 v:given-name "Cindy" ;
v:homeTel "1 (203) 446-5478" .
d:i0432 v:given-name "Richard" ;
v:homeTel " (729)556-5135 " .
d:i8301 v:given-name "Craig" ;
v:homeTel "9232765135" .
d:i8309 v:given-name "Leigh" ;
v:homeTel "843-5544" .
The result after running the query above with this data shows the phone numbers from the data and the results of the replace()
calls:
name | phoneNum | fixedPhone |
---|---|---|
Craig | 9232765135 | 923-276-5135 |
Leigh | 843-5544 | 843-5544 |
Richard | (729)556-5135 | 729-556-5135 |
Cindy | 1 (203) 446-5478 | 203-446-5478 |
As the SPARQL query spec tells us, this function corresponds to the XPath fn:replace
function. That leads to more documentation, which points to a separate Regular expression syntax section that lists available flags such as i
for case-insensitive matching and m
for multiline matching.
Those links ultimately lead to an escape character table in the XML Schema Part 2 specification. This table tells us the typical regular expression codes—for example, that \s
matches white space characters and \d
matches a numeric digit. Note that when I used the \d
codes in the SPARQL query above they’re in a quoted string, so the backslash itself needed escaping; that’s why you see two backslashes before each d
in my query’s regular expression.
The REPLACE
function’s ability to find substrings and delete or rearrange them in RDF literal data should be very handy for data cleanup and enhancement. I’m sorry I didn’t notice it before!
Comments? Reply to my tweet (or even better, my Mastodon message) announcing this blog entry.
Excerpt from xkcd comic by Randall Monroe, CC BY-NC 2.5 DEED.
Share this post