SPARQL queries of the Billboard Hot 100 PART 2
Current and historical data! REVISE THAT
constructforwd.rq works, so if I can’t get the INSERT version to work just import from constructed triples.
REMEMBER THE / AT THE END OF THE schema.org PREFIX DECLARATION
Right how these are just notes that I pasted here so that it would get backed up*
April blog entry should get up to the running of the Python script, loading those triples, and good queries below. But, this could have been done with an RDBMS. We’ll see how to make it a real knowledge graph next week.
For that, constructforwd.rq works. Comment at top shows curl command that makes all the triples linking to wikidata. With those added to the repo, try a query that shows the youngest person to have a hit in a given week, and think of some other things. Who played the guitar, who played blues, who was influenced by Robert Johnson, who is in the Rock and Roll Hall of Fame (last few ideas all came from https://www.wikidata.org/wiki/Q11036)
From the README:
Files in ~/git/billboard-hot-100-rdf/
This works, so I can proceed, although I’d rather use annotation syntax:
PREFIX h1: http://rdfdata.org/hot100# PREFIX schema: http://schema.org PREFIX dc: http://purl.org/dc/elements/1.1/ SELECT * WHERE { ?recording a schema:Recording ; dc:title “Cruel Summer” ; schema:byArtist “Taylor Swift” .
<< ?recording h1:charted ?chartDate >> h1:position ?chartPosition .
}
Calculating the values we didn’t convert from JSON
Chart position last week
PREFIX h1: http://rdfdata.org/hot100# PREFIX schema: http://schema.org PREFIX dc: http://purl.org/dc/elements/1.1/ PREFIX xsd: http://www.w3.org/2001/XMLSchema#
SELECT ?dateLastWeek ?positionLastWeek WHERE {
2. Find out the week before the latest chart
appearance and the position from that week.
BIND (?latestPosition - “P7D”^^xsd:duration AS ?dateLastWeek) « ?recording h1:charted ?dateLastWeek » h1:position ?positionLastWeek . { # 1. Find the date of the latest chart appearance. SELECT ?recording (MAX(?chartDate) AS ?latestPosition) WHERE { ?recording a schema:Recording ; dc:title “Snooze” ; schema:byArtist/rdfs:label “SZA@en” . ?recording h1:charted ?chartDate . } GROUP BY ?recording } }
Who had hits in the most decades?
PREFIX schema: http://schema.org/ PREFIX h1: http://rdfdata.org/hot100#
SELECT ?artist (COUNT(DISTINCT ?decade) AS ?decades) WHERE { ?recording a schema:Recording ; schema:byArtist ?artist ; h1:charted ?chartDate . BIND (SUBSTR(str(?chartDate),1,3) AS ?decade) } GROUP BY ?artist ORDER BY DESC(?decades)
what were the decades in which Little Richard had hits?
After I straighten out SERVICE call part: revise everything up here to make a URI for the artist from the string and give schema:byArtist a range of that.
All this could be done with a relational database, so for a “knowledge graph” angle see if there is a way to add links to each artist’s Wikidata page.
passing artist name to wikidata to get artist URL
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?artistURI WHERE {
BIND ("The Beatles"@en AS ?artistName)
SERVICE <https://query.wikidata.org/sparql> {
# Check if artist a musician (a human with a value for instrument; singers and
# rappers listed with "instrument" of "voice") or a musical group.
{
?artistURI rdfs:label ?artistName;
wdt:P31 wd:Q5 ; # instance of human
wdt:P1303 ?instrument .
}
UNION
{
?artistURI rdfs:label ?artistName ;
wdt:P31 wd:Q215380 . # instance of musical group
}
} # end of SERVICE
}
** json2rdf.py
#!/usr/bin/env python3
EXPLAIN MORE ABOUT IT HERE
run as
./json2rdf.py ../all.json
import json import sys import urllib.parse
if (len(sys.argv) < 2): print(“Enter an input filename as an argument.”) exit()
inputFile = sys.argv[1]
jsonBlock = ""
with open(inputFile) as fp: for line in fp: jsonBlock += line
jsonData = json.loads(jsonBlock)
print(’@prefix h1: http://rdfdata.org/hot100# .’) print(’@prefix schema: http://schema.org/ .’) print(’@prefix dc: http://purl.org/dc/elements/1.1/ .’) print(’@prefix xsd: http://www.w3.org/2001/XMLSchema# .’) print()
for week in jsonData:
chartDate = week["date"]
for recording in week["data"]:
artist =recording["artist"]
artist = artist.replace('"','\\"')
song = recording["song"]
song = song.replace('"','\\"')
# ID of song is artist + song because two different songs can have
# the same title, e.g. Taylor Swift's and Banarama's "Cruel Summer".
artistSong = artist + song
# Lose characters that screw up URI.
for c in ' &/.\"\'':
artistSong = artistSong.replace(c,'')
print("h1:" + urllib.parse.quote(artistSong) + " a schema:Recording;")
print(' schema:byArtist ' + '"' + artist + '"@en;')
print(' dc:title ' + '"' + song + '";')
print(' h1:charted ' + '"' + chartDate + '"^^xsd:date {| ')
print(' h1:position ' + str(recording["this_week"]))
print('|}.')
adding in links to wikidata
Not as Python, but as an INSERT update query with a SERVICE call to Wikidata. First do a CONSTRUCT and a COUNT to see what % of the Hot 100 this works for. With a URI like http://www.wikidata.org/entity/Q194220 representing the artist, what triple do I add? https://schema.org/sameAs mentions Wikidata entries as possibilities. Make it clear that unlike owl sameAs, this is a link to follow and not for inferencing. But why not do that? Admit that there will be mistakes.
The following works but only for 1000 artists. Maybe there is some GraphDB limit I can reset. See the two commented lines near the top of it.
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>
PREFIX schema: <http://schema.org/>
# Can I do anything with the query so that I don't need the DISTINCT keyword?
# There are 10,773 artist names. Can I do this for all of them? Reset some GraphDB limit?
SELECT DISTINCT ?artistName ?artistURI WHERE {
#BIND ("Ricky Nelson"@en AS ?artistName)
?s schema:byArtist ?artistName .
SERVICE <https://query.wikidata.org/sparql> {
# Check if artist a musician (a human with a value for instrument; singers and
# rappers listed with "instrument" of "voice") or a musical group.
{
?artistURI rdfs:label ?artistName;
wdt:P31 wd:Q5 ; # instance of human
wdt:P1303 ?instrument .
}
UNION
{
?artistURI rdfs:label ?artistName ;
wdt:P31 wd:Q215380 . # instance of musical group
}
} # end of SERVICE
}
- more readme notes
To point to a Wikidata resource, I used the LoC’s http://www.loc.gov/mads/rdf/v1#hasCloseExternalAuthority (see https://id.loc.gov/ontologies/madsrdf/v1.html and https://id.loc.gov/authorities/names/n81127048.html because they use it for the same thing.
Turn https://github.com/mhollingshead/billboard-hot-100 into SPARQL and query it. Maybe turn it into JSON-LD as an experiment in using that. Have a cron job on snee pull it down and put an RDF version on bobdc.com where I can pull it down. Or make it a local cron job that just pulls from there and loads the RDF into a repo. Model: “Cruel Summer” charted 2024-02-17; rdf-star about that triple has all the other data. But, “Cruel Summer” isn’t a good ID because there has been another hit with the same title: https://en.wikipedia.org/wiki/Cruel_Summer_(Bananarama_song). So for ID http://whatever/Taylor-Swift-Cruel-Summer ? rdf:type of https://schema.org/MusicRecording. byArtist is a property. Actually, the ID would be better as an MD5() of “Taylor SwiftCruel Summer”; the 33 chars it returns is the shortest of the checksum functions. If I leave out last_week, peak_position, and weeks_on_chart can I should be able to calculate those, so try just storing :charted value of the date and the this_week value.
GraphDB imported all.ttl in 16 seconds.
This works, so I can proceed, although I’d rather use annotation syntax:
PREFIX h1: http://rdfdata.org/hot100# PREFIX schema: http://schema.org/ PREFIX dc: http://purl.org/dc/elements/1.1/ SELECT * WHERE { ?recording a schema:Recording ; dc:title “Cruel Summer” ; schema:byArtist “Taylor Swift” .
<< ?recording h1:charted ?chartDate >> h1:position ?chartPosition .
}
Calculating the values we didn’t convert from JSON
Weeks on chart
PREFIX h1: http://rdfdata.org/hot100# PREFIX schema: http://schema.org/ PREFIX dc: http://purl.org/dc/elements/1.1/
SELECT (COUNT(?chartPosition) AS ?weeksOnChart) WHERE { ?recording a schema:Recording ; dc:title “Cruel Summer” ; schema:byArtist “Taylor Swift” .
<< ?recording h1:charted ?chartDate >> h1:position ?chartPosition .
}
Highest chart position
Change SELECT line on the last one to
SELECT (MIN(?chartPosition) AS ?highestPosition) WHERE {
Chart position last week
PREFIX h1: http://rdfdata.org/hot100# PREFIX schema: http://schema.org/ PREFIX dc: http://purl.org/dc/elements/1.1/ PREFIX xsd: http://www.w3.org/2001/XMLSchema#
SELECT ?dateLastWeek ?positionLastWeek WHERE {
2. Find out the week before the latest chart
apperance and the position from that week.
BIND (?latestPosition - “P7D”^^xsd:duration AS ?dateLastWeek) « ?recording h1:charted ?dateLastWeek » h1:position ?positionLastWeek . { # 1. Find the date of the latest chart appearance. SELECT ?recording (MAX(?chartDate) AS ?latestPosition) WHERE { ?recording a schema:Recording ; dc:title “Snooze” ; schema:byArtist “SZA” . ?recording h1:charted ?chartDate . } GROUP BY ?recording } }
All this could be done with a relational database, so for a “knowledge graph” angle see if there is a way to add links to each artist’s Wikidata page.
Checking on their Wikidata pages – This works for Keith: curl https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=DESCRIBE%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ189599%3E But it’s asking with his Wikidata ID. I want to ask with the name if there is a page for that person.
Share this post