Converting sqlite browser cookies to Turtle and querying them with SPARQL
Because you have more SQLite data than you realized.
There is a reasonable chance that you’ve never heard of SQLite and are unaware that this database management program and many database files in its format may be stored on all of your computing devices. Firefox and Chrome in particular use it to keep track of your cookies and, as I’ve recently learned, many other things. Of course I want to query all that data with SPARQL, so I wrote some short simple scripts to convert these tables of data to Turtle.
From a Linux, Windows, or MacOS command prompt (or from the prompt that the excellent termux app adds to Android phones), type sqlite3
to get to the SQLite prompt. If you enter sqlite3 someFileName
it opens that file if it’s an SQLite database or creates one with that name if it doesn’t exist. From the SQLite prompt, the .quit
command quits SQLite, .tables
lists tables, and .help
tells you about the other commands. Other than that, at the prompt you can enter the typical SQL commands to create tables, insert data into them, as well as to query, update, and delete data. (I did a blog entry titled My SQL quick reference several years ago and have since contributed an updated version of it to the excellent Learn X in Y Minutes site.)
A search of my hard disk found dozens and dozens of files whose names ended with .sqlite
. I believe it’s an older convention to end SQLite database filenames with .db
, and I had some Chrome and Firefox files with that as well. The ~/.config/google-chrome/Default
directory had many files that didn’t have .sqlite
or .db
extensions but still turned out to be SQLite files.
SQLite can execute a series of commands stored in a script. For example, my tableList.scr
file has just these two lines,
.tables
.quit
and from my operating system command line I can quickly list the tables in the cookies.sqlite
database file that I found in a Firefox directory with this command line:
sqlite3 cookies.sqlite < tableList.scr
The result shows that cookies.sqlite
has just one table: moz_cookies
.
Once I know what tables are an SQLite file, my sqliteToTSV.sh
script pulls the data from a named table within that file and saves it as a Turtle file so that I can query it with SPARQL. (You can find all of the scripts and queries that I wrote for this in github.) If you pass the database filename and table name to this shell script like this,
sqliteToTSV.sh cookies.sqlite moz_cookies
it first creates an SQLite script that executes an SQL SELECT command to save everything in that table to a tab-separated value file. It then runs a Perl script that converts the TSV file to Turtle. (This shell script should work fine as a Windows batch file with minimal modifications.)
The Perl script would be especially short if I hadn’t found escaped JSON data in some SQLite column values and binary data in others, so I had the script check for those and just output stub labels instead of trying to do anything useful with them. (Note that for all the SQLite files that I played with, I actually played with copies in a new directory, not the originals created by applications such as the browsers.)
The remaining ASCII data still offers plenty of interesting things to look at. My files in the git repo include an ffCookiesHosts.rq
SPARQL query that counts how many Firefox cookies come from each base domain and outputs a list sorted by the counts in descending order. Here are the first few lines of output:
-------------------------------------
| baseDomain | cookies |
=====================================
| "google.com" | 57 |
| "pubmatic.com" | 41 |
| "tremorhub.com" | 32 |
| "verizon.com" | 29 |
| "cnn.com" | 22 |
| "verizonwireless.com" | 21 |
| "nfl.com" | 19 |
I’m not a big NFL guy, but I do remember that when having some Internet trouble the technician and I were using that site to check connectivity. The big surprises for me were the high scores of two names that I didn’t recognize: pubmatic.com and tremorhub.com. The tremorhub.com domain name redirects to telaria.com, some company that manages “premium” video advertising, which sounds just like the kind of company that would dump cookies on your hard disk without telling you. The pubmatic.com site is about “monetization” of “content”, so they too look like a cookie-dumping ad tech firm.
The googleCookiesHosts.rq
query in the git repo performs a similar query on data from the cookies
table in the ~/.config/google-chrome/Default/Cookies
SQLite database. Its output listed rubiconproject.com as a leading cookie depositor along with sites that I actually visit often; they’re another ad tech firm that I haven’t heard of but has clearly been dumping plenty of cookies onto my hard disk.
I started looking into this so that I could do SPARQL queries about these deposited cookies and it was interesting to see how many other kinds of SQLite files I had. That same google-chrome/Default
directory has a History
SQLite database file with 11 tables, including keyword_search_terms
and visits
. (Not all the files in that directory are SQLite files, so the lack of file extensions to indicate which ones are SQLite files is a bit annoying.) After conversion to Turtle, the keyword_search_terms
data had triples like this, showing that it had stored my search terms in both the entered case and in lower case:
[
m:keyword_id "2" ;
m:url_id "18899" ;
m:lower_term "coca y sus exploradores lo añoro" ;
m:term "Coca y Sus Exploradores Lo Añoro" ] .
Thunderbird, Skype, and even iPython have also deposited SQLite files on my Ubuntu laptop’s hard disk.
If I was writing a script to use in an application that used that keyword_search_terms
data, then instead of representing the values with blank nodes, I’d probably give the triples above a subject that built on that m:url_id
value. When converting SQL or CSV or other tabular data to Turtle before I’ve usually generated URIs to be the subjects; I finally realized that doing it as blank nodes with square brackets, like the example above, is a nice clean way to represent a row from tabular data and a little less trouble.
One note about date formats: the Google cookies (and maybe more SQLite files) store dates in a strange format that I could not work out how to convert to proper ISO 8601 format in my Perl script. I found an explanation on Stack Overflow of how to convert the date formats as part of an SQL SELECT statement that retrieves the data. You could even use the same logic in SQLite to convert the dates within (a copy of!) the database file itself so that my generic Turtle extraction script would pull out more readable dates. For example, the following example of an SQL UPDATE command does this to the lastAccessed
column of a moz_cookies
table:
UPDATE moz_cookies SET lastAccessed = datetime(lastAccessed / 1000000, 'unixepoch');
Overall, it’s cool to see how much data is spread around our hard disks using SQLite so that, after some simple scripting, we can explore it with SPARQL.
Share this post