Your own free, publicly available SPARQL endpoint
Free as in tier.
There are a few tutorials out there about how to start up your own free-tier Amazon Web Services (AWS) Elastic Compute Cloud (EC2) instance and then run your own publicly available web server. I’ve planned for a while to try this with a Jena Fuseki triplestore and SPARQL endpoint, but I postponed it because I thought it might be complicated. It turned out to be pretty easy.
EC2 Exercise 1.1: Host a Static Webpage by Kerry Sheldon is an example of one of the tutorials described above, and it was a good starting point for putting up an Apache web server. Because AWS now has a “new launch experience” I couldn’t follow her 2018 instructions exactly, but my first few instructions below are based on hers.
Tell AWS you want to launch an instance
If you don’t have an AWS account, create one. Then log in and pick EC2 on the AWS Console and “Launch Instance” from the orange “Launch Instance” button’s dropdown menu.
Configure and launch the instance
The older version of this “experience” was more of a wizard leading you through various small screens to fill out. The current version has one big screen where you fill out these details:
-
Add something to the “Name” field like “Fuseki server”.
-
Pick from the “Application and OS Images” selection. This includes a field where you can search from many choices or, under that, you can pick one of the Quick Start choices. I clicked the blue Amazon Linux AWS Quick Start category and then, under that, picked the first choice: “Amazon Linux 2 AMI (HVM) - Kernel 5.10, SSD Volume Type Free tier eligible”. Scrolling down that list you can see more more machine-learning-oriented images with additional features such as GPUs and PyTorch. This is one of those places where you have to be careful to pick something that will cost you little or nothing, and it’s up to you to keep track of that. (After all my experiments with this project so far, as I write the first draft of this blog entry the AWS billing management screen says that I currently owe them 20 cents.) I went with the first free tier choice mentioned above.
-
Under that is the “Instance type”. I selected the first choice there, “t2.micro” which is also Free tier eligible. Again, it’s up to you to make the choice that will cost you little or nothing, and some of the choices can be expensive.
-
Under that, create or select a Key Pair—a public and private key combination that will let you log in to your new instance from your local machine. If you are an AWS user and have an existing one you can pick it from the dropdown list there. If you don’t have one, click “Create new key pair”, give it a name such as fuseki-key-pair, leave the other settings at their default, and click the orange “Create key pair” button. It will create one with a name like
fuseki-key-pair.pem
that your browser downloads. Save that (a typical destination would be the.ssh
subdirectory of your home directory) and remember where you saved it for later. -
Moving down the instance configuration page, the next box to fill out is “Network settings”. “Allow SSH traffic from Anywhere” is checked as a default, meaning that anyone can use the
ssh
utility for shell access to your instance from anywhere on the Internet. (Shell access will also need the file that you downloaded in the previous step, so that’s a somewhat decent level of security. As with the potential costs, it’s up to you to research other configurations if that’s what you need.) Add checks to the “Allow HTTPs traffic” and “Allow HTTP traffic” checkboxes so that browsers and other tools can send HTTP requests to your web server or Fuseki SPARQL endpoint. -
Scroll around to see the other things you can set, leave them at their default for this exercise, and click the orange “Launch instance” button. After a few seconds you should sees a screen that say “Success” with an orange “View all instances” button in the lower right. Click that to display the Instances list.
Review your running instance and start a terminal session with it
Sometimes, when doing this, I didn’t see my new instance right away. If this happens to you, wait a minute, reload your browser, and you should eventually see it. The instances list will show that the “Instance state” of your new instance is already “Running”.
Click the checkbox to the left of your instance on the instances list. From the “Instance state” dropdown at the top you will see that this the place to Stop, Start, and Terminate the instance, along with a few other options.
The tabs below the instance list let you do further configuration of the checked instance. The Security tab shows “Inbound rules” that allow inbound traffic on port 22 for SSH, 80 for HTTP, and 443 for HTTPS.
Fuseki uses port 3030 as a default, so add a rule for that: on the Security tab under “Security groups” click the Security group name of sg-long-hex-number
and then under Inbound rules click “Edit Inbound Rules”. Click “Add rule” to create a new one with a “Port range” of 3030. Set the sixth column to 0.0.0.0/0 like the others by picking “Anywhere-IPv4” from the fifth column’s dropdown. Leave the Type value at “Custom TCP” and click “Save rules” at the bottom.
Now your instance is all set up. Pick “Instances” under “Instances” (yes, a bit confusing) on the left to return to your Instances list, go back to the Details tab to the left of your new instance’s Security tab, and copy the Public Ipv4 address into your clipboard. I will use 12.345.678.90 in my examples below, so substitute yours for that. There are ways to map these IP addresses to registered domain names, but for this exercise, that address will be your server’s name when you use ssh
or a web browser to do anything with it.
Before you log in to your new machine you will need to reset the permissions on the pem
file that you downloaded earlier to something acceptable to your ssh
utility, because the default permissions after downloading are too permissive. Enter the following, adjusting the path as necessary for the file you downloaded:
chmod 400 ~/.ssh/fuseki-key-pair.pem
(Windows users will have some other command to use instead of chmod
, and also may be using PuTTY instead of ssh
. I’m not sure of the exact Windows syntax to do these tasks, but they shouldn’t be difficult to find out.)
In a shell window on your local computer, enter the following command, substituting the Ipv4 address that you copied above and pointing the -i
parameter to the file that you downloaded earlier:
ssh -i ~/.ssh/fuseki-key-pair.pem ec2-user@12.345.678.90
A prompt will ask if you are sure you want to continue, so answer yes, and then you’ll be logged in to your new instance as it waits for you to tell it what to do:
__| __|_ )
_| ( / Amazon Linux 2 AMI
___|\___|___|
https://aws.amazon.com/amazon-linux-2/
[ec2-user@ip-987-65-4-321 ~]$
Download and unzip the Jena software
You will need the software for the Fuseki server itself and also the Jena tools that let you load data into that server and work with that data. (I described some of those tools in the Working with Fuseki datasets from the command line section of my blog post Hidden gems included with Jena’s command line utilities.)
After visiting the Jena download page to find the URLs of these distribution files I executed these commands at the EC2 prompt to retrieve the files to the current directory:
wget https://dlcdn.apache.org/jena/binaries/apache-jena-fuseki-4.6.1.zip
wget https://dlcdn.apache.org/jena/binaries/apache-jena-4.6.1.zip
(If this posting that you are reading is more than a few months old you’ll want to check the download page yourself to get more recent versions of these files.)
Unzip the two files you downloaded. (For demo purposes, you can just do it from your new instance’s root directory. For a more serious production system you would want to create some directories to organize all this better.)
Install Java
Jena is a Java-based tool, and the default version of this EC2 instance doesn’t have Java, so you have to add it. I found the x64 RPM Package URL on https://www.oracle.com/java/technologies/downloads/. The next two commands pull that package into the EC2 instance and then install it there:
wget https://download.oracle.com/java/19/latest/jdk-19_linux-x64_bin.rpm
sudo yum localinstall jdk-19_linux-x64_bin.rpm
Try the Fuseki server
Change into the directory with the Fuseki binary (created by unzipping above) and see if it responds to a simple command:
cd apache-jena-fuseki-4.6.1/
./fuseki-server --help
If you see the help information, that means that you installed Fuseki and Java correctly.
Now let’s start it up for real:
./fuseki-server
Give it a few seconds until the status messages stop scrolling and then send a browser to port 3030 of the Public Ipv4 address you saved earlier. Your URL will be something like http://12.345.678.90:3030/.
You should see the main Apache Jena Fuseki management screen, with the message “No datasets created - add one”. Don’t bother to click on “add one”, because this server doesn’t have permission to write to your new instance’s disk storage, even if you had started fuseki-server
with its --update
switch. We will load data using the Jena tools.
Create an empty dataset for the triples that you will load
In the shell window where you started up Fuseki, press ^C to shut it down, because the command line tools that you’re about to use don’t work with a server that is up and running. Make the root directory your default and, as a sample data set to load, get the data file I created for SPARQL queries of Beatles recording sessions. With this data loaded in Fuseki, people will be able to query its endpoint about who played what instruments on which Beatles recordings:
cd
wget https://bobdc.com/miscfiles/BeatlesMusicians.ttl
To tell Fuseki the named dataset on the Fuseki server where you want to load your data, you need to identify the assembler file for that dataset. Your new Fuseki instance has no datasets or assembler files, so how can we create them?
As I explained in the introduction to Working with Fuseki datasets from the command line, instead of learning the syntax of these files I found that I could just create one with the web interface to a Fuseki server running on my local machine, as long as I started it up with the --update
switch so that the web interface would have write permission. For that one, I called the dataset that I created dataset2, and Fuseki put the assembler file into ~/apache-jena-fuseki/run/configuration/dataset2.ttl
on my local machine. I put a copy of that dataset2.ttl
file on my blog’s server so that I could wget
it to my EC2 instance. (I could have also sftp
’d it from my local machine to the EC2 instance, but this way it’s available to others who want to try the same thing.)
From your EC2 shell’s root directory, execute the following to change into the directory where assembler files get stored, get a copy of the assembler file mentioned above, and rename it for the Beatles data:
cd apache-jena-fuseki-4.6.1/run/configuration
wget https://bobdc.com/miscfiles/dataset2.ttl
mv dataset2.ttl beatlesSessions.ttl
Next, you need to edit it for your new dataset. The vi
and nano
editors are included with this Amazon Linux 2 image, but I need my emacs, so I installed it:
sudo yum install emacs
Open up beatlesSessions.ttl
with your editor. Near the bottom you will see some triples that look like this:
:tdb_dataset_readwrite
rdf:type tdb2:DatasetTDB2 ;
tdb2:location "/home/bob/bin/apache-jena-fuseki/run/databases/dataset2" .
(Isn’t it nice that the configuration file for this triplestore stores everything as triples? ) Change that tdb2:location
value to “/home/ec2-user/apache-jena-fuseki-4.6.1/run/databases/dataset2/beatlesSessions”, do a global replace of “BeatlesSessions” for “dataset2” elsewhere in the file (including in that pathname that you put in in the previous step), save the file, and quit out of your editor.
Now that you’ve created this empty dataset for the server, let’s make sure that Fuseki recognizes it before we load any data. Change into the Fuseki directory and start up the Fuseki server again:
cd ~/apache-jena-fuseki-4.6.1/
./fuseki-server
After the startup status messages stop scrolling, send your browser to the same IP address you did before. You should see /BeatlesSessions
listed as an available dataset. If you like, you can click the “query” action and run the default query, which asks for ten triples. (Click the dark gray triangle to the right of the query to actually execute it.) It won’t get any data, but it shouldn’t show an error, either, so you know that the query engine works with this dataset.
Load some triples into the new dataset
At the shell window, press ^C to end the server session and go back to the command prompt. With the following two commands, go back to the root directory, and before loading data with Jena’s tdbloader
tool, use the riot
tool to verify that the data file we’re about to load doesn’t have any syntax problems, because data load time is not a good time to find out about such problems:
cd
./apache-jena-4.6.1/bin/riot --validate BeatlesMusicians.ttl
You shouldn’t see any error messages.
Next, load that data into your new dataset by pointing the tdb.tdbloader
command line tool at the data file and at the dataset’s assembler file (this is a single long command that I split up to show here, but pasting it as shown worked for me):
./apache-jena-4.6.1/bin/tdb2.tdbloader --tdb \
./apache-jena-fuseki-4.6.1/run/configuration/beatlesSessions.ttl \
BeatlesMusicians.ttl
(Read more about riot
, tdbloader
, and their companion utilities at Working with Fuseki datasets from the command line. These will let you edit and perform other maintenance on the data loaded in Fuseki.)
Query the data
Start up the server again:
cd ~/apache-jena-fuseki-4.6.1/
./fuseki-server
Run that default query again, and this time you should see ten triples about the Beatles’ recording sessions.
Let’s try a more interesting query. Paul was known as the bass player but sometimes added guitar solos. On which songs? Paste the following into that query screen and run it to find out:
PREFIX s: <http://learningsparql.com/ns/schema/>
PREFIX i: <http://learningsparql.com/ns/instrument/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX m: <http://learningsparql.com/ns/musician/>
SELECT ?title WHERE {
?song a s:Song ;
i:leadguitar m:PaulMcCartney .
?song rdfs:label ?title .
}
You will see a surprising number of songs where he played lead guitar. (Optional step: check out his amazing solo on “Good Morning Good Morning”. Be sure to wait for the last lick, after John sings “it’s time for tea and meet the wife”.)
Remember, what you see in your browser is not a SPARQL endpoint, but the HTML interface to one. There’s an important difference. To really test this as a SPARQL endpoint, paste the query above into a file on your local machine (or any machine with web access) called paulquery.rq
and then enter the following at the machine’s command prompt, substituting the Ipv4 address that you copied above into the URL:
curl --data-urlencode "query@paulquery.rq" \
http://12.345.678.90:3030/BeatlesSessions/sparql
It should display a JSON version of the query results. (You can learn how to customize this behavior in my blog posting Curling SPARQL.)
Your own SPARQL web server
If it works with curl, it will work with all kinds of other tools, letting those applications take advantage of the data you provide over your new SPARQL endpoint. A few more points:
-
Be careful in the options you pick when setting this up, because some can get expensive. I copied this from one of the setup pages: “Free tier: In your first year includes 750 hours of t2.micro (or t3.micro in the Regions in which t2.micro is unavailable) instance usage on free tier AMIs per month, 30 GiB of EBS storage, 2 million IOs, 1 GB of snapshots, and 100 GB of bandwidth to the internet.” The Amazon EC2 T2 Instances page says that a t2 micro instance costs $0.0116 per hour, which works out to about $1.95 per week. Of course, if you want to scale way up and host a ton of data on a faster instance, the more expensive options are available.
-
That being said, forgetting about it for a year and then owing AWS a hundred bucks would be no fun. Remember to stop your instance when the time is right and to check the billing management screen ever now and then.
-
The EC2 Exercise 1.1: Host a Static Webpage article mentioned above explains how to add a regular Apache web server to your EC2 instance so that you can host static web pages from your new EC2 instance.
The most important thing is that you can use some robust open source software to create a SPARQL endpoint that costs practically nothing and is available to everyone on the Internet. That provides some big opportunities for standards-based data publishing.
Comments? Reply to my tweet announcing this blog entry.
Share this post