Difference between revisions of "Lab: SPARQL"

From Info216
m
 
(72 intermediate revisions by 6 users not shown)
Line 1: Line 1:
=Lab 4: SPARQL / Blazegraph=
 
 
 
==Topics==
 
==Topics==
* Setting up the Blazegraph graph database. Previously we have only stored our triples in memory, which is not persistent.  
+
* Setting up the Blazegraph graph database.
* SPARQL queries and updates. We use SPARQL to retrieve of update triples in our databases/graphs of triples
+
* SPARQL queries and updates.
  
==Tasks==
+
==Useful materials==
 +
Blazegraph homepage:
 +
* [https://blazegraph.com/ Welcome to Blazegraph]
 +
* [https://github.com/blazegraph/database/wiki Blazegraph wiki]
  
==Installing the Blazegraph database on your own computer==
+
SPARQL reference:
Download Blazegraph (blazegraph.jar) from here: [https://blazegraph.com/ https://blazegraph.com/]
+
* [https://www.w3.org/TR/sparql11-query/ SPARQL Query Documentation]
I recommend placing blazegraph.jar in the same folder of your python project for the labs.
+
* [http://www.w3.org/TR/sparql11-update/ SPARQL Update Documentation]
Navigate to the folder of blazegraph.jar in your commandline/terminal using cd. (cd C:\Users\Martin\PycharmProjects\info216_labs for me as an example). Now run this command:
+
* [https://en.wikibooks.org/wiki/SPARQL/Expressions_and_Functions SPARQL Expressions and Functions]
<syntaxhighlight>
 
java -server -Xmx4g -jar blazegraph.jar
 
</syntaxhighlight>
 
You might have to install java 8 64-bit JDK if you have problems running blazegraph. You can do it from  this link:
 
"https://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html"
 
If you get an "Address already in use" error, try to run this command instead: "java -server -Xmx4g -Djetty.port=19999 -jar blazegraph.jar". This changes the port of the blazegraph server.
 
If you have trouble installing Blazegraph you can use this link for now: "i2s.uib.no:8888/bigdata/#splash".
 
This is the same blazegraph interface, but its stored in the cloud and only be used on the UiB network.
 
  
 +
==Tasks==
 +
===Running Blazegraph===
 +
You can either run Blazegraph locally on your own machine (best) or online on a shared server at UiB (also ok).
  
If it works it should now display an url like: "http://10.0.0.13:9999/blazegraph/". Open this in a browser.  
+
'''Installing the Blazegraph database on your own computer:'''
You can now run SPARQL queries and updates and load RDF graphs from your file into Blazegraph.
+
* Download the [https://github.com/blazegraph/database/releases/tag/BLAZEGRAPH_2_1_6_RC Blazegraph 2.1.6 2.1.6 Release Candidate (the file ''blazegraph.jar'')]. You can place ''blazegraph.jar'' in your INFO216 exercises folder.
In the update tab, load RDF data (select type below) and then paste the contents of your turtle/.txt file to add them all at once to the database. If you have not serialized your graph from lab 2 yet, you can use the triples on the bottom of the page instead. Just copy and paste them into the Update section.
+
* Go to the folder where you saved ''blazegraph.jar'' in your command/terminal window using ''cd'' (for example, ''cd C:\Users\marti\info216'').
 +
* Start Blazegraph:
 +
java -server -Xmx4g -jar blazegraph.jar
 +
** You might have to [https://www.oracle.com/technetwork/java/javase/downloads/ install a 64-bit Java Development Kit (JDK)] if you have problems running Blazegraph.
 +
** If you get an "Address already in use" error, this is likely because Blazegraph has been terminated improperly. Either restart the command/terminal window or try to change the port of the Blazegraph server with this command:
 +
java -server -Xmx4g -Djetty.port=19999 -jar blazegraph.jar
 +
* When everything works, Blazegraph will print out something like:
 +
Welcome to the Blazegraph(tm) Database.
 +
 +
Go to http://10.112.161.87:9999/blazegraph/ to get started.
 +
* Open the URI on the previous line in a web browser to access Blazegraph's web interface (the address will most likely be different from this example).
  
 +
'''Running Blazegraph online:'''
 +
If you have trouble installing Blazegraph, you can use [http://sandbox.i2s.uib.no/bigdata/ a shared online server] for now. It provides the same Blazegraph interface, but runs in the cloud and can only be used from inside the UiB network. (If you are outside the UiB campus, you can connect through the [https://hjelp.uib.no/tas/public/ssp/content/detail/service?unid=a566dafec92a4d35bba974f0733f3663 UiB VPN] first.) Note that there is no authentication or authorisation: ''all the data you upload to the cloud server will be visible to - and can be changed by - anyone inside the UiB network.''
  
Write the following SPARQL queries:
+
'''Using Blazegraph:'''
 +
* ''Creating a namespace:'' In the Blazegraph interface, you may go to the ''UPDATE'' tab and create a new namespace using default values and the ''Create namespace'' button.
 +
** You '''must''' do this if you use the shared online server to keep your own graph(s) separate.
 +
** You can also do this on your own (local) server to keep your graphs separate.
 +
** If you do not create a namespace, the default will be '''kb'''.
 +
** Note that Blazegraph namespaces have nothing to do with namespaces in rdflib or in Turtle or other RDF serialisations.
 +
* ''Uploading data:'' In the Blazegraph interface, go to the ''UPDATE'' tab and use the ''Browse...'' and ''Update'' buttons to load the file into Blazegraph.
 +
** You can use the data in the Turtle file [[File:russia_investigation_kg.txt]]. Make sure you save it with the correct extension, as ''russia_investigation_kg.ttl'' (not ''.txt'').
 +
** You can also use the Turtle file you saved after exercises 1 and 2.
 +
* ''Querying and updating:'' In the Blazegraph interface, go to the ''QUERY'' and ''UPDATE'' tabs to enter queries and updates.
  
* SELECT all triples in your graph.
+
===SPARQL tasks===
* SELECT all the interests of Cade.
 
* SELECT the city and country of where Emma lives.
 
* SELECT only people who are older than 26.
 
* SELECT Everyone who graduated with a Bachelor Degree.
 
  
Use SPARQL Update's DELETE DATA to delete that fact that Cade is interested in Photography. Run your SPARQL query again to check that the graph has changed.
+
'''Task:'''
 +
Using the data in ''russia_investigation_kg.ttl'', write the following SPARQL SELECT queries.
 +
([[Russian investigation KG | This page explains]] the Russian investigation KG a bit more.)
 +
* List all triples in your graph.  
 +
* List the first 100 triples in your graph.
 +
* Count the number of triples in your graph.
 +
* Count the number of indictments in your graph.
 +
* List everyone who pleaded guilty, along with the name of the investigation.
 +
* List everyone who were convicted, but who had their conviction overturned by which president.
 +
* For each investigation, list the number of indictments made.
 +
* For each investigation with multiple indictments, list the number of indictments made.
 +
* For each investigation with multiple indictments, list the number of indictments made, sorted with the most indictments first.
 +
* For each president, list the numbers of convictions and of pardons made after conviction.
  
Use INSERT DATA to add information about Sergio Pastor, who lives in 4 Carrer del Serpis, 46021 Valencia, Spain. he has a M.Sc. in computer from the University of Valencia from 2008. His areas of expertise include big data, semantic technologies and machine learning.
+
'''Task:'''
 +
Write the following SPARQL updates:
 +
* The ''muellerkg:name'' property is misnamed, because the object in those triples is always a resource. Rename it to something like ''muellerkg:person''.
 +
* Update the graph so all the investigated person and president nodes (such as ''muellerkg:G._Gordon_Liddy'' and  ''muellerkg:Richard_Nizon'') become the subjects in ''foaf:name'' triples with the corresponding strings (''G. Gordon Liddy'' and ''Richard Nixon'') as the literals. (''Tip:'' Use ''STR(kgmueller:)'' inside a REPLACE in a BIND statement to remove the URI path.)
  
Write a SPARQL DELETE/INSERT update to change the name of "University of Valencia" to "Universidad de Valencia" whereever it occurs.
+
'''Task:'''
 +
Load the RDF graph you created in exercises 1 and 2. (Maybe you want to create a new namespace in Blazegraph first.) Use INSERT DATA updates to add these triples to your graph:
 +
* George Papadopoulos was adviser to the Trump campaign.
 +
** He pleaded guilty to lying to the FBI.
 +
** He was sentenced to prison.
 +
* Roger Stone is a Republican.
 +
** He was adviser to Trump.
 +
** He was an official in the Trump campaign.
 +
** He interacted with Wikileaks.
 +
** He made a testimony for the House Intelligence Committee.
 +
** He was cleared of all charges.
  
Write a SPARQL DESCRIBE query to get basic information about Sergio.
+
'''Task:'''
 +
Use DELETE DATA and then INSERT DATA updates to correct that Roger Stone was cleared of all charges. Actually,
 +
* He was indicted for making false statements, witness tampering, and obstruction of justice.
  
Write a SPARQL CONSTRUCT query that returns the total address of people in one literal.
+
'''Task:'''
 +
* Use a DESCRIBE query to show the updated information about Roger Stone.
 +
* Use a CONSTRUCT query to create a new RDF group with triples only about Roger Stone (in other words, having Roger Stone as the subject.)
  
 
==If you have more time==
 
==If you have more time==
Redo all the above steps, this time writing a Python/RDFlib program. This will be the topic of lab 6.
+
'''Task:'''
You can look at the python example page to see how to connect to your blazegraph database in Python and perform some basic queries.
+
Install ''curl'' on your computer if you do not have it.
 +
 
 +
'''Windows 10/11:''' You most likely already have it, test if you have it by typing: ''curl --help'' in your command prompt. If you do not have it, follow the guide on https://stackoverflow.com/questions/9507353/how-do-i-install-and-use-curl-on-windows.
  
 +
'''Mac:''' If you do not have, type the following: ''sudo port install curl'' in your terminal.
  
==Useful Links==
+
'''Linux:''' If you do not have, type the following: ''sudo apt install curl'' in your terminal.
[https://wiki.uib.no/info216/index.php/File:S03-SPARQL-13.pdf Lecture Notes]
 
  
==Triples that you can base your queries on: (turtle format)==
+
Use the command below to download all the triples in your Blazegraph namespace. (You must replace ''NAMESPACE'' with the name of your Blazegraph namespace and ''FILENAME'' with the Turtle file you want to save to.)
<syntaxhighlight>
+
curl -X POST http://sandbox.i2s.uib.no/bigdata/namespace/NAMESPACE/sparql \
@prefix ex: <http://example.org/> .
+
      --data-urlencode 'query=CONSTRUCT {?s?p?o} WHERE {?s?p?o}' \
@prefix foaf: <http://xmlns.com/foaf/0.1/> .
+
      -H 'Accept:application/x-turtle' > FILENAME.ttl
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
+
(On Windows, you have to use double quotes and write everything on a single line.) This command works for the shared online server. If you run Blazegraph on your own machine, you must use a local address like ''http://10.112.161.87:9999/blazegraph/'' instead of the cloud address ''http://sandbox.i2s.uib.no/bigdata/''.
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
 
@prefix xml: <http://www.w3.org/XML/1998/namespace> .
 
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
 
  
ex:Cade a foaf:Person ;
+
'''Task:'''
    ex:address [ a ex:Address ;
+
Go back to the ''russia_investigation_kg.ttl'' dataset (maybe you need to change to an old Blazegraph namespace). The ''muellerkg:name'' property used as predicate is already covered by a standard term from an estalished vocabulary in the LOD cloud: ''foaf:name'', where ''foaf:'' is ''http://xmlns.com/foaf/0.1/''.
            ex:city ex:Berkeley ;
+
* If you have not done so already: write a SPARQL DELETE/INSERT update to change every ''muellerkg:name'' predicate in your graph to ''foaf:name''. (It is easy to destroy your RDF graph when you do this, so it is good you saved a copy in the previous task.)
            ex:country ex:USA ;
+
* Otherwise: find another resource to rename everywhere. For example, you can change your local URI for a public person to a standard [https://wikidata.org Wikidata] URI.
            ex:postalCode "94709"^^xsd:string ;
 
            ex:state ex:California ;
 
            ex:street "1516_Henry_Street"^^xsd:string ] ;
 
    ex:age 27 ;
 
    ex:characteristic ex:Kind ;
 
    ex:degree [ ex:degreeField ex:Biology ;
 
            ex:degreeLevel "Bachelor"^^xsd:string ;
 
            ex:degreeSource ex:University_of_California ;
 
            ex:year "2011-01-01"^^xsd:gYear ] ;
 
    ex:interest ex:Bird,
 
        ex:Ecology,
 
        ex:Environmentalism,
 
        ex:Photography,
 
        ex:Travelling ;
 
    ex:married ex:Mary ;
 
    ex:meeting ex:Meeting1 ;
 
    ex:visit ex:Canada,
 
        ex:France,
 
        ex:Germany ;
 
    foaf:knows ex:Emma ;
 
    foaf:name "Cade_Tracey"^^xsd:string .
 
  
ex:Mary a ex:Student,
+
'''Task:''' Write a DELETE/INSERT statement to change one of the prefixes in your graph, renaming all the resources that use that prefix.
        foaf:Person ;
 
    ex:age 26 ;
 
    ex:characteristic ex:Kind ;
 
    ex:interest ex:Biology,
 
        ex:Chocolate,
 
        ex:Hiking .
 
  
ex:Emma a foaf:Person ;
+
'''Task:''' Write an INSERT statement to add at least one significant date to the Mueller investigation, with literal type xsd:date. Write a DELETE/INSERT statement to change the date to a string, and a new DELETE/INSERT statement to change it back to xsd:date.
    ex:address [ a ex:Address ;
 
            ex:city ex:Valencia ;
 
            ex:country ex:Spain ;
 
            ex:postalCode "46020"^^xsd:string ;
 
            ex:street "Carrer_de_la Guardia_Civil_20"^^xsd:string ] ;
 
    ex:age 26 ;
 
    ex:degree [ ex:degreeField ex:Chemistry ;
 
            ex:degreeLevel "Master" ;
 
            ex:degreeSource ex:University_of_Valencia ;
 
            ex:year "2015-01-01"^^xsd:gYear ] ;
 
    ex:expertise ex:Air_Pollution,
 
        ex:Toxic_Waste,
 
        ex:Waste_Management ;
 
    ex:interest ex:Bike_Riding,
 
        ex:Music,
 
        ex:Travelling ;
 
    ex:meeting ex:Meeting1 ;
 
    ex:visit ( ex:Portugal ex:Italy ex:France ex:Germany ex:Denmark ex:Sweden ) ;
 
    foaf:name "Emma_Dominguez"^^xsd:string .
 
  
ex:Meeting1 a ex:Meeting ;
+
'''Task:''' Try to program some of the queries/updates in a Python program (this will be the topic of later labs). You have two options:
    ex:date "August, 2014"^^xsd:string ;
 
    ex:involved ex:Cade,
 
        ex:Emma ;
 
    ex:location ex:Paris .
 
  
ex:Paris a ex:City ;
+
''Using rdflib:''
    ex:capitalOf ex:France ;
+
Read the Turtle file into an rdflib Graph and use the ''query()'' method.
    ex:locatedIn ex:France .
+
g = Graph()
 +
g.parse(..., format='ttl')
 +
r = g.query(...your_query_string...)
 +
The hard part is picking the results out of the object ''r''...
  
ex:France ex:capital ex:Paris .
+
''Using SPARQLwrapper:''
 +
You can use SPARQLwrapper (another Python API) to connect to your running Blazegraph endpoint. See the Python example page for how to do this.
  
 +
'''Task:''' If you want to explore more, try out the Wikidata Query Service (WDQS):
 +
* [https://query.wikidata.org/ Wikidata Query Service]
  
</syntaxhighlight>
+
WDQS tutorials:
 +
* [https://www.wikidata.org/wiki/Wikidata:SPARQL_tutorial Wikidata SPARQL tutorial]
 +
* [https://wdqs-tutorial.toolforge.org/ Interactive WDQS tutorial]

Latest revision as of 22:59, 1 February 2023

Topics

  • Setting up the Blazegraph graph database.
  • SPARQL queries and updates.

Useful materials

Blazegraph homepage:

SPARQL reference:

Tasks

Running Blazegraph

You can either run Blazegraph locally on your own machine (best) or online on a shared server at UiB (also ok).

Installing the Blazegraph database on your own computer:

java -server -Xmx4g -jar blazegraph.jar
    • You might have to install a 64-bit Java Development Kit (JDK) if you have problems running Blazegraph.
    • If you get an "Address already in use" error, this is likely because Blazegraph has been terminated improperly. Either restart the command/terminal window or try to change the port of the Blazegraph server with this command:
java -server -Xmx4g -Djetty.port=19999 -jar blazegraph.jar 
  • When everything works, Blazegraph will print out something like:
Welcome to the Blazegraph(tm) Database.

Go to http://10.112.161.87:9999/blazegraph/ to get started.
  • Open the URI on the previous line in a web browser to access Blazegraph's web interface (the address will most likely be different from this example).

Running Blazegraph online: If you have trouble installing Blazegraph, you can use a shared online server for now. It provides the same Blazegraph interface, but runs in the cloud and can only be used from inside the UiB network. (If you are outside the UiB campus, you can connect through the UiB VPN first.) Note that there is no authentication or authorisation: all the data you upload to the cloud server will be visible to - and can be changed by - anyone inside the UiB network.

Using Blazegraph:

  • Creating a namespace: In the Blazegraph interface, you may go to the UPDATE tab and create a new namespace using default values and the Create namespace button.
    • You must do this if you use the shared online server to keep your own graph(s) separate.
    • You can also do this on your own (local) server to keep your graphs separate.
    • If you do not create a namespace, the default will be kb.
    • Note that Blazegraph namespaces have nothing to do with namespaces in rdflib or in Turtle or other RDF serialisations.
  • Uploading data: In the Blazegraph interface, go to the UPDATE tab and use the Browse... and Update buttons to load the file into Blazegraph.
    • You can use the data in the Turtle file File:Russia investigation kg.txt. Make sure you save it with the correct extension, as russia_investigation_kg.ttl (not .txt).
    • You can also use the Turtle file you saved after exercises 1 and 2.
  • Querying and updating: In the Blazegraph interface, go to the QUERY and UPDATE tabs to enter queries and updates.

SPARQL tasks

Task: Using the data in russia_investigation_kg.ttl, write the following SPARQL SELECT queries. ( This page explains the Russian investigation KG a bit more.)

  • List all triples in your graph.
  • List the first 100 triples in your graph.
  • Count the number of triples in your graph.
  • Count the number of indictments in your graph.
  • List everyone who pleaded guilty, along with the name of the investigation.
  • List everyone who were convicted, but who had their conviction overturned by which president.
  • For each investigation, list the number of indictments made.
  • For each investigation with multiple indictments, list the number of indictments made.
  • For each investigation with multiple indictments, list the number of indictments made, sorted with the most indictments first.
  • For each president, list the numbers of convictions and of pardons made after conviction.

Task: Write the following SPARQL updates:

  • The muellerkg:name property is misnamed, because the object in those triples is always a resource. Rename it to something like muellerkg:person.
  • Update the graph so all the investigated person and president nodes (such as muellerkg:G._Gordon_Liddy and muellerkg:Richard_Nizon) become the subjects in foaf:name triples with the corresponding strings (G. Gordon Liddy and Richard Nixon) as the literals. (Tip: Use STR(kgmueller:) inside a REPLACE in a BIND statement to remove the URI path.)

Task: Load the RDF graph you created in exercises 1 and 2. (Maybe you want to create a new namespace in Blazegraph first.) Use INSERT DATA updates to add these triples to your graph:

  • George Papadopoulos was adviser to the Trump campaign.
    • He pleaded guilty to lying to the FBI.
    • He was sentenced to prison.
  • Roger Stone is a Republican.
    • He was adviser to Trump.
    • He was an official in the Trump campaign.
    • He interacted with Wikileaks.
    • He made a testimony for the House Intelligence Committee.
    • He was cleared of all charges.

Task: Use DELETE DATA and then INSERT DATA updates to correct that Roger Stone was cleared of all charges. Actually,

  • He was indicted for making false statements, witness tampering, and obstruction of justice.

Task:

  • Use a DESCRIBE query to show the updated information about Roger Stone.
  • Use a CONSTRUCT query to create a new RDF group with triples only about Roger Stone (in other words, having Roger Stone as the subject.)

If you have more time

Task: Install curl on your computer if you do not have it.

Windows 10/11: You most likely already have it, test if you have it by typing: curl --help in your command prompt. If you do not have it, follow the guide on https://stackoverflow.com/questions/9507353/how-do-i-install-and-use-curl-on-windows.

Mac: If you do not have, type the following: sudo port install curl in your terminal.

Linux: If you do not have, type the following: sudo apt install curl in your terminal.

Use the command below to download all the triples in your Blazegraph namespace. (You must replace NAMESPACE with the name of your Blazegraph namespace and FILENAME with the Turtle file you want to save to.)

curl -X POST http://sandbox.i2s.uib.no/bigdata/namespace/NAMESPACE/sparql \
     --data-urlencode 'query=CONSTRUCT {?s?p?o} WHERE {?s?p?o}' \
     -H 'Accept:application/x-turtle' > FILENAME.ttl

(On Windows, you have to use double quotes and write everything on a single line.) This command works for the shared online server. If you run Blazegraph on your own machine, you must use a local address like http://10.112.161.87:9999/blazegraph/ instead of the cloud address http://sandbox.i2s.uib.no/bigdata/.

Task: Go back to the russia_investigation_kg.ttl dataset (maybe you need to change to an old Blazegraph namespace). The muellerkg:name property used as predicate is already covered by a standard term from an estalished vocabulary in the LOD cloud: foaf:name, where foaf: is http://xmlns.com/foaf/0.1/.

  • If you have not done so already: write a SPARQL DELETE/INSERT update to change every muellerkg:name predicate in your graph to foaf:name. (It is easy to destroy your RDF graph when you do this, so it is good you saved a copy in the previous task.)
  • Otherwise: find another resource to rename everywhere. For example, you can change your local URI for a public person to a standard Wikidata URI.

Task: Write a DELETE/INSERT statement to change one of the prefixes in your graph, renaming all the resources that use that prefix.

Task: Write an INSERT statement to add at least one significant date to the Mueller investigation, with literal type xsd:date. Write a DELETE/INSERT statement to change the date to a string, and a new DELETE/INSERT statement to change it back to xsd:date.

Task: Try to program some of the queries/updates in a Python program (this will be the topic of later labs). You have two options:

Using rdflib: Read the Turtle file into an rdflib Graph and use the query() method.

g = Graph()
g.parse(..., format='ttl')
r = g.query(...your_query_string...)

The hard part is picking the results out of the object r...

Using SPARQLwrapper: You can use SPARQLwrapper (another Python API) to connect to your running Blazegraph endpoint. See the Python example page for how to do this.

Task: If you want to explore more, try out the Wikidata Query Service (WDQS):

WDQS tutorials: