Lab: Semantic Lifting - CSV: Difference between revisions

From info216
No edit summary
No edit summary
 
(14 intermediate revisions by 3 users not shown)
Line 1: Line 1:
= Lab 6: Semantic Lifting - CSV =
== Topic ==
== Topic ==
Today's topic involves lifting data in CSV format into RDF. The goal is for you to learn how we can convert non-semantic data into RDF as well as getting familiar with some common vocabularies.
* Reading non-semantic data tables into semantic knowledge graphs
* Specifically, reading data in CSV format via Pandas dataframes into RDF graphs
* ''If you have more time'', reading data in CSV format via Ontotext Refine into RDF graphs


Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.
== Useful materials ==
The textbook (Allemang, Hendler & Gandon):
* chapter 3 on RDF (from section 3.1 on ''Distributing Data across the Web'')
* [https://github.com/fivethirtyeight/data/tree/master/russia-investigation Information about the dataset]


We will also use Pandas Dataframes which will contain our CSV data in python code, and we'll do some basic data manipulation to improve our output data.
Pandas:
* [https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff Article about working with pandas.DataFrames and CSV]
* [https://pandas.pydata.org/docs/user_guide/io.html#parsing-options User Guide for Pandas]
* class: DataFrame (methods: read_csv, set_index, apply, iterrows, astype)


== Relevant Libraries - Classes, Functions and Methods and Vocabularies==
rdflib:
=== Libraries ===
* classes/interfaces from earlier (such as Graph, Namespace, URIRef, Literal, perhaps BNode)
* RDFlib concepts from earlier (Graph, Namespace, URIRef, Literal, BNode)
* also vocabulary classes like RDF (e.g., type), RDFS (e.g., label) and XSD (for various datatypes)
* Pandas: DataFrame, apply, iterrows, astype
* DBpedia Spotlight


=== Semantic Vocabularies ===
''If you have more time'', Ontotext Refine:
You do not have to use the same ones, but these should be well suited.
* [https://www.ontotext.com/products/ontotext-refine/ Ontotext Refine Product page]
* RDF: type
* [https://platform.ontotext.com/ontorefine/ Ontotext Refine Documentation]
* RDFS: label
* Simple Event Ontology (sem): Event, eventType, Actor, hasActor, hasActorType, hasBeginTimeStamp, EndTimeStamp, hasTime, hasSubEvent
* TimeLine Ontology (tl): durationInt
* An example-namespace to represent terms not found elsewhere (ex): IndictmentDays, Overturned, Pardoned
* DBpedia


== Tasks ==
== Tasks ==
Today we will be working with FiveThirtyEight's russia-investigation dataset. It contains special investigations conducted by the United States since the Watergate-investigation with information about them to May 2017. If you found the last weeks exercice doable, I recommend trying to write this with object-oriented programming (OOP) structure, as this tends to make for cleaner code.
We will be working with the same dataset as in the SPARQL exercise: [https://projects.fivethirtyeight.com/russia-investigation/ FiveThirtyEight's Russia Investigation]. It contains data about special investigations conducted by the United States from the Watergate-investigation until May 2017. [[Russian investigation KG | This page explains]] the Russia Investigation dataset a bit more.


It contains the following columns:
'''Task:'''
* investigation
In the SPARQL exercise, you downloaded the data as a Turtle file ([[File:russia_investigation_kg.txt]], which you renamed to ''.ttl''). This time you will [https://github.com/fivethirtyeight/data/tree/master/russia-investigation download the data as a CSV file from GitHub].
* investigation-start
* investigation-end
* investigation-days
* name
* indictment-days
* type
* cp-date
* cp-days
* overturned
* pardoned
* american
* president


More information about the columns and the dataset here: https://github.com/fivethirtyeight/data/tree/master/russia-investigation
'''Task:'''
Install Pandas in your virtual environment, for example
pip install pandas
Write a Python program that imports the ''pandas'' API and uses Pandas' ''read_csv'' function to load the ''russia-investigation.csv'' dataset into a Pandas ''dataframe''.


Our goal is to convert this non-semantic dataset into a semantic one. To do this we will go row-by-row through the dataset and extract the content of each column.
'''Task:'''
An investigation may have multiple rows in the dataset if it investigates multiple people, you can choose to represent these as one or multiple entities in the graph. Each investigation may also have a sub-event representing the result of the investigation, this could for instance be indictment or guilty-plea.
''(Pandas basics)'' Inspect the Pandas dataframe. If you have called your dataframe ''df'', you can check out the expressions below. Use the [https://pandas.pydata.org/docs/user_guide/io.html#parsing-options documentation] to understand what each of them does.
df.shape
df.index  # ...and list(df.index)
df.columns
df['name']
df.name
df.loc[3]
df.loc[3]['president']
(Pandas offers many ways of picking out rows, columns, and values. These are just examples to get started.)


For a row we will start by creating a resource representing the investigation. In this example we handle all investigations with the same name as the samme entity, and will therefore use the name of the investigation ("investigation"-column) to create the URI:
'''Task:'''
''(Pandas basics)'' Pandas' apply method offers a compact way to process all the rows in a dataframe. This line lists all the rows in your dataframe as a Python dict():
df.apply(lambda row: print(dict(row)), axis=1)
What happens if you drop the ''axis'' argument, or set ''axis=0''?


<syntaxhighlight>
'''Task:'''
name = row["investigation"]
Instead of the ''lambda'' function, you can use a named function. Write a function that prints out only the ''name'' and ''indictment-days'' in a row, and use it to print out the ''name'' and ''indictment-days'' for all rows in the dataframe.


investigation = URIRef(ex + name)
''Alternative to df.apply():''
g.add((investigation, RDF.type, sem.Event))
Pandas offers several ways to iterate through data. You can also use the ''itertuples'' methods in a simple ''for''-loop to iterate through rows.
</syntaxhighlight>


Further we will create a relation between the investigation and all its associated columns. For when the investigation started we'll use the "investigation-start"-column and we can use the property sem:hasBeginTimeStamp:
'''Task:'''
Modify your function so it adds ''name'' and ''indictment-days'' triples to a global rdflib ''Graph'' for each row in the dataframe. The subject in each triple could be the numeric index of the row.


<syntaxhighlight>
You can use standard terms from RDF, RDFS, XSD, and other vocabularies when you see fit. Otherwise, just use an example-prefix.
investigation_start = row["investigation-start"]


g.add((investigation, sem.hasBeginTimeStamp, Literal(investigation_start, datatype=XSD.datetime)))
Things may be easier if you copy ''df.index'' into an ordinary column of the dataframe:
</syntaxhighlight>
df['id'] = df.index
You can use this index, along with a prefix, as the subject in your triples.


To represent the result of the investigation, if it has one, We can create another entity and connect it to the investigation using the sem:hasSubEvent. If so the following columns can be attributed to the sub-event:
'''Task:'''
* type
Continue to extend your function to convert the non-semantic CSV dataset into a semantic RDF one. Here is an example of how the data for one investigation could look like in the end:
* indictment-days
muellerkg:investigation_0 a muellerkg:Indictment ;
* overturned
    muellerkg:american true ;
* pardon
    muellerkg:cp_date "1973-01-30"^^xsd:date ;
* cp_date
    muellerkg:cp_days -109 ;
* cp_days
    muellerkg:indictment_days -246 ;
* name (the name of the investigatee, not the name of the investigation)
    muellerkg:investigation muellerkg:watergate ;
    muellerkg:investigation_days 1492 ;
    muellerkg:investigation_end "1977-06-19"^^xsd:date ;
    muellerkg:investigation_start "1973-05-19"^^xsd:date ;
    muellerkg:name muellerkg:James_W._McCord ;
    muellerkg:outcome muellerkg:conviction ;
    muellerkg:overturned false ;
    muellerkg:pardoned false ;
    muellerkg:president muellerkg:Richard_Nixon .


=== Code to get you started ===
== If you have more time ==
<syntaxhighlight>


import pandas as pd
'''Task:'''
import rdflib
If you have not done so already, you should include checks to ensure that you do not add empty columns to your graph.


from rdflib import Graph, Namespace, URIRef, Literal, BNode
'''Task:'''
from rdflib.namespace import RDF, RDFS, XSD
In the SPARQL exercise and earlier in this lab, you used data downloaded as a Turtle file ([[File:russia_investigation_kg.txt]], which you renamed to ''.ttl''). This time you will [https://github.com/fivethirtyeight/data/tree/master/russia-investigation download the data as a CSV file from GitHub].


ex = Namespace("http://example.org/")
'''Task:'''
dbr = Namespace("http://dbpedia.org/resource/")
Install Ontotext Refine on your computer, and create a repository in GraphDB named ''refine''.
sem = Namespace("http://semanticweb.cs.vu.nl/2009/11/sem/")
tl = Namespace("http://purl.org/NET/c4dm/timeline.owl#")


g = Graph()
* Windows
g.bind("ex", ex)
** Download the Refine ''.msi'' installer file.
g.bind("dbr", dbr)
** Double-click the application file and follow the on-screen installer prompts.
g.bind("sem", sem)
** You will be asked for an installation location.
g.bind("tl", tl)
** Locate the application in the Windows Start menu or on the desktop and start it.
** The Refine application opens at http://localhost:7333/.


df = pd.read_csv("data/investigations.csv")
* MacOS
# We need to correct the type of the columns in the DataFrame, as Pandas assigns an incorrect type when it reads the file (for me at least). We use .astype("str") to convert the content of the columns to a string.
** Download the Refine ''.dmg'' file.
df["name"] = df["name"].astype("str")
** Double-click it and get a virtual disk on your desktop.
df["type"] = df["type"].astype("str")
** Copy the program from the virtual disk to your hard disk Applications folder.
** Locate the application on the desktop and start it.
** The Refine application opens at http://localhost:7333/.


# iterrows creates an iterable object (list of rows)
* Linux
for index, row in df.iterrows():
** Download the Refine ''.deb'' or ''.rpm'' file.
# Do something here to add the content of the row to the graph
** Install the package:
pass
** Debian and derivatives: sudo dpkg -i <package-name>.deb
** Redhat and derivatives: sudo rpm -i  <package-name>.rpm
** CentOS and derivatives: sudo yum install <package-name>
** Alternatively, double-click the package.
** Locate the application on the desktop and start it.
** The Refine application opens at http://localhost:7333/.


g.serialize("output.ttl", format="ttl")
Note: Alternatively you can change the ''Repository ID'' in Ontotext Refine under the Setup tab to either a preexisting repository's name, or to one that you would prefer. Ontotext Refine will not make changes to this reposity.
</syntaxhighlight>


== If you have more time ==
'''Task:''' ''Data loading''
If you have not already you should include some checks to assure that you don't add any empty columns to your graph.
 
Load the CSV-file into Ontotext Refine, and convert the columns' types into the appropriate ones. (Dates to dates, ints to ints.)
 
'''Task:''' ''Setting up RDF mappings'' 


If you have more time you can implement DBpedia Spotlight to link the people mentioned in the dataset to DBpedia resources.
Use the Visual RDF Mapper tool to map columns to triples. Try to recreate the structure we had in the KG up until this point. Make sure to set types for literals!
You can use the same code example as in the last lab, but you will need some error-handling for when DBpedia is unable to find a match. For instance:


<syntaxhighlight>
'''Task:''' ''Setting up RDF mappings'' 
# Parameter given to spotlight to filter out results with confidence lower than this value
CONFIDENCE = 0.5


def annotate_entity(entity, filters={"types":"DBpedia:Person"}):
In addition to the previous structure, we would also like to store the names of presidents as properties.
annotations = []
Modify your mapping so each president has an foaf:surname, and a foaf:lastname property, with the appropriate values.
try:
annotations = spotlight.annotate(SERVER, entity, confidence=CONFIDENCE, filters=filters)
    # This catches errors thrown from Spotlight, including when no resource is found in DBpedia
except SpotlightException as e:
print(e)
# Implement some error handling here
return annotations
</syntaxhighlight>


Here we use the types-filter with DBpedia:Person, as we only want it to match with people. You can choose to only implement the URIs in the response, or the types as well. An issue here is that
'''Task:''' ''Testing your KG''


== Useful readings ==
Load your freshly created triples into GraphDB, and try to run the queries you have created during the previous labs.
* [https://github.com/fivethirtyeight/data/tree/master/russia-investigation Information about the dataset]
If they do not work as they did before, try and update your mappings so they do!
* [https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff Article about working with pandas.DataFrames and CSV]
* [https://pandas.pydata.org/pandas-docs/stable/reference/frame.html Pandas DataFrame documentation]
* [https://semanticweb.cs.vu.nl/2009/11/sem/#sem:eventType Simple Event Ontology Descripiton]
* [http://motools.sourceforge.net/timeline/timeline.html The TimeLine Ontology Description]
* [https://www.dbpedia-spotlight.org/api Spotlight Documentation]

Latest revision as of 18:56, 3 March 2024

Topic

  • Reading non-semantic data tables into semantic knowledge graphs
  • Specifically, reading data in CSV format via Pandas dataframes into RDF graphs
  • If you have more time, reading data in CSV format via Ontotext Refine into RDF graphs

Useful materials

The textbook (Allemang, Hendler & Gandon):

Pandas:

rdflib:

  • classes/interfaces from earlier (such as Graph, Namespace, URIRef, Literal, perhaps BNode)
  • also vocabulary classes like RDF (e.g., type), RDFS (e.g., label) and XSD (for various datatypes)

If you have more time, Ontotext Refine:

Tasks

We will be working with the same dataset as in the SPARQL exercise: FiveThirtyEight's Russia Investigation. It contains data about special investigations conducted by the United States from the Watergate-investigation until May 2017. This page explains the Russia Investigation dataset a bit more.

Task: In the SPARQL exercise, you downloaded the data as a Turtle file (File:Russia investigation kg.txt, which you renamed to .ttl). This time you will download the data as a CSV file from GitHub.

Task: Install Pandas in your virtual environment, for example

pip install pandas

Write a Python program that imports the pandas API and uses Pandas' read_csv function to load the russia-investigation.csv dataset into a Pandas dataframe.

Task: (Pandas basics) Inspect the Pandas dataframe. If you have called your dataframe df, you can check out the expressions below. Use the documentation to understand what each of them does.

df.shape
df.index  # ...and list(df.index)
df.columns
df['name']
df.name
df.loc[3]
df.loc[3]['president']

(Pandas offers many ways of picking out rows, columns, and values. These are just examples to get started.)

Task: (Pandas basics) Pandas' apply method offers a compact way to process all the rows in a dataframe. This line lists all the rows in your dataframe as a Python dict():

df.apply(lambda row: print(dict(row)), axis=1)

What happens if you drop the axis argument, or set axis=0?

Task: Instead of the lambda function, you can use a named function. Write a function that prints out only the name and indictment-days in a row, and use it to print out the name and indictment-days for all rows in the dataframe.

Alternative to df.apply(): Pandas offers several ways to iterate through data. You can also use the itertuples methods in a simple for-loop to iterate through rows.

Task: Modify your function so it adds name and indictment-days triples to a global rdflib Graph for each row in the dataframe. The subject in each triple could be the numeric index of the row.

You can use standard terms from RDF, RDFS, XSD, and other vocabularies when you see fit. Otherwise, just use an example-prefix.

Things may be easier if you copy df.index into an ordinary column of the dataframe:

df['id'] = df.index

You can use this index, along with a prefix, as the subject in your triples.

Task: Continue to extend your function to convert the non-semantic CSV dataset into a semantic RDF one. Here is an example of how the data for one investigation could look like in the end:

muellerkg:investigation_0 a muellerkg:Indictment ;
    muellerkg:american true ;
    muellerkg:cp_date "1973-01-30"^^xsd:date ;
    muellerkg:cp_days -109 ;
    muellerkg:indictment_days -246 ;
    muellerkg:investigation muellerkg:watergate ;
    muellerkg:investigation_days 1492 ;
    muellerkg:investigation_end "1977-06-19"^^xsd:date ;
    muellerkg:investigation_start "1973-05-19"^^xsd:date ;
    muellerkg:name muellerkg:James_W._McCord ;
    muellerkg:outcome muellerkg:conviction ;
    muellerkg:overturned false ;
    muellerkg:pardoned false ;
    muellerkg:president muellerkg:Richard_Nixon .

If you have more time

Task: If you have not done so already, you should include checks to ensure that you do not add empty columns to your graph.

Task: In the SPARQL exercise and earlier in this lab, you used data downloaded as a Turtle file (File:Russia investigation kg.txt, which you renamed to .ttl). This time you will download the data as a CSV file from GitHub.

Task: Install Ontotext Refine on your computer, and create a repository in GraphDB named refine.

  • Windows
    • Download the Refine .msi installer file.
    • Double-click the application file and follow the on-screen installer prompts.
    • You will be asked for an installation location.
    • Locate the application in the Windows Start menu or on the desktop and start it.
    • The Refine application opens at http://localhost:7333/.
  • MacOS
    • Download the Refine .dmg file.
    • Double-click it and get a virtual disk on your desktop.
    • Copy the program from the virtual disk to your hard disk Applications folder.
    • Locate the application on the desktop and start it.
    • The Refine application opens at http://localhost:7333/.
  • Linux
    • Download the Refine .deb or .rpm file.
    • Install the package:
    • Debian and derivatives: sudo dpkg -i <package-name>.deb
    • Redhat and derivatives: sudo rpm -i <package-name>.rpm
    • CentOS and derivatives: sudo yum install <package-name>
    • Alternatively, double-click the package.
    • Locate the application on the desktop and start it.
    • The Refine application opens at http://localhost:7333/.

Note: Alternatively you can change the Repository ID in Ontotext Refine under the Setup tab to either a preexisting repository's name, or to one that you would prefer. Ontotext Refine will not make changes to this reposity.

Task: Data loading

Load the CSV-file into Ontotext Refine, and convert the columns' types into the appropriate ones. (Dates to dates, ints to ints.)

Task: Setting up RDF mappings

Use the Visual RDF Mapper tool to map columns to triples. Try to recreate the structure we had in the KG up until this point. Make sure to set types for literals!

Task: Setting up RDF mappings

In addition to the previous structure, we would also like to store the names of presidents as properties. Modify your mapping so each president has an foaf:surname, and a foaf:lastname property, with the appropriate values.

Task: Testing your KG

Load your freshly created triples into GraphDB, and try to run the queries you have created during the previous labs. If they do not work as they did before, try and update your mappings so they do!