Revision as of 23:47, 12 March 2020

Lab 9: Semantic Lifting - CSV

Topics

Today's topic involves lifting data in CSV format into RDF. The goal is for you to learn an example of how we can convert unsemantic data into RDF.

CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.

Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.

We will also use Pandas Dataframes which will contain our CSV data in python code. We will also do some basic data manipulation to improve our output data.

Relevant Libraries/Functions

import pandas

pandas.read_csv

dataframe.iterrows(), dataframe.fillna(), dataframe.replace()

string.split(), string.title(), string.replace()

RDF concepts we have used earlier.

Tasks

Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project (e.g task1.csv) folder and write a program with a loop that reads each line from that file and adds it to your graph as triples:

"Name","Gender","Country","Town","Expertises","Interests"
"Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
"Achille Blaise","M","France","Nancy","","Chess, computer games"
"Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks",""
"Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"

To get started you can use the code furhter down:

When solving the task take note of the following:

The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.

Some columns like expertise have multiple values for one person. You should create unique triples for each of these expertises/interests.

Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.

Any case with missing data should not form a triple.

For consistency, make sure all resources start with a Captital letter.

If You have more Time

Extend/improve the graph with concepts you have learned about so far. E.g RDF.type, or RDFS domain and range.

Additionaly, see if you can find fitting existing terms for the relevant predicate and classes on DBpedia, Schema.org, Wikidata or elsewhere. Then replace the old ones with those.

Code to Get Started (You could also use your own approach if you want to)

from rdflib import Graph, Literal, Namespace, URIRef

import pandas as pd

# Load the CSv data as a pandas Dataframe.
csv_data = pd.read_csv("task1.csv")

g = Graph()
ex = Namespace("httph://example.org/")
g.bind("ex", ex)


# You should probably deal with replacing of characters or missing data here:



# Iterate through each row in order the create triples. First I select the subjects of the triples which will be the names.

for index, row in csv_data.iterrows():
    # row['Name'] selects the name value of the current row.
    subject = row['Name']

     #Continue the loop here:


# Clean printing of end-results.
print(g.serialize(format="turtle").decode())

Hints

Replacing characters with Dataframe:

csv_data = csv_data.replace(to_replace ="banana",
                 value ="apple", regex=True)

Fill missing/empty data of Dataframe with paramteter value.

csv_data = csv_data.fillna("missing")

Make first letter of word Captial.

name = "cade".title()

After creating the graph you can remove all triples that contained unknown data easily if you marked like above.

g.remove((None, None, URIRef("http://example.org/missing")))

Useful Reading

Useful Resource for working with Dataframes and CSV

@@ Line 109: / Line 109: @@
 <syntaxhighlight>
-g.remove((None, None, URIRef("http://example.org/unknown_data")))
+g.remove((None, None, URIRef("http://example.org/missing")))
 </syntaxhighlight>