Lab: Semantic Lifting - CSV

From Info216
Revision as of 23:53, 12 March 2020 by Say004 (talk | contribs)

Lab 9: Semantic Lifting - CSV


Today's topic involves lifting the data in CSV format into RDF. The goal is for you to learn an example of how we can convert unsemantic data into RDF.

CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.

Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.

Relevant Libraries

  • Pandas


Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project folder and write a program with a loop that reads each line from that file (except the initial header line) and adds it to your graph as triples:

"Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
"Achille Blaise","M","France","Nancy","","Chess, computer games"
"Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks","Hiking, botany"
"Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"

When solving the task take note of the following:

  • The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
  • Some columns like expertise have multiple values for one person. You should create unique triple for each of these expertises.
  • Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.
  • Any case with missing data should not form a triple.
  • For consistency, make sure all resources start with a Captital letter.

If You have more Time

Extend/improve the graph with concepts you have learned about so far. E.g RDF.type, or RDFS domain and range. Additionaly, see if you can find existing terms for the predicates we used here on DBpedia, or Wikidata.

Code to Get Started (Optional)

from rdflib import Graph, Literal, Namespace, URIRef

import pandas as pd

csv_data = pd.read_csv("task1.csv")

g = Graph()
ex = Namespace("httph://")
g.bind("ex", ex)

# iterate through each row. First I select the subjects of the triples which will be the names.
for index, row in csv_data.iterrows():
    subject = row['Name'].replace(" ", "_")

     #Continue Code here: