Revision as of 14:28, 14 March 2020

Lab 10: Semantic Lifting - HTML

Link to Live-stream

Topics

Today's topic involves lifting data in HTML format into RDF. The goal is for you to learn an example of how we can convert unsemantic data into RDF.

HTML is the coding language used to describe structure and content of websites.

Relevant Libraries/Functions

Tasks

If You have more Time

Code to Get Started (You could also use your own approach if you want to)

Hints

Replacing characters with Dataframe:

Useful Reading

@@ Line 1: / Line 1: @@
-=Lab 9: Semantic Lifting - CSV=
+=Lab 10: Semantic Lifting - HTML=
 ==Link to Live-stream==
 <syntaxhighlight>
-https://teams.microsoft.com/dl/launcher/launcher.html?url=%2f_%23%2fl%2fmeetup-join%2f19%3ameeting_MGI1ZjcxNTUtODBjNy00ZjkxLWJlNGUtOTQ2Y2M3NjEwYzkx%40thread.v2%2f0%3fcontext%3d%257b%2522Tid%2522%253a%2522648a24bc-a98d-4025-9c60-48c19a142069%2522%252c%2522Oid%2522%253a%252252d6ac23-7c70-43f5-bc41-95254a3ac7f1%2522%252c%2522IsBroadcastMeeting%2522%253atrue%257d%26anon%3dtrue&type=meetup-join&deeplinkId=3b3b5d40-c010-45ca-a620-7108967fe3e3&directDl=true&msLaunch=true&enableMobilePage=true&suppressPrompt=true
 </syntaxhighlight>
 ==Topics==
-Today's topic involves lifting data in CSV format into RDF.
+Today's topic involves lifting data in HTML format into RDF.
 The goal is for you to learn an example of how we can convert unsemantic data into RDF.
-CSV stands for Comma Seperated Values, meaning that each point of data is seperated by a column.
+HTML is the coding language used to describe structure and content of websites.
-Fortunately, CSV is already structured in a way that makes the creation of triples relatively easy.
-We will also use Pandas Dataframes which will contain our CSV data in python code. We will also do some basic data manipulation to improve our output data.
 ==Relevant Libraries/Functions==
-import pandas
-pandas.read_csv
-dataframe.iterrows(), dataframe.fillna(), dataframe.replace()
-string.split(), string.title(), string.replace()
-RDF concepts we have used earlier.
 ==Tasks==
-Below are four lines of CSV that could have been saved from a spreadsheet. Copy them into a file in your project (e.g task1.csv) folder and '''write a program with a loop that reads each line from that file and adds it to your graph as triples''':
- "Name","Gender","Country","Town","Expertises","Interests"
- "Regina Catherine Hall","F","Great Britain","Manchester","Ecology, zoology","Football, music, travelling"
- "Achille Blaise","M","France","Nancy","","Chess, computer games"
- "Nyarai Awotwi Ihejirika","F","Kenya","Nairobi","Computers, semantic networks",""
- "Xun He Zhang","M","China","Chengdu","Internet, mathematics, logistics","Dancing, music, trombone"
-To get started you can use the code furhter down:
-When solving the task take note of the following:
-* The subject of the triples will be the names of the people. The header (first line) are the columns of data and should act as the predicates of the triples.
-* Some columns like expertise have multiple values for one person. You should create unique triples for each of these expertises/interests.
-* Spaces should replaced with underscores to from a valid URI. E.g Regina Catherine should be Regina_Catherine.
-* Any case with missing data should not form a triple.
-* For consistency, make sure all resources start with a Captital letter.
 ==If You have more Time==
-* Extend/improve the graph with concepts you have learned about so far. E.g RDF.type, or RDFS domain and range.
-* Additionaly, see if you can find fitting existing terms for the relevant predicate and classes on DBpedia, Schema.org, Wikidata or elsewhere. Then replace the old ones with those.
 ==Code to Get Started (You could also use your own approach if you want to) ==
-<syntaxhighlight>
-from rdflib import Graph, Literal, Namespace, URIRef
-import pandas as pd
-# Load the CSv data as a pandas Dataframe.
-csv_data = pd.read_csv("task1.csv")
-g = Graph()
-ex = Namespace("httph://example.org/")
-g.bind("ex", ex)
-# You should probably deal with replacing of characters or missing data here:
-# Iterate through each row in order the create triples. First I select the subjects of the triples which will be the names.
-for index, row in csv_data.iterrows():
-    # row['Name'] selects the name value of the current row.
-    subject = row['Name']
-     #Continue the loop here:
-# Clean printing of end-results.
-print(g.serialize(format="turtle").decode())
-</syntaxhighlight>
@@ Line 97: / Line 31: @@
 | Replacing characters with Dataframe:
-<syntaxhighlight>
-csv_data = csv_data.replace(to_replace ="banana",
-                 value ="apple", regex=True)
-</syntaxhighlight>
-Fill missing/empty data of Dataframe with paramteter value.
-<syntaxhighlight>
-csv_data = csv_data.fillna("missing")
-</syntaxhighlight>
-Make first letter of word Captial.
-<syntaxhighlight>
-name = "cade".title()
-</syntaxhighlight>
-After creating the graph you can remove all triples that contained unknown data easily if you marked like above.
-<syntaxhighlight>
-g.remove((None, None, URIRef("http://example.org/missing")))
-</syntaxhighlight>
 |}
 ==Useful Reading==
-* [https://towardsdatascience.com/pandas-dataframe-playing-with-csv-files-944225d19ff Useful Resource for working with Dataframes and CSV]