Difference between revisions of "Lab: Semantic Lifting - HTML"

From Info216
Line 1: Line 1:
  
=Lab 10: Semantic Lifting - XML=
+
=Lab 10: Semantic Lifting - HTML=
  
 
==Link to Discord server==
 
==Link to Discord server==
Line 6: Line 6:
  
 
==Topics==
 
==Topics==
Today's topic involves lifting data in XML format into RDF.
+
Today's topic involves lifting data in HTML format into RDF.
XML stands for Extensible Markup Language and is used to commonly for data storage/transfer, especially for websites.
+
HTML stands for HyperText Markup Language and is used to describe the structure and content of websites.
 
+
HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on.
XML has a tree structure similar to HTML, consisting of a root element, children and parent elements, attributes and so on.
 
 
The goal is for you to learn an example of how we can convert unsemantic data into RDF.  
 
The goal is for you to learn an example of how we can convert unsemantic data into RDF.  
  
Line 15: Line 14:
 
==Relevant Libraries/Functions==
 
==Relevant Libraries/Functions==
  
import requests
+
from bs4 import BeautifulSoup
 
 
import xml.etree.ElementTree as ET
 
 
 
* ET.parse('xmlfile.xml')
 
  
All parts of the XML tree are considered '''Elements'''.
 
  
* Element.getroot()
 
* Element.findall("path_in_tree")
 
* Element.find("name_of_tag")
 
* Element.text
 
* Element.attrib("name_of_attribute")
 
  
  
Line 35: Line 24:
 
'''Task 1'''
 
'''Task 1'''
  
'''Lift the XML data from http://feeds.bbci.co.uk/news/rss.xml about news articles by BBC_News into RDF triples.'''
 
 
You can look at the actual XML structure of the data by clicking ctrl + U when you have opend the link in browser.
 
 
The actual data about the news articles are stored under the <item></item> tags
 
 
For instance a triple should be something of the form: news_paper_id - hasTitle - titleValue
 
 
Do this by parsing the XML using ElementTree (see import above).
 
 
I recommend starting with the code at the bottom of the page and continuing on it. This code retrieves the XML using a HTTPRequest and saves it to an XML_file, so that you can view and parse it easily.
 
 
You can use this regex (string matcher) to get only the ID's from the full url that is in the <guid> data.
 
<syntaxhighlight>
 
news_id = re.findall('\d+$', news_id)[0]
 
</syntaxhighlight>
 
  
  
 
'''Task 2'''
 
'''Task 2'''
  
Parse through the fictional XML data below and add the correct journalists as the writers of the news_articles from earlier.
 
This means that e.g if the news article is written on a Tuesday, Thomas Smith is the one who wrote it.
 
One way to do this is by checking if any of the days in the "whenWriting" attribute is contained in the news articles "pubDate".
 
 
<syntaxhighlight>
 
<data>
 
    <news_publisher name="BBC News">
 
        <journalist whenWriting="Mon, Tue, Wed" >
 
            <firstname>Thomas</firstname>
 
            <lastname>Smith</lastname>
 
        </journalist>
 
        <journalist whenWriting="Thu, Fri" >
 
            <firstname>Joseph</firstname>
 
            <lastname>Olson</lastname>
 
        </journalist>
 
        <journalist whenWriting="Sat, Sun" >
 
            <firstname>Sophia</firstname>
 
            <lastname>Cruise</lastname>
 
        </journalist>
 
    </news_publisher>
 
</data>
 
</syntaxhighlight>
 
  
  
 
==If You have more Time==
 
==If You have more Time==
Extend the graph using the PROV vocabulary to describe Agents and Entities.
 
For instance, we want to say that the news articles originates from BBC,
 
and that the journalists acts on behalf of BBC.
 
  
  
 
==Code to Get Started==
 
==Code to Get Started==
<syntaxhighlight>
 
from rdflib import Graph, Literal, Namespace, URIRef
 
from rdflib.namespace import RDF, XSD
 
import xml.etree.ElementTree as ET
 
import requests
 
import re
 
 
g = Graph()
 
ex = Namespace("http://example.org/")
 
prov = Namespace("http://www.w3.org/ns/prov#")
 
g.bind("ex", ex)
 
g.bind("prov", prov)
 
 
 
# URL of xml data
 
url = 'http://feeds.bbci.co.uk/news/rss.xml'
 
 
# Retrieve the xml data from the web-url.
 
resp = requests.get(url)
 
  
# Saving the xml data to a .xml file
 
with open('news.xml', 'wb') as f:
 
    f.write(resp.content)
 
  
</syntaxhighlight>
 
  
  
Line 121: Line 46:
  
 
==Useful Reading==
 
==Useful Reading==
* [https://www.geeksforgeeks.org/xml-parsing-python/ XML-parsing-python by geeksforgeeks.org]
 
* [https://www.w3schools.com/xml/xml_whatis.asp XML information by w3schools.com]
 
* [https://www.w3.org/TR/prov-o/#description PROV vocabulary]
 

Revision as of 07:51, 27 March 2020

Lab 10: Semantic Lifting - HTML

Link to Discord server

https://discord.gg/t5dgPrK

Topics

Today's topic involves lifting data in HTML format into RDF. HTML stands for HyperText Markup Language and is used to describe the structure and content of websites. HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on. The goal is for you to learn an example of how we can convert unsemantic data into RDF.


Relevant Libraries/Functions

from bs4 import BeautifulSoup



Tasks

Task 1


Task 2


If You have more Time

Code to Get Started


Useful Reading