|
|
Line 1: |
Line 1: |
|
| |
|
| =Lab 10: Semantic Lifting - XML= | | =Lab 10: Semantic Lifting - HTML= |
|
| |
|
| ==Link to Discord server== | | ==Link to Discord server== |
Line 6: |
Line 6: |
|
| |
|
| ==Topics== | | ==Topics== |
| Today's topic involves lifting data in XML format into RDF. | | Today's topic involves lifting data in HTML format into RDF. |
| XML stands for Extensible Markup Language and is used to commonly for data storage/transfer, especially for websites.
| | HTML stands for HyperText Markup Language and is used to describe the structure and content of websites. |
| | | HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on. |
| XML has a tree structure similar to HTML, consisting of a root element, children and parent elements, attributes and so on.
| |
| The goal is for you to learn an example of how we can convert unsemantic data into RDF. | | The goal is for you to learn an example of how we can convert unsemantic data into RDF. |
|
| |
|
Line 15: |
Line 14: |
| ==Relevant Libraries/Functions== | | ==Relevant Libraries/Functions== |
|
| |
|
| import requests | | from bs4 import BeautifulSoup |
| | |
| import xml.etree.ElementTree as ET
| |
| | |
| * ET.parse('xmlfile.xml')
| |
|
| |
|
| All parts of the XML tree are considered '''Elements'''.
| |
|
| |
|
| * Element.getroot()
| |
| * Element.findall("path_in_tree")
| |
| * Element.find("name_of_tag")
| |
| * Element.text
| |
| * Element.attrib("name_of_attribute")
| |
|
| |
|
|
| |
|
Line 35: |
Line 24: |
| '''Task 1''' | | '''Task 1''' |
|
| |
|
| '''Lift the XML data from http://feeds.bbci.co.uk/news/rss.xml about news articles by BBC_News into RDF triples.'''
| |
|
| |
| You can look at the actual XML structure of the data by clicking ctrl + U when you have opend the link in browser.
| |
|
| |
| The actual data about the news articles are stored under the <item></item> tags
| |
|
| |
| For instance a triple should be something of the form: news_paper_id - hasTitle - titleValue
| |
|
| |
| Do this by parsing the XML using ElementTree (see import above).
| |
|
| |
| I recommend starting with the code at the bottom of the page and continuing on it. This code retrieves the XML using a HTTPRequest and saves it to an XML_file, so that you can view and parse it easily.
| |
|
| |
| You can use this regex (string matcher) to get only the ID's from the full url that is in the <guid> data.
| |
| <syntaxhighlight>
| |
| news_id = re.findall('\d+$', news_id)[0]
| |
| </syntaxhighlight>
| |
|
| |
|
|
| |
|
| '''Task 2''' | | '''Task 2''' |
|
| |
|
| Parse through the fictional XML data below and add the correct journalists as the writers of the news_articles from earlier.
| |
| This means that e.g if the news article is written on a Tuesday, Thomas Smith is the one who wrote it.
| |
| One way to do this is by checking if any of the days in the "whenWriting" attribute is contained in the news articles "pubDate".
| |
|
| |
| <syntaxhighlight>
| |
| <data>
| |
| <news_publisher name="BBC News">
| |
| <journalist whenWriting="Mon, Tue, Wed" >
| |
| <firstname>Thomas</firstname>
| |
| <lastname>Smith</lastname>
| |
| </journalist>
| |
| <journalist whenWriting="Thu, Fri" >
| |
| <firstname>Joseph</firstname>
| |
| <lastname>Olson</lastname>
| |
| </journalist>
| |
| <journalist whenWriting="Sat, Sun" >
| |
| <firstname>Sophia</firstname>
| |
| <lastname>Cruise</lastname>
| |
| </journalist>
| |
| </news_publisher>
| |
| </data>
| |
| </syntaxhighlight>
| |
|
| |
|
|
| |
|
| ==If You have more Time== | | ==If You have more Time== |
| Extend the graph using the PROV vocabulary to describe Agents and Entities.
| |
| For instance, we want to say that the news articles originates from BBC,
| |
| and that the journalists acts on behalf of BBC.
| |
|
| |
|
|
| |
|
| ==Code to Get Started== | | ==Code to Get Started== |
| <syntaxhighlight>
| |
| from rdflib import Graph, Literal, Namespace, URIRef
| |
| from rdflib.namespace import RDF, XSD
| |
| import xml.etree.ElementTree as ET
| |
| import requests
| |
| import re
| |
|
| |
| g = Graph()
| |
| ex = Namespace("http://example.org/")
| |
| prov = Namespace("http://www.w3.org/ns/prov#")
| |
| g.bind("ex", ex)
| |
| g.bind("prov", prov)
| |
|
| |
|
| |
| # URL of xml data
| |
| url = 'http://feeds.bbci.co.uk/news/rss.xml'
| |
|
| |
| # Retrieve the xml data from the web-url.
| |
| resp = requests.get(url)
| |
|
| |
|
| # Saving the xml data to a .xml file
| |
| with open('news.xml', 'wb') as f:
| |
| f.write(resp.content)
| |
|
| |
|
| </syntaxhighlight>
| |
|
| |
|
|
| |
|
Line 121: |
Line 46: |
|
| |
|
| ==Useful Reading== | | ==Useful Reading== |
| * [https://www.geeksforgeeks.org/xml-parsing-python/ XML-parsing-python by geeksforgeeks.org]
| |
| * [https://www.w3schools.com/xml/xml_whatis.asp XML information by w3schools.com]
| |
| * [https://www.w3.org/TR/prov-o/#description PROV vocabulary]
| |