Difference between revisions of "Lab: Semantic Lifting - HTML"

From Info216
 
(33 intermediate revisions by one other user not shown)
Line 1: Line 1:
  
=Lab 10: Semantic Lifting - XML=
+
=Lab 11: Semantic Lifting - HTML=
  
 +
<!-- ALO: I think this was from 2020:
 
==Link to Discord server==
 
==Link to Discord server==
 
https://discord.gg/t5dgPrK
 
https://discord.gg/t5dgPrK
 +
-->
  
 
==Topics==
 
==Topics==
Today's topic involves lifting data in XML format into RDF.
+
Today's topic involves lifting data in HTML format into RDF.
XML stands for Extensible Markup Language and is used to commonly for data storage/transfer, especially for websites.
+
HTML stands for HyperText Markup Language and is used to describe the structure and content of websites.
  
XML has a tree structure similar to HTML, consisting of a root element, children and parent elements, attributes and so on.
+
HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on.
The goal is for you to learn an example of how we can convert unsemantic data into RDF.  
+
The goal is for you to learn an example of how we can convert unsemantic data into RDF.
 +
 +
For parsing of the HTML, we will use the python library: BeautifulSoup.
  
  
 
==Relevant Libraries/Functions==
 
==Relevant Libraries/Functions==
  
import requests
+
*from bs4 import BeautifulSoup as bs
 +
*import requests
 +
*import re
  
import xml.etree.ElementTree as ET
 
  
* ET.parse('xmlfile.xml')
+
*beautifulsoup.find()
 +
*beautifulsoup.find_all()
  
All parts of the XML tree are considered '''Elements'''.
+
*string.replace(), string.split()
 +
*re.findall()
  
* Element.getroot()
 
* Element.findall("path_in_tree")
 
* Element.find("name_of_tag")
 
* Element.text
 
* Element.attrib("name_of_attribute")
 
  
 +
==Tasks==
  
 +
'''Task 1'''
  
==Tasks==
+
'''pip install beautifulsoup4'''
  
'''Task 1'''
+
'''Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858" '''
  
'''Lift the XML data from http://feeds.bbci.co.uk/news/rss.xml about news articles by BBC_News into RDF triples.'''
+
The papers will be represented with their Corpus ID. (the subject of triples).  
 +
For Example, a paper has a title, a year, authors and so on.  
  
You can look at the actual XML structure of the data by clicking ctrl + U when you have opend the link in browser.
+
For parsing of the HTML, we will use BeautifulSoup.
  
The actual data about the news articles are stored under the <item></item> tags
+
I recommend right clicking on the web page itself and clicking 'inspect' in order to get a readable version of the html.
 
For instance a triple should be something of the form: news_paper_id - hasTitle - titleValue
 
  
Do this by parsing the XML using ElementTree (see import above).
+
Now you can hover over the HTML tags on the right side to easily find information like the ID of the paper.
  
I recommend starting with the code at the bottom of the page and continuing on it. This code retrieves the XML using a HTTPRequest and saves it to an XML_file, so that you can view and parse it easily.  
+
For example, we can see that the main topic of the page "Knowlede Graph" is under a 'h1' tag with the attribute class: "entity-name".
  
You can use this regex (string matcher) to get only the ID's from the full url that is in the <guid> data.
+
Knowing this we can use BeautifulSoup to find this in python code e.g:
 
<syntaxhighlight>
 
<syntaxhighlight>
news_id = re.findall('\d+$', news_id)[0]
+
topic = html.find('h1', attrs={'class': 'entity-name'}).text
 
</syntaxhighlight>
 
</syntaxhighlight>
  
 +
Similarily, to find multiple values at once, we use find_all instead. E.g, Here I am selecting all the papers, Which I can then iterate through:
  
'''Task 2'''
+
<syntaxhighlight>
 
+
papers = html.find_all('div', attrs={'class': 'flex-container'})
Parse through the fictional XML data below and add the correct journalists as the writers of the news_articles from earlier.  
+
for paper in papers:
This means that e.g if the news article is written on a Tuesday, Thomas Smith is the one who wrote it.  
+
    # e.g selecting title.
One way to do this is by checking if any of the days in the "whenWriting" attribute is contained in the news articles "pubDate".
+
    title = paper.find('div', attrs={'class': 'timeline-paper-title'})
 +
    print(title.text)
 +
</syntaxhighlight>
  
 +
You can use this regex to extract the id from the Corpus ID, or the topic ID (which is in the URL)
 
<syntaxhighlight>
 
<syntaxhighlight>
<data>
+
id = re.findall('\d+$', id)[0]
    <news_publisher name="BBC News">
 
        <journalist whenWriting="Mon, Tue, Wed" >
 
            <firstname>Thomas</firstname>
 
            <lastname>Smith</lastname>
 
        </journalist>
 
        <journalist whenWriting="Thu, Fri" >
 
            <firstname>Joseph</firstname>
 
            <lastname>Olson</lastname>
 
        </journalist>
 
        <journalist whenWriting="Sat, Sun" >
 
            <firstname>Sophia</firstname>
 
            <lastname>Cruise</lastname>
 
        </journalist>
 
    </news_publisher>
 
</data>
 
 
</syntaxhighlight>
 
</syntaxhighlight>
 +
 +
==Task 2==
 +
Create triples for the Topic of the page ("Knowledge Graph").
 +
 +
For example, a topic has related topics (on the top-right of the page). It also has, "known as" values, and a description.
 +
 +
This is a good opportunity to use the SKOS vocabulary to describe Concepts.
  
  
 
==If You have more Time==
 
==If You have more Time==
Extend the graph using the PROV vocabulary to describe Agents and Entities.
+
 
For instance, we want to say that the news articles originates from BBC,
+
If you look at the web page, you can see that there are buttons for expanding the description, related topics and more.
and that the journalists acts on behalf of BBC.
+
 
 +
This is a problem as beautiful soup won't find this additional information until these buttons are pressed.
 +
 
 +
Use the python library '''selenium''' to simulate a user pressing the 'expand' buttons to get all the triples you should get.
  
  
==Code to Get Started==
+
==Code to Get Started (Make sure you understand it)==
 +
 
 
<syntaxhighlight>
 
<syntaxhighlight>
from rdflib import Graph, Literal, Namespace, URIRef
+
from bs4 import BeautifulSoup as bs
from rdflib.namespace import RDF, XSD
+
from rdflib import Graph, Literal, URIRef, Namespace
import xml.etree.ElementTree as ET
+
from rdflib.namespace import RDF, OWL, SKOS, RDFS, XSD
 
import requests
 
import requests
 
import re
 
import re
Line 95: Line 98:
 
g = Graph()
 
g = Graph()
 
ex = Namespace("http://example.org/")
 
ex = Namespace("http://example.org/")
prov = Namespace("http://www.w3.org/ns/prov#")
 
 
g.bind("ex", ex)
 
g.bind("ex", ex)
g.bind("prov", prov)
 
  
 +
# Download html from URL and parse it with BeautifulSoup.
 +
url = "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"
 +
page = requests.get(url)
 +
html = bs(page.content, features="html.parser")
 +
# print(html.prettify())
  
# URL of xml data
+
# Find the html that surrounds all the papers
url = 'http://feeds.bbci.co.uk/news/rss.xml'
+
papers = html.find_all('div', attrs={'class': 'flex-container'})
  
# Retrieve the xml data from the web-url.
+
# Iterate through each paper to make triples:
resp = requests.get(url)
+
for paper in papers:
 
+
    # e.g selecting title.  
# Saving the xml data to a .xml file
+
    title = paper.find('div', attrs={'class': 'timeline-paper-title'}).text
with open('news.xml', 'wb') as f:
+
     print(title)
     f.write(resp.content)
 
  
 
</syntaxhighlight>
 
</syntaxhighlight>
 
 
{| role="presentation" class="wikitable mw-collapsible mw-collapsed"
 
| <strong>Hints</strong>
 
|-
 
|
 
|}
 
  
  
 
==Useful Reading==
 
==Useful Reading==
* [https://www.geeksforgeeks.org/xml-parsing-python/ XML-parsing-python by geeksforgeeks.org]
+
* [https://www.dataquest.io/blog/web-scraping-tutorial-python/ Dataquest.io - Web-scraping with Python]
* [https://www.w3schools.com/xml/xml_whatis.asp XML information by w3schools.com]
 
* [https://www.w3.org/TR/prov-o/#description PROV vocabulary]
 

Latest revision as of 09:57, 6 April 2021

Lab 11: Semantic Lifting - HTML

Topics

Today's topic involves lifting data in HTML format into RDF. HTML stands for HyperText Markup Language and is used to describe the structure and content of websites.

HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on. The goal is for you to learn an example of how we can convert unsemantic data into RDF.

For parsing of the HTML, we will use the python library: BeautifulSoup.


Relevant Libraries/Functions

  • from bs4 import BeautifulSoup as bs
  • import requests
  • import re


  • beautifulsoup.find()
  • beautifulsoup.find_all()
  • string.replace(), string.split()
  • re.findall()


Tasks

Task 1

pip install beautifulsoup4

Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"

The papers will be represented with their Corpus ID. (the subject of triples). For Example, a paper has a title, a year, authors and so on.

For parsing of the HTML, we will use BeautifulSoup.

I recommend right clicking on the web page itself and clicking 'inspect' in order to get a readable version of the html.

Now you can hover over the HTML tags on the right side to easily find information like the ID of the paper.

For example, we can see that the main topic of the page "Knowlede Graph" is under a 'h1' tag with the attribute class: "entity-name".

Knowing this we can use BeautifulSoup to find this in python code e.g:

topic = html.find('h1', attrs={'class': 'entity-name'}).text

Similarily, to find multiple values at once, we use find_all instead. E.g, Here I am selecting all the papers, Which I can then iterate through:

papers = html.find_all('div', attrs={'class': 'flex-container'})
for paper in papers:
    # e.g selecting title.
    title = paper.find('div', attrs={'class': 'timeline-paper-title'})
    print(title.text)

You can use this regex to extract the id from the Corpus ID, or the topic ID (which is in the URL)

id = re.findall('\d+$', id)[0]

Task 2

Create triples for the Topic of the page ("Knowledge Graph").

For example, a topic has related topics (on the top-right of the page). It also has, "known as" values, and a description.

This is a good opportunity to use the SKOS vocabulary to describe Concepts.


If You have more Time

If you look at the web page, you can see that there are buttons for expanding the description, related topics and more.

This is a problem as beautiful soup won't find this additional information until these buttons are pressed.

Use the python library selenium to simulate a user pressing the 'expand' buttons to get all the triples you should get.


Code to Get Started (Make sure you understand it)

from bs4 import BeautifulSoup as bs
from rdflib import Graph, Literal, URIRef, Namespace
from rdflib.namespace import RDF, OWL, SKOS, RDFS, XSD
import requests
import re

g = Graph()
ex = Namespace("http://example.org/")
g.bind("ex", ex)

# Download html from URL and parse it with BeautifulSoup.
url = "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"
page = requests.get(url)
html = bs(page.content, features="html.parser")
# print(html.prettify())

# Find the html that surrounds all the papers
papers = html.find_all('div', attrs={'class': 'flex-container'})

# Iterate through each paper to make triples:
for paper in papers:
    # e.g selecting title. 
    title = paper.find('div', attrs={'class': 'timeline-paper-title'}).text
    print(title)


Useful Reading