Difference between revisions of "Lab: Semantic Lifting - HTML"

From Info216
Line 27: Line 27:
 
'''Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858" '''
 
'''Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858" '''
  
 +
The papers will be represented with their Corpus ID. (the subject of triples).
 +
For Example, a paper has a title, a year, authors and so on.
  
 +
For parsing of the HTML, we will use BeautifulSoup.
 +
 +
I recommend right clicking on the web page itself and clicking 'inspect' in order to get a readable version of the html.
 +
 +
Now you can hover over the HTML tags on the right side to easily find information like the ID of the paper.
 +
 +
For example, we can see that the main topic of the page "Knowlede Graph" is under a <h1> tag with the attribute class: "entity-name".
 +
 +
Knowing this we can use BeautifulSoup to find this in python code e.g:
 +
<syntaxhighlight>
 +
topic = html.body.find('h1', attrs={'class': 'entity-name'}).text
 +
</syntaxhighlight>
 +
 +
Similarily, to find multiple values at once, we use find_all instead. E.g, Here I am selecting all the papers, Which i can then iterate through<syntaxhighlight>
 +
papers = html.body.find_all('div', attrs={'class': 'flex-container'})
 +
for paper in papers:
 +
    # e.g selecting title.
 +
    title = paper.find('div', attrs={'class': 'timeline-paper-title'})
 +
    print(title.text)
 +
</syntaxhighlight>
  
  

Revision as of 08:26, 27 March 2020

Lab 10: Semantic Lifting - HTML

Link to Discord server

https://discord.gg/t5dgPrK

Topics

Today's topic involves lifting data in HTML format into RDF. HTML stands for HyperText Markup Language and is used to describe the structure and content of websites. HTML has a tree structure, consisting of a root element, children and parent elements, attributes and so on. The goal is for you to learn an example of how we can convert unsemantic data into RDF.


Relevant Libraries/Functions

from bs4 import BeautifulSoup



Tasks

Task 1 pip install beautifulsoup4

Lift the HTML information about research articles found on this link into triples: "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"

The papers will be represented with their Corpus ID. (the subject of triples). For Example, a paper has a title, a year, authors and so on.

For parsing of the HTML, we will use BeautifulSoup.

I recommend right clicking on the web page itself and clicking 'inspect' in order to get a readable version of the html.

Now you can hover over the HTML tags on the right side to easily find information like the ID of the paper.

For example, we can see that the main topic of the page "Knowlede Graph" is under a

tag with the attribute class: "entity-name". Knowing this we can use BeautifulSoup to find this in python code e.g:
topic = html.body.find('h1', attrs={'class': 'entity-name'}).text
Similarily, to find multiple values at once, we use find_all instead. E.g, Here I am selecting all the papers, Which i can then iterate through
papers = html.body.find_all('div', attrs={'class': 'flex-container'})
for paper in papers:
    # e.g selecting title.
    title = paper.find('div', attrs={'class': 'timeline-paper-title'})
    print(title.text)


Task 2


If You have more Time

Code to Get Started

from bs4 import BeautifulSoup as bs
from rdflib import Graph, Literal, URIRef, Namespace
from rdflib.namespace import RDF, OWL, SKOS
import requests
from selenium import webdriver

g = Graph()
ex = Namespace("http://example.org/")
g.bind("ex", ex)

# Download html from URL and parse it with BeautifulSoup.
url = "https://www.semanticscholar.org/topic/Knowledge-Graph/159858"
page = requests.get(url)
html = bs(page.content, features="html.parser")
# print(html.prettify())

# This is the topic of the webpage: "Knowledge graph".
topic = html.body.find('h1', attrs={'class': 'entity-name'}).text
print(topic)

# Find the html that surrounds all the papers
papers = html.body.find_all('div', attrs={'class': 'flex-container'})

# Iterate through each paper to make triples:
for paper in papers:
    # e.g selecting title. 
    title = paper.find('div', attrs={'class': 'timeline-paper-title'})
    print(title.text)



Useful Reading