(COMS10012 / COMSM0085)
Welcome back – two weeks to go:
Week 10: Encryption
Previously: writing and serving websites, presenting data in a website.
Now: collecting webpages, extracting data from them.
The web is a massive data resource – why would we we only access it manually?
Approximately half of web traffic is not human.
Robots visit websites for a variety of reasons:
Depending on the usage, also a ‘spider’ – because it traverses the web:
A crawler is a HTTP client, like a browser, but automated. Two approaches:
Remember wget <url>
?
‘do one thing well’
So far you’ve only used the bare minimum of wget
’s capacity
wget https://news.ycombinator.com/news
vs
wget -p https://news.ycombinator.com/news
Resources required to display the page correctly.
Standard for websites to use to tell HTTP or FTP crawlers which parts of a site can be accessed.
List of rules for which parts of the site a crawler can access:
User-Agent: foobot
Allow: /example/page/
Disallow: /example/page/disallowed.gif
User-Agent: bazbot
Disallow: /example/page.html
Crawlers are meant to check for their own user-agent string in the file (always placed in the webroot) and follow the rules.
More of an ‘honour system’ to enable good bots to respect the wishes of website owners.
wget
is a good bot.
You should also write good bots!
elinks danluu.com
In the labs you’ll practice recursive downloading and true web mirroring using
wget
.
Key use of this: make a personal offline backup of a website you like.
What else can we do with copies of websites?
You have a generally unrestricted right to access public web content.
This doesn’t mean you can do anything you want with it.
The means of accessing content can also be important – aggressive downloading of large sites can pose a burden on servers. Servers sometimes respond by blacklisting clients.
Generally ‘polite’ to introduce small delays between requests, even if not asked for.
API endpoints designed for automated access may impose different rate limits.
Webpages often present structured information which we would like to work with.
However, page structures can be complex, and are very site-specific.
Need a system for accessing page content programmatically.
Javascript has methods for this within the browser, but often we want our code to run on its own.
Python library for extracting data from HTML files.
Current version is BeautifulSoup 4 (bs4
).
from bs4 import BeautifulSoup
filename = "news.html"
handle = open(filename, 'r')
text = handle.read()
soup = BeautifulSoup(text)
We don’t make Python itself a focus for this unit.
d = {}
d['this'] = 'that'
d['other'] = 'value'
d = {'a': 1, 'b': 2}
l = ['a']
l.append('b')
l.append('c')
wget
and understanding how to control its behaviour.