BeautifulSoup

Previously in this unit we have taught you how to construct webpages, including how to construct webpages so that they present structured information (e.g., the contents of a database). The presumption has generally been that you are designing webpages for (usually visual) consumption by a human using a standard web browser, or at most writing Javascript functions that will help a human use a webpage in particular way.

However, there are situations where you may need to write code that interacts with webpages in an automated fashion -- perhaps to extract some data from webpages that isn't available in a more machine-friendly format. This scraping of webpages is a valuable (if sometimes maliciously applied) skill that can be useful to many kinds of software engineers and data scientists. This exercise talks you through some applications of one of the most common frameworks, a Python library called BeautifulSoup.

Setup

First, you want to install the BeautifulSoup library, which you can achieve with sudo apt install python3-bs4. Python does have its own package manager, pip, but it would get in the way of our system-wide package management to use it here.

You can test that Python is working by just typing python3 at the command line -- it should launch the interactive interpreter, which will prompt you for Python code with >>>. Type from bs4 import BeautifulSoup. If this completes then you've successfully imported the library.

Interacting with Soup

Next, enter something like the following:

file = "cattax/index.html"
soup = BeautifulSoup(open(file, 'r'))

(You may need to change the 'file' line to reflect where the cattax/index.html file really is relative to where you launched python3. Also, if you close the interpreter at any point remember you'll need to re-run the import line above to get the library).

You now have a 'soup' object, which is a Python object that has various methods for interacting with HTML (and XML) documents and accessing their contents. To start with, just type soup in your interpreter. Python will print out the basic textual representation of the object, which is just the source of the index.html page. Next, let's attempt one of the commonest use-cases for scraping.

soup.get_text()

You should see a string containing the textual content of the webpage. If you call print on the result, you should see that the text has been laid out in a fair approximation of how the visible text would appear on a webpage.

text = soup.get_text()
print(text)

Getting the visible text out of a page is a common requirement if, e.g., you were going to use the webpage as input to an NLP system. You can also access non-visible portions of the page -- soup.title will give you the title element of the webpage, and soup.title.text will get you the textual content within the title element. Note the difference: soup.title is a BeautifulSoup element (of type Tag), and has methods associated with being a tag; soup.title.text is just a string, and only has methods that apply to strings.

As HTML documents are structured, you can interact with them in a structured manner. soup.head will get you a soup object reflecting the 'head' part of the HTML structure, and soup.head.findChildren() will return a list containing all of the 'children' elements inside the head. By knowing the structure of the document you can thus navigate to certain elements programmatically. This doesn't just relate to the tags, either: you can also access the value of attributes. soup.head.meta['charset'] would access the charset attribute of the meta tag in the head of the document.

Exercises

Take a look at some other examples from the BeautifulSoup documentation, in particular regarding the use of the .find_all() method.
Use your interpreter to access a list of all the <strong> elements in the webpage, and figure out how to print out just the text contained within them.
How would you use .find_all() to find all <div> elements with a particular class value (e.g., 'container')? Would your method work if the div had multiple classes?

A Scraping Script

Download this python script to the directory that contains your cattax folder (not into cattax itself). On the command line, you should be able to run this script with python3 scrape.py. You'll see that it prints out a series of lines related to all the files in cattax.

Open scrape.py in an editor and inspect what it is doing. The script imports two libraries, one is BeautifulSoup and the other is os, which allows the script to use certain operating system functions. os.listdir is then used to list the contents of our cattax directory and iterate over them. We filter the filenames by checking to see which of them end with the string 'html', and if they do then we open the file and parse it into a BeautifulSoup object. Then, we print out text from certain elements, separated by a ':'. Understand how this works, asking a TA or lecturer for help if you aren't sure.

Exercises

Modify scrape.py so that it also prints out the contents of the 'info' paragraph in each page (this can be a second print statement). Run the script again to test that it works.
Currently the script prints something for every page. Modify it so it would only print something out for the leaf nodes -- those pages that don't have a 'container' element of their own.
Printing things out can be useful, but often we want to store values we scrape for later programmatic work. Rather than printing out information, create and update a Python dict for all leaf nodes where the dictionary keys are the titles of a page and the values are the corresponding content of the 'info' box. Run your script with python3 -i scrape.py and it will execute your script and then place you in an interactive session immediately following your script's execution. You can then check that the dictionary's contents are what you expect by interacting with the dict object in your interpreter.

When to scrape

As we discuss in this week's lectures, scraping webpages can come with some risks, depending upon how you go about it and what you do with the results. Guidelines you should follow include:

Always respect the robots.txt presented by a site. Resources denied to crawlers may be protected for a legal or ethical reason.
Always look for an API first. If a site will provide you with a JSON endpoint to query for the structured data you want, it is both easier and more polite to use that API instead of scraping the same data from webpages designed for human consumption.
Be very careful about any scraping behind an authenticated log-in, such as from a logged-in social media account. Think about the privacy expectations people have about the content you are collecting, both in legal and ethical terms. If you post publicly something that was intended only for a limited audience, you could be betraying a confidence, and might even face legal repercussions.
Generally beware that being able to access and read information from the web does not mean you are permitted to republish it, even in a modified form.