Crawling the Web
So far in this unit we've had you use wget only for one of its simplest
use-cases: when you want to download a single file from the web. Consistent with
the Unix philosophy that tools should 'do one thing well', wget is capable of
a lot more than this.
To demonstrate this, we're going to have you first run a server to deploy a
website locally, and then test out various wget options by connecting to that
server via localhost. This of course doesn't mean that wget can only do these
things via localhost -- it's designed to work with real websites -- but we
decided that getting ~200 students to test out web-crawling on a particular live
website was probably not a great idea, so we're having each of you run your own.
Set up the server
As with the HTTP exercises, it would be best to either carry out these steps directly on a lab machine or on your own machine (optionally from the Debian Linux VM with ports forwarded). If you are using a lab machine via SSH instead then you'll need to open a second SSH session to the same lab machine in another terminal, to act as a client for the next parts of the exercise. However, we'll periodically invite you to check things in your browser, which is a lot simpler if you're not trying to use a remote machine.
First, download the webpages to be served by the webserver.
If you like you can even do this using wget:
wget https://cs-uob.github.io/COMSM0085/exercises/part2/resources/cattax.tar.gz
Then extract the contents of the tarball using tar -xzf cattax.tar.gz in the
folder you downloaded it to. This will create a folder cattax which contains some
webpages and resources.
Next, use the darkhttpd server from a previous week's exercises to
serve the content of the cattax folder on localhost:8080. (Refer to the
HTTP week's exercise instructions if you have forgotten how to do this).
You can check that this is working in a browser (unless you are connecting via
SSH) by navigating to localhost:8080/index.html -- you should see a webpage
talking about Felidae. You'll need to leave this server running -- the
simplest way forward would be to open another terminal for the steps below.
(Alternatively: use your shell expertise to figure out how to run the server in
the background without its output interfering with the rest of what you're going
to be doing in this terminal).
Single-page download
To keep your filesystem tidy, we're going to work within a 'client' folder.
We'll be repeatedly downloading files and sometimes deleting them, and you'll
neither want lots of duplicated webpages littering your filesystem nor want to
run rm * in a directory that possibly contains files you don't want deleted.
Make sure you are in the parent directory that contains cattax (i.e., you can
type ls and see the directory cattax in the output) and not inside
cattax itself. Then create a directory and move into it:
mkdir client
cd client
Now we'll start with the simple use of wget you have already become familiar
with:
wget localhost:8080/index.html
This downloads the same index.html as is being served from cattax by
darkhttpd. However, if you open this downloaded file in your browser, you'll
see that there's a sense in which something missing -- wget has only
downloaded the specific HTML file you requested, and not any of the resources
that the page itself references, like the CSS file -- so the version you open in
your browser from your client directory won't look the same as the version
being served via localhost. This can be desirable default behaviour (we only
asked it to get that page, after all), but if we wanted to download a copy of a
webpage and later read that webpage's copy with the styles and images it was
originally created to contain, we'd need to get wget to also download these
resources.
One way to do this would be to manually identify each of the required resources
and download them one-by-one. But this is tedious, repetitive work -- highly
suited to automation -- and moreover, wget can save us the effort. Try the
following, and read the output.
wget -p localhost:8080/index.html
Notice that this time wget downloaded multiple files. It also created a
directory named localhost:8080 to store all the files in. This is helpful
organisation if you're ever using wget to download pages from multiple
different websites -- it stores them under directories named after the domain
you requested them from.
If you read the output carefully you'll notice that as well as the index.html
we requested directly, wget has also downloaded the catstyle.css file
referenced in that page, and another file called robots.txt that you didn't
ask for and which isn't mentioned in index.html. This 'robots' file is part of
a standard for responsible web crawling, and tells crawling tools which parts of
a website they are or aren't allowed to visit. When you use wget to crawl a
webpage or website it will check the site's robots.txt to understand which
resources it may not be allowed to download. You can read more about how these
files are written
here.
Open the index.html file from the new localhost:8080 folder that was
created, and you should see that it looks just like the version you got in your
browser by navigating to the URI localhost:8080/index.html. (There are some
cases where this wouldn't be true for a webpage -- wget can sometimes not be
permitted access to some resources required to display a page the same way as it
is shown in your browser).
Crawling a site
The version of the webpage you downloaded using the previous command still has
one major flaw: the links on the page don't work. Or, rather, the links to the
'Felinae' and 'Pantherinae' pages are broken, because those links are made
relative to the webpage, and the corresponding files don't exist in the client's
folder.  The link to Wikipedia in the page footer does still work, because the
href attribute of that link is set to a full URI.
What do we do if we want to download more than one webpage from a site? Wget
supports something called 'recursive downloading'. Simply put, when used in this
manner it will follow all links internal to a site and download the resources
displayed at those links, storing a copy locally and creating a directory
structure if necessary. One version of this recursion is to use the -r (or
--recursive) option, which downloads all linked pages up to a certain maximum
depth. Try this out:
wget -r -l 1 localhost:8080/index.html
This downloads recursively with the 'level' (maximum depth of recursion) set to
1 level of recursion. You should see that both the requested index.html and
the two pages linked from that resource have been downloaded, along with
robots.txt. Notice as well that the Wikipedia page has not been downloaded --
it's not hosted at localhost:8080, so wget ignores it, and the link will
work from the downloaded page anyway.  Our two newly-downloaded pages, however,
will contain dead links to other pages, because we limited the depth of
recursion to just one hop.  If we increase this:
wget -r -l 2 localhost:8080/index.html
You'll see that a lot more files get downloaded. These are only very small,
simple web-pages, without many links (contrast, for example, any given Wikipedia
page). Very short recursion depths can capture an awful lot of a domain, and if
you ever tell wget to crawl links without caring about which domain they
belong to, this becomes explosively worse (-l 2 in such a case for our
index.html would involve
downloading everything linked from the Wikipedia page referenced in the footer
-- several hundred resources).  In the case of our cattax website, however,
there are still a few pages that are more than 2 steps away from the index page.
Let's start afresh:
rm -r localhost:8080
wget -m localhost:8080/index.html
The -m flag is designed to provide some sensible defaults for 'mirroring' an
entire website (something you might do if you wanted to keep a copy of it for
offline browsing, or for providing a public backup of a valuable resource). It
sets the recursion level to infinite and checks timestamps before downloading
files, as well as setting a few more configuration settings. For many cases
where you might want to download an entire website, this is the flag you would use --
perhaps also with a polite -w 1, which sets a 1-second delay between requests,
to avoid over-burdening the server if the website is large.
Further Exercises
- Read man wgetto understand what the-i--force-htmland--spideroptions do. Download a copy of this webpage (the one you are currently reading) and usewgetto test all the links on the page. Are there any broken links?
- Tell wgetto use a different user agent string in a request to your server running on localhost. Check what the request looks like to your server.
- How would wget -r -l 1 http://example.comdiffer fromwget -p http://example.com? (Hint: think about external resources).
- Look for 'Recursive Accept/Reject options' in the wgetmanpage. How would you getwgetto crawl pages from multiple different domains?
- Look up what -ncdoes. What is clobbering, and why would or wouldn't you want to do it?