Collecting PDFs

As part of my orientation at the Internet Archive, I was given a list of activities as the “Welcome Treasure Hunt”, intended to help new staff learn about the organisation by completing hands-on tasks. One of them was to create a test Archive-It collection.

Archive-It is the web archiving service offered by the Internet Archive that allows organisations to preserve digital content from the web. It comes with a web application for selecting, describing and harvesting websites of interest to partners.

I used the “PDF only” feature in Archive-It and tested this on two websites:

  1., a central place where UK government departments publish policies, announcements, statistics and consultations. The website has a dedicated section for publications, currently containing over 80,000 documents.I used as the seed (starting point) for the crawl as I know there is where I can find most of the official publications. I also set the crawl limit to 3 days and an one-off crawl. The crawl stopped after the time limit was reached and brought back 19,728 PDFs.
  1., a north London local council website, which does not have a specific section for official publications but a search of “PDF” brings up over 20,000 results, spreading across the website in different locations.For the, I used as the seed, which is essentially a site-wide crawl. I looked at the site and noticed a few directories where PDFs seem to reside. I could use regular expression to limit the crawl to those directories. Without knowing how the site is structured for sure, it was however much easier just to crawl the entire site, and save just the PDFs. I limited the crawl to just one day which brought back 2,647 PDFs, hosted on different parts of the Barnet Council’s website.

The underlying use case is collecting “official publications”, which are “materials issued for public use by federal, national, provincial or municipal governments and intergovernmental organisations. They include parliamentary papers, debates and proceedings, and publications of government departments, agencies and research institutes on any subject.” [1] Libraries around the world have a long tradition of collecting and providing access to these public documents, which are increasingly published in digital formats such as PDFs.

Bearing the use case in mind, I thought a suitable tool should at least meet the following basic requirements:

  1. Select documents to crawl
  2. Define and configure crawl (e.g. frequency and limits)
  3. Collect documents of interest
  4. Quality control
  5. Describe the documents
  6. Access the documents
  7. Export documents and metadata (e.g. for integration in local access tools)

Archive-It meets all the requirements above and the user interface is fairly straightforward to use. I find the PDF-only feature particularly useful because it supports a specific use case and allows me to collect and curate the documents I care about. I can choose to crawl an entire site, in case I don’t know where the PDFs sit on the site, and save the PDFs – only these would count towards my personal data quota.

The reports on my crawls were comprehensive, including information on hosts discovered and crawled, URLs missed (and the possibility to run a patch crawl), file types etc. After 24 hours, I could also access the archived URLs in Wayback and on (if I choose to make my collection openly available).

I could add metadata to each of the PDF documents using Dublin Core, and add any custom metadata field. I could also download the PDFs as WARCs, and the metadata in XML/JSON.

Another useful feature is that I could feed the crawler with a user name and password and it goes behind the login to collect PDFs. I however had no means of testing it – I no longer have access to websites which have official publications behind logins.



One thought on “Collecting PDFs

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s