Meeting the Challenges of Preserving the UK Web

[This article was written for and accepted by DPASSH2015, 1st Annual Conference on Digital Preservation for the Arts, Humanities, and Social Sciences, 25-26 June, 2015, Dublin, Ireland.]

ABSTRACT

Collecting and providing continued access to the UK’s digital heritage is a core purpose for the British Library. An important element of this is the World Wide Web. The British Library started web archiving in 2004, building from scratch the capability of eventually preserving the entire UK web domain. This is required by the non-print Legal Deposit Regulations which came into force in April 2013, charging the Legal Deposit Libraries with capturing, among a wide range of digital publications, the contents of every site carrying the .uk suffix (and more), preserving the material and making it accessible in the Legal Deposit Libraries’ reading rooms.

The paper provides an overview of the key challenges related to archiving the UK web, and the approaches the British Library has taken to meet these challenges. Specific attention will be given to issues such as the “right to be forgotten” and the treatment for social networks. The paper will also discuss the access and scholarly use of web archives, using the Big UK Domain Data for Arts and Humanities project as an example.

Keywords

Web Archiving, Non-print Legal Deposit, Digital Preservation, Big data, Scholarly use, Digital Humanities.

1.    INTRODUCTION

Web Archiving was initiated by the Internet Archive in the mid-1990’s, followed by memory institutions including national libraries and archives around the world. Web archiving has now become a mainstream digital heritage activity. Many countries expanded the existing mandatory deposit scheme to include digital publications and passed regulations to enable systematic collection of the national web domain. A recent survey identified 68 web archiving initiatives and estimated that 534 billion files (measuring 17PB) had been archived since 1996. [2]

Non-Print Legal Deposit (NPLD) Regulations became effective in the UK in April 2013, applying to all digitally published and on line work. NPLD is a joint responsibility of publishers and Legal Deposit Libraries (LDLs). An important requirement is that access to NPLD content is restricted to premises controlled by the LDLs.The British Library started archiving UK websites in 2004, based on the consent from site owners. This resulted in the Open UK Web Archive[1], a curated collection currently consisting of over 70,000 point-in-time snapshots of nearly 16,000 selected websites, archived by the British Library and partners.[2]

The British Library leads the implementation of NPLD for the UK web. While the many existing web archiving challenges described in detail by Hockx-Yu[7] remain valid, the significant increase of scale, from archiving hundreds of websites to millions, has brought about new and additional challenges. The key ones are discussed in this paper.

2.    IMPLEMENTING NON-PRINT LEGAL DEPOSIT

NPLD of UK websites is mainly implemented through periodic crawling of the openly available UK web domain, following an automated harvesting process where web crawlers request resources from web servers hosting in-scope content.

2.1    Collecting Strategy

With over 10 million registered domain names, .uk is one of the largest Top Level Domains (TLDs) in the world. A strategy to archive such a large web space requires striking a balance between comprehensive snapshots of the entire domain and adequate coverage of changes of important resources. Figure 1. outlines our current strategy, which is a mixed model allowing annual crawl of the UK web in its entirety, augmented by prioritisation of the parts which are deemed important and receive greater curatorial attention.

DPASSHFigure 1. UK Web Archive collecting strategy

  • The domain crawl is intended to capture the UK domain as comprehensively as possible, providing the “big picture”.
  • The key sites represent UK organisations and individuals of general and enduring interest in a particular sector of the life of the UK and its constituent nations.
  • News websites contain news published frequently on the web by journalistic organisations.
  • The events-based collections are intended to capture political, cultural, social and economic events of national interest, as reflected on the web.

The key sites, news sites and events collections are maintained by curators across the LDLs and governed by a sub-group overseeing web archiving for NPLD. These are typically captured  more than once a year.

2.2    UK Territoriality

The Regulations define an on line work as in scope if:

  1. a) it is made available to the public from a website with a domain name which relates to the United Kingdom or to a place within the United Kingdom; or
  2. b) it is made available to the public by a person and any of that person’s activities relating to the creation or the publication of the work take place within the United Kingdom. [13]

Part a) is interpreted as including all .uk websites, plus websites in future geographic top level domains that relate to the UK such as .scotland, .wales or .london. This part of the UK territoriality criteria can be implemented using automated methods, by assembling various lists or directories, or through discovery crawls, which identify linked resources from an initial list and extract additional URLs recursively.

Part b) concerns websites using non .uk domains. It is a statement about the location of the publisher or the publishing process without defining explicitly what “takes place within the United Kingdom” constitutes. We use a mixture of automated and manual means to discover content relevant to this category. Manual checks include UK postal address, private communication, who-is records and professional judgment. A crawl-time process has also been developed, to check non .uk URLs against an external Geo-IP database and add UK-hosted content to the fetch-chain. This helped us identify over 2 million non .uk hosts during our 2014 domain crawl.

At a more detailed level, crawler configurations also determine the scope and boundary of a national web archive collection. Some key parameters of our current implementation are as follows:

  • Data volume limitation[3]

A default per-host data cap of 512MB or 20 hops[4] is applied to our domain crawls with the exception of a few selected hosts. As soon as one of the pre-configured caps has been reached, the crawl of a given host will terminate automatically.

  • robots.txt policy

We obey robots.txt and META exclusions, except for the home pages and content required to render a page (e.g. JavaScript, CSS).

  • Embedded resources

Resources which are essential to the coherent interpretation of a web page (e.g. JavaSrcript, CSS) are considered in-scope and collected, regardless of where these are hosted.

3.    “RIGHT TO BE FORGOTTEN” [6]

The “right to be forgotten” relates to the European Court of Justice (ECJ)’s ruling against Google, who were asked to remove the index and access to a 16-year old newspaper article concerning an individual’s proceedings over social security debts. [10]

“Right to be forgotten” reflects the principle of an individual being able to remove traces of past events in life from the Internet or other records. When considering this, it is important not to lose sight of the purpose of NPLD. By keeping a historical record of the UK web for heritage purposes, it ensures the “right to be remembered”. Websites archived for NPLD are only accessible within the LDL’s reading rooms and the content of the archive is not available for search engines. This significantly reduces the potential damage and impact to individuals and the libraries’ exposure to take-down requests.

There is at present no formal and general “right to be forgotten” in UK law by which a person may demand withdrawal of the lawfully archived copy of lawfully published material, just because they do not wish it to be available any longer. We apply the Data Protection Act 1998 for withdrawing material containing sensitive personal data from the NPLD collection. A notice and takedown policy is in place allowing withdrawal of public access or removal of deposited material under specific circumstances.[12] “Evidence of damage and distress to individuals” is a key criterion used to review complaints.

4.    ARCHIVING SOCIAL MEDIA [5]

A sampling approach was taken to archiving social media content prior to NPLD. The Open UK Web Archive contains a limited amount of pages from Twitter, Facebook and YouTube. These typically are parts of “special collections”, groups of websites about a particular theme or an event, usually archived for a fixed period of time. An example is the special collection on the UK General Election 2010, which includes Twitter pages belonging to the Prospective Parliamentary Candidates (PPCs). The decision not to systematically archive social media related to the selective nature of the archive itself, the difficulty in obtaining permissions and resources constraints – even the exemplar content required highly skilled technical staff to develop customised solutions outside standard workflow.

Social media content are not treated differently from other web resources in the context of NPLD. Regardless of the platform used for publication, non-print work is collected for Legal Deposit if it is fulfills the territoriality criteria.

Determining territoriality of social media is however not straightforward. The major social network platforms, YouTube, Twitter and Facebook, all use .com domain names so do not meet the first part of the territoriality criteria.[5] The second applies to UK-based individuals or organisations, so does not warrant archiving twitter.com or facebook.com in their entirety. Until scalable solutions are developed, the identification of UK content in social media relies on manual checking and curators’ professional judgment. In-scope content will continue to be archived together with the rest of the UK web for non-print Legal Deposit.

5.    ACCESS AND SCHOLARLY USE [4]

Despite the time-consuming and non-scalable administrative process, the advantage of permissioned-based archiving is online access. NPLD enabled systematic collection at scale but limits access to those physically present at LDL’s premises. Access restriction is common for web archives developed with a Legal Deposit mandate.

There seems to be a choice between comprehensiveness of the archive and online access. While similar access restrictions were applied to printed Legal Deposit collections, there is an expectation from the users to be able to access web archvies 24/7. The misalignment between legal requirements and user expectation is a difficult problem.

Another issue is the single-site based access method, over-focusing on the actual HTML text and ignoring contextual or para-textual information. The division of effort in archiving just the national (or a subset) web also breaks down a global system and introduces arbitrary boundaries which are irrelevant to research questions.

Access and use of the web archives has been one of our strategic focuses from the outset. Many activities and projects took place to involve general users and researchers in collection building, requirements and tools development. We explored web archives both in granularity and totality, providing a rich set of functions to enhance access to individual websites, and developing analytics and visualisations to explore patterns, trends and relationships based on the entire web archive collection. We also developed the Mementos Service, allowing resource discovery across multiple web archives in the world.[6]

The Big UK Domain Data for the Arts and Humanities project, was funded by the Arts and Humanities Research Council to grasp the opportunities offered by big data. Using a historical UK web domain dataset[9] collected by the Internet Archive and acquired by the Joint Information Systems Committee (JISC), the project aims to develop a methodological framework for the study of the archived web, a monograph on the history of the UK web and an access tool based on requirements extracted from 11 research proposals across a range of disciplines.

The British Library’s role was to work with the researchers and co-develop a prototype user interface to the JISC/IA dataset. An iterative model was followed where researchers used the prototype interface to conduct specific research and provide feedback which guided the next cycle of development.

The Shine interface[7] provides access to a 30TB underlying archive containing 2.02 billion URLs, collected between 1996 and 2010. It supports query building, corpus formation and handling. The full-text search has proximity options, and can exclude specified text strings. The search results are presented back to users without much manipulation or ranking, but can be filtered using multiple facets, e.g. content type, public suffix, domain, crawl year. Single resources or whole hosts can also be excluded from result sets. Queries can be saved and results exported as CSV or similar. The interface also allows “trends” search, visually presenting occurrences of search terms across a timeline and allowing access to random samples at given data points which support the trends.

DPASSH1Figure 2. Shine Trends Search

Researchers on the project concluded that web archives have great potential and limitations. The main challenge is methodological rather than content-related: how to make sense of a vast amount of unstructured data and how to create relevant research corpora in a consistent manner. Traditional “relevance searching” seems to raise unreasonable expectations. A number of researchers had to adapt the original research questions due to the overwhelming amount of search results. Gareth Millward, a historian who proposed to research disability organisations on the web, detailed some of the challenges and frustrations in an interview with the Washington Post. [11] Andy Jackson, who led the technical development of Shine, responded to the interview and explained the difference between a historical search engine and one like Google, which makes many assumptions and uses these to rank search results.[8]

The project demonstrated a learning process for both researchers and web archive providers. Understanding each other’s assumptions or expectations reveals issues but is also a step towards building better web archives. It is not about a choice between granularity or totality, it is rather the ability to move between the two that is desired: “a visualisation tool that allows a single data point, to be both visualised at scale in the context of a billion other data points, and drilled down to its smallest compass”. [3]

6.    CONCLUSION

After more than ten years of archiving the UK web, we still face many challenges. Two years into NPLD, we have already collected over 100TB of web data and the archive continues to grow. While new content types and technologies are added to the web, our purpose-built crawler struggles with dynamic content, streaming media and social media. Only a relatively small group of researchers have discovered the value of web archives and started to use this new type of scholarly source. They too need to understand and resolve many conceptual and methodological issues.

The most valuable lesson from interaction with scholars is that much contextual information we regard as operational or private is relevant and can impact the interpretation of web archives. This includes a wide range of technical decisions, curatorial choices and contextual data. Crawl logs and configurations, responses from web servers, websites we intended to include in a special collection but failed to obtain rights-holders’ permission, they are all relevant. Explanation of and access to such information should become base-line knowledge and integral parts of the web archive.

The immediate next step is for us to redevelop the Open UK Web Archive and fold the key learning into a next generation web archive with a different focus. We hope to provide as much as possible information and open a window to our rich web archive collection regardless of the access conditions, including the millions of websites with have disappeared from the live web. We hope to link out to more web archives, so that the historical UK web can be studied in the global context.

7.    REFERENCE

[1]          Big UK Domain Data for the Arts and Humanities: http://buddah.projects.history.ac.uk/. Accessed: 2015-05-20.

[2]          Costa, M. and Gomes, D. 2015. Web Archive Information Retrival: http://www.netpreserve.org/sites/default/files/attachments/2015_IIPC-GA_Slides_11_Gomes.pptx. Accessed: 2015-05-20.

[3]          Hitchcock, T. 2014. Big Data, Small Data and Meaning. Historyonics: http://historyonics.blogspot.co.uk/2014/11/big-data-small-data-and-meaning_9.html. Accessed: 2015-05-20.

[4]          Hockx-Yu, H. 2014. Access and Scholarly Use of Web Archives. Alexandria: The Journal of National and International Library and Information Issues. 25, Numbers 1-2 (Aug. 2014), 113–127.

[5]          Hockx-Yu, H. 2014. Archiving Social Media in the Context of Non-print Legal Deposit. (Lyon, France, Jul. 2014). http://library.ifla.org/id/eprint/999. Accessed: 2015-05-20

[6]          Hockx-Yu, H. 2014. A Right to be Remembered. UK Web Archive Blog: http://britishlibrary.typepad.co.uk/webarchive/2014/07/a-right-to-be-remembered.html. Accessed: 2015-05-20.

[7]          Hockx-Yu, H. 2011. The Past Issue of the Web. (Koblenz, Germany, Jun. 2011). http://www.websci11.org/fileadmin/websci/Papers/PastIssueWeb.pdf. Accessed: 2015-05-20.

[8]          Jackson, A. 2015. Building a “Historical Search Engine” is no easy thing. UK Web Archive Blog: http://britishlibrary.typepad.co.uk/webarchive/2015/02/building-a-historical-search-engine-is-no-easy-thing.html. Accessed: 2015-05-20.

[9]          JISC UK Web Domain Dataset (1996-2013): http://data.webarchive.org.uk/opendata/ukwa.ds.2/. Accessed: 2015-05-20.

[10]        JUDGMENT OF THE COURT (Grand Chamber): 2014. http://curia.europa.eu/juris/document/document_print.jsf?doclang=EN&docid=152065. Accessed: 2015-05-20.

[11]        Millward, G. 2015. I tried to use the Internet to do historical research. It was nearly impossible: http://www.washingtonpost.com/posteverything/wp/2015/02/17/i-tried-to-use-the-internet-to-do-historical-research-it-was-nearly-impossible/. Accessed: 2015-05-20.

[12]        The British Library Notice and Take Down Policy: http://www.bl.uk/aboutus/legaldeposit/complaints/noticetakedown. Accessed: 2015-05-20.

[13]        The Legal Deposit Libraries (Non-Print Works) Regulations 2013: 2013. http://www.legislation.gov.uk/uksi/2013/777/pdfs/uksi_20201307_en.pdf.  Accessed: 2015-05-20.

NOTES

[1] Open UK Web Archive, http://www.webarchive.org.uk.

[2] Another key collection is the UK Government UK Web Archive provided by the National Archives, containing government records on the web, http://www.nationalarchives.gov.uk/webarchive.

[3] This is a common way to manage large scale crawls, which otherwise could require significant machine resources or time to complete.

[4] Each page below the level of the seed, i.e. the starting point of a crawl, is considered a hop.

[5] YouTube is out of scope as the Regulations do not apply to works consisting solely or predominantly of film or recorded sound (or both).

[6] UK Web Archive Mementos Service, http://www.webarchive.org.uk/ukwa/info/mementos.

[7] UK Web Archive Shine application, http://www.webarchive.org.uk/shine.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s