The Web has revolutionized our access to information. Documents and publications that were once difficult to find are now readily available to anyone with an Internet connection. Federal, state and local government agencies and non-profit organizations now have an inexpensive means for distributing information to the public. When important historical events such as Hurricane Katrina or 9/11 take place, we can see the popular reaction unfold via blogs and personal web sites, and have an unprecedented view into popular culture. All of these materials will serve as valuable resources for researchers for years to come.
But ready access to these publications cannot be taken for granted. Web pages and documents are as easy to change or remove as they are to publish. When sites are redesigned, when new administrations take office, when policies or organizations change, we witness the wholesale disappearance of information. State and local web publications are particularly at risk. In many cases, these documents are no longer available in print, and libraries are challenged to continue their historic role as cultural memory institutions in the digital environment.
As scholars increasingly rely on web citations, it becomes difficult or impossible to verify a scholar's sources. Studies of web citations are showing that up to half of the citations in scholarly journal articles can cease to function within four years. Even if a web citation still returns a page, there is no guarantee that you are looking at the same content the author cited. Furthermore, web content faces the same risks as other digital publications as file formats evolve and change.
In 2005, The National Digital Information Infrastructure and Preservation Program awarded a grant to the California Digital Library and its partners at New York University Libraries and the University of North Texas Library to provide librarians and archivists with the tools to capture, curate and preserve web publications. One result of that grant is the Web Archiving Service, which produced the archives available here. Curators at University of California Libraries, Stanford University Libraries and New York University Libraries along with a growing number of institutions have used these tools to save web publications for researchers.
These archives will provide lasting access to the publications of the State of California at the state and local level, as well as access to a rich array of topics of value to researchers. Searching the archives not only provides a snapshot of each website in time, but also allows researchers to explore those resources in ways they could not do on the live web.
The future holds interesting possibilities for web archives as new tools become available to allow large-scale data analysis on captured web content.
The California Digital Library plays an active role in the development of the web archiving standards and tools that make web archiving possible. The Web Archiving Service, used to create and deliver these archives, was developed at the California Digital Library, and relies on a number of open source tools developed by the Internet Archive with the support of the International Internet Preservation Consortium.
Further information and video demonstrations of the curatorial tools are available for those interested in using the service.
The CDL staff involved in Web Archiving Service Development are:
Additional input and review was provided by CDL's curatorial partners at:
Support for the development of this service was provided by the National Digital Information Infrastructure and Preservation Program and the University of California.
59 public archives
22 partners
5550 web sites
701,283,897 documents
36.6 TB of data
The archives were built with the Web Archiving Service from the California Digital Library