The Digital Noah’s Ark

The net is taking root as the dominant paradigm for the digital repository, but can the Internet really replace archives?

Mundaneum's Universal Bibliographical System, Mons (Wallonia).

Mundaneum’s Universal Bibliographical System, Mons (Wallonia).

While archives and libraries took care of housing and conserving society’s knowledge in the twentieth century, the Internet now looks set to step in as the natural successor for this role. Digital content of all types has several advantages over its analogue equivalents: more information can be stored in less space, it is easer to copy, indexing and querying are more efficient, etc. As part of this trend, the net is taking root as the dominant paradigm for the digital repository, the place where we can find whatever we seek. But can the Internet really replace archives?

Digital technologies have boosted content creation, and its production and dissemination have expanded exponentially as a result of the Internet. The net breaks down geographical barriers and allows us to access any document from any place. As such, copies lose relevance in favour of hypertext: there is no longer any need to copy information when we can link directly to the original. Everything is in “the Cloud”, a perfect metaphor for the massive digital depository that seems to have no specific form or location. And like real-life clouds, it is also changeable and ephemeral.

The Internet is a work in progress, in perpetual beta. It evolves rapidly, and in spite of its youth, its memory is poor. As users, we have adapted perfectly to this whirlwind of change. We easily discover and adapt to new services and just as easily forget the former applications, which almost immediately become memories of what seems to us like a distant past. With neither trauma nor regret, the new buries the old on a daily basis. Not many initiatives survive, and the few that do constantly evolve and are reinvented. This dynamic affects our models of information production, and we have grown accustomed to constantly using and discarding it. But what trace will we leave behind for the future? How can we escape the data deluge?

The Internet is a huge container, but it is a significant departure from the ideals of order and preservations that govern archives. For this reason, ever since the net began, different initiatives have set out to conserve its content, which would otherwise seem doomed to end up in the enormous digital black hole. But archiving the Internet is by no means easy, and it raises new dilemmas that do not have a simple or single solution.

Technical issues

When it comes to archiving content, digital technologies have one major advantage – how easily we can copy content. Compared to other formats, copying digital documents is far quicker and involves virtually no loss of quality. Nonetheless, the net introduces the problem of how to access originals and how to conserve them.

Internet access protocols only allow us to read part of the original code of the files in our computer, only client-side languages. The rest of the code is executed in remote servers and it reaches us in its final version. This means that in order to make full and accurate copies of most websites we need the collaboration of their owners so that we can access the files. Otherwise, all we can do is capture the websites using programmes that crawl the site to collect and save the URLs of its pages. But this system has serious shortcomings when it comes to saving all the contents of a website, because often the crawling software fails to find all of the pages. This doesn’t just lead to a loss of files, it also makes it necessary to check the code and correct possible errors.

Once we do have all the files, we must grapple with a great diversity of languages and types of documents that make preservation a complex matter. The digital world is perpetually evolving, so new machines eventually become unable to read old files. In addition, there is the short lifespan of digital media (the average lifespan of a hard drive is estimated to be around 5 years), so files and media have to be constantly updated. And in spite of all these maintenance tasks, it is sometimes necessary to preserve old hardware and software in order to enable access to obsolete documents.

Legal issues

Aside from the technical characteristics of the Internet that complicate access to and conservation of online content, there are also legal problems when it comes to storing them. The terms that apply to the legal deposit of analogue documents and publications are not applicable to the net, and this means that archives and libraries have to negotiate the copyright for each website individually before they can copy it. To complicate things further, the legislation regarding these matters is national, so different legal frameworks will apply depending on the website. It would be virtually impossible to manage these rights at the international level, because doing so would entail identifying and contacting all the authors of all the content on the net.

Legal problems also arise when it comes to dealing with content that may be defamatory, illegal, or considered obscene. In these cases, the material has to be reviewed and a decision made about whether it can be conserved and/or made available for consultation. Finally, the archiving process also has to take into account the presence of personal details on the net, and to abide by the relevant data protection laws.

One possible solution that is being considered by several institutions as a way of dealing with these legal issues is the creation of opaque files that limit access and consultation, and as such reduce the legal risks (such as the economic rights of original works, or problems with defamatory or illicit content). But reducing public access to an online file in the medium term substantially reduces its advantages and benefits.

Ethical issues

Notwithstanding the legal issues discussed above, the net is still in a very primitive stage, and there are many regulatory gaps in relation to it. The Internet is a young medium, and we are still unable to imagine the repercussions that our actions may have in the future. And as it is a mass medium, these risks increase exponentially.

One of the big questions that need to be answered is whether the Internet is actually a publishing medium, and whether its users see it as such. The net is an ephemeral medium, and it is used in many different ways. It is a multi-faceted platform that address both the public and the private spheres, without any clearly defined boundaries. As such, we need to think about the extent to which we can archive content that was not published with the intention of being conserved, or that its authors may want to delete at some point in the future.

These dilemmas become more pronounced when we think about archiving social networks, which are probably the most personal side of the net. Although the information in these networks is generally quite irrelevant at an individual level (given its mainly personal nature), it may be of interest when taken as a whole for helping us to understand global events – such as the 15-M movement or the Arab revolutions –, and archiving projects have already started working with these platforms. For example, the United States Library of Congress has started to archive twitter in its Twitter record project, and is considering the legal and ethical terms of this archive.


Without question, the biggest problem we face in archiving the Internet is its magnitude. The enormous amount of content and its staggering growth make it an almost chimeric task. that leaves us no option but to select and prioritise the material that is deemed to be most valuable for future research.

One of the main questions that we have to resolve in regard to an Internet archive is how to define what an online publication is. The diversity of the ever-increasing formats on the net make this a difficult task, but it is of prime importance if we are to create a meaningful and genuinely useful archive.

In geographic terms, given that the Internet is a global, interconnected network, it is difficult to define borders and almost impossible to understand it from a solely national perspective. But as we have noted above, an international archiving project would be forced to deal with the political and legal frameworks of each country, and this would add further complexity to the project. One possible solution may be to create a network of national archives that can work together in order to enable global consultation.

Archiving the Net

As Yuk Hui wrote in the Archivist Manifesto, “archives are reservoirs of discourses that make possible an archaeology of knowledge.” It follows that an Internet archive should not simply store information from the net, but also order it so that we can access it in a simple, logical way. It is important to save information, but without order it will simply be an ocean of bits that we will get lost in whenever we try to navigate through it. At the same time, one of the reasons for archiving is to preserve information that future generations will be able to study with enough hindsight to understand our history as a society. This means keeping a medium and long-term perspective in mind when selecting content.

In short, it would appear that the characteristics of the net force us to choose what we want to save. But at the same time, the fact that we are dealing with information from the present means that we lack the perspective to make this choice with enough certainty. Given this dilemma, we can consider two possible types of archive: a stricter and more defined archive, that sets out certain boundaries and contours but then saves as much as possible within them in order to provide in-depth information; and another that is more photographic, that covers large areas and offers snapshots that are more superficial but allow us to make global connections.

View comments0

Leave a comment