Mass digitization

Mass digitization is a term used to describe "large-scale digitization projects of varying scopes." Such projects include efforts to digitize physical books, on a mass scale, to make knowledge openly and publicly accessible and are made possible by selecting cultural objects, prepping them, scanning them, and constructing necessary digital infrastructures including digital libraries. These projects are often piloted by cultural institutions and private bodies, however, individuals may attempt to conduct a mass digitization effort as well. Mass digitization efforts occur quite often; millions of files (books, photos, color swatches, etc.) are uploaded to large-scale public or private online archives every single day. This practice of taking the physical to the digital on a mass realm changes the way we interact with knowledge. The history of mass digitization can be traced as early as the mid-1800s with the advent of microfilm, and technical infrastructures such as the internet, data farms, and computer data storage make these efforts technologically possible. This seemingly simple process of digitization of physical knowledge, or even products, has vast implications that can be explored.

Fictional Considerations
Perhaps one of the most notable considerations of mass digitization, in a fictional sense, is the speculations on the Library of Babel by Jorge Luis Borges. In this account, Borges describes a vision of a library in which every possible permutation of books were available. Although Borges describes the preservation and archival practices of all knowledge in a physical space (a library), Borges' fictional vision has already taken place in a digital sense. Endless copies of online books are freely available to the public by means of internet archives or library databases. An account like this was actually quite common, and expertly conveys the idea that "the dream and practice of mass digitization cultural works have been around for decades."

Non-fictional considerations
Some of the earliest digitization programs started before the age of the internet, and include the adaption of technologies such as microfilm in the 19th century. The technical affordances of microfilm allowed it to be a significant medium in the efforts to preserve and extend library materials, as well as its feature of "graphically dramatizing questions of scale." Microfilm was also known as microphotography, developed in 1839, and its capabilities demonstrate (perhaps for the first time) the ability to store mass amounts of information, in this case photos, on a physically small space. When discussing the affordances of microfilm, it was noted by an observer that, "the whole archives of the nation might be packed away in a snuffbox." Such notes expertly demonstrate how the technical infrastructure of microfilm could be leveraged to archive and preserve on a mass scale. Paul Otlet, a Belgian author often considered one of the founders of information science, "outlined the benefits of microfilm as a stable and long-term remediation format that could be used to extend the reach of literature" in his 1906 work "Sur une forme nouvelle du livre : le livre microphotographique". His claim was proven right, with the Library of Congress and other bodies using microfilm to "digitize" cultural objects such as manuscripts, books, images, and newspapers in the early 20th century.

Microfilm
Microfilm represents a shift in the infrastructure of data storage: an immense amount of pictures could be stored in a physically small space, and then expanded for viewing with the help of the microfilm machine. Microfilm, in combination with the microfilm viewer, were leveraged to allow objects to be digitized, preserved, and viewed on a mass scale. It is interesting to note that students needed the help of staff before using the machine; accessing digital materials now is a swift, easy process that one can conduct independently. More information on microfilm can be found under the "Non-fictional considerations" tab of this page.

Server Farms
Another large shift in the infrastructure of data storage was the advent server farms. Websites rely on server farms for “scalability, reliability, and low-latency access to Internet content”. According to Burns, these technologies are essential when building a high-performance infrastructure for content delivery. Moving from microfilm to complex server farms with their own schemas demonstrates the infrastructural demands mass digitization requires over time. Here, mass digitization is both facilitated and exists in this place. Without server farms, data would not be able to be stored or accessed on the necessary scale for mass digitization projects. However, it is important to note that server farms do not act alone in storing data. Other web based infrastructures aid greatly in the storage of data, such as hard drives on a personal computer. Encryption tools and services also work to protect and secure data in sensitive, or internal use, mass digitization projects.

Databases
Databases are often seen as the "home" of a variety of mass digitization efforts. Databases, such as Google Books, allow one to view an entire collection of digitized objects. In the case of Google Books, the database allows a user to search, research, and preview an estimated 40 million titles, corresponding to roughly 30% of the estimated number of all books ever published that the Google team has scanned and uploaded However, faults do exist within such databases; the hands of a scanner can accidentally be scanned and posted, as opposed to the page of a book itself. Errors such as these in public, and often permanent, databases call into question the efficiency of human efforts in mass digitization projects.

Other databases allow researchers from all over the world to upload or view data for scientific inquiry. In this case, raw data from scientific experiments - anonymized for participant privacy - is uploaded and stored on a mass scale. A prime example of such databases for research purposes include the Child Language Data Exchange System (CHILDES) Database. This database houses raw data for language acquisition, and includes videos, audio, transcripts, and de-identified participant information. Databases that store published research articles also exist, and include sites such as PubMed, ScienceDirect, JSTOR, and EBSCO.

Databases, in conjunction with server farms and other web based infrastructures, allow for crucial collaboration in the scientific realm. Here, mass digitization has expanded from the digitization of physical objects (such as books) to the digitization of interactions for scientific inquiry.