User:Gregmce/sandbox

'''Holley, R. (2009). How good can it get?: Analysing and improving OCR accuracy in large scale historic newspaper digitisation programs. D-Lib Magazine: The Magazine of Digital Library Research, 15(3/4). doi:10.1045/march2009-holley'''

Holly's article examines the usage of OCR technology within the National Library of Australia's Newspaper Digitisation Program (ANDP). The paper details how OCR technology has advanced over the years and how it would be used in a newspaper preservation effort. Topics include the factors that can affect OCR accuracy in historic newspapers, how to measure accuracy rates when using OCR (and what a good level of OCR accuracy would be), and how to improve OCR accuracy. Holley then details the efforts from ANDP to create a search system for historical newspapers using OCR and the methods in this article, and the success rates achieved.

'''Jefferson, R., Taylor, L., & Santamaria-Wheeler, L. (2012). ''Digital dreams: The potential in a pile of old Jewish newspapers. Journal of Electronic Resources Librarianship, 24''(3), 177-188. doi:10.1080/1941126X.2012.706109'''

Jefferson, Taylor, and Santamaria-Wheeler detail the Isser and Rae Price Library of Judaica at the University of Florida (UF)'s program to digitize a special collection of international Jewish newspaper anniversary editions. After explaining the scope of the collection, the authors explain the software and techniques to be used for digitization, as well as how to select which items to convert to a digital format. Selection decisions were based on acquiring a varied amount of content (languages, places of publication), the newspaper's history and reputation, and copyright status. Technical decisions made include using the Metadata Encoding and Transmission Standards (METS) format, SobekCM open-source software (in conjunction with the UF Digital Collections system), and to add additional metadata after OCR scanning.

'''Kanungo, T., & Allen, R.B. (2007). Full-text access to historical newspapers. Star, 45(1). Retrieved from http://search.proquest.com.proxy.lib.wayne.edu/docview/23940188?accountid=14925'''

Kanungo and Allen's paper guides the reader through the National Endowment for the Humanities' system to take archived 19th century newspapers and transform them to digital copies available to all. The authors explain why traditional OCR software packages will fail when attempting to translate newspapers, and then how their package analyzes and breaks down a newspaper broadsheet using zone segmentation into information that can be converted to digital. Once digitized, the project members catalog the information, using multiple layers of metadata that recognize both content as well as the original format and layout of the newspaper. Finally, an interface needs to be created to access the material, both for those inspecting and updating the OCR as well as for researchers and students accessing the content of the collection.

'''King, E. (2005). Digitisation of newspapers at the British library. Serials Librarian, 49(1), 165-181. doi:10.1300/J123v49n01_07'''

Since the 1990s, the British Library has engaged in digitizing its archived newspaper holdings. Problems with using OCR software with newspaper broadsheets are explained, as well as issues with indexing the material in a manner that is usable for researchers. Three case studies from within the British Library are detailed, showing the progression of technology and approaches in newspaper digitization that result in increasing success. Multiple XML layers, OCR abilities, and copyright status are all highlighted as steps along the way. International case studies on similar newspaper digitization projects are also spotlighted.

'''Klijn, E. (2008). The current state-of-art in newspaper digitization: A market perspective. D-Lib Magazine, 14(1), 5. doi:10.1045/january2008-klijn'''

In preparation for the digitizing of over 8 million pages from Dutch newspapers over the centuries, the Koninklijke Bibliotheek (the National Library of the Netherlands)'s Databank of Digital Daily Newspapers performed a survey on other efforts being performed around the world. Klijn details the information found, with sections on digital imaging technology, OCR, zoning and segmentation, metadata extraction, searchability and web delivery systems. The article serves as a snapshot into 2008's newspaper digitization technologies as well as a basic primer for others considering a similar project.

'''MacQueen, D. S. (2004). ''Developing methods for very-large-scale searches in proquest historical newspapers collection and infotrac the times digital archive: The case of two million versus two millions. Journal of English Linguistics, 32''(2), 124-143. doi:10.1177/0075424204265944'''

MacQueen's article highlights the size of newspaper digital archives, focusing on Proquest's historical newspaper archives and Infotrac's The Times, and moves through the difficulties in finding information in a source so large. MacQueen shows how to use the tools built into newspaper archives to turn the seemingly-impossible into an almost manageable task.

'''Murray, R. L. (2005, June). Toward a metadata standard for digitized historical newspapers. In Proceedings of the 5th ACM/IEEE-CS joint conference on Digital libraries (pp. 330-331). ACM.'''

This case study has Murray detail the metadata development within the National Digital Newspaper Program, an initiative that expands access to historical newspapers. Murray explains the need for structural metadata when digitizing newspapers, and how there can be complex links between each digitized piece of information. The solution enacted involves the Metadata Encoding and Transmission Standard (METS) and Metadata Object Description Schema (MODS), then discusses the basic setup of the information involving a title document, an issue document, a page object, and a reel document. The end result is an easier exchange of information between newspaper digitization projects.

'''Popik, B. (2004). Digital historical newspapers: A review of the powerful new research tools. Journal of English Linguistics 32(2), 114-123. doi:10.1177/0075424204265818

Annotation placeholder

'''Reakes, P., & Ochoa, M. (2009). Non-commercial digital newspaper libraries: Considering usability. Internet Reference Services Quarterly, 14(3-4), 92-113. doi:10.1080/10875300903336357'''

Reakes and Ochoa survey the challenges in creating digital newspaper libraries through case studies of usability testing done on the University of Florida's Florida Digital Newspaper Library (FDNL) and the Chronicling America/National Digital Newspaper Project (NDNP). Issues examined include search pages being intuitive, ease of navigation, result pages providing information needed, and a lack of proper metadata in some archives. Direct comparisons between FDNL and Chronicling America are regularly presented to the audience to help highlight the differences in approach between the projects.

'''Tanner, S., Munoz, T., & Ros, P. H. (2009). Measuring mass text digitization quality and usefulness. D-Lib Magazine: The Magazine of Digital Library Research, 15(7/8). doi:10.1045/july2009-Munoz'''

Using a case study measuring the British Library's 19th Century Newspapers Database, the authors measure OCR accuracy by not only measuring individual character accuracy, but also word and significant word accuracy. The article gives a history of OCR and how it works, then explains why when searching articles, some words being correctly identified are more important than others. The output from OCR scans are examined, detailing how the resulting information can be used for a stronger methodology in converting newspapers to a digital format.