User:Peshkira/Hoppla

Hoppla (‘Home and Office Painless Persistent Long-term Archiving’) is a software project at the Vienna University of Technology resulting in an archiving software-solution that combines back-up and fully automated migration services for data collections in small office environments.

Introduction
The system allows user-friendly handling of services and outsources digital preservation expertise. An increasing amount of digital collection held by small institutions, SOHOs (Small Office/Home Office) and private users with limited know-how and awareness of digital preservation drives the need for new approaches of fully-automated archiving systems.

Small institutions are often hardly aware of changes in their technological environment. This can have serious effects on their long-term ability to access and use their highly valuable digital assets. In some countries, the law requires that business transactions remain available and auditable for up to seven years. Moreover, essential assets such as construction plans, manuals, production process protocols or customer correspondence need to be at hand for even longer periods of time, in case of maintenance issues, lawsuits or for business value. To avoid the physical loss of data, companies implement various backup solutions and strategies. Although the bitstream preservation problem is not entirely solved, there exists many years of practical experience in the industry, with data being constantly migrated to current storage media types, and duplicate copies held to preserve bitstreams over years.

A much more pressing problem is logical preservation. The interpretation of a bitstream depends on the environment of hardware platforms, operating systems, software applications and file formats. Even small changes in this environment can cause problems in opening important files. There is no guarantee that a construction plan for part of an aircraft, stored in an application-specific format, can be rendered again in five, ten or twenty years. Logical preservation requires continuous activity to keep digital assets accessible and useable.

Digital preservation is mainly driven by memory institutions like libraries, museums and archives, which have a focus on preserving scientific and cultural heritage, as well as dedicated resources to care for their digital assets. Enterprises whose core business is not data curation are going to have an increased demand for knowledge and expertise in logical preservation solutions to keep their data accessible. Long-term preservation tools and services are developed for professional environments to be used by highly qualified employees in this area. In order to operate in more distant domains, automated systems and convenient ways to outsource digital preservation expertise are required.

The system builds on a service model similar to current firewall and antivirus software packages, providing user-friendly handling of services and an automated update service, and hiding the technical complexity of the software. The logical preservation is performed by using established best practice preservation strategies and supports multiple migration pathways for object formats. Multiple backup accross different storage media avoids the physical loss of the data caused by physical deterioration of media.

System Architecture
The modular architecture of Hoppla is influenced by the Open Archival Information System or OAIS reference model. The following subsections give a brief overview of the six core modules of the system: acquisition, ingest, data management, preservation management, access and storage management. On the client side two registries contain preservation rules and tools. Both are updated by an external update web service. The tools and service registry contain tools for object characterization, preservation action, and preservation validation, while the rules registry contains the settings which preservation plan to apply to which objects. The preservation rule registry specifies preservation strategies for different types of objects. The format registry on the server side contains representation information about formats, e.g. format specifications. Information about formats can be optionally stored on the client side. However, more suitably for home users, who are not engaged to have the information available, it can be requested from the server on demand.

Acquisition
The acquisition component crawls different types of sources and maps the source structure to an Element/Folder structure, which is later on passed for further processing. Being plug-in enabled this component is able to gather data not only from your local hard drives but also from remote hosts via SSH protocol or even your Email accounts accessed via IMAP or POP. Providing all these methods to acquire the data Hoppla will ensure that it stays safe.

The sources, however, might contain a lot of data that is not considered endangered or important, which will only make the backup and preservation process significantly slower and will waste a lot of physical space on the machine. This fact is handled by the acquisition component too, since it allows basic filtering and exclusion of sub-directories.

Ingest
The ingest component is responsible for more advanced filtering based not only on the extensions of data but also based on date, size or even its metadata gathered by different identification and characterization tools, such as JHOVE, PRONOMs Droid, etc. This component gathers changed or new elements and creates new versions, which are passed for further processing. It also acts as a basic firewall since elements are scanned for viruses.

Preservation
This component handles all logical preservation aspects of Hopplas workflow. It provides the necessary know-how, that is often missed by SOHO users, about file formats, their strengths and weaknesses and how to ensure higher probability chances for future access of their data. It communicates with an external Web Service, which provides preservation rules most suitable to the user based on his/hers collection-, user- and even system-profile, which are generated and send transparently by the client application. The rules returned to the client contain all needed information, so Hoppla can execute the format transformations, provided that the external tools are already installed on the local system. Since this aspect raises some privacy issues and questions it is noteworthy to mention that no files or partial content of files is being sent to the Web Service, but only metadata, such as format, size, etc.

Data Management
The Data Management module handles two important tasks. Firstly, it handles the matching between the new collection and the old one. During this process, all new versions and elements are added to the collection (and to the database) in order to ensure integrity for the next workflow. Secondly, it generates and outputs all relevant metadata in every folder of the Element/Folder tree. These (metadata)files will play an important role in the future versions of Hoppla, since they will provide a way to rebuild the tree if the database is compromised. Furthermore, they will enable compatibility with other applications (e.g. Fedora repository).

Storage Management
This component is responsible for bit-preservation of the data. Following the LOCKSS principle, it handles the copying of files and their migrations to different storage media. As the acquisition component it handles different types of media, such as local and external hard drives, CDs and DVDs and even remote machines.

Another important aspect of this component is the restore functionality. Since data on the storage media is stored in the exactly same structure as on the original source it is easy to restore the original source if it gets corrupted. Furthermore, it supports refreshing and storage medium migration, which will allow the user to choose and change the storage media in a flexible fashion.

Access
This module is tightly coupled to the storage management and offers the access and retrieval of preserved data. It supports different search criteria based not only on filenames but also on specific meta data gathered by the application.