User:Lisa Gatzke/sandbox

= NCSA Brown Dog =

This article is about the National Center of Supercomputing Applications (NCSA) National Science Foundation funded research project Brown Dog.

Brown Dog is part of the DataNet Partners program funded by NSF beginning in 2008. DataNet was conceived to address the increasingly digital and data-intensive nature of science and engineering research and education. Brown Dog is part of a follow-on effort called Data Infrastructure Building Blocks (DIBBs), focused on building software to support the DataNet efforts.

Unstructured, Uncurated, Long Tail Data
Much of the data generated by science, social science and the humanities is smaller, unstructured, un-curated and thus not easily shared. In the scientific world this is sometimes referred to as "long tail" data, borrowing a term from statistics and referring to the tail of the distribution of project sizes, with the vast majority of smaller projects not having the resources to properly manage the data they produce. This so-called “long tail” data, both past and present, has the potential to inform future research in many areas of study; however, much of this data has become largely inaccessible due to obsolete software and file formats. The inaccessibility of this data and the reality of digital obsolescence, means that the integrity of scientific research is itself increasingly at risk because it is no longer reproducible.

Brown Dog Approach
Brown Dog describes itself as the “super mutt” of software (thus the name “Brown Dog”), serving as a low-level data infrastructure to interface with digital data content across the web. It’s approach is to use every possible source of automatable help (i.e., software) already in existence in a robust and provenance preserving manner to create a service that can deal with as much of this data as possible. The project sees the broader impact of its work in its potential to serve not just the scientific community but the general public as a sort of “DNS for data”, with the goal of making all data and all file formats as accessible as webpages are today.

Brown Dog Technology
Brown Dog addresses problems involving the use of uncurated and unstructured data collections through the development of two services: the Data Access Proxy (DAP) to aid in the conversion of file formats and the Data Tilling Services (DTS) for the automatic extraction of metadata from file contents. Once developed, general public users will be able to download the browser plugins and other tools from the Brown Dog tool catalog.

Data Tilling Service - DTS
The DTS, will allow users to search collections of data using an existing file to discover other similar files in the data. Again, once the machine and browser settings are configured, a search field will be appended to the browser where example files can be dropped in by the user. Doing this triggers the DTS to search the contents of all the files under a given URL for files similar to the one provided by the user. For example, while browsing an online image collection, a user could drop an image of three people into the search field, and the DTS would return all images in the collection that also contains three people. If the DTS encounters a file format it is unable to parse, it will utilize the DAP to make the file accessible. The DTS will also perform general indexing of the data and extract and append metadata to files and collections enabling users to gain some sense of the type of data they are encountering.

Data Access Proxy - DAP
Brown Dog’s DAP will allow users to seamlessly access data files that would otherwise be unreadable on their client devices. Similar to an internet gateway or Domain Name Service (DNS), the DAP configuration would be entered into a user’s machine settings and forgotten thereafter. From then on, with modifications in the form of plugins to most browsers, data requests over HTTP would first be examined by the DAP to determine if the native file format is readable on the client device. If not, the DAP would be called in the background to convert the file into the best possible format readable by the client machine. Alternatively, the user would have the option of specifying the desired format themselves, instead of the DAP doing it automatically.

=Brown Dog Use Cases=

Brown Dog technologies will be developed in the context of three identified use cases proposed by groups within the EarthCube research communities. Developers and researchers from some of these communities will work together to explore three compelling scientific use cases that span geoscience, engineering, biology and social science.

Long Tail Vegetation Data in Ecology and Global Change Biology
led by Michael Dietze, Boston University

Data on the abundance, species composition, and size structure of vegetation is critically important for a wide array of sub-disciplines in ecology, conservation, natural resource management, and global change biology. However, addressing many of the pressing questions in these disciplines will require that terrestrial biosphere and hydrologic models are able to assimilate the large amount of long-tail data that exists but is largely inaccessible. The Brown Dog team in cooperation with researches from Dietze's lab will facilitate the capture of a huge body of smaller research-oriented vegetation data sets collected over many decades and historical vegetation data embedded in Public Land Survey data dating back to 1785. This data will be used as initial conditions for models, to make sense of other large data sets and for model calibration and validation.

Designing Green Infrastructure Considering Storm Water and Human Requirements
led by Barbara Minsker, UIUC; William Sullivan, UIUC; Arthur Schmidt, UIUC

This case study involves developing novel green infrastructure design criteria and models that integrate requirements for storm water management and ecosystem and human health and wellbeing. To address the scientific and social problems associated with the design of green spaces, data accessibility and availability is a major challenge. This study will focus on identified areas of the Green Healthy Neighborhood Planning region within the City of Chicago where existing local sewer performance is most deficient and where changes in impervious area through green infrastructure would be beneficial to underserved neighborhoods. Brown Dog will be used to extract long-tail experimental data on human landscape preferences and health impacts. This data will be used to develop a human health impacts model that will then be linked together with a terrestrial biosphere model and a storm water model using Brown Dog technology.

Development and Application for Critical Zone Studies
led by Praveen Kumar, UIUC

Critical Zone (CZ) is the “skin” of the earth that extends from the treetops to the bedrock that is created by life processes working at scales from microbes to biomes and it supports all terrestrial living systems. Its upper part is the biomantle. This is where terrestrial biota live, reproduce, use and expend energy, and where their wastes and remains accumulate and decompose. It encompasses the soil, which acts as a geomembrane through which water and solutes, energy, gases, solids, and organisms interact with the atmosphere, biosphere, hydrosphere, and lithosphere. A variety of drivers affect this biodynamic zone, ranging from climate and deforestation to agriculture, grazing and human development. Understanding and predicting these effects is central to managing and sustaining vital ecosystem services such as soil fertility, water purification, and production of food resources, and, at larger scales, global carbon cycling and carbon sequestration. The CZ provides a unifying framework for integrating terrestrial surface and near-surface environments, and reflects an intricate web of biological and chemical processes and human impacts occurring at vastly different temporal and spatial scales. The nature of these data create significant challenges for inter-disciplinary studies of the CZ because integration of the variety and number of data products and models has been a barrier. On the other hand, CZ data provides an excellent opportunity for defining, testing and implementing Brown Dog technologies. In this context “unstructured” data is viewed broadly as comprising of a collection of heterogeneous data with formats that reflect temporal and disciplinary legacies, data from emerging low cost open hardware based sensors and embedded sensor networks that lack well defined metadata and sensor characteristics, as well as data that are available as maps, images and text.

=NSF Award=

CIF21 DIBBs: Brown Dog was awarded in the winter of 2013 with a start date of October 1, 2013. Estimated expiration date is September 30, 2018.

The award amount was $10,519,716.00, the largest of all the DIBBs awards. The principal investigator is Kenton McHenry of the National Center for Supercomputing Applications (NCSA) at the University of Illinois Urbana-Champaign (UIUC). Co-PIs are Jong Lee NCSA/UIUC; Barbara Minsker, Civil and Environmental Engineering, UIUC; Praveen Kumar, Civil and Environmental Engineering, UIUC; Michael Dietze, Department of Earth and Environment, Boston University.