User:Master4279/sandbox

Data virtualization is any approach to data management that allows an application to retrieve and manipulate data without requiring technical details about the data, such as how it is formatted or where it is physically located.

Unlike the traditional extract, transform, load ("ETL") process, the data remains in place, and real-time access is given to the source system for the data, thus reducing the risk of data errors and reducing the workload of moving data around that may never be used.

Unlike Data Federation it does not attempt to impose a single data model on the data (heterogeneous data). The technology also supports the writing of transaction data updates back to the source systems.

To resolve differences in source and consumer formats and semantics, various abstraction and transformation techniques are used.

This concept and software is a subset of data integration and is commonly used within business intelligence, service-oriented architecture data services, cloud computing, enterprise search, and master data management.

Examples

 * The Phone House—the trading name for the European operations of UK-based mobile phone retail chain Carphone Warehouse—implemented Denodo’s data virtualisation technology between its Spanish subsidiary’s transactional systems and the Web-based systems of mobile operators.


 * Novartis, which implemented a data virtualisation tool from Composite Software to enable its researchers to quickly combine data from both internal and external sources into a searchable virtual data store.


 * Linked Data can use a single hyperlink-based Data Source Name (DSN) to provide a connection to a virtual database layer that is internally connected to a variety of back-end data sources using ODBC, JDBC, OLE DB, ADO.NET, SOA-style services, and/or REST patterns.


 * Database virtualization may use a single ODBC-based DSN to provide a connection to a similar virtual database layer.

Functionality
Data virtualization is used to transform available data from many different sources into the type needed for reporting and analytics. Because data virtualization requires less fewer databases and processes, integrating it into business intelligence architecture leads to more agile systems. The system works as an intermediate layer which interacts with applications translating locations, API requirements and programming languages. It does so by encapsulating data sources in such a way that technical details are hidden and the application can integrate with a simpler interface.

Data Virtualization software is an enabling technology which provides some or all of the following capabilities:


 * Abstraction – Abstraction is a method of identifying the important aspects of an event, and ignore its details. When abstraction is applied to a database, the rows, columns and tables not relevant to the consumer are hidden.
 * Virtualized Data Access – Connect to different data sources and make them accessible from a common logical data access point.
 * Transformation – Transform, improve quality, reformat, etc. source data for consumer use.
 * Data Federation – Data federation is an aspect of data virtualization where data located in multiple data stores are integrated into a single view for data consumers as a data unified store. Data federation always required data to be virtualized because of the different types of servers, files and languages being integrated.
 * Data Delivery – Publish result sets as views and/or data services executed by client application or users when requested.

Data virtualization software may include functions for development, operation, and/or management.
 * Meta Data Simplification – When data virtualization is used, meta data specifications only have to be implemented once for the application, and it is not necessary to replicate them to multiple data consumers. When integration solutions are handled by a data virtualization layer, this results in more consistent application behavior and more consistent results.

Benefits

 * Reduce risk of data errors
 * Reduce systems workload through not moving data around
 * Increase speed of access to data on a real-time basis
 * Significantly reduce development and support time
 * Increase governance and reduce risk through the use of policies
 * Reduce data storage required


 * Database language and API conversion from the database server to any desired type while remaining transparent to the consumer.
 * Makes data consumers no longer reliant on a particular data store technology. The data will be converted to the consumers desired technology.
 * Simplified table structure through the use of metadata specifications. These are defined once, and utilized to reduce application development and maintenance.
 * Reduces the risk of data errors, and has limited ability to corrects data formats and prevents incorrect data values being replicated.
 * Unified data access for multiple data storage formats including: SQL, noSQL, XML, Excel files, HTML-based websites, ect.
 * Near real time processing of many disparate sources to support business needs.

Drawbacks

 * May impact Operational systems response time, particularly if under-scaled to cope with unanticipated user queries or not tuned early on
 * Does not impose a heterogeneous data model, meaning the user has to interpret the data, unless combined with Data Federation and business understanding of the data
 * Requires a defined Governance approach to avoid budgeting issues with the shared services
 * Not suitable for recording the historic snapshots of data - data warehouse is better for this
 * Change management "is a huge overhead, as any changes need to be accepted by all applications and users sharing the same virtualization kit"


 * Not suitable for time series analysis or historic snapshots of data – data warehousing is better for this.
 * Lacks productivity improvements in data center operations.


 * Unproven for extremely high levels of variety of structures and volume.
 * Delivering integrated data cleansed and transformed into a business suitable format requires complex procedural rules.
 * The quality and condition of data is only as good as the condition at the source.

Reference Books

 * Data Virtualization: Going Beyond Traditional Data Integration to Achieve Business Agility, Judith R. Davis and Robert Eve
 * Data Integration Blueprint and Modeling: Techniques for a Scalable and Sustainable Architecture  Anthony Giordano

History
The history of data virtualization started with IBM's System R Project in 1974. From IBM's research, they developed what is known as structured query language (SQL). Eventually, System R lead to the development of IBM's commercial DB2 servers. The first distributed databases from this technology supported data federation technology, which is inherited by data virtualization.

When extensible markup language (XML) became popular, a language was needed to decompile XML documents and flatten the hierarchical structure of XML. Extensible stylesheet language transformations (XSLT) was created as a standard language in 2000 and it preforms the required transformation of XML documents into a structure that can be used by data virtualization. Another language designed for the database side of XML documents is XQuery. XQuery can join XML documents, extract elements and join relational data with XML data.

Enterprise Information Integration (EII), first coined by Metamatrix, now known as Red Hat JBoss Data Virtualization, and data federation have been used by some vendors to describe a core element of data virtualization: the capability to create relational JOINs in a federated VIEW.