User:Ist302/sandbox

Introduction - Cloud Computing and Databases
Cloud computing has been the most adoptable technology in the recent times, and the database has also moved to cloud computing. A model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources, Cloud Computing can be rapidly provisioned and released with minimal management effort or service provider interaction. The cloud model is composed of five essential characteristics, three service models, and four deployment models :

Essential Characteristics:

On-demand self-service. A consumer can unilaterally provision computing capabilities, such as server time and network storage, as needed automatically without requiring human interaction with each service provider.

Broad network access. Capabilities are available over the network and accessed through standard mechanisms that promote use by heterogeneous thin or thick client platforms (e.g., mobile phones, tablets, laptops, and workstations).

Resource pooling. The provider’s computing resources are pooled to serve multiple consumers using a multi-tenant model, with different physical and virtual resources dynamically assigned and reassigned according to consumer demand. There is a sense of location independence in that the customer generally has no control or knowledge over the exact location of the provided resources but may be able to specify location at a higher level of abstraction (e.g., country, state, or datacenter). Examples of resources include storage, processing, memory, and network bandwidth.

Rapid elasticity. Capabilities can be elastically provisioned and released, in some cases automatically, to scale rapidly outward and inward commensurate with demand. To the consumer, the capabilities available for provisioning often appear to be unlimited and can be appropriated in any quantity at any time.

Measured service. Cloud systems automatically control and optimize resource use by leveraging a metering capability at some level of abstraction appropriate to the type of service (e.g., storage, processing, bandwidth, and active user accounts). Resource usage can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized service.

Service Models:

Software as a Service (SaaS). The capability provided to the consumer is to use the provider’s applications running on a cloud infrastructure. The applications are accessible from various client devices through either a thin client interface, such as a web browser (e.g., web-based email), or a program interface. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user specific application configuration settings.

Platform as a Service (PaaS). The capability provided to the consumer is to deploy onto the cloud infrastructure consumer-created or acquired applications created using programming languages, libraries, services, and tools supported by the provider. The consumer does not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage, but has control over the deployed applications and possibly configuration settings for the application-hosting environment.

Infrastructure as a Service (IaaS). The capability provided to the consumer is to provision processing, storage, networks, and other fundamental computing resources where the consumer is able to deploy and run arbitrary software, which can include operating systems and applications. The consumer does not manage or control the underlying cloud infrastructure but has control over operating systems, storage, and deployed applications; and possibly limited control of select networking components (e.g., host firewalls).

Deployment Models:

Private cloud. The cloud infrastructure is provisioned for exclusive use by a single organization comprising multiple consumers (e.g., business units). It may be owned, managed, and operated by the organization, a third party, or some combination of them, and it may exist on or off premises.

Community cloud. The cloud infrastructure is provisioned for exclusive use by a specific community of consumers from organizations that have shared concerns (e.g., mission, security requirements, policy, and compliance considerations). It may be owned, managed, and operated by one or more of the organizations in the community, a third party, or some combination of them, and it may exist on or off premises.

Public cloud. The cloud infrastructure is provisioned for open use by the general public. It may be owned, managed, and operated by a business, academic, or government organization, or some combination of them. It exists on the premises of the cloud provider.

Hybrid cloud. The cloud infrastructure is a composition of two or more distinct cloud infrastructures (private, community, or public) that remain unique entities, but are bound together by standardized or proprietary technology that enables data and application portability (e.g., cloud bursting for load balancing between clouds).

The cloud database holds the data on different data centers located at different locations and this is what makes the cloud different from the rational database management system.

In the sections below, we will give an overview of Cloud databases and will explore the Architecture, Data Storage and Access of Amazon's Elastic MapReduce(EMR) service and analyze the Hadooop software.

Overview
A Cloud database is a service accessed through a cloud platform. It differentiates itself from a traditional database because it is hosted by a cloud platform provider rather than physical or virtual hardware. Companies utilizing these services can access their data in an established structure. Amazon is a top Cloud database provider where customers can create Cloud-hosted environments via Amazon Web Services (AWS). AWS is easy to use, low cost, secure, and flexible. Companies using the AWS enjoy the freedom to take risks at a lower cost as the service provider eliminates the traditional overhead associated with computing. This freedom allows companies to innovate without being fearful of failure and significant losses. Companies have reported savings ranging between 25 to 50% in costs. Other major advantages include access from any location and flexibility of infrastructure deployments that can be tailored to their specific use case.

AWS Elastic MapReduce (EMR) is a service that offers flexibility, no initial costs on hardware, and quick solutions. Companies like Razorfish have been able to implement their services within just a 6-week period. Sysco is another company enjoying the rapid prototyping options within AWS EMR and Hadoop.

Hadoop is a software that runs applications on clusters of commodity hardware. Although Hadoop is still a growing product, it already has a very strong Ecosystem that companies find value for its ability to integrate data. Hadoop values the ability to integrate with the technology already in place; this advantage results in faster results and fewer management headaches.

Spark is another option; it runs on ram and works well for real-time data applications like Zillow. There are instances where companies will find an advantage to using both, Hadoop and Spark at the same time.

Architecture, Data Storage & Access
While Cloud databases range in their architecture, data storage and accessibility, most provider a web based management console. This allows end users to access the database through a web app that connects database owners to database operators and can be based on SQL or NoSQL. Here we examine both a generic Cloud database as a concept and focus more closely on Amazon EMR and the underlying Hadoop technology.

Architecture
Amazon EMR uses clusters as its basic building block. Each cluster is a collection of elastic cloud instances. Each instance within the cluster is a node. Nodes can be distributed into master and slave nodes, depending on the use case and software installed on the node. A master node manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. A slave node is a node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on the cluster. Amazon EMR automatically installs and configures applications in the Hadoop ecosystem when set up. Hadoop and similar software ecosystems enable massively parallel computing that can be spread across many servers with little effects on performance. Hadoop ecosystem software are built with the expectation that any given machine in your cluster could fail at any time. If a server running a task fails, Hadoop reruns that task on another machine until completion.

Data Storage
Basic data storage for cloud databases is operated by data center owners like CyrusOne and Equinix, which lease out their data centers to large cloud operators like AWS and Google Cloud, which in turn connect the end users to their data. Within the data center servers, companies choose between shared nothing and shared disk storage options.

A shared-nothing system uses independent nodes in an interconnected network. With each node maintaining its own memory and storage system, data consistency is harder to achieve because each node must independently update its storage, but is infinitely scalable as the operating firm can incorporate new nodes into the network to increase storage space.

A shared disk system creates nodes that share disk space by partitioning the data, ensuring that each server maintains a small unit of the complete database. Each node communicates with all other nodes to cover a database with a very high workload, ensuring dynamic load management. The shared disk model also enables easier data consolidation and consistency because all nodes are interconnected and data redundancy is limited. However, a shared node system is not as scalable and more prone to catastrophic data failures because of fewer redundancy points.

Like most other shared-nothing system, Hadoop stores data in distributed scalable files through a Hadoop distributed file system, creating redundancies by storing data on multiple copies to prevent catastrophic loss. The Hadoop distributed file system also caches intermediate results during queries, creating more efficient query responses for the future. Amazon EMR extends Hadoop to add the ability to access data stored in S3 as if it was data stored in Hadoop. Most often, Amazon tools are used to store I/O data and intermediate results are stored in HDFS.

Access
We access Amazon EMR clusters through the master node, which we use to manipulate and interact with the rest of the cluster. We establish a remote connection with the master node using a secure shell protocol after which we observe that the terminal on our local computer behaves as if it were running on the remote computer. Once we have established this connection, we use Linux commands to run applications, browse directories, and read log files.

Performance Optimization Approach
Apache Hadoop specializes in the optimized storage and processing of very large datasets (up to Petabytes of data). To do this, Hadoop leverages the clustering of multiple machines (rather than a more traditional single server architecture) which allows the database to process lots of data in parallel. The technology within Hadoop that supports this parallel machine processing is called MapReduce. MapReduce allows the data to be chunked into small pieces and processed by any node in the cluster at the same time. When setup within the Cloud, Hadoop, and in particular the MapReduce framework, can optimize performance. For this presentation, we chose to focus on Amazon EMR (Elastic MapReduce) which is Amazon's managed service version of the MapReduce technology within Hadoop. The scalability of Amazon EMR, able to spin up tens or even thousands of nodes on demand or automatically, is the corner stone of the optimization. Amazon EMR allows the user to determine if they want to add capacity by increasing an existing cluster or by adding new ones. Not only is the initial setup available in minutes, saving organizations time and money in traditional architecture design, but it already includes optimized configurations, tuned to be extremely efficient when processing large datasets. Leveraging Amazon EMR combines the high speed processing power of the Hadoop database technology with the scalability of the Cloud, creating a perfect union for large data analytic needs.

Comparison to RDBMS' approach to Optimization
When compared to a traditional RDBMS' approach to optimization, Amazon EMR and the underlying Hadoop technology offer much better scalability. The RDBMS technology does not horizontally scale very well and tends to have low quality availability when distributed. Amazon EMR has great strength in scaling and can dynamically grow horizontally and continue to operate with top performance when leveraging a distributed architecture. Another optimization strength of Amazon EMR is the size of the data it can process; a traditional RDBMS would likely have great difficultly processing petabytes of data, meanwhile the design of Amazon EMR supports optimized performance even with sizable data. Conversely, RDBMS technology has the upper hand on query performance; however, Amazon EMR supports adding services like Presto (an open source SQL engine) that can address the need for fast query performance.

Security
Since Amazon EMR, which hosts the Hadoop module, is offered as a platform as a service, many common infrastructure security concerns are handled by Amazon. This is in opposition to a fully self-hosted RDBMS where both network traffic must be authenticated and actions performed with the database must have security controls as well. Amazon EMR removes the handling of network traffic from those subscribed to the service. That being said, users within organizations are encouraged to set up tiered permissions to determine which parties can access and manipulate data in specific ways. These types of permissions are handled within Amazon EMR’s service catalog within the AWS Management Console. One of the first permissions available for configuration is the choice to create an ec2 key pair, which allows an end user to SSH into the master node order to manage the child (slave) nodes and task nodes controlled by it. Subsequent users can be created and given tiered access in a similar fashion as to how it is done within a standard Amazon EC2 instance.

Concurrency
Concurrency within Hadoop is handled through a queue of batches. Each batch is taken off the queue in a LIFO fashion, divided into pieces, and sent to the nodes in the cluster. Processing happens on disk, and when it has finished, the output is returned. At this point, the next batch can be started. This continues until the queue is empty.

Hadoop uses distributed nodes where data is replicated. This makes the process markedly more fault tolerant. An error, such a crash in one of the nodes does not sabotage the entire process, as execution occurs in a distributed fashion. The storage of data within Hadoop, such as in a data lake, is also distributed meaning that the destruction of the copy of data on one disk or on one machine does not taint the entire data set.

= Strengths/Weaknesses vs. Relational Databases =

Strengths

- Cloud Databases offer a faster response if the database needs to be updated in real time

- Greater flexibility in usage as the user can increase or decrease their usage based on their needs

- Can help save money as the company does not need to invest money in setting up their own data centers and hiring extra staff to manage it.

- The database can be accessed from a variety of devices without the need of the installation of special software.

Weaknesses

- If a company has high traffic, a Relational Databases might be less costly if the cloud database provider charges for data transfers

- Without full control over the server, there is no control over the software installed on devices and so security concerns arise (particularly with sensitive data)

- Relational Databases is a fully mature and proven concept whereas Cloud Databases are still relatively new

- Cloud Databases are fundamentally reliant on internet access whereas some traditional Relational Databases can support offline capabilities.