User:Waury/sandbox

Stratosphere is an open-source software framework for processing large data sets on clusters. It is developed by the Hasso Plattner Institute, HU Berlin, TU Berlin and individual contributors. It is licensed under the Apache License 2.0.

Front ends
Stratosphere provides a Java and a Scala API to write programs. . The internals of Stratosphere are implemented in Java.

Operators
Similar to MapReduce or Apache Hadoop the Stratosphere runtime provides second-order functions called Operators (in publications called PACTs (PArallelization ContracTs)) for processing records with UDFs.

The runtime contains different primitives to execute these operators. Currently, there are two join algorithms implemented in Stratosphere: Sort-Merge join and Hybrid Hash Join. The REDUCE operator uses an external sort.

Iterations
In addition to the five second-order functions Stratosphere also provides two types of iterations, bulk and incremental.

Stratosphere Compiler
In contrast to other frameworks the pipeline is not fixed to a Map step followed by an optional Reduce step. A Stratosphere job can be an arbitrary DAG composed of multiple operators, bulk iterations, incremental iterations, data sources and data sinks. This enables optimizations like choosing a join strategy at runtime based on metadata, available system resources and compiler hints provided by the user.

The result of the optimized Stratosphere program is a job graph that can be executed on the cluster.

Nephele execution engine
Nephele is the low-level parallel execution engine of Stratosphere. It handles cluster resource allocation and in-memory and network communication. .

File systems
Stratosphere supports HDFS, HBase, Avro, Amazon S3, JDBC and local file systems as data sources and data sinks for Stratopshere jobs.