User:Luc taesch/sandbox

The concept of  'Chaos Monkey'  was invented in 2011 by Netflix to test the resilience of its IT infrastructure. The purpose of this tool is to simulate failures in a real environment and to check that the computer system continues to work.

Concept
Historically, in the design of Software s, the concept of non-functional requirement was included in the General Functional Specifications. These requirements included the ability of the software to tolerate failures, to be resilient to ensure optimal Quality of Service. Often due to lack of time to quickly deliver software or lack of knowledge of the field, development teams skipped these topics.

In 2011, engineers from Netflix - Yury Izrailevsky, today Director Cloud & Infrastructure and Ariel Tseitlin, today Director of Cloud Solutions, had the idea to change the paradigm by setting up a tool in production environment, the real environment used by Netflix customers, a tool that would cause breakdowns. They therefore propose to move from a model where teams build software hoping that there will be no breakdowns to a model where they will be sure that there will be a failure - provoked. Taking into account resilience in software design is no longer an option, but an obligation: "At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created  Chaos Monkey , a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services. Imagine a monkey entering a "data center", these "farms" of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy.  [Netflix]] released the source code for this tool in 2012.    .

Different variants of the Simian Army
The Simian Army  (ape army) is a suite of tools developed by Netflix to test the reliability, security, or resiliency of its Amazon Web Services infrastructure. ref> .

The first tool developed by Netflix, it allows random selection of instances in the production environment and deliberately put them out of service. At the very top of the Simian Army hierarchy, Chaos Gorilla, drops a full Amazon Availability Zone.   By introducing communication delays at the communication layer level, a tool that allows to test the tolerance to the loss of performance of an external component whose system is dependent upon, up to the simulation of a complete cut  - an infinite delay ; without having to ask the partner concerned to cut his service. Tool that detects all instances that present health risks - CPU overload for example - and separates them from the system for root cause analysis or even extinction. Tool that disables any unused instances to avoid over-consuming. Tool that disables any nonconforming instances to allow the system to recreate it properly. Derived from the  Conformity Monkey , a tool that disables all instances that have vulnerabilities. Tool that detects problems of localizations, languages ​​(l10n-i18n) on instances.
 * Chaos Monkey
 * Gorilla Chaos
 * Latency Monkey
 * Doctor Monkey
 * Janitor Monkey
 * Conformity Monkey
 * Monkey safety
 * 10-18 Monkey

Chaos Monkey and Devops
As part of the Devops, movement, special attention is paid to the safe operation of computer systems, thus providing a sufficient level of confidence despite frequent releases. By contributing to the Devops Tool Chain, Chaos Monkey meets the need for continuous testing.

They are part of the pattern "Design for failure" , "designed to support failure": a computer application must be able to support the failure of any underlying software or hardware component.

Chaos Engineering
Chaos Engineering is the discipline of experimentation on a distributed system to build confidence in the system's ability to withstand turbulent production conditions.   This is a community built around the principles defined on the site http://principlesofchaos.org/, initiated by Netflix. 

Facebook Storm
To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures.  

Days of Chaos
Inspired by AWS GameDays  to test the resilience of its applications, teams volunteer applications from Voyages-sncf.com participated in a Day of Chaos. Every 30 minutes, operators simulated failures in pre-production. Teams earned points based on detections, diagnoses and resolutions. This type of gamified event helps to introduce development teams to the concept of resilience.  </ ref>

Presented at the 2017 Devops REX conference {{Article | language = en | author1 = | name1 = devops REX | title = Days of Chaos: the development of the devops culture at Voyages-Sn ... | periodic = Slideshare | date = 2017-10-03 | issn = | read online = https: //en.slideshare.net/devopsrex/days-of-chaos-the-development-of-culture-devops-your-voyagessncfcom-laid- the concept is presented on the site http://days-of-chaos.com in order to collect the other experiments.

Chaos Toolkit
The Chaos Toolkit was born from the desire to simplify access to the discipline of Chaos Engineering and demonstrate that the experimentation approach can be done at different levels: infrastructure, platform but also application. The Chaos Toolkit is an open-source tool, licensed under Apache 2, published in October 2017 </ ref>.

Notes and references
Category: Engineering Category: Software Development