User:Erikhy/Software Operations

Software Operations is a Software Engineering discipline that encompasses Software Lifecycle Processes to build, launch, maintain and eventually sunset software systems that run in the "real world". Unlike software development, which is largely focused on the creation of new software, software operations is concerned with deploying "live" systems and keeping them healthy, scalable, and maintainable in an ever-changing and often dangerous global environment.

Overview
The web-enabled world is composed of myriad systems joined together into the global ecosystem that is the internet. Web sites and their databases, networks, hosting providers, Internet Service Providers (ISPs), edge caching providers and millions of users interact with each other in a fluid topology that changes on a moment-to-moment basis. Like a living body, such a system can run smoothly at times, or can have localized or wide-spread problems, can develop what amount to behaviors and personalities, can get ill, and can be repaired. Pieces of this ecosystem can remain static while others evolve rapidly.

Software Operations is a collection of disciplines which seek to understand this environment and make systems within it run effectively, from the time of their planning, creation and launch, through years of normal operations during which the system must adapt to changes around it, to the final decommissioning of the system as it reaches its end-of-life.

In small companies, software operations may consist of keeping a few servers healthy, ensuring email is handled properly, and working with the Internet Service Provider to ensure connectivity to the internet is maintained, and maybe handling the company's website.

In larger companies, software operations may additionally encompass planning for new systems and their service level agreements (SLAs), source code control systems, automated build and packaging systems, automated quality assurance (QA) systems, deployment systems, security, maintenance, performance, metrics and monitoring, troubleshooting, and vendor interactions.

Software operations is distinguished from its many component disciplines in that it takes holistic view of all components of the systems, from end-users down to the bits of data in a database on a server. The emphasis is on the interconnectedness of the parts and their resulting behaviors, rather than on any one part.

A Software Operations Engineer (SOE) is a person whose job focuses on some set of these needs. The typical SOE has been a software developer, has done some networking, has maintained servers and systems, has done some QA work, and has faced the challenges of keeping live systems operational and healthy.

Software Operations Goals
Software Operations can be summed up as "Keep the system working, keep the costs down, make it play nicely with others, and do all the technical stuff the code developers don't do."

Stability and Reliability: In the global environment where 24x7 operations are considered normal, and the stability of systems becomes paramount. Downtime in small systems may result in annoyed users. Downtime in large systems such as Amazon.com can be measured in tens of thousands of dollars per minute.

Maintainability: Like getting your car's oil changed, software systems must undergo routine maintenance. Typically this includes building, packaging and deploying new versions of the system's software, updating commercial off-the-shelf (COTS) software with vendor patches and service packs, performing network maintenance, and troubleshooting when things go wrong.

Performance: How fast is the system running as a whole? How efficient is each component? How efficient is the communication between components? Performance includes measurement and tweaks to the software and its configuration, servers and their disk, memory and CPU resources, caching, identification of traffic and other bottlenecks,

Scalability: SOEs will typically be responsible for scaling a system up or down as demand on it increases or decreases. Methods of scaling include adding or removing servers, tweaking the resources on servers such as available memory and disk space, and improving communications between system components. Scaling can also encompass the business processes that humans use to interact with the system, including end users, developers, maintainers, and other stake holders. Both technical and human factors must be taken into account.

Adaptation: Business needs change over time, and systems must often be rearchitected or adapted to these new factors. Requirements creep is a common event in companies and systems of all sizes as the world around the systems changes.

Cost: Nearly every system in the world is constrained by cost. The largest costs tend to be salaries for maintainers of the system, followed by air conditioning costs in data centers, hardware and power costs, and vendor licensing and service costs.

Components Typical to Software Operations
Software operations manages a layered sets of components, each with their own operational requirements. The outer most layer is typically the real-world user, while the inner-most layer is typically the data stored somewhere in a database.

User: The user layer comprises the human beings that interact with the software. Types of users include legitimate clients using the software for its intended purpose, maintainers such as SOEs who deploy and patch the system, quality assurance personnel who do functional, load and other types of testing, security personnel who perform penetration and other intrusion testing, and illegitimate users, both internal and external, who attempt to compromise the system for data theft or other types of attacks. SOEs seek to understand the goals, behaviors and knowledge levels of each user type to enable legitimate users and disable illegitimate users.

Client: The client software layer generally consists of applications installed on the user's computer that interact with the system. Web browsers, Microsoft Office applications like Outlook, ad-hoc programs such as Perl or PHP scripts, and various command-line tools can all be clients of the system.

External Analytics: External analytics give the ability to monitor what clients are doing with the system. Typically, when a client loads a web page, analytics JavaScript will run, interrogate the browser for data about the request, the client, the browser and the operating system, package these data into a request, such as for a 1 pixel image, and send the request to the analytics service. The analytics service unpackages and records the data, from which graphs and reports about traffic, geo-location, popular pages, click-throughs and many other facets of user behavior can be generated. Popular analytics providers include Omniture and Google Analytics.

External Health Monitoring: Externally-hosted health monitoring companies, such as Keynote and WebSitePulse, provide SOEs with an external-user view of the health of their system. Each will run ping checks against IPs and URL checks against web sites to determine latency and availability. When problems are detected, an alert is generated and sent to a distribution email list, usually consisting of the company's SOEs and network engineers. Reports of latency and availability are generated and periodically emailed as well.

External caching: This optional layer is often used by high-traffic web sites to dynamically scale to traffic loads and to provide geographically distributed "internet edge servers" that are physically close to the client to minimize response time.

External storage: External storage is optionally used to store large amounts of data cheaply offsite. A vendor, such as Amazon's S3 service, is used to offload the costs of growing, maintaining and powering corporate data centers.

External compute: External computing can optionally be used for both hosting of web sites and hosting of other applications. Web site hosting providers are a common solution for

External Network: The external network layer moves data between the client and the server systems, and includes Internet Service Provider links to the internet, DNS service providers that provide name-to-IP resolution, and often metro links between distributed components of the server-side system. small companies who do not want to maintain their own data centers. Application hosting services, such as Amazon's Elastic Compute Cloud (EC2) and Akamai's services, allow a choice of operating system and application images to be rapidly loaded on multiple computers and rapidly and dynamically scaled up or down according to load.

Boundary network: The boundary network, sometimes referred to as the "DMZ", consists of the ISP links arriving at the company, the external firewall which selectively opens ports and handles network address translation (NAT) between external and internal IPs, optional load balancers that distribute incoming traffic to servers, DNS servers, routers and network switches. The boundary network provides the first layer of security insulating internal systems from external abuse. In low or medium-security environments, Secure Socket Layer (SSL) connections, such as HTTPS, will be terminated at the load balancer. Typically web servers may be placed in the boundary network, while their supporting databases may live on the internal, more highly protected network.

Internal network: The internal network consists of firewalls, routers, switches, DNS servers, and application and databse servers which are proprietary to the operation of the company and which must be protected from external access and abuse.

Proxy Server: Proxy servers accept incoming connections and distribute them to application or web servers, optionally performing generic operations such as request validation, request throttling based on client IP, translation to a neutral format, or other light-weight filtering operations.

Web servers: Web servers accept incoming connections from client applications such as web browsers, parse the request, access and aggregate the data needed to respond, format the data into a result, and return the result to the caller. They are responsible for rendering web pages and images (HTTP), transferring files (FTP), and for securely accessing their data stores.

Application Servers: In a multi-tier architecture, separate application servers may be used to compartmentalize business data. For example, catalog data, customer reviews, and payment systems may all contribute to the data appearing on a web page, but each may have its own application server farm to access and manipulate those data.

Database Servers: Database servers host the data used to support the application. Common databases are Oracle (large systems) and MySQL (free). The database server runs the database software. It may or may not physically store the data on its drives.

SAN: Storage Array Network devices are often used to hold large amounts of data on redundant arrays of large disk drives. These data may include both flat files and databases.

Software Operations Tasks
Architecture of systems is an often overlooked SOE role. The SOE can offer operational questions and perspectives that can help the resulting system architecture, design and implementation be tolerant of scaling, availability, performance, security and other real-world factors.

Source Code Control is often owned by the SOE, and includes systems, storage and source code control software, and, equally importantly, standards of use, such as check-out and check-in policies, branching policies for diverging code streams such as hot fixes, and roll-back and roll-forward policies for failed versions.

Configuration refers to all the settings and data surrounding the software necessary for that software to run properly. Network devices, servers, applications and databases all have configuration settings unique to the system. SOEs generally own the source code control of the configuration data, automation of configuration deployment, consistency of configuration across application instances and server farms, and QA of the actual configuration data set.

Building refers to the process of getting the correct source code from the source code control system, compiling and link it with all the required resources, placing the result on a build server, and monitoring the process for errors. While developers own the code, SOEs generally own the build process, servers and data stores.

Packaging refers to collecting all the build products and bundling them for distribution in some manner. A package could be as simple as a file to be deployed or an MSI to be installed, or perhaps to a more complex library of executables and documentation that gets burned to DVD and eventually boxed and shipped, or put on a server for download.

Testing is usually the province of a Quality Assurance department. SOEs, however, generally own the automation and integration of testing into the build, package and deployment processes.

Deployment refers to moving the build products onto the servers in a safe manner, replacing the existing running software with the new versions. Small deployments can be done by hand; large deployments that are complex or target hundreds or thousands of servers are usually automated. SOEs own the deployment process and automation.

Backups of databases, file systems and configurations of applications and network devices is a vital SOE task that mitigates the effects of both minor failures, like a blown disk drive, and major disasters such as a data center fire, flood or lightening strike. Backups should be archived well off site so that they are safe from a regional disaster like a hurricane that may affect the company's data center.

Service Level Agreements (SLAs) define the operational parameters of all the components of the system, such as availability and latency, who will respond to problems and how fast they must respond, how long various problems will be tolerated before a fix must be deployed, and what monetary or other penalties may be assigned if an SLA is breached. SOEs take the role of defining and refining SLAs, monitoring system performance against the SLAs, configuring proper alarming, and responding to problems.

Performance Metrics help the SOE gain understanding of how the system is behaving, and generally involve monitoring some or all of the components of the system against Key Performance Indicators (KPIs). Typical KPIs include latency, availability, CPU utilization, disk space usage, I/O usage, network utilization, and many other possible metrics. SOEs generally define what KPIs are valuable, set up monitoring, respond to fluctuations or degradations, troubleshoot to find a root cause, and institute appropriate actions to restore performance to nominal levels.

Business Metrics are those Key Performance Indicators (KPIs) that reflect on how the system is enhancing the business. A web site owner may be interested in how traffic to the site trends over time, where callers are located, what pages they visit, and so on. E-Commerce business may be interested in revenue and where customers are falling out of the checkout pipeline. SOEs are often involved in implementing some sort of KPI monitoring tools that can collect such data, analyse them, and generate reports valuable to the business owner.

Security encompasses the infrastructure supporting the web site or other software, the software itself, the data received, manipulated and transmitted by the software, and, perhaps most importantly, the image of the company and the trust users place in it. When systems deal with money or personal data, a compromise of that data can be devastating to a company's reputation far beyond the incidental cost of purchases or monies involved. Software operations takes a holistic approach to security across the end-to-end system, from user to database and back, attempting to
 * secure data from theft or corruption using encryption.
 * secure networks from intrusion, DNS exploits, SSL exploits, man-in-the-middle attacks and other dangers.
 * secure servers by filtering incoming requests and outgoing responses, and server access.
 * secure systemic availability by throttling incoming requests.
 * secure applications from intrusions such as buffer overruns, SQL injection, cross site scripting (XSS), error handling exploitation, information leakage, privilege compromises, and cryptographic compromises.

Maintenance can consist of deploying new software versions, applying security and other kinds of patches, life-cycling hardware and software, migrating systems to new locations, and ensuring systems have sufficient resources such as disk space, memory, CPU and bandwidth. SOEs generally are responsible for scheduling, performing and testing maintenance activities.

Troubleshooting is generally the least favorite task of SOEs, and consists of isolating failures, determining the root causes, developing, deploying and QAing a solution, documenting the fix in a knowledge-base, and often communicating with and/or fending off irate users. In general, most companies have at least on on-call SOE, and often many, carrying pagers or cell phones, ready to respond to outages and degradations.

Administration includes routine tasks such as credentialling new users, removing old users, restoring data from backups, and performing other application-specific tasks.

Vendor Tech Support is also typically a responsibility of the SOE, who acts as a local subject matter expert and liaison for external services that may be contracted, such as monitoring, analytics, external hosting, caching or compute services.