Resilient control systems

A resilient control system is one that maintains state awareness and an accepted level of operational normalcy in response to disturbances, including threats of an unexpected and malicious nature".

Computerized or digital control systems are used to reliably automate many industrial operations such as power plants or automobiles. The complexity of these systems and how the designers integrate them, the roles and responsibilities of the humans that interact with the systems, and the cyber security of these highly networked systems have led to a new paradigm in research philosophy for next-generation control systems. Resilient Control Systems consider all of these elements and those disciplines that contribute to a more effective design, such as cognitive psychology, computer science, and control engineering to develop interdisciplinary solutions. These solutions consider things such as how to tailor the control system operating displays to best enable the user to make an accurate and reproducible response, how to design in cybersecurity protections such that the system defends itself from attack by changing its behaviors, and how to better integrate widely distributed computer control systems to prevent cascading failures that result in disruptions to critical industrial operations.

In the context of cyber-physical systems, resilient control systems are an aspect that focuses on the unique interdependencies of a control system, as compared to information technology computer systems and networks, due to its importance in operating our critical industrial operations.

Introduction
Originally intended to provide a more efficient mechanism for controlling industrial operations, the development of digital control systems allowed for flexibility in integrating distributed sensors and operating logic while maintaining a centralized interface for human monitoring and interaction. This ease of readily adding sensors and logic through software, which was once done with relays and isolated analog instruments, has led to wide acceptance and integration of these systems in all industries. However, these digital control systems have often been integrated in phases to cover different aspects of an industrial operation, connected over a network, and leading to a complex interconnected and interdependent system. While the control theory applied is often nothing more than a digital version of their analog counterparts, the dependence of digital control systems upon the communications networks, has precipitated the need for cybersecurity due to potential effects on confidentiality, integrity and availability of the information. To achieve resilience in the next generation of control systems, therefore, addressing the complex control system interdependencies, including the human systems interaction and cybersecurity, will be a recognized challenge.



From a philosophical standpoint, advancing the area of resilient control systems requires a definition, metrics and consideration of the challenges and associated disciplinary fusion to address. From these will fall the value proposition for investment and adoption. Each of these topics will be discussed in what follows, but for perspective consider Fig.1.

Defining resilience
Research in resilience engineering over the last decade has focused in two areas, organizational and information technology. Organizational resilience considers the ability of an organization to adapt and survive in the face of threats, including the prevention or mitigation of unsafe, hazardous or compromising conditions that threaten its very existence. Information technology resilience has been considered from a number of standpoints. Networking resilience has been considered as quality of service. Computing has considered such issues as dependability and performance in the face of unanticipated changes. However, based upon the application of control dynamics to industrial processes, functionality and determinism are primary considerations that are not captured by the traditional objectives of information technology. .

Considering the paradigm of control systems, one definition has been suggested that "Resilient control systems are those that tolerate fluctuations via their structure, design parameters, control structure and control parameters". However, this definition is taken from the perspective of control theory application to a control system. The consideration of the malicious actor and cyber security are not directly considered, which might suggest the definition, "an effective reconstitution of control under attack from intelligent adversaries," which was proposed. However, this definition focuses only on resilience in response to a malicious actor. To consider the cyber-physical aspects of control system, a definition for resilience considers both benign and malicious human interaction, in addition to the complex interdependencies of the control system application.

The use of the term “recovery” has been used in the context of resilience, paralleling the response of a rubber ball to stay intact when a force is exerted on it and recover its original dimensions after the force is removed. Considering the rubber ball in terms of a system, resilience could then be defined as its ability to maintain a desired level of performance or normalcy without irrecoverable consequences. While resilience in this context is based upon the yield strength of the ball, control systems require an interaction with the environment, namely the sensors, valves, pumps that make up the industrial operation. To be reactive to this environment, control systems require an awareness of its state to make corrective changes to the industrial process to maintain normalcy. With this in mind, in consideration of the discussed cyber-physical aspects of human systems integration and cyber security, as well as other definitions for resilience at a broader critical infrastructure level, the following can be deduced as a definition of a resilient control system:

"A resilient control system is one that maintains state awareness and an accepted level of operational normalcy in response to disturbances, including threats of an unexpected and malicious nature"



Considering the flow of a digital control system as a basis, a resilient control system framework can be designed. Referring to the left side of Fig. 2, a resilient control system holistically considers the measures of performance or normalcy for the state space. At the center, an understanding of performance and priority provide the basis for an appropriate response by a combination of human and automation, embedded within a multi-agent, semi-autonomous framework. Finally, to the right, information must be tailored to the consumer to address the need and position a desirable response. Several examples or scenarios of how resilience differs and provides benefit to control system design are available in the literature.

Areas of resilience
Some primary tenets of resilience, as contrasted to traditional reliability, have presented themselves in considering an integrated approach to resilient control systems. These cyber-physical tenants complement the fundamental concept of dependable or reliable computing by characterizing resilience in regard to control system concerns, including design considerations that provide a level of understanding and assurance in the safe and secure operation of an industrial facility. These tenants are discussed individually below to summarize some of the challenges to address in order to achieve resilience.

Human systems
The benign human has an ability to quickly understand novel solutions, and provide the ability to adapt to unexpected conditions. This behavior can provide additional resilience to a control system, but reproducibly predicting human behavior is a continuing challenge. The ability to capture historic human preferences can be applied to bayesian inference and bayesian belief networks, but ideally a solution would consider direct understanding of human state using sensors such as an EEG. Considering control system design and interaction, the goal would be to tailor the amount of automation necessary to achieve some level of optimal resilience for this mixed initiative response. Presented to the human would be that actionable information that provides the basis for a targeted, reproducible response.

Cyber security
In contrast to the challenges of prediction and integration of the benign human with control systems, the abilities of the malicious actor (or hacker) to undermine desired control system behavior also create a significant challenge to control system resilience. Application of dynamic probabilistic risk analysis used in human reliability can provide some basis for the benign actor. However, the decidedly malicious intentions of an adversarial individual, organization or nation make the modeling of the human variable in both objectives and motives. However, in defining a control system response to such intentions, the malicious actor looks forward to some level of recognized behavior to gain an advantage and provide a pathway to undermining the system. Whether performed separately in preparation for a cyber attack, or on the system itself, these behaviors can provide opportunity for a successful attack without detection. Therefore, in considering resilient control system architecture, atypical designs that imbed active and passively implemented randomization of attributes, would be suggested to reduce this advantage.

Complex networks and networked control systems
While much of the current critical infrastructure is controlled by a web of interconnected control systems, either architecture termed as distributed control systems (DCS) or supervisory control and data acquisition (SCADA), the application of control is moving toward a more decentralized state. In moving to a smart grid, the complex interconnected nature of individual homes, commercial facilities and diverse power generation and storage creates an opportunity and a challenge to ensuring that the resulting system is more resilient to threats. The ability to operate these systems to achieve a global optimum for multiple considerations, such as overall efficiency, stability and security, will require mechanisms to holistically design complex networked control systems. Multi-agent methods suggest a mechanism to tie a global objective to distributed assets, allowing for management and coordination of assets for optimal benefit and semi-autonomous, but constrained controllers that can react rapidly to maintain resilience for rapidly changing conditions.

Base metrics for resilient control systems
Establishing a metric that can capture the resilience attributes can be complex, at least if considered based upon differences between the interactions or interdependencies. Evaluating the control, cyber and cognitive disturbances, especially if considered from a disciplinary standpoint, leads to measures that already had been established. However, if the metric were instead based upon a normalizing dynamic attribute, such a performance characteristic that can be impacted by degradation, an alternative is suggested. Specifically, applications of base metrics to resilience characteristics are given as follows for type of disturbance:


 * Physical disturbances:
 * Time latency affecting stability
 * Data integrity affecting stability
 * Cyber disturbances:
 * Time latency
 * Data confidentiality, integrity and availability
 * Cognitive disturbances:
 * Time latency in response
 * Data digression from desired response

Such performance characteristics exist with both time and data integrity. Time, both in terms of delay of mission and communications latency, and data, in terms of corruption or modification, are normalizing factors. In general, the idea is to base the metric on “what is expected” and not necessarily the actual initiator to the degradation. Considering time as a metrics basis, resilient and un-resilient systems can be observed in Fig. 3.



Dependent upon the abscissa metrics chosen, Fig. 3 reflects a generalization of the resiliency of a system. Several common terms are represented on this graphic, including robustness, agility, adaptive capacity, adaptive insufficiency, resiliency and brittleness. To overview the definitions of these terms, the following explanations of each is provided below: On the abscissa of Fig. 3, it can be recognized that cyber and cognitive influences can affect both the data and the time, which underscores the relative importance of recognizing these forms of degradation in resilient control designs. For cybersecurity, a single cyberattack can degrade a control system in multiple ways. Additionally, control impacts can be characterized as indicated. While these terms are fundamental and seem of little value for those correlating impact in terms like cost, the development of use cases provide a means by which this relevance can be codified. For example, given the impact to system dynamics or data, the performance of the control loop can be directly ascertained and show approach to instability and operational impact.
 * Agility: The derivative of the disturbance curve. This average defines the ability of the system to resist degradation on the downward slope, but also to recover on the upward. Primarily considered a time based term that indicates impact to mission. Considers both short term system and longer term human responder actions.
 * Adaptive Capacity: The ability of the system to adapt or transform from impact and maintain minimum normalcy. Considered a value between 0 and 1, where 1 is fully operational and 0 is the resilience threshold.
 * Adaptive Insufficiency: The inability of the system to adapt or transform from impact, indicating an unacceptable performance loss due to the disturbance. Considered a value between 0 and -1, where 0 is the resilience threshold and -1 is total loss of operation.
 * Brittleness: The area under the disturbance curve as intersected by the resilience threshold. This indicates the impact from the loss of operational normalcy.
 * Phases of Resilient Control System Preparation and Disturbance Response:
 * Recon: Maintaining proactive state awareness of system conditions and degradation
 * Resist: System response to recognized conditions, both to mitigate and counter
 * Respond: System degradation has been stopped and returning system performance
 * Restore: Longer term performance restoration, which includes equipment replacement
 * Resiliency: The converse of brittleness, which for a resilience system is “zero” loss of minimum normalcy.
 * Robustness: A positive or negative number associated with the area between the disturbance curve and the resilience threshold, indicating either the capacity or insufficiency, respectively.

Resilience manifold for design and operation


The very nature of control systems implies a starting point for the development of resilience metrics. That is, the control of a physical process is based upon quantifiable performance and measures, including first principles and stochastic. The ability to provide this measurement, which is the basis for correlating operational performance and adaptation, then also becomes the starting point for correlation of the data and time variations that can come from the cognitive, cyber-physical sources. Effective understanding is based upon developing a manifold of adaptive capacity that correlates the design (and operational) buffer. For a power system, this manifold is based upon the real and reactive power assets, the controllable having the latitude to maneuver, and the impact of disturbances over time. For a modern distribution system (MDS), these assets can be aggregated from the individual contributions as shown in Fig. 4. For this figure, these assets include: a) a battery, b) an alternate tie line source, c) an asymmetric P/Q-conjectured source, d) a distribution static synchronous compensator (DSTATCOM), and e) low latency, four quadrant source with no energy limit.

Examples of resilient control system developments
1) When considering the current digital control system designs, the cyber security of these systems is dependent upon what is considered border protections, i.e., firewalls, passwords, etc. If a malicious actor compromised the digital control system for an industrial operation by a man-in-the-middle attack, data can be corrupted with the control system. The industrial facility operator would have no way of knowing the data has been compromised, until someone such as a security engineer recognized the attack was occurring. As operators are trained to provide a prompt, appropriate response to stabilize the industrial facility, there is a likelihood that the corrupt data would lead to the operator reacting to the situation and lead to a plant upset. In a resilient control system, as per Fig. 2, cyber and physical data is fused to recognize anomalous situations and warn the operator.

2) As our society becomes more automated for a variety of drivers, including energy efficiency, the need to implement ever more effective control algorithms naturally follow. However, advanced control algorithms are dependent upon data from multiple sensors to predict the behaviors of the industrial operation and make corrective responses. This type of system can become very brittle, insofar as any unrecognized degradation in the sensor itself can lead to incorrect responses by the control algorithm and potentially a worsened condition relative to the desired operation for the industrial facility. Therefore, implementation of advanced control algorithms in a resilient control system also requires the implementation of diagnostic and prognostic architectures to recognize sensor degradation, as well as failures with industrial process equipment associated with the control algorithms.

Resilient control system solutions and the need for interdisciplinary education
In our world of advancing automation, our dependence upon these advancing technologies will require educated skill sets from multiple disciplines. The challenges may appear simply rooted in better design of control systems for greater safety and efficiency. However, the evolution of the technologies in the current design of automation has created a complex environment in which a cyber-attack, human error (whether in design or operation), or a damaging storm can wreak havoc on the basic infrastructure. The next generation of systems will need to consider the broader picture to ensure a path forward where failures do not lead to ever greater catastrophic events. One critical resource are students who are expected to develop the skills necessary to advance these designs, and require both a perspective on the challenges and the contributions of others to fulfill the need. Addressing this need, a semester course in resilient control systems was established over a decade ago at Idaho and other universities as a catalogue or special topics focus for undergraduate and graduate students. The lessons in this course were codified in a text that provides the basis for the interdisciplinary studies. In addition, other courses have been developed to provide the perspectives and relevant examples to overview the critical infrastructure issues and provide opportunity to create resilient solutions at such universities as George Mason University and Northeastern.

Through the development of technologies designed to set the stage for next generation automation, it has become evident that effective teams are comprised several disciplines. However, developing a level of effectiveness can be time-consuming, and when done in a professional environment can expend a lot of energy and time that provides little obvious benefit to the desired outcome. It is clear that the earlier these STEM disciplines can be successfully integrated, the more effective they are at recognizing each other's contributions and working together to achieve a common set of goals in the professional world. Team competition at venues such as Resilience Week will be a natural outcome of developing such an environment, allowing interdisciplinary participation and providing an exciting challenge to motivate students to pursue a STEM education.

Standardizing resilience and resilient control system principles
Standards and policy that define resilience nomenclature and metrics are needed to establish a value proposition for investment, which includes government, academia and industry. The IEEE Industrial Electronics Society has taken the lead in forming a technical committee toward this end. The purpose of this committee will be to establish metrics and standards associated with codifying promising technologies that promote resilience in automation. This effort is distinct from more supply chain community focus on resilience and security, such as the efforts of ISO and NIST