Nonlinear system identification

System identification is a method of identifying or measuring the mathematical model of a system from measurements of the system inputs and outputs. The applications of system identification include any system where the inputs and outputs can be measured and include industrial processes, control systems, economic data, biology and the life sciences, medicine, social systems and many more.

A nonlinear system is defined as any system that is not linear, that is any system that does not satisfy the superposition principle. This negative definition tends to obscure that there are very many different types of nonlinear systems. Historically, system identification for nonlinear systems has developed by focusing on specific classes of system and can be broadly categorized into five basic approaches, each defined by a model class:
 * 1) Volterra series models,
 * 2) Block-structured models,
 * 3) Neural network models,
 * 4) NARMAX models, and
 * 5) State-space models.

There are four steps to be followed for system identification: data gathering, model postulate, parameter identification, and model validation. Data gathering is considered as the first and essential part in identification terminology, used as the input for the model which is prepared later. It consists of selecting an appropriate data set, pre-processing and processing. It involves the implementation of the known algorithms together with the transcription of flight tapes, data storage and data management, calibration, processing, analysis, and presentation. Moreover, model validation is necessary to gain confidence in, or reject, a particular model. In particular, the parameter estimation and the model validation are integral parts of the system identification. Validation refers to the process of confirming the conceptual model and demonstrating an adequate correspondence between the computational results of the model and the actual data.

Volterra series methods
The early work was dominated by methods based on the Volterra series, which in the discrete time case can be expressed as


 * $$\begin{align}

y(k)& = h_0+\sum\limits_{m_1=1}^M h_1(m_1)u(k-m_1) + \sum\limits_{m_1=1}^M \sum\limits_{m_2=1}^M h_2(m_1,m_2)u(k-m_1)u(k-m_2) \\ & {}\quad{}+\sum\limits_{m_1=1}^M \sum\limits_{m_2=1}^M \sum\limits_{m_3=1}^M h_3(m_1,m_2,m_3)u(k-m_1)u(k-m_2)u(k-m_3) + \cdots \end{align}$$

where u(k), y(k); k = 1, 2, 3, ... are the measured input and output respectively and $$h_\ell(m_1,\ldots ,m_\ell)$$ is the lth-order Volterra kernel, or lth-order nonlinear impulse response. The Volterra series is an extension of the linear convolution integral. Most of the earlier identification algorithms assumed that just the first two, linear and quadratic, Volterra kernels are present and used special inputs such as Gaussian white noise and correlation methods to identify the two Volterra kernels. In most of these methods the input has to be Gaussian and white which is a severe restriction for many real processes. These results were later extended to include the first three Volterra kernels, to allow different inputs, and other related developments including the Wiener series. A very important body of work was developed by Wiener, Lee, Bose and colleagues at MIT from the 1940s to the 1960s including the famous Lee and Schetzen method. While these methods are still actively studied today there are several basic restrictions. These include the necessity of knowing the number of Volterra series terms a priori, the use of special inputs, and the large number of estimates that have to be identified. For example, for a system where the first order Volterra kernel  is described by say 30 samples, 30x30 points will be required for the second order kernel, 30x30x30 for the third order   and so on and hence the amount of data required to provide good estimates becomes excessively large. These numbers can be reduced by exploiting certain symmetries but the requirements are still excessive irrespective of what algorithm is used for the identification.

Block-structured systems
Because of the problems of identifying Volterra models other model forms were investigated as a basis for system identification for nonlinear systems. Various forms of block structured nonlinear models have been introduced or re-introduced. The Hammerstein model consists of a static single valued nonlinear element followed by a linear dynamic element. The Wiener model is the reverse of this combination so that the linear element occurs before the static nonlinear characteristic. The Wiener-Hammerstein model consists of a static nonlinear element sandwiched between two dynamic linear elements, and several other model forms are available. The Hammerstein-Wiener model consists of a linear dynamic block sandwiched between two static nonlinear blocks. The Urysohn model is different from other block models, it does not consists of sequence linear and nonlinear blocks, but describes both dynamic and static nonlinearities in the expression of the kernel of an operator. All these models can be represented by a Volterra series but in this case the Volterra kernels take on a special form in each case. Identification consists of correlation based and parameter estimation methods. The correlation methods exploit certain properties of these systems, which means that if specific inputs are used, often white Gaussian noise, the individual elements can be identified one at a time. This results in manageable data requirements and the individual blocks can sometimes be related to components in the system under study.

More recent results are based on parameter estimation and neural network based solutions. Many results have been introduced and these systems continue to be studied in depth. One problem is that these methods are only applicable to a very special form of model in each case and usually this model form has to be known prior to identification.

Neural networks
Artificial neural networks try loosely to imitate the network of neurons in the brain where computation takes place through a large number of simple processing elements. A typical neural network consists of a number of simple processing units interconnected to form a complex network. Layers of such units are arranged so that data is entered at the input layer and passes through either one or several intermediate layers before reaching the output layer. In supervised learning the network is trained by operating on the difference between the actual output and the desired output of the network, the prediction error, to change the connection strengths between the nodes. By iterating, the weights are modified until the output error reaches an acceptable level. This process is called machine learning because the network adjusts the weights so that the output pattern is reproduced. Neural networks have been extensively studied and there are many excellent textbooks devoted to this topic in general, and more focused textbooks which emphasise control and systems applications,. There are two main problem types that can be studied using neural networks: static problems, and dynamic problems. Static problems include pattern recognition, classification, and approximation. Dynamic problems involve lagged variables and are more appropriate for system identification and related applications. Depending on the architecture of the network the training problem can be either nonlinear-in-the-parameters which involves optimisation or linear-in-the-parameters which can be solved using classical approaches. The training algorithms can be categorised into supervised, unsupervised, or reinforcement learning. Neural networks have excellent approximation properties but these are usually based on standard function approximation results using for example the Weierstrass Theorem that applies equally well to polynomials, rational functions, and other well-known models. Neural networks have been applied extensively to system identification problems which involve nonlinear and dynamic relationships. However, classical neural networks are purely gross static approximating machines. There is no dynamics within the network. Hence when fitting dynamic models all the dynamics arise by allocating lagged inputs and outputs to the input layer of the network. The training procedure then produces the best static approximation that relates the lagged variables assigned to the input nodes to the output. There are more complex network architectures, including recurrent networks, that produce dynamics by introducing increasing orders of lagged variables to the input nodes. But in these cases it is very easy to over specify the lags and this can lead to over fitting and poor generalisation properties. Neural networks have several advantages; they are conceptually simple, easy to train and to use, have excellent approximation properties, the concept of local and parallel processing is important and this provides integrity and fault tolerant behaviour. The biggest criticism of the classical neural network models is that the models produced are completely opaque and usually cannot be written down or analysed. It is therefore very difficult to know what is causing what, to analyse the model, or to compute dynamic characteristics from the model. Some of these points will not be relevant to all applications but they are for dynamic modelling.

NARMAX methods
The nonlinear autoregressive moving average model with exogenous inputs (NARMAX model) can represent a wide class of nonlinear systems, and is defined as


 * $$\begin{align}

y(k) & =F[y(k-1),y(k-2),\ldots ,y(k-n_y),u(k-d),u(k-d-1),\ldots ,u(k-d-n_u), \\ & {}\quad e(k-1),e(k-2),\ldots ,e(k-n_e)]+e(k) \end{align}$$ where y(k), u(k) and e(k) are the system output, input, and noise sequences respectively; $$n_y$$, $$n_u$$, and $$n_e$$ are the maximum lags for the system output, input and noise; F[•] is some nonlinear function, d is a time delay typically set to d = 1.The model is essentially an expansion of past inputs, outputs and noise terms. Because the noise is modelled explicitly, unbiased estimates of the system model can be obtained in the presence of unobserved highly correlated and nonlinear noise. The Volterra, the block structured models and many neural network architectures can all be considered as subsets of the NARMAX model. Since NARMAX was introduced, by proving what class of nonlinear systems can be represented by this model, many results and algorithms have been derived based around this description. Most of the early work was based on polynomial expansions of the NARMAX model. These are still the most popular methods today but other more complex forms based on wavelets and other expansions have been introduced to represent severely nonlinear and highly complex nonlinear systems. A significant proportion of nonlinear systems can be represented by a NARMAX model including systems with exotic behaviours such as chaos, bifurcations, and subharmonics. While NARMAX started as the name of a model it has now developed into a philosophy of nonlinear system identification,. The NARMAX approach consists of several steps:


 * Structure detection: which terms are in the model
 * Parameter estimation: determine the model coefficients
 * Model validation: is the model unbiased and correct
 * Prediction: what is the output at some future time
 * Analysis: what are the dynamical properties of the system

Structure detection forms the most fundamental part of NARMAX. For example, a NARMAX model which consists of one lagged input and one lagged output term, three lagged noise terms, expanded as a cubic polynomial would consist of eighty two possible candidate terms. This number of candidate terms arises because the expansion by definition includes all possible combinations within the cubic expansion. Naively proceeding to estimate a model which includes all these terms and then pruning will cause numerical and computational problems and should always be avoided. However, only a few terms are often important in the model. Structure detection, which aims to select terms one at a time, is therefore critically important. These objectives can easily be achieved by using the Orthogonal Least Squares algorithm and its derivatives to select the NARMAX model terms one at a time. These ideas can also be adapted for pattern recognition and feature selection and provide an alternative to principal component analysis but with the advantage that the features are revealed as basis functions that are easily related back to the original problem.

NARMAX methods are designed to do more than find the best approximating model. System identification can be divided into two aims. The first involves approximation where the key aim is to develop a model that approximates the data set such that good predictions can be made. There are many applications where this approach is appropriate, for example in time series prediction of the weather, stock prices, speech, target tracking, pattern classification etc. In such applications the form of the model is not that important. The objective is to find an approximation scheme which produces the minimum prediction errors. A second objective of system identification, which includes the first objective as a subset, involves much more than just finding a model to achieve the best mean squared errors. This second aim is why the NARMAX philosophy was developed and is linked to the idea of finding the simplest model structure. The aim here is to develop models that reproduce the dynamic characteristics of the underlying system, to find the simplest possible model, and if possible to relate this to components and behaviours of the system under study. The core aim of this second approach to identification is therefore to identify and reveal the rule that represents the system. These objectives are relevant to model simulation and control systems design, but increasingly to applications in medicine, neuro science, and the life sciences. Here the aim is to identify models, often nonlinear, that can be used to understand the basic mechanisms of how these systems operate and behave so that we can manipulate and utilise these. NARMAX methods have also been developed in the frequency and spatio-temporal domains.

Stochastic nonlinear models
In a general situation, it might be the case that some exogenous uncertain disturbance passes through the nonlinear dynamics and influence the outputs. A model class that is general enough to capture this situation is the class of stochastic nonlinear state-space models. A state-space model is usually obtained using first principle laws, such as mechanical, electrical, or thermodynamic physical laws, and the parameters to be identified usually have some physical meaning or significance.

A discrete-time state-space model may be defined by the difference equations:



\begin{aligned} x_{t+1} &= f(x_t, u_t, w_t; \theta),\\ y_t &= g(x_t, u_t, v_t; \theta), \quad t =1,2, \dots \end{aligned} $$

in which $$t$$ is a positive integer referring to time. The functions $$f$$ and $$g$$ are general nonlinear functions. The first equation is known as the state equation and the second is known as the output equation. All the signals are modeled using stochastic processes. The process $$x_t$$ is known as the state process, $$w_t$$ and $$v_t$$ are usually assumed independent and mutually independent such that $$w_t \sim p(w; \theta),\; v_t \sim p(v; \theta)$$. The parameter $$\theta$$ is usually a finite-dimensional (real) parameter to be estimated (using experimental data). Observe that the state process does not have to be a physical signal, and it is normally unobserved (not measured). The data set is given as a set of input-output pairs $$(y_t, u_t)$$ for $$t = 1, \dots, N$$ for some finite positive integer value $$N$$.

Unfortunately, due to the nonlinear transformation of unobserved random variables, the likelihood function of the outputs is analytically intractable; it is given in terms of a multidimensional marginalization integral. Consequently, commonly used parameter estimation methods such as the Maximum Likelihood Method or the Prediction Error Method based on the optimal one-step ahead predictor are analytically intractable. Recently, algorithms based on sequential Monte Carlo methods have been used to approximate the conditional mean of the outputs or, in conjunction with the Expectation-Maximization algorithm, to approximate the maximum likelihood estimator. These methods, albeit asymptotically optimal, are computationally demanding and their use is limited to specific cases where the fundamental limitations of the employed particle filters can be avoided. An alternative solution is to apply the prediction error method using a sub-optimal predictor. The resulting estimator can be shown to be strongly consistent and asymptotically normal and can be evaluated using relatively simple algorithms.