User:EyalBrill/Event Detection System

Water Quality Event Detection System
This page describes the basics of industrial water quality event detection system. It gives the theoretical background and example from actual systems. Examples will be based on event detection from the water industry. However, examples are generic to be implemented in any industry.

Abstract
Water quality Event Detection System is a software based on machine learning with an aim to detect pending problems in water quality by detecting abnormality in data. In many cases, machine learning algorithms are utilized to this end. The idea behind this methodology is that an algorithm can recognize patterns for problematic events, and alert when a similar condition occurs.

This methodology can be practiced using two main theories, supervised learning and unsupervised learning. In cases where samples of actual events exist, supervised learning can be implemented. Using this methodology the learning algorithm builds statistical relations between measurements and outcome. Next, an expert has to classify all the cases into "bad" cases and "good" cases, which allows the algorithm to learn which input combinations lead to bad results and alerts whenever fresh incoming data has such characteristics.

However, in many cases, especially water quality, information about true events does not exist. What does exist are: True Negative cases, i.e. cases in which water quality was fine and no alarm was raised, False Positive cases, i.e. cases in which alarms was mistakenly declared. Less frequently, information on  False Negatives cases exist, i.e. cases in which water quality was subpar and the detection system failed to detect it. In such situations, unsupervised algorithms should be used. An unsupervised algorithm learns from the "normal data" how abnormal cases occur statistically, and whenever a data combination which obeys to the abnormal characteristics occur the system will alert. This wiki focus on the above issues. It briefly describs the difference between supervised and unsupervised learning mechanisms. It describes the nature of water quality measurements and the challenge water quality measurements create when learning mechanisms are used in order to detect water quality events. Next, it describe two major algorithms (Decision Tree and clustering) which are used for detecting water quality events. It will explains both the mechanism each algorithm is using and how the detection results can be evaluated. Finally, a public data set with simulated water quality contamination events will be analyzed in order to examine and evaluate the differences between the algorithms. The wiki will conclude with general suggestions on how to test the fitness of an algorithm to the task of water quality event detection.

The challange of water quality event detection systems
Water quality measurments are charactrized by noisy measurments. This noise is due to several reasons. First, water may be from several source such as up stream water or wells. When ever the owner of the distribution system changes the origion of the water, quality measurments flactuates. Second, quality instruments my be subjects to electrical disterbanse which creates bad signals. Third, changes in physical characteristics of water such as pressure, flow and level may reflacts the water quality measurments as well. And last but not least, changes between day and night and winter and summer results also in changes in water quality. A report published by the EPA (2005) has showen two majuor issues: one, not every changes in water quality is neccessarely, indication for water contamination. Second, in some cases intentional contamination may result in a situation in which each one of the water quality measurments is within its allowed limits, while the water are not safe for driniking. This indicated that simple low nd high limits for water quality parameters may not be sufficient to detect water quality event. Two main methods are available for detection water quality event. The next two sections describe these methods.

Data set
The data set for the analysis contains set of 3000 records. These records are 2 minutes samples set taken from a public data set made available by the EPA during the "EAP EDS challenge". The data set consists on 4 measurements.

Table 1: Variables

Detecting water quality events using Supervised method
Supervised machine learning is a group of algorithms which belong to the machine learning domain. The main idea in these algorithms is that human expert can classify a set of historical records by taging the records as Good cases and Bad cases. Algorithm than learns the relation between inputs and output and creates a model which is mathematical connection between the two. Examples for suppervised learning are linear regression or decision tree. This section describes how abnormal water quality is detected using a [Logistic regression|Logistic regression].