One of the reference essays that we keep in our library is “Outlier Detection for Hydropower Generation Plant” written for the 14th International Conference on Automation Science and Engineering (CASE) Munich, Germany, August 20-24, 2018.

The essay presents a hydropower generation plant as a complex system and composed of numerous physical components. To monitor the health of different components it is necessary to detect anomalous behaviour in time. Establishing a performance guideline along with identification of the critical variables causing anomalous behaviour can help the maintenance personnel to detect any potential shift in the process timely. To establish any guideline for future control, at first a mechanism is needed to differentiate anomalous observations from the normal ones. Three different approaches were employed to detect the anomalous observations and compare their performances using a historical data set received from a hydropower plant. The outliers detected were verified by the domain experts. Making use of a decision tree and feature selection process, some critical variables were identified, potentially linked to the presence of the outliers. A one-class classifier was further developed, using the outlier cleaned dataset, which defines the normal working condition, and therefore, violation of the normal conditions could identify anomalous observations in future operations.

We would like to quote on our blog part of the essay.

“A hydropower generation plant can be divided into many functional areas like generators, turbines, bearings etc, each of which areas can in turn be subdivided into components. Data are generated in real time from hundreds of sensors across these functional areas and instrumented equipment. Anomaly can come from various sources and can cause different range of problems. For instance, an anomaly can be overheating of bearing oil and metal components, vibrations from bearings, or low generation of active or reactive power. It is vital to identify anomaly as soon as it appears. But doing so becomes extremely challenging as data generated is of high dimensionality (i.e., too many variables). In the literature, the anomaly detection problem is known as the unsupervised learning problems, because one does not have a training dataset with observations labelled as normal observations or outliers. Consequently, a supervised learning cannot be used to learn a rule to classify future observations. Intuitively speaking, outliers or anomalies are points or clusters of points which lie away from neighbouring points and clusters and they seem to be inconsistent with other observations. A perfect definition of an outlier, however, does not exist. All the outlier detection methods developed thus far are based on some assumptions, and no single unsupervised outlier detection method can perfectly classify all different types of outliers in a dataset [1]. Existing outlier detection methods can be grouped into four major schools of thoughts depending on their criteria of identifying outliers. According to the timeline of their inception to the body of knowledge, the four schools are: distance and density-based methods, subspace-based methods, angle-based methods and ensemble-based methods. Each of these domains has their respective strengths and weaknesses. In the distance-based methods, a point is considered an outlier if it lies further away from most of the points [2]. Instead of considering distances from all the points it would be more logical to consider the deviation from neighbourhood points and lead thus to methods based on the concept of k-nearest neighbour (k-NN) [3]–[5]. One of the major downsides of these distance-based methods is: if the dataset has multiple clusters of varying density, then they would not be able to separate local outliers (i.e., outliers only with respect to a single cluster) and normal data points successfully. Distance based methods tend to work effectively when the dataset has clusters of similar density or no cluster at all. In the density based methods [6]–[9], a point is considered an outlier if the density around it is considerably lower than the density around its neighbours. So, these methods can handle data with clustering tendency and can identify local outliers. Both of distance and density-based methods need the pairwise distances to be calculated. When the data dimension is too high, the proportional difference between the farthest point distance and the closest point distance vanishes and distances between any pair of data records become much less differentiable [1]. Consequently, the distance and density-based methods does not work effectively in high dimensional data spaces. In data spaces of high dimension, we have to consider relevant subspaces rather than considering the entire feature set. For a particular observation, relevance means the subspace in which it is different than other observations. As such, the search for outliers must be accompanied by the search for relevant subspaces. The disadvantages of the subspace-based method include the lack of an appropriate way to compare outliers identified in different subspaces, and the large number of subspaces that need to be explored. Angle based methods are similar to the distance-based methods, but they were introduced with the consideration that angles are a more stable measure in high dimensions compared to distances [10]. One major limitation of this method is the high computational time it requires to calculate the angles. Ensemble based techniques were introduced more recently, motivated in part by the frustration that no outlier detection techniques have been able to identify different types of outliers, while in the other part by its success in supervised learning such as bagging or boosting [11]. Researchers feel the need to combine non-compatible techniques of different types to improve the outlier detection accuracy. To ensemble, one can either use different techniques one after another on the dataset in randomly selected subspaces or one can use one suitable technique on the dataset in randomly selected subspaces for a number of iterations and then combine the result over different techniques/iterations for each observation. Using ensemble-based techniques, how to combine scores from different outlier methods is still an issue elusive of the data mining community [1]. In this paper we chose three outlier detection methods based on distinct schools of thoughts, namely, Local Outlier Factor (LOF) as a density based method, Feature Bagging for Outlier Detection (FBOD) as an ensemble method, and Subspace Outlier Degree(SOD) as a subspace method. We employed these methods to identify the outliers in a dataset received from a hydropower generation plant. The comparative performances of these methods are analysed and some commonality among the results are found. We discuss our finding concerning which variables have the most contribution to the selected outliers and for what range of values. We have also trained a one-class support vector machine (SVM) classifier based on the outlier-removed hydropower plant dataset. The one-class SVM defines the boundary for normalcy and can thus be used to check future observations for their likelihood of being an outlier.

The rest of the paper unfolds as follows: Section II analyzes the dataset received, how it was cleaned, and summarizes the research question at the end. Section III describes the outlier detection methods that we selected to apply on our dataset. Section IV presents the results from applying the selected methods to the hydropower dataset. Analysis of the results follows in Section V. Finally, we conclude the paper in Section V.”

To go deeper through the subject please refer to:

Imtiaz Ahmed, Aldo Dagnino, Alessandro Bongiovi and Yu Ding (2018). Outlier Detection for Hydropower Generation Plant – 13th Conference on Automation Science and Engineering. Received February 14, 2017