After more than a year in the making, my book “Pengenalan Machine Learning dengan Python (Introduction to Machine Learning using Python)” is out! This book is aimed at Indonesian readers who are interested to start learning about this exciting field of study.
It is a 200-page tutorial on Python, giving readers introduction on what machine learning is, basic skills in Pandas, and samples of several machine learning algorithms with focus on Scikit-learn and some other Pythonic libraries. There is a chapter on Computer Vision as well.
Get your copy in the local bookstore or from popular e-commerce sites (such as this).
This post continues my previous post on outlier detection as a statistical method to identify Data Quality (DQ) issues. If you haven’t read my previous article, click here. This time around, a different approach is taken by employing machine learning (ML) technique.
As I explained in my previous post, a sudden change in the number of outliers in a dataset is a strong indicator that we may have a data quality issue. Outliers can be legitimate, but more often than not, they are anomalies which should be removed from the dataset. Finding anomalies is a crucial step before allowing the dataset to be used for further processing.
To demonstrate how the anomaly detection works, I took a sample dataset containing the Daily Sales Revenue statistics of a fictitious company, as shown below:
Our intention is to detect the day in which revenue numbers in any of the Sales Areas deviate from what we consider as normal. This is important because sudden change in revenue is generally not expected. However, eyeballing the data row-by-row to search for anomalies can be a daunting, error-prone task. We need a machine to do that for us. To achieve this, we can use an ML algorithm called Isolation Forest.
Isolation Forest is a classifier which traces its root from the popular Decision Tree algorithm. In short, Isolation Forest algorithm tries to build a tree and directly isolate outliers. This is precisely what we need to detect data quality issues (while it is an interesting topic, I may not be the right person to explain the mathematical background of this algorithm in detail. If you wish to understand the theory behind Isolation Forest, please visit https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf).
Isolation Forest allows us to detect multivariateoutliers, meaning we can find unusual values in more than one variable. For the first run, we will use two variables only, “NQ” and “Total” from the table. “NQ” is particularly interesting because it indicates revenue transactions which cannot be categorized into the correct Sales Area. A sudden jump in “NQ” is a clear sign that something has gone wrong. Let’s visualize the data in a scatter plot (with the values have been normalized):
By looking at the scatter plot above, outliers can be identified as dots which are distant away from the majority in the mid-lower part of the chart. There are both outliers in “NQ” and “Total” variables. Human eyes can easily see these, but computers need to be trained to achieve the same. Using Python’s Scikit-learn library, Isolation Forest can be implemented with few lines of code like the example below:
This algorithm needs a random number to start the work, and we can experiment with this number to see different result. One thing to note, the algorithm takes one parameter called contamination, which in the above example, is set to 0.1, meaning we expect outliers will be less than 10% of the population (this is because the dataset is small; in a larger dataset, 2% will be a more reasonable limit). The estimator parameter can be set to above 100 to gain some accuracy at the price of more processing time.
When I ran the program on a dataset of approximately 300 records, the model produced outliers count = 16. Let’s superimpose the outliers in the previous scatter plot, marking the 16 outliers with red color:
As we can see, most obvious outliers have successfully been detected, but not all. The detection performance can be improved by experimenting with the parameters and by adding more historical data (the above example splits data into training and test datasets with 80-20 ratio). To complete the detection, we should also perform another run with all 5 variables in the table (AREA1, AREA2, etc.), at the same time. This might result in a smaller set of outliers like the example below:
Since we already have the anomalies, they can then be visualized in a Tableau dashboard with different color, alerting us to immediately perform some investigation on them.
Other ML Algorithm
Another interesting alternative for outlier detection is clustering algorithm called DBSCAN. The good thing about DBSCAN is that it does the job almost autonomously, it does not require us to supply the number of clusters to be created unlike other algorithms such as K-Means. The implementation in Scikit-learn is also quite straightforward. The snippet of the code looks like below:
There are two important parametes to be supplied; eps and min_samples. It takes some experiments to find the right values. Running it on similar Daily Sales statistics table (with few months of historical data) results in two clusters, with the anomalies are flagged as CLUSTER = -1 as seen below:
Visualized in a scatter plot, anomalies are marked red:
Most anomalies can be detected pretty much effectively with this technique. However, DBSCAN tends to run slowly and seems to take a lot of computing resource. In many cases it simply fails to produce any useful output when the dataset becomes too large.
Doing It at Scale
The above programs cover datasets with size of only few hundred rows. What if we want to detect outliers in a much larger dataset, say, 1 million records? In real-world application, this will likely be the usual case. That’s the where challenge begins!
To search for outliers in a large dataset, we cannot use the standard Python because it would simply break after reaching a certain point. Python and Scikit-learn are not actually designed for big data. We would ideally have to employ Spark MLLib for data at this scale. However, at the time of writing, Spark MLLib does not support Isolation Forest or DBSCAN. Alas, we are stuck with Scikit-learn.
To make an attempt to see how far Scikit-learn could go, I created a sample of 1 million, 2 million, 10 million, 15 million and a gigantic 150 million records. In a hope to exploit all available computing resources, I tried creating Spark UDF (user-defined function) to make Isolation Forest as if it is a Spark function. Also, I used sparkContext.broadcast() function to distribute the data and model to all nodes, minimizing communication cost. Here is the code :
Was it successful? Not really. The above program worked for up to 10 million records, but it crashed when faced with 15 million records. A far cry from 150 million records that I wanted! Running it on a sampled table with 10 million rows and 2 variables took 5 hours, resulting in around 200 outliers found.
As you can see, putting the code into production-level data is the real challenge. A true implementation using Spark remains something to wish for. After some Googling, I found a library that the creator claimed to use Isolation Forest in Spark (https://github.com/titicaca/spark-iforest). I would really like to try this, however I still don’t know how (and don’t have time) to make it work. Would anyone be interested in helping me? Drop me an email!
In my previous post on Data Quality Framework, I discussed about the concept of DQ (data quality) Test Point. The idea is to detect data quality issues by probing the data. In these DQ Test Points, DQ metrics are extracted, calculated and then compared against a baseline.
To bolster the DQ metrics, a method called outlier detection is added to the DQ Test Points. This is a pretty simple check to see abnormalities in the data by looking for outliers. Outliers are, as you must have guessed, values which are very different from most of the population. For example, if most of Telkomsel subscribers make 5-6 voice calls per day, then those who place 100 voice calls per day are clear outliers. Outliers can be legitimate as in the previous example, but they can also be invalid or unwanted data. For example a person whose age is 250 years is certainly an invalid data, something that we want to erase from our datasets.
Most outliers are hidden inside the dataset. There are almost always outliers, but when we see a lot of them in our tables, or a sudden change in the number of outliers, we can see it as a sign of a potential data quality issue. It is important to remove outliers because they generally have negative effect on analysis and training a predictive model.
To identify outliers, one of the most common methods is to calculate the distribution of the data using Inter-quartile Range (IQR). In the simplest words, IQR is the area of the data which represents the “middle” values where half (50%) of the data belongs. Anything that falls beyond a certain distance from the IQR will be marked as outliers. According to “Tukey’s Rule” (named after John Tukey, an American mathematician), the distance is 1.5 x IQR (see diagram below).
To perform the computation, first we will have to sort all rows in the dataset from the lowest to the highest value, then divide the data into four equal parts (hence the name “quartile”). The boundary for the first 25% of the data is called Q1 and the last 25% is called Q3. To get the IQR, subtract Q3 from Q1. After that, find the “min” and “max” limits by calculating 1.5 x IQR from Q1 and Q3, respectively. Once you have these “min” and “max” numbers, you can start counting the outliers which values are less than “min” and more than “max”. There you get your outliers.
Finding IQR like above is a computationally heavy process, especially when involving large number of rows. Luckily, our friend Spark has provided us with the tool for this. Using approxQuantile() function which allows us to compute IQR without going through the whole dataset, and this can be done quite easily. You may want to see the snippet of the PySpark code as shown below:
To give an example of how it works, this script was run on one of the tables in the ABT schema, with around 5 million records. Outliers are gathered from all numerical columns. The script produced the following statistics:
As we can see, the percentage of outliers is generally small. Anything under 2% will be considered normal. However, in the above example, there are some particular columns which have large percentage of outliers, even up to 15%. This is something worth investigating as this can lead to potential quality problems.
At the moment the IT Data Quality team is working on putting the outliers count in our BI DQ Dashboard (see below) alongside with other DQ metrics. By comparing the outliers statistics with historical data, we may be able to detect issues before they present a problem in the consumers side.
The method mentioned above is far from perfect. It only captures outliers in one dimension, which is called univariate outliers. To get outliers which lie in two or more dimensions, called multivariate outliers, different methods employing ML techniques must be prepared. Stay tuned for more posts on this matter. Our journey to improve data quality in BI keeps going on!