Big Data – Page 2 – Dios Kurniawan's Blog

June 9, 2020

Data Quality Test Using Machine Learning

June 2020 | Dios Kurniawan

This post continues my previous post on outlier detection as a statistical method to identify Data Quality (DQ) issues. If you haven’t read my previous article, click here. This time around, a different approach is taken by employing machine learning (ML) technique.

As I explained in my previous post, a sudden change in the number of outliers in a dataset is a strong indicator that we may have a data quality issue. Outliers can be legitimate, but more often than not, they are anomalies which should be removed from the dataset. Finding anomalies is a crucial step before allowing the dataset to be used for further processing.

To demonstrate how the anomaly detection works, I took a sample dataset containing the Daily Sales Revenue statistics of a fictitious company, as shown below:

Our intention is to detect the day in which revenue numbers in any of the Sales Areas deviate from what we consider as normal. This is important because sudden change in revenue is generally not expected. However, eyeballing the data row-by-row to search for anomalies can be a daunting, error-prone task. We need a machine to do that for us. To achieve this, we can use an ML algorithm called Isolation Forest.

Isolation Forest is a classifier which traces its root from the popular Decision Tree algorithm. In short, Isolation Forest algorithm tries to build a tree and directly isolate outliers. This is precisely what we need to detect data quality issues (while it is an interesting topic, I may not be the right person to explain the mathematical background of this algorithm in detail. If you wish to understand the theory behind Isolation Forest, please visit https://cs.nju.edu.cn/zhouzh/zhouzh.files/publication/icdm08b.pdf).

Isolation Forest allows us to detect multivariate outliers, meaning we can find unusual values in more than one variable. For the first run, we will use two variables only, “NQ” and “Total” from the table. “NQ” is particularly interesting because it indicates revenue transactions which cannot be categorized into the correct Sales Area. A sudden jump in “NQ” is a clear sign that something has gone wrong. Let’s visualize the data in a scatter plot (with the values have been normalized):

By looking at the scatter plot above, outliers can be identified as dots which are distant away from the majority in the mid-lower part of the chart. There are both outliers in “NQ” and “Total” variables. Human eyes can easily see these, but computers need to be trained to achieve the same. Using Python’s Scikit-learn library, Isolation Forest can be implemented with few lines of code like the example below:

from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(88)
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.2, random_state=rng)
dfTrain = pd.DataFrame(list(zip(X_train, y_train)), columns=['total_','nq_'])
dfTest = pd.DataFrame(list(zip(X_test, y_test)), columns=['total_','nq_'])
clf = IsolationForest(n_estimators=100, max_samples='auto', contamination=0.1, random_state=rng)
clf.fit(dfTrain)
y_pred_train = clf.predict(dfTrain)
y_pred_test = clf.predict(dfTest)
dfPredTest = pd.DataFrame(y_pred_test)
dfPredTest = dfTest.join(dfPredTest, how='left')
dfOutlier = dfPredTest.loc[dfPredTest[0] < 0]
print("Outlier Count=", dfOutlier.count())

This algorithm needs a random number to start the work, and we can experiment with this number to see different result. One thing to note, the algorithm takes one parameter called contamination, which in the above example, is set to 0.1, meaning we expect outliers will be less than 10% of the population (this is because the dataset is small; in a larger dataset, 2% will be a more reasonable limit). The estimator parameter can be set to above 100 to gain some accuracy at the price of more processing time.

When I ran the program on a dataset of approximately 300 records, the model produced outliers count = 16. Let’s superimpose the outliers in the previous scatter plot, marking the 16 outliers with red color:

As we can see, most obvious outliers have successfully been detected, but not all. The detection performance can be improved by experimenting with the parameters and by adding more historical data (the above example splits data into training and test datasets with 80-20 ratio). To complete the detection, we should also perform another run with all 5 variables in the table (AREA1, AREA2, etc.), at the same time. This might result in a smaller set of outliers like the example below:

Since we already have the anomalies, they can then be visualized in a Tableau dashboard with different color, alerting us to immediately perform some investigation on them.

Anomalies are marked red in the DQ dashboard

Other ML Algorithm

Another interesting alternative for outlier detection is clustering algorithm called DBSCAN. The good thing about DBSCAN is that it does the job almost autonomously, it does not require us to supply the number of clusters to be created unlike other algorithms such as K-Means. The implementation in Scikit-learn is also quite straightforward. The snippet of the code looks like below:

from sklearn.cluster import DBSCAN
db_default = DBSCAN(eps = 0.5, min_samples = 3).fit(dfRecharge)
labels = db_default.labels_

There are two important parametes to be supplied; eps and min_samples. It takes some experiments to find the right values. Running it on similar Daily Sales statistics table (with few months of historical data) results in two clusters, with the anomalies are flagged as CLUSTER = -1 as seen below:

Visualized in a scatter plot, anomalies are marked red:

Most anomalies can be detected pretty much effectively with this technique. However, DBSCAN tends to run slowly and seems to take a lot of computing resource. In many cases it simply fails to produce any useful output when the dataset becomes too large.

Doing It at Scale

The above programs cover datasets with size of only few hundred rows. What if we want to detect outliers in a much larger dataset, say, 1 million records? In real-world application, this will likely be the usual case. That’s the where challenge begins!

To search for outliers in a large dataset, we cannot use the standard Python because it would simply break after reaching a certain point. Python and Scikit-learn are not actually designed for big data. We would ideally have to employ Spark MLLib for data at this scale. However, at the time of writing, Spark MLLib does not support Isolation Forest or DBSCAN. Alas, we are stuck with Scikit-learn.

To make an attempt to see how far Scikit-learn could go, I created a sample of 1 million, 2 million, 10 million, 15 million and a gigantic 150 million records. In a hope to exploit all available computing resources, I tried creating Spark UDF (user-defined function) to make Isolation Forest as if it is a Spark function. Also, I used sparkContext.broadcast() function to distribute the data and model to all nodes, minimizing communication cost. Here is the code :

def get_outliers(a, b):
  result = 0
  x_pred = [(a, b)]
  try:
    x_pred = b_scaler.value.transform(x_pred)
    result = b_model1.value.predict(x_pred)[0]
  except:
    print('{0} Error in {1} : '.format(dTime, x_pred))
  return int(result)
udf_get_outliers = F.udf(get_outliers, IntegerType())

scaler = pp.StandardScaler(copy=True, with_mean=True, with_std=True) 
model1 = IsolationForest(n_estimators=150, max_samples='auto', contamination=0.02, random_state=42) 
X_train = scaler.fit_transform(X_train) 
y_train = model1.fit(X_train)
b_scaler = spark.sparkContext.broadcast(scaler)
b_model1 = spark.sparkContext.broadcast(y_train)
df1 = df1.withColumn('prediction', udf_get_outliers('rev_mtd_', 'rev_m2_'))
nOutliers = df1.where(F.col('prediction') < 0).count()

Was it successful? Not really. The above program worked for up to 10 million records, but it crashed when faced with 15 million records. A far cry from 150 million records that I wanted! Running it on a sampled table with 10 million rows and 2 variables took 5 hours, resulting in around 200 outliers found.

As you can see, putting the code into production-level data is the real challenge. A true implementation using Spark remains something to wish for. After some Googling, I found a library that the creator claimed to use Isolation Forest in Spark (https://github.com/titicaca/spark-iforest). I would really like to try this, however I still don’t know how (and don’t have time) to make it work. Would anyone be interested in helping me? Drop me an email!

March 10, 2020

Detecting Data Quality Issues by Identifying Outliers

March 2020 | Dios Kurniawan

In my previous post on Data Quality Framework, I discussed about the concept of DQ (data quality) Test Point. The idea is to detect data quality issues by probing the data. In these DQ Test Points, DQ metrics are extracted, calculated and then compared against a baseline.

To bolster the DQ metrics, a method called outlier detection is added to the DQ Test Points. This is a pretty simple check to see abnormalities in the data by looking for outliers. Outliers are, as you must have guessed, values which are very different from most of the population. For example, if most of Telkomsel subscribers make 5-6 voice calls per day, then those who place 100 voice calls per day are clear outliers. Outliers can be legitimate as in the previous example, but they can also be invalid or unwanted data. For example a person whose age is 250 years is certainly an invalid data, something that we want to erase from our datasets.

Most outliers are hidden inside the dataset. There are almost always outliers, but when we see a lot of them in our tables, or a sudden change in the number of outliers, we can see it as a sign of a potential data quality issue. It is important to remove outliers because they generally have negative effect on analysis and training a predictive model.

To identify outliers, one of the most common methods is to calculate the distribution of the data using Inter-quartile Range (IQR). In the simplest words, IQR is the area of the data which represents the “middle” values where half (50%) of the data belongs. Anything that falls beyond a certain distance from the IQR will be marked as outliers. According to “Tukey’s Rule” (named after John Tukey, an American mathematician), the distance is 1.5 x IQR (see diagram below).

To perform the computation, first we will have to sort all rows in the dataset from the lowest to the highest value, then divide the data into four equal parts (hence the name “quartile”). The boundary for the first 25% of the data is called Q1 and the last 25% is called Q3. To get the IQR, subtract Q3 from Q1. After that, find the “min” and “max” limits by calculating 1.5 x IQR from Q1 and Q3, respectively. Once you have these “min” and “max” numbers, you can start counting the outliers which values are less than “min” and more than “max”. There you get your outliers.

Finding IQR like above is a computationally heavy process, especially when involving large number of rows. Luckily, our friend Spark has provided us with the tool for this. Using approxQuantile() function which allows us to compute IQR without going through the whole dataset, and this can be done quite easily. You may want to see the snippet of the PySpark code as shown below:

To give an example of how it works, this script was run on one of the tables in the ABT schema, with around 5 million records. Outliers are gathered from all numerical columns. The script produced the following statistics:

As we can see, the percentage of outliers is generally small. Anything under 2% will be considered normal. However, in the above example, there are some particular columns which have large percentage of outliers, even up to 15%. This is something worth investigating as this can lead to potential quality problems.

What’s Next?

At the moment the IT Data Quality team is working on putting the outliers count in our BI DQ Dashboard (see below) alongside with other DQ metrics. By comparing the outliers statistics with historical data, we may be able to detect issues before they present a problem in the consumers side.

The method mentioned above is far from perfect. It only captures outliers in one dimension, which is called univariate outliers. To get outliers which lie in two or more dimensions, called multivariate outliers, different methods employing ML techniques must be prepared. Stay tuned for more posts on this matter. Our journey to improve data quality in BI keeps going on!

February 8, 2020

Personal Data Protection

February 2020 | Dios Kurniawan

Indonesian lawmakers will soon ratify the new law on personal data protection, called Undang-undang Perlindungan Data Pribadi (UU PDP). At the heart of this new law is a stringent safeguard measure on privacy rights. Personal data protection is part of human rights, and the law carries heavy criminal penalties for fraud and misuse of personal data. There is a criminal sanction of 7 years in prison or a fine of 70 billion Rupiah (!) for anyone involved in unlawful use of personal data.

If you run a business and you regularly collect customer’s data, remember that the personal data belongs to your customers. The ownership remains with your customers even though the data sits on your system’s hard drive. In order for you to process the data, the customers must give explicit consent. Without it, you are breaking the law.

So what constitutes a personal data? According the latest draft of the law, there are two types of personal data:

General Personal Data, that is, data which could easily identify who someone is. This includes full name, gender, nationality, passport number and NIK (national ID number). Birth date, home address, work address, photographs also belong to this category.
Specific Personal Data, that is, sensitive data which could harm someone if it falls into the wrong hands. This category of personal data requires different kind of protection. This includes medical information, biometrics, genetics, political views, financial data, religion and family information. Credit card number, bank account information, mother’s name, geolocation data are also categorized as specific personal data.

What about phone numbers and e-mail addresses? There can be multiple interpretation of the law with regard to this, but it is safe to consider that phone numbers and e-mails are not categorized as personal data. They fall into the category of pseudonym data, which is a type of data that requires additional data before it can be used to identify someone. Although not explicitly mandated by law, to protect the customers, phone numbers and email addresses must also be protected.

For us who work in the field of big data analytics, what are the do’s and don’ts when it comes to handling customer’s data? Just remember that by law, we are personally responsible for our actions. Here is a few guideline for you:

Treat customer data with respect. The data is not yours.
Do not touch specific/sensitive personal data unless you have a very strong reason to do so. Avoid making analysis based on this category of data.
Protect personal data with encryption in the physical level.
Each time you need to deliver data, a report, or to grant table access to your users, check to make sure there is no personal data within. If you are transferring data to an external party, be certain you have the legal clearance before doing so. Ask for written evidence.
Never reveal customer’s personal data to friends, relatives or family members. Never.
Destroy customer’s personal data when they are no longer our customers (that is, they have churned).

In the age of big data, those who own data hold the power. The famous quote “along with great power comes great responsibility” perfectly describes the situation.