Data Quality Test Using Machine Learning

June 2020 | Dios Kurniawan

This post continues my previous post on outlier detection as a statistical method to identify Data Quality (DQ) issues. If you haven’t read my previous article, click here. This time around, a different approach is taken by employing machine learning (ML) technique.

As I explained in my previous post, a sudden change in the number of outliers in a dataset is a strong indicator that we may have a data quality issue. Outliers can be legitimate, but more often than not, they are anomalies which should be removed from the dataset. Finding anomalies is a crucial step before allowing the dataset to be used for further processing.

To demonstrate how the anomaly detection works, I took a sample dataset containing the Daily Sales Revenue statistics of a fictitious company, as shown below:

Our intention is to detect the day in which revenue numbers in any of the Sales Areas deviate from what we consider as normal. This is important because sudden change in revenue is generally not expected. However, eyeballing the data row-by-row to search for anomalies can be a daunting, error-prone task. We need a machine to do that for us. To achieve this, we can use an ML algorithm called Isolation Forest.

Isolation Forest is a classifier which traces its root from the popular Decision Tree algorithm. In short, Isolation Forest algorithm tries to build a tree and directly isolate outliers. This is precisely what we need to detect data quality issues (while it is an interesting topic, I may not be the right person to explain the mathematical background of this algorithm in detail. If you wish to understand the theory behind Isolation Forest, please visit

Isolation Forest allows us to detect multivariate outliers, meaning we can find unusual values in more than one variable. For the first run, we will use two variables only, “NQ” and “Total” from the table. “NQ” is particularly interesting because it indicates revenue transactions which cannot be categorized into the correct Sales Area. A sudden jump in “NQ” is a clear sign that something has gone wrong. Let’s visualize the data in a scatter plot (with the values have been normalized):

By looking at the scatter plot above, outliers can be identified as dots which are distant away from the majority in the mid-lower part of the chart. There are both outliers in “NQ” and “Total” variables. Human eyes can easily see these, but computers need to be trained to achieve the same. Using Python’s Scikit-learn library, Isolation Forest can be implemented with few lines of code like the example below:

from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(88)
X_train, X_test, y_train, y_test = ms.train_test_split(X, y, test_size=0.2, random_state=rng)
dfTrain = pd.DataFrame(list(zip(X_train, y_train)), columns=['total_','nq_'])
dfTest = pd.DataFrame(list(zip(X_test, y_test)), columns=['total_','nq_'])
clf = IsolationForest(n_estimators=100, max_samples='auto', contamination=0.1, random_state=rng)
y_pred_train = clf.predict(dfTrain)
y_pred_test = clf.predict(dfTest)
dfPredTest = pd.DataFrame(y_pred_test)
dfPredTest = dfTest.join(dfPredTest, how='left')
dfOutlier = dfPredTest.loc[dfPredTest[0] < 0]
print("Outlier Count=", dfOutlier.count())

This algorithm needs a random number to start the work, and we can experiment with this number to see different result. One thing to note, the algorithm takes one parameter called contamination, which in the above example, is set to 0.1, meaning we expect outliers will be less than 10% of the population (this is because the dataset is small; in a larger dataset, 2% will be a more reasonable limit). The estimator parameter can be set to above 100 to gain some accuracy at the price of more processing time.

When I ran the program on a dataset of approximately 300 records, the model produced outliers count = 16. Let’s superimpose the outliers in the previous scatter plot, marking the 16 outliers with red color:

As we can see, most obvious outliers have successfully been detected, but not all. The detection performance can be improved by experimenting with the parameters and by adding more historical data (the above example splits data into training and test datasets with 80-20 ratio). To complete the detection, we should also perform another run with all 5 variables in the table (AREA1, AREA2, etc.), at the same time. This might result in a smaller set of outliers like the example below:

Since we already have the anomalies, they can then be visualized in a Tableau dashboard with different color, alerting us to immediately perform some investigation on them.

Anomalies are marked red in the DQ dashboard

Other ML Algorithm

Another interesting alternative for outlier detection is clustering algorithm called DBSCAN. The good thing about DBSCAN is that it does the job almost autonomously, it does not require us to supply the number of clusters to be created unlike other algorithms such as K-Means. The implementation in Scikit-learn is also quite straightforward. The snippet of the code looks like below:

from sklearn.cluster import DBSCAN
db_default = DBSCAN(eps = 0.5, min_samples = 3).fit(dfRecharge)
labels = db_default.labels_

There are two important parametes to be supplied; eps and min_samples. It takes some experiments to find the right values. Running it on similar Daily Sales statistics table (with few months of historical data) results in two clusters, with the anomalies are flagged as CLUSTER = -1 as seen below:

Visualized in a scatter plot, anomalies are marked red:

Most anomalies can be detected pretty much effectively with this technique. However, DBSCAN tends to run slowly and seems to take a lot of computing resource. In many cases it simply fails to produce any useful output when the dataset becomes too large.

Doing It at Scale

The above programs cover datasets with size of only few hundred rows. What if we want to detect outliers in a much larger dataset, say, 1 million records? In real-world application, this will likely be the usual case. That’s the where challenge begins!

To search for outliers in a large dataset, we cannot use the standard Python because it would simply break after reaching a certain point. Python and Scikit-learn are not actually designed for big data. We would ideally have to employ Spark MLLib for data at this scale. However, at the time of writing, Spark MLLib does not support Isolation Forest or DBSCAN. Alas, we are stuck with Scikit-learn.

To make an attempt to see how far Scikit-learn could go, I created a sample of 1 million, 2 million, 10 million, 15 million and a gigantic 150 million records. In a hope to exploit all available computing resources, I tried creating Spark UDF (user-defined function) to make Isolation Forest as if it is a Spark function. Also, I used sparkContext.broadcast() function to distribute the data and model to all nodes, minimizing communication cost. Here is the code :

def get_outliers(a, b):
  result = 0
  x_pred = [(a, b)]
    x_pred = b_scaler.value.transform(x_pred)
    result = b_model1.value.predict(x_pred)[0]
    print('{0} Error in {1} : '.format(dTime, x_pred))
  return int(result)
udf_get_outliers = F.udf(get_outliers, IntegerType())

scaler = pp.StandardScaler(copy=True, with_mean=True, with_std=True) 
model1 = IsolationForest(n_estimators=150, max_samples='auto', contamination=0.02, random_state=42) 
X_train = scaler.fit_transform(X_train) 
y_train =
b_scaler = spark.sparkContext.broadcast(scaler)
b_model1 = spark.sparkContext.broadcast(y_train)
df1 = df1.withColumn('prediction', udf_get_outliers('rev_mtd_', 'rev_m2_'))
nOutliers = df1.where(F.col('prediction') < 0).count()

Was it successful? Not really. The above program worked for up to 10 million records, but it crashed when faced with 15 million records. A far cry from 150 million records that I wanted! Running it on a sampled table with 10 million rows and 2 variables took 5 hours, resulting in around 200 outliers found.

As you can see, putting the code into production-level data is the real challenge. A true implementation using Spark remains something to wish for. After some Googling, I found a library that the creator claimed to use Isolation Forest in Spark ( I would really like to try this, however I still don’t know how (and don’t have time) to make it work. Would anyone be interested in helping me? Drop me an email!