Detecting Data Quality Issues by Identifying Outliers

March 2020 | Dios Kurniawan

In my previous post on Data Quality Framework, I discussed about the concept of DQ (data quality) Test Point. The idea is to detect data quality issues by probing the data. In these DQ Test Points, DQ metrics are extracted, calculated and then compared against a baseline. 

To bolster the DQ metrics, a method called outlier detection is added to the DQ Test Points. This is a pretty simple check to see abnormalities in the data by looking for outliers. Outliers are, as you must have guessed, values which are very different from most of the population. For example, if most of Telkomsel subscribers make 5-6 voice calls per day, then those who place 100 voice calls per day are clear outliers. Outliers can be legitimate as in the previous example, but they can also be invalid or unwanted data. For example a person whose age is 250 years is certainly an invalid data, something that we want to erase from our datasets.

Most outliers are hidden inside the dataset. There are almost always outliers, but when we see a lot of them in our tables, or a sudden change in the number of outliers, we can see it as a sign of a potential data quality issue. It is important to remove outliers because they generally have negative effect on analysis and training a predictive model.

To identify outliers, one of the most common methods is to calculate the distribution of the data using Inter-quartile Range (IQR). In the simplest words, IQR is the area of the data which represents the “middle” values where half (50%) of the data belongs. Anything that falls beyond a certain distance from the IQR will be marked as outliers.  According to “Tukey’s Rule” (named after John Tukey, an American mathematician), the distance is 1.5 x IQR (see diagram below).

To perform the computation, first we will have to sort all rows in the dataset from the lowest to the highest value, then divide the data into four equal parts (hence the name “quartile”). The boundary for the first 25% of the data is called Q1 and the last 25% is called Q3. To get the IQR, subtract Q3 from Q1. After that, find the “min” and “max” limits by calculating 1.5 x IQR from Q1 and Q3, respectively. Once you have these “min” and “max” numbers, you can start counting the outliers which values are less than “min” and more than “max”. There you get your outliers.

Finding IQR like above is a computationally heavy process, especially when involving large number of rows. Luckily, our friend Spark has provided us with the tool for this. Using approxQuantile() function which allows us to compute IQR without going through the whole dataset, and this can be done quite easily. You may want to see the snippet of the PySpark code as shown below:

To give an example of how it works, this script was run on one of the tables in the ABT schema, with around 5 million records. Outliers are gathered from all numerical columns. The script produced the following statistics:

As we can see, the percentage of outliers is generally small. Anything under 2% will be considered normal. However, in the above example, there are some particular columns which have large percentage of outliers, even up to 15%. This is something worth investigating as this can lead to potential quality problems. 

What’s Next?

At the moment the IT Data Quality team is working on putting the outliers count in our BI DQ Dashboard (see below) alongside with other DQ metrics. By comparing the outliers statistics with historical data, we may be able to detect issues before they present a problem in the consumers side.

The method mentioned above is far from perfect. It only captures outliers in one dimension, which is called univariate outliers. To get outliers which lie in two or more dimensions, called multivariate outliers, different methods employing ML techniques must be prepared. Stay tuned for more posts on this matter. Our journey to improve data quality in BI keeps going on!

Personal Data Protection

February 2020 | Dios Kurniawan

Indonesian lawmakers will soon ratify the new law on personal data protection, called Undang-undang Perlindungan Data Pribadi (UU PDP). At the heart of this new law is a stringent safeguard measure on privacy rights. Personal data protection is part of human rights, and the law carries heavy criminal penalties for fraud and misuse of personal data. There is a criminal sanction of 7 years in prison or a fine of 70 billion Rupiah (!) for anyone involved in unlawful use of personal data.

If you run a business and you regularly collect customer’s data, remember that the personal data belongs to your customers. The ownership remains with your customers even though the data sits on your system’s hard drive. In order for you to process the data, the customers must give explicit consent. Without it, you are breaking the law. 

So what constitutes a personal data? According the latest draft of the law, there are two types of personal data:

  1. General Personal Data, that is, data which could easily identify who someone is. This includes full name, gender, nationality, passport number and NIK (national ID number). Birth date, home address, work address, photographs also belong to this category.
  2. Specific Personal Data, that is, sensitive data which could harm someone if it falls into the wrong hands. This category of personal data requires different kind of protection. This includes medical information, biometrics, genetics, political views, financial data, religion and family information. Credit card number, bank account information, mother’s name, geolocation data are also categorized as specific personal data.

What about phone numbers and e-mail addresses? There can be multiple interpretation of the law with regard to this, but it is safe to consider that phone numbers and e-mails are not categorized as personal data. They fall into the category of pseudonym data, which is a type of data that requires additional data before it can be used to identify someone. Although not explicitly mandated by law, to protect the customers, phone numbers and email addresses must also be protected.

For us who work in the field of big data analytics, what are the do’s and don’ts when it comes to handling customer’s data? Just remember that by law, we are personally responsible for our actions. Here is a few guideline for you:

  1. Treat customer data with respect. The data is not yours.
  2. Do not touch specific/sensitive personal data unless you have a very strong reason to do so. Avoid making analysis based on this category of data.
  3. Protect personal data with encryption in the physical level.
  4. Each time you need to deliver data, a report, or to grant table access to your users, check to make sure there is no personal data within. If you are transferring data to an external party, be certain you have the legal clearance before doing so. Ask for written evidence. 
  5. Never reveal customer’s personal data to friends, relatives or family members. Never.
  6. Destroy customer’s personal data when they are no longer our customers (that is, they have churned).

In the age of big data, those who own data hold the power. The famous quote “along with great power comes great responsibility” perfectly describes the situation.

My Book is Out for Preview

October 2019 | Dios Kurniawan

After more than a year in the making, I have finally sent the manuscript of my next upcoming book to the publisher. The book is titled “Pengenalan Machine Learning dengan Python (Introduction to Machine Learning using Python)”, which is aimed at Indonesian readers who are interested to start learning about this exciting field of study, but may have problems reading English literature.

Temporary Book Cover (final version being created)

It is a 200-page tutorial on Python, giving readers introduction on what machine learning is, basic skills in Pandas, and samples of several machine learning algorithms with focus on Scikit-learn and some other Pythonic libraries. There is a chapter on Computer Vision as well.

Hopefully my book will be published by early 2020. I am aiming at both printed and e-book versions.

While being in the editing and review process, I am giving you a sneak peek of what my book will look like. You may drop me an email and I will send you a copy of my manuscript. Your comments are very welcome!