Detecting Data Quality Issues by Identifying Outliers

March 2020 | Dios Kurniawan

In my previous post on Data Quality Framework, I discussed about the concept of DQ (data quality) Test Point. The idea is to detect data quality issues by probing the data. In these DQ Test Points, DQ metrics are extracted, calculated and then compared against a baseline. 

To bolster the DQ metrics, a method called outlier detection is added to the DQ Test Points. This is a pretty simple check to see abnormalities in the data by looking for outliers. Outliers are, as you must have guessed, values which are very different from most of the population. For example, if most of Telkomsel subscribers make 5-6 voice calls per day, then those who place 100 voice calls per day are clear outliers. Outliers can be legitimate as in the previous example, but they can also be invalid or unwanted data. For example a person whose age is 250 years is certainly an invalid data, something that we want to erase from our datasets.

Most outliers are hidden inside the dataset. There are almost always outliers, but when we see a lot of them in our tables, or a sudden change in the number of outliers, we can see it as a sign of a potential data quality issue. It is important to remove outliers because they generally have negative effect on analysis and training a predictive model.

To identify outliers, one of the most common methods is to calculate the distribution of the data using Inter-quartile Range (IQR). In the simplest words, IQR is the area of the data which represents the “middle” values where half (50%) of the data belongs. Anything that falls beyond a certain distance from the IQR will be marked as outliers.  According to “Tukey’s Rule” (named after John Tukey, an American mathematician), the distance is 1.5 x IQR (see diagram below).

To perform the computation, first we will have to sort all rows in the dataset from the lowest to the highest value, then divide the data into four equal parts (hence the name “quartile”). The boundary for the first 25% of the data is called Q1 and the last 25% is called Q3. To get the IQR, subtract Q3 from Q1. After that, find the “min” and “max” limits by calculating 1.5 x IQR from Q1 and Q3, respectively. Once you have these “min” and “max” numbers, you can start counting the outliers which values are less than “min” and more than “max”. There you get your outliers.

Finding IQR like above is a computationally heavy process, especially when involving large number of rows. Luckily, our friend Spark has provided us with the tool for this. Using approxQuantile() function which allows us to compute IQR without going through the whole dataset, and this can be done quite easily. You may want to see the snippet of the PySpark code as shown below:

To give an example of how it works, this script was run on one of the tables in the ABT schema, with around 5 million records. Outliers are gathered from all numerical columns. The script produced the following statistics:

As we can see, the percentage of outliers is generally small. Anything under 2% will be considered normal. However, in the above example, there are some particular columns which have large percentage of outliers, even up to 15%. This is something worth investigating as this can lead to potential quality problems. 

What’s Next?

At the moment the IT Data Quality team is working on putting the outliers count in our BI DQ Dashboard (see below) alongside with other DQ metrics. By comparing the outliers statistics with historical data, we may be able to detect issues before they present a problem in the consumers side.

The method mentioned above is far from perfect. It only captures outliers in one dimension, which is called univariate outliers. To get outliers which lie in two or more dimensions, called multivariate outliers, different methods employing ML techniques must be prepared. Stay tuned for more posts on this matter. Our journey to improve data quality in BI keeps going on!

Personal Data Protection

February 2020 | Dios Kurniawan

Indonesian lawmakers will soon ratify the new law on personal data protection, called Undang-undang Perlindungan Data Pribadi (UU PDP). At the heart of this new law is a stringent safeguard measure on privacy rights. Personal data protection is part of human rights, and the law carries heavy criminal penalties for fraud and misuse of personal data. There is a criminal sanction of 7 years in prison or a fine of 70 billion Rupiah (!) for anyone involved in unlawful use of personal data.

If you run a business and you regularly collect customer’s data, remember that the personal data belongs to your customers. The ownership remains with your customers even though the data sits on your system’s hard drive. In order for you to process the data, the customers must give explicit consent. Without it, you are breaking the law. 

So what constitutes a personal data? According the latest draft of the law, there are two types of personal data:

  1. General Personal Data, that is, data which could easily identify who someone is. This includes full name, gender, nationality, passport number and NIK (national ID number). Birth date, home address, work address, photographs also belong to this category.
  2. Specific Personal Data, that is, sensitive data which could harm someone if it falls into the wrong hands. This category of personal data requires different kind of protection. This includes medical information, biometrics, genetics, political views, financial data, religion and family information. Credit card number, bank account information, mother’s name, geolocation data are also categorized as specific personal data.

What about phone numbers and e-mail addresses? There can be multiple interpretation of the law with regard to this, but it is safe to consider that phone numbers and e-mails are not categorized as personal data. They fall into the category of pseudonym data, which is a type of data that requires additional data before it can be used to identify someone. Although not explicitly mandated by law, to protect the customers, phone numbers and email addresses must also be protected.

For us who work in the field of big data analytics, what are the do’s and don’ts when it comes to handling customer’s data? Just remember that by law, we are personally responsible for our actions. Here is a few guideline for you:

  1. Treat customer data with respect. The data is not yours.
  2. Do not touch specific/sensitive personal data unless you have a very strong reason to do so. Avoid making analysis based on this category of data.
  3. Protect personal data with encryption in the physical level.
  4. Each time you need to deliver data, a report, or to grant table access to your users, check to make sure there is no personal data within. If you are transferring data to an external party, be certain you have the legal clearance before doing so. Ask for written evidence. 
  5. Never reveal customer’s personal data to friends, relatives or family members. Never.
  6. Destroy customer’s personal data when they are no longer our customers (that is, they have churned).

In the age of big data, those who own data hold the power. The famous quote “along with great power comes great responsibility” perfectly describes the situation.

Tips: Installing Standalone PySpark in MacOS

August 2019 | Dios Kurniawan

I wrote this post to share with you how to install PySpark in MacOS. I was reformatting my laptop last week and I found it difficult to reinstall PySpark because the Apache Spark documentation did not mention much about MacOS. If you Google it, you would find there are quite many different ways of doing this, but I could assure you that what I wrote here would be the most straightforward way to install PySpark in a standalone setup in a MacOS.

Standalone setup is ideal if you need to write in PySpark but do not own or do not have access to a Hadoop / Spark cluster. If you only need to write PySpark programs locally in your laptop without actually running them in a cluster of machines, for example at home or in a coffee shop, then this is the way to go.

To install PySpark, follow these 7 easy steps below. This assumes that you are starting with a clean machine. Be warned, you will need a fast internet connection because the size of the software to download will be quite large.

Steps:

  1. Download and install Java (JDK) SE 8 if you haven’t done so. Beware: do not use other version. If you use newer version, you will have to downgrade.
  2. Download and install Anaconda (http://anaconda.com). Pick Python 3.x instead of Python 2.x. Test your installation by creating and running a simple Python program before you proceed to the next step.
  3. Download and install Homebrew (https://brew.sh). Homebrew is package manager for MacOS, we will need this to install Spark. Once Homebrew is installed, open a new Terminal and run this command to get core Apache Spark package:
brew install apache-spark

4. Once finished, go to your Spark directory /usr/local/Cellar/apache-spark as shown below (change the numbers with the actual version you have installed in your computer, in my case it is “2.4.5”) and then find and edit the bash_profile file in that directory:

cd /usr/local/Cellar/apache-spark/2.4.5
nano ~/.bash_profile

Add these new lines at the very bottom of the file, then save and close:

export SPARK_PATH=/usr/local/Cellar/apache-spark/2.4.5
export PYSPARK_DRIVER_PYTHON="jupyter"
export PYSPARK_DRIVER_PYTHON_OPTS="notebook"
export PYSPARK_PYTHON=python3 alias snotebook='$SPARK_PATH/bin/pyspark --master local[2]'

5. Run the file bash_profile you just edited by executing this:

source ~/.bash_profile

6. Next, download and install Findspark and PySpark using Conda:

conda install -c conda-forge findspark
conda install pyspark

7. Test your installation by starting a new Python 3 program as below (you can also use Jupyter Notebook):

import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext(appName="dios")

If the program returns no error, that’s it, you’ve got your Pyspark environment ready!