Pandas for Data Transformation

December 2018 | Dios Kurniawan

Market Basket Analysis (MBA, as many call it) is an analytical method widely used in retail business to gather insights on what products are usually purchased together by consumers. This time around, I was given a problem of analyzing transaction data from a client in Food & Beverages business, finding purchase patterns so the management can later examine which meals and drinks to bundle into ‘paket hemat’.

To start with, I have to extract the transaction data from the Point of Sale (POS) system, which sits in a SQL database, into a CSV file. The data, as one might suspect, is in raw format and requires preprocessing. The data is not much, less than 20,000 rows so I immediately thought that I would simply use Python and run the process in my laptop.

Below is an example of the original transaction data format (I changed the product names to obscure its real values for publication in this blog). The data has gone through some cleansing to eliminate nulls and inconsistent values.

I was looking for a quick way to transpose transactional data above into 1-hot encoded format which spans to the right. Sure, I could do that in SQL but that would require me to write a long query and re-extract the data from POS again. I did not want to do that. Pandas came to the rescue:

import pandas as pd
penjualan1 = pd.read_csv('D:\', parse_dates=['TRX_TS'], index_col=['TRX_ID'])
pivot1=penjualan1.pivot_table(index='TRX_ID', columns='PRODUCT_NAME', values='PRICE').fillna(0) 
pivot1[pivot1 > 0] = 1

It results in exactly the format I need:

Voila! Data is transformed within few seconds. With such a short program, only two lines of code, my data is ready for further analysis. Pandas is just great.