Getting Value from Big Data Projects

August 2017 | Dios Kurniawan

Many organizations are now engaging big data initiatives, and many projects have started with lots of investments and lots of ‘buzz’. Large companies have poured money to buy big data machines such as Hadoop, procured expensive software and hired data scientists. But after few years people start asking, how is the progress they have made, are they getting the value of their investment?

I do not have the statistics, but I believe only few do meet their initial expectations.

The main challenge for most big data initiatives is to extract insights out of the data. Many organizations are still having difficulties in discovering the valuable data that they have and making it available for analysis. Hadoop technology, with its perceived low cost of storing data, draws people to simply dump whatever data they have into it. In most people’s mind, Hadoop has become about providing cheap storage environment, blurring the real end goals and objectives of having big data in the first place. The data is in the Hadoop, but it is not really searchable and is not stored in a way that allows consumers to get the value easily.

In most organizations, Hadoop remains to be ‘IT-toy’, meaning business still pretty much depends on IT technical people to work on getting the data. There is a ‘disconnect’ between IT department and the business units. Business has spurred experiments on big data use cases utilizing many machine learning techniques, but the challenge is how to operationalise these experiments into something that truly generates valuable business insights in a sustainable manner, which in most cases, still technically difficult.

To ensure success, what needs to be built with big data in many organizations is essentially the automation on top of it.

It is a comprehensive set of tools which permits business users, and eventually the consumers, getting the data they need at the right time. Automation in big data is something like an e-commerce site: it allows user to search for the products they want, get help in doing it, and get the merchandise delivered to their doorstep. An online shop-style engine, which sits on top of the big data platform, connects the big data platform to the consumers, much like the recommendation engines usually found in online shops. Things like self-service Business Intelligence and self-service analytics. Without this, it is difficult to reap the benefits of big data investment and very little would materialize into something meaningful.

Hadoop, the most popular big data platform, is not delivering what it promises, either. Pouring data into HDFS (Hadoop File System) is quite easy, but getting the data out is still a challenge. Many SQL-on-Hadoop tools exist, but none has gained wide adoption to actually compete with true SQL engines such as Oracle and Teradata.

My experience also tells me that data governance has become a very challenging effort, in the sense that data is pretty much duplicated here and there, and data is sitting there without actually being used. Many IT organizations had literally spent months -even years- and millions of dollars in integrating big data technology such as Hadoop, and in most cases the effort has been successful. Successful in the sense that the ‘data lake’ with its cheap storage and abundant processing power can be provided, but the missing link is an integrated, end-to-end systems and procedures to enable completely operational environment, which brings valuable insights to the consumers.

Big Data Ethics

July 2017 | Dios Kurniawan

So you run a business, and you have a lot of customers. Thousands, or even millions of them. You hold all the customer data and their transaction history in your system, most likely in your data warehouse. You understand that you have tons of information at your fingertips and you start thinking about taking advantage of the data for making money.

Purchase behavior, mobility patterns, demographic information are the bread and butter of Big Data (photo: Dios K)

You installed Hadoop machines, and start pouring data into it to perform analysis on your customer’s demography and behavior. Examine what they purchase. Analyze their spending habits, even putting the location where your customers go into the microscope.

Before you know it, you are already crossing the boundaries between protecting your customer’s sensitive personal data and exploiting it.

There is no denying that today is the era of big data. You have got data in your hands, but the ethical question becomes: do you have the right to do anything you like with the data?

The answer is yes. But only if you protect the personally identifiable information (PII). It is the right of each customer to keep his or her private information private. In some countries, this is a legal requirement (Indonesia is a bit relaxed on this matter). The question is, do you really respect customer’s rights?

Therefore, before you perform data mining on your data, do not forget to anonymize your customer’s data. Change real names, phone numbers into hash codes and delete the actual data. Obscure sensitive information right inside the database tables.

Digital Dark Age is Real

February 2017 | Dios Kurniawan

Few days ago I realized that all my family video tapes stored in MiniDV format from the year 2002 to 2011 were simply unreadable. Not because they are defective, but because my MiniDV camcorder refuses to turn on – most likely because of its 15-year-old age. It is terrifying to see that suddenly I lost years of memory just because I do not have the hardware to play it back. Sure, I can always buy a new camcorder, but MiniDV is an obsolete video format and not many manufacturers still produce the hardware today.

Luckily I have transferred all videos to DVD discs, but the original raw uncompressed video – with higher quality than what’s stored in DVD’s MPEG-2 format – remains in those tapes, so I am left with a pile of video tapes which are as good as useless.

My video album in MiniDV tapes

This made me realize that the threat of “Digital Dark Age” was real. Digital Dark Age refers to a possible dystopian situation when our future generation cannot read our history records that we store in digital media. This can be compared to the first “Dark Age” in the mid ages after the fall of Roman Empire when most record of history on civilization was lost.

MiniDV is a relatively young digital format introduced in late 1990’s, but with the rise of flash storage technology, tape technology has slowly faded away from consumer electronics. Imagine your collection of memorable moments, photos, videos and documents you have amassed in the last 20 years in tapes would be unreadable if you don’t quickly migrate to the new technology. Digital Dark Age is looming over our lives.

Another example that digital dark age is upon us: In 1997, I published a book (see my book here), it was printed few thousands copies and they sold pretty well. Now, 20 years later, I still have the physical copy of my book, but I don’t have the digital copy anymore because the computer that I used to write the book in 1997 has gone forever.

Many digital formats have come to obsolescence and have finally extinct. Remember floppy disk? It used to be the most popular media to store computer files in 1980’s. Nobody uses it anymore, but there must be tons files are still stored in floppy disks which have not been migrated to a new storage media.

The same is true for CD-ROM, DVD, hard drive, USB flash disk, etc. Not one person in this world can guarantee that in 50-100 years time, someone will still own the device to read and extract the information.

My PATA hard disk, CD-ROM and Floppy Disks (remember them?)

Even if the files in the legacy digital media can be restored, there is still a big probability that we do not possess the software for reading the files in their original format. Those who were raised in 1980’s to early 1990’s most likely have used PCs to write documents using old word processing software which does not exist anymore. Remember Wordstar and Wordperfect? Can we open the files properly today?

JPEG format may be the de facto standard for storing digital pictures today, but who can guarantee that the algorithm to decompress JPEG images will still be known by the future generations in 100 or 200 years from now?

Cloud storage is also a vulnerability. We are accustomed to store our photos in Google Photos, Dropbox or Apple iCloud and think they will be safe there. Are we 100% sure that Google and Apple will still in business 100 years from now?

Large organizations are now relying on Big Data technology to store and process data in large amount. They put files in Hadoop File System with multiple different compression formats. How can we ensure that in 20 years the data will still be readable?

If we do not do something in controlling our way in storing digital data, we are risking the possibility that our grand children will never be able to read our records. History would be lost forever. I recommend that from now on, all of us make physical copies of our most important photos and documents to preserve them.