August 2017 | Dios Kurniawan
Many organizations are now engaging big data initiatives, and many projects have started with lots of investments and lots of ‘buzz’. Large companies have poured money to buy big data machines such as Hadoop, procured expensive software and hired data scientists. But after few years people start asking, how is the progress they have made, are they getting the value of their investment?
I do not have the statistics, but I believe only few do meet their initial expectations.
The main challenge for most big data initiatives is to extract insights out of the data. Many organizations are still having difficulties in discovering the valuable data that they have and making it available for analysis. Hadoop technology, with its perceived low cost of storing data, draws people to simply dump whatever data they have into it. In most people’s mind, Hadoop has become about providing cheap storage environment, blurring the real end goals and objectives of having big data in the first place. The data is in the Hadoop, but it is not really searchable and is not stored in a way that allows consumers to get the value easily.
In most organizations, Hadoop remains to be ‘IT-toy’, meaning business still pretty much depends on IT technical people to work on getting the data. There is a ‘disconnect’ between IT department and the business units. Business has spurred experiments on big data use cases utilizing many machine learning techniques, but the challenge is how to operationalise these experiments into something that truly generates valuable business insights in a sustainable manner, which in most cases, still technically difficult.
To ensure success, what needs to be built with big data in many organizations is essentially the automation on top of it.
It is a comprehensive set of tools which permits business users, and eventually the consumers, getting the data they need at the right time. Automation in big data is something like an e-commerce site: it allows user to search for the products they want, get help in doing it, and get the merchandise delivered to their doorstep. An online shop-style engine, which sits on top of the big data platform, connects the big data platform to the consumers, much like the recommendation engines usually found in online shops. Things like self-service Business Intelligence and self-service analytics. Without this, it is difficult to reap the benefits of big data investment and very little would materialize into something meaningful.
Hadoop, the most popular big data platform, is not delivering what it promises, either. Pouring data into HDFS (Hadoop File System) is quite easy, but getting the data out is still a challenge. Many SQL-on-Hadoop tools exist, but none has gained wide adoption to actually compete with true SQL engines such as Oracle and Teradata.
My experience also tells me that data governance has become a very challenging effort, in the sense that data is pretty much duplicated here and there, and data is sitting there without actually being used. Many IT organizations had literally spent months -even years- and millions of dollars in integrating big data technology such as Hadoop, and in most cases the effort has been successful. Successful in the sense that the ‘data lake’ with its cheap storage and abundant processing power can be provided, but the missing link is an integrated, end-to-end systems and procedures to enable completely operational environment, which brings valuable insights to the consumers.