December 2014 | Dios Kurniawan
The relational database has been dominating the way we store our data in the data warehouse for the last 30 years; whatever the data sources you have in your organization, it must be stored neatly in perfect structure, that is, in tables with rows and columns.
Relational databases need schema to be defined in advance before loading the data, you can either choose normalized data model, star schema or other similar models to structure your data. The pitfall is changes afterwards –even the slightest ones- will require significant effort in altering the tables. But things change. In the era of big data technology, relational database may soon be less relevant particularly in data warehousing implementations. Big Data technologies such as Hadoop let us store and analyze massive data of any type without the need to follow a predefined schema structure. And at much lower cost.
Since Dr Codd invented relational database concept in 1970’s, it has grown hugely important in the computing industry that it is even taught as a compulsory course to all computer science students. At the heart of relational concept, the third normal form (3NF) model was largely designed to solve the problem of disk space usage, among other things. The 3NF model promises efficient use of disk space by eliminating redundancy in the data stored on disks. Disk storage was expensive in the 1970s era, and any effort to save storage space such as 3NF would be highly rewarding at that time.
But that was then. Today, disk storage is abundant and cheap. The cost of storing 1TB of data in a Hadoop cluster is now less than $500 (in 1980, a 5MB hard drive cost $1500). It makes much less sense today to design a data warehouse using 3NF because conserving disk usage has now become less of a pressing need. For applications which in nature serve transactional processing, 3NF may still be best fit but for data warehousing and the world of analysis (query, reporting, data mining etc.), there is no absolute need to use 3NF anymore.
As an alternative to 3NF, for years, the concept of star schemawhich was introduced by Dr Ralph Kimball has been regarded as the more acceptable standard method to store information in a data warehouse. Data is stored in fact and dimension tables, also in relational databases. This makes analysis easier for business users as data is organized by subject areas. Similar to 3NF, star schema must be defined for a particular analysis purpose – changes in business definitions would lead to cumbersome task of database modifications. Also similar to 3NF, star schema requires users to use a lot of joins to execute complex data queries.
Today, in the era of big data technology and data science, the preference has shifted to a “flat” data model. This means data is stored as is, or is stored by integrating multiple information into a single, flat table, eliminating the need for table joins. It emphasizes on denormalization, a completely different route from relational model. This is the method usually preferred by data scientists and can easily be implemented in Hadoop. They will create flattened data model and will create huge tables with long records. Yes there will be redundancies and inefficiencies, but disk storage is cheap anyway. Using flat model might as well consume a lot of computing resources, however providing abundant processing power at lower cost is what Hadoop is all about.
The emergence of “schema on read” approach further exaggerates the demise of our dependency on relational model in data warehousing. It allows much flexible way on how the data can be stored and consumed. Simply store the data in Hadoop and start exploring the information inside it. We are no longer stuck in a predefined, rigid schema. But one would ask, what about data integrity? For decades, the ACID (atomicity, consistency, isolation and durability) properties have been the strong points, the bread-and-butter of relational database. Back in 1970-1990s, enterprise data was so “mission-critical”, very important and should never get corrupted.
Relational database system was designed for data consistency and integrity, not allowing a single record to be lost. But today, in the land which is flooded with petabytes of data, it is not economically feasible -and even is not necessary – to keep and to scrutinize every bit of data in our data warehouse. When you have billions of records, losing few thousands records would be quite acceptable and would not make the result of your analysis go significantly erroneous; insight and discoveries can still be obtained. Big Data platforms focus on extracting value from the data straight away, and data scientists are willing to sacrifice consistency for speed and flexibility.
There has been a lot of buzz of Hadoop these days and indisputably Hadoop has changed the landscape of data warehousing industry forever. For the first time, now we have the choice of NOT using relational database for our data warehousing needs. Does it mean the end of relational database in data warehousing? Well, not really. At least not now.
Hadoop indeed promises a lot of good things, yet I would not say that it is the silver bullet to all your data warehousing requirements. There are reports and analysis that are still better served by relational database, such as the ever-important corporate financial reports. The relational database technology is very mature, very well understood and very widely used. Relational database has its own place in the computing world and will still find its way into the data warehousing applications, however Hadoop will certainly dethrone its dominance.