Big Data – Page 6 – Dios Kurniawan's Blog

January 2015 | Dios Kurniawan

Last week was pretty busy for IBM support here. The Netezza TwinFin12 machine, purchased in 2013, suddenly refused to work normally. It was still running, lights were still blinking green but somehow it was very slow to respond to any user query – if it responded at all. Business operations were affected. After a lengthy troubleshooting, it was decided that the cause was a known bug in the firmware, and as the time of writing, the machine has been restored and back to normal operation.

The IBM Netezza – now rebranded “IBM PureData System” – is a data warehouse appliance, just like Teradata (that we love and hate). It is a database specifically designed for data warehousing. At its heart, Netezza is powered by a set of blade servers, each running Intel CPUs and a specialized FPGA (field-programmable gate array), an electronic circuitry whose role is to help the main CPUs.

The FPGA is the secret sauce of Netezza system; it minimizes data movement by performing a lot of filtering and compression/decompression jobs on the storage level. These FPGAs are housed in S-Blades enclosures (12 of them are installed in each rack). These blades are controlled by a pair of host servers, which coordinate the execution of user queries. Storage is provided by disk array enclosures totalling 96TB of raw capacity per rack. Netezza is a well-designed architecture and offers a solid data warehousing solution.

How does Netezza compare with other DWH appliance and what are actually the strong points of Netezza? For most, it is the total costs of ownership. To some extent, it is sold at significantly lower prices than the competition such as Teradata, most probably because of the use of many off-the-shelf components. DBA activities are also simpler in Netezza, which makes hiring a dedicated DBA unnecessary in most situations, again, lowering costs. Because of its smart approach of filtering data close to the disks, some say Netezza also outperforms Teradata in many cases as well. Netezza has sold well and today there are hundreds of installations in the world.

The question now is, in Big Data era such as today, does Netezza still hold a future in the data warehousing business? Hadoop and its true cost-saving advantages has been driving people away from buying “enterprise data warehouse” products. Vendor lock-in is the pitfall of appliance approach. Hardware and software need to be sourced from the same vendor, and each time users would like to expand their data warehouses they would need to go to the same vendor again. To make things worse, most of the time older models cannot co-exist with the new model, rendering the old hardware useless.

Hadoop, in the other hand, allows users to purchase commodity servers from any manufacturer and can always co-exist. Hadoop runs on open-source software and on cheap commodity servers, saves money and at the same time offers more flexible solution. Hey, when company owners hear you say ‘saving money’, they will start listening to you. That is why Hadoop is grabbing a lot of attention nowadays.

Proponents would say IBM is now expanding its PureData product offering to include Hadoop integration which makes it more technically capable, however the economics will eventually prevail: businesses will simply select Hadoop platforms for the huge price/TB advantage over traditional DWH appliance. Even if large companies still require traditional data warehouse appliance and are willing to accept vendor lock-in, they would likely go playing safe; they would buy from vendors with the least risk such as Teradata and Oracle, which offer long-time, more mature products. FYI, Netezza is only 10-12 years old while Oracle and Teradata have been in the market for 30+ years.

Positioned in a less advantageous place, IBM Netezza/PureData System will be much less attractive to businesses in the days to come. I am predicting that IBM will run into a hard time selling Netezza and sales revenue will be hardly enough to finance further R&D efforts. Netezza will sooner or later lose the market and eventually go the way of Dodo bird.

December 2014 | Dios Kurniawan

The relational database has been dominating the way we store our data in the data warehouse for the last 30 years; whatever the data sources you have in your organization, it must be stored neatly in perfect structure, that is, in tables with rows and columns.

Relational databases need schema to be defined in advance before loading the data, you can either choose normalized data model, star schema or other similar models to structure your data. The pitfall is changes afterwards –even the slightest ones- will require significant effort in altering the tables. But things change. In the era of big data technology, relational database may soon be less relevant particularly in data warehousing implementations. Big Data technologies such as Hadoop let us store and analyze massive data of any type without the need to follow a predefined schema structure. And at much lower cost.

Since Dr Codd invented relational database concept in 1970’s, it has grown hugely important in the computing industry that it is even taught as a compulsory course to all computer science students. At the heart of relational concept, the third normal form (3NF) model was largely designed to solve the problem of disk space usage, among other things. The 3NF model promises efficient use of disk space by eliminating redundancy in the data stored on disks. Disk storage was expensive in the 1970s era, and any effort to save storage space such as 3NF would be highly rewarding at that time.

But that was then. Today, disk storage is abundant and cheap. The cost of storing 1TB of data in a Hadoop cluster is now less than $500 (in 1980, a 5MB hard drive cost $1500). It makes much less sense today to design a data warehouse using 3NF because conserving disk usage has now become less of a pressing need. For applications which in nature serve transactional processing, 3NF may still be best fit but for data warehousing and the world of analysis (query, reporting, data mining etc.), there is no absolute need to use 3NF anymore.

As an alternative to 3NF, for years, the concept of star schemawhich was introduced by Dr Ralph Kimball has been regarded as the more acceptable standard method to store information in a data warehouse. Data is stored in fact and dimension tables, also in relational databases. This makes analysis easier for business users as data is organized by subject areas. Similar to 3NF, star schema must be defined for a particular analysis purpose – changes in business definitions would lead to cumbersome task of database modifications. Also similar to 3NF, star schema requires users to use a lot of joins to execute complex data queries.

Today, in the era of big data technology and data science, the preference has shifted to a “flat” data model. This means data is stored as is, or is stored by integrating multiple information into a single, flat table, eliminating the need for table joins. It emphasizes on denormalization, a completely different route from relational model. This is the method usually preferred by data scientists and can easily be implemented in Hadoop. They will create flattened data model and will create huge tables with long records. Yes there will be redundancies and inefficiencies, but disk storage is cheap anyway. Using flat model might as well consume a lot of computing resources, however providing abundant processing power at lower cost is what Hadoop is all about.

The emergence of “schema on read” approach further exaggerates the demise of our dependency on relational model in data warehousing. It allows much flexible way on how the data can be stored and consumed. Simply store the data in Hadoop and start exploring the information inside it. We are no longer stuck in a predefined, rigid schema. But one would ask, what about data integrity? For decades, the ACID (atomicity, consistency, isolation and durability) properties have been the strong points, the bread-and-butter of relational database. Back in 1970-1990s, enterprise data was so “mission-critical”, very important and should never get corrupted.

Relational database system was designed for data consistency and integrity, not allowing a single record to be lost. But today, in the land which is flooded with petabytes of data, it is not economically feasible -and even is not necessary – to keep and to scrutinize every bit of data in our data warehouse. When you have billions of records, losing few thousands records would be quite acceptable and would not make the result of your analysis go significantly erroneous; insight and discoveries can still be obtained. Big Data platforms focus on extracting value from the data straight away, and data scientists are willing to sacrifice consistency for speed and flexibility.

There has been a lot of buzz of Hadoop these days and indisputably Hadoop has changed the landscape of data warehousing industry forever. For the first time, now we have the choice of NOT using relational database for our data warehousing needs. Does it mean the end of relational database in data warehousing? Well, not really. At least not now.

Hadoop indeed promises a lot of good things, yet I would not say that it is the silver bullet to all your data warehousing requirements. There are reports and analysis that are still better served by relational database, such as the ever-important corporate financial reports. The relational database technology is very mature, very well understood and very widely used. Relational database has its own place in the computing world and will still find its way into the data warehousing applications, however Hadoop will certainly dethrone its dominance.

Category: Big Data

The Future of IBM Netezza

Do We Still Need Relational Database in Big Data Era?