Digital Dark Age is Real

February 2017 | Dios Kurniawan

Few days ago I realized that all my family video tapes stored in MiniDV format from the year 2002 to 2011 were simply unreadable. Not because they are defective, but because my MiniDV camcorder refuses to turn on – most likely because of its 10-year-old age. It is terrifying to see that suddenly I lost years of memory just because I do not have the hardware to play it back. Sure, I can always buy a new camcorder, but MiniDV is an obsolete video format and not many manufacturers still produce the hardware today. Luckily I have transferred all videos to DVD discs, but the original raw uncompressed video – with higher quality than what’s stored in DVD’s MPEG-2 format – remains in those tapes, so I am left with a pile of video tapes which are as good as useless.

My video album in MiniDV tapes

This made me realize that the threat of “Digital Dark Age” was real. Digital Dark Age refers to a possible dystopian situation when our future generation cannot read our history records that we store in digital media. This can be compared to the first “Dark Age” in the mid ages after the fall of Roman Empire when most record of history on civilization was lost.

MiniDV is a relatively young digital format introduced in late 1990’s, but with the rise of flash storage technology, tape technology has slowly faded away from consumer electronics. Imagine your collection of memorable moments, photos, videos and documents you have amassed in the last 20 years in tapes would be unreadable if you don’t quickly migrate to the new technology. Digital Dark Age is looming over our lives.

Another example that digital dark age is upon us: In 1997, I published a book (see my book here), it was printed few thousands copies and they sold pretty well. Now, 20 years later, I still have the physical copy of my book, but I don’t have the digital copy anymore because the computer that I used in 1997 has gone forever.

Many digital formats have come to obsolescence and have finaly extinct. Remember floppy disk? It used to be the most popular media to store computer files in 1980’s. Nobody uses it anymore, but there must be tons files are still stored in floppy disks which have not been migrated to a new storage media.

The same is true for CD-ROM, DVD, hard drive, USB flash disk, etc. Not one person in this world can guarantee that in 50-100 years time, someone will still own the device to read and extract the information.

My PATA hard disk, CD-ROM and Floppy Disks (remember them?)

Even if the files in the legacy digital media can be restored, there is still a big probability that we do not possess the software for reading the files in their original format. Those who were raised in 1980’s to early 1990’s most likely have used PCs to write documents using old word processing software which does not exist anymore. Remember Wordstar and Wordperfect? Can we open the files properly today?

JPEG format may be the de facto standard for storing digital pictures today, but who can guarantee that the algorithm to decompress JPEG images will still be known by the future generations in 100 or 200 years from now?

Cloud storage is also a vulnerability. We are accustomed to store our photos in Google Photos, Dropbox or Apple iCloud and think they will be safe there. Are we 100% sure that Google and Apple will still in business 100 years from now?

Large organizations are now relying on Big Data technology to store and process data in large amount. They put files in Hadoop File System with multiple different compression formats. How can we ensure that in 20 years the data will still be readable?

If we do not do something in controlling our way in storing digital data, we are risking the possibility that our grand children will never be able to read our records. History would be lost forever. I recommend that from now on, all of us make physical copies of our most important photos and documents to preserve them.

The Future of IBM Netezza

January 2015 | Dios Kurniawan

Last week was pretty busy for IBM support here. The Netezza TwinFin12 machine, purchased in 2013, suddenly refused to work normally. It was still running, lights were still blinking green but somehow it was very slow to respond to any user query – if it responded at all. Business operations were affected. After a lengthy troubleshooting, it was decided that the cause was a known bug in the firmware, and as the time of writing, the machine has been restored and back to normal operation.

The IBM Netezza – now rebranded “IBM PureData System” – is a data warehouse appliance, just like Teradata (that we love and hate). It is a database specifically designed for data warehousing. At its heart, Netezza is powered by a set of blade servers, each running Intel CPUs and a specialized FPGA (field-programmable gate array), an electronic circuitry whose role is to help the main CPUs.

The FPGA is the secret sauce of Netezza system; it minimizes data movement by performing a lot of filtering and compression/decompression jobs on the storage level. These FPGAs are housed in S-Blades enclosures (12 of them are installed in each rack). These blades are controlled by a pair of host servers, which coordinate the execution of user queries. Storage is provided by disk array enclosures totalling 96TB of raw capacity per rack. Netezza is a well-designed architecture and offers a solid data warehousing solution.

How does Netezza compare with other DWH appliance and what are actually the strong points of Netezza? For most, it is the total costs of ownership. To some extent, it is sold at significantly lower prices than the competition such as Teradata, most probably because of the use of many off-the-shelf components. DBA activities are also simpler in Netezza, which makes hiring a dedicated DBA unnecessary in most situations, again, lowering costs. Because of its smart approach of filtering data close to the disks, some say Netezza also outperforms Teradata in many cases as well. Netezza has sold well and today there are hundreds of installations in the world.

The question now is, in Big Data era such as today, does Netezza still hold a future in the data warehousing business? Hadoop and its true cost-saving advantages has been driving people away from buying “enterprise data warehouse” products. Vendor lock-in is the pitfall of appliance approach. Hardware and software need to be sourced from the same vendor, and each time users would like to expand their data warehouses they would need to go to the same vendor again. To make things worse, most of the time older models cannot co-exist with the new model, rendering the old hardware useless.

Hadoop, in the other hand, allows users to purchase commodity servers from any manufacturer and can always co-exist. Hadoop runs on open-source software and on cheap commodity servers, saves money and at the same time offers more flexible solution. Hey, when company owners hear you say ‘saving money’, they will start listening to you. That is why Hadoop is grabbing a lot of attention nowadays.

Proponents would say IBM is now expanding its PureData product offering to include Hadoop integration which makes it more technically capable, however the economics will eventually prevail: businesses will simply select Hadoop platforms for the huge price/TB advantage over traditional DWH appliance. Even if large companies still require traditional data warehouse appliance and are willing to accept vendor lock-in, they would likely go playing safe; they would buy from vendors with the least risk such as Teradata and Oracle, which offer long-time, more mature products. FYI, Netezza is only 10-12 years old while Oracle and Teradata have been in the market for 30+ years.

Positioned in a less advantageous place, IBM Netezza/PureData System will be much less attractive to businesses in the days to come. I am predicting that IBM will run into a hard time selling Netezza and sales revenue will be hardly enough to finance further R&D efforts. Netezza will sooner or later lose the market and eventually go the way of Dodo bird.

Do We Still Need Relational Database in Big Data Era?

December 2014 | Dios Kurniawan

The relational database has been dominating the way we store our data in the data warehouse for the last 30 years; whatever the data sources you have in your organization, it must be stored neatly in perfect structure, that is, in tables with rows and columns.

Relational databases need schema to be defined in advance before loading the data, you can either choose normalized data model, star schema or other similar models to structure your data. The pitfall is changes afterwards –even the slightest ones- will require significant effort in altering the tables. But things change. In the era of big data technology, relational database may soon be less relevant particularly in data warehousing implementations. Big Data technologies such as Hadoop let us store and analyze massive data of any type without the need to follow a predefined schema structure. And at much lower cost.

Since Dr Codd invented relational database concept in 1970’s, it has grown hugely important in the computing industry that it is even taught as a compulsory course to all computer science students. At the heart of relational concept, the third normal form (3NF) model was largely designed to solve the problem of disk space usage, among other things. The 3NF model promises efficient use of disk space by eliminating redundancy in the data stored on disks. Disk storage was expensive in the 1970s era, and any effort to save storage space such as 3NF would be highly rewarding at that time.

But that was then. Today, disk storage is abundant and cheap. The cost of storing 1TB of data in a Hadoop cluster is now less than $500 (in 1980, a 5MB hard drive cost $1500). It makes much less sense today to design a data warehouse using 3NF because conserving disk usage has now become less of a pressing need. For applications which in nature serve transactional processing, 3NF may still be best fit but for data warehousing and the world of analysis (query, reporting, data mining etc.), there is no absolute need to use 3NF anymore.

As an alternative to 3NF, for years, the concept of star schemawhich was introduced by Dr Ralph Kimball has been regarded as the more acceptable standard method to store information in a data warehouse. Data is stored in fact and dimension tables, also in relational databases. This makes analysis easier for business users as data is organized by subject areas. Similar to 3NF, star schema must be defined for a particular analysis purpose – changes in business definitions would lead to cumbersome task of database modifications. Also similar to 3NF, star schema requires users to use a lot of joins to execute complex data queries.

Today, in the era of big data technology and data science, the preference has shifted to a “flat” data model. This means data is stored as is, or is stored by integrating multiple information into a single, flat table, eliminating the need for table joins. It emphasizes on denormalization, a completely different route from relational model. This is the method usually preferred by data scientists and can easily be implemented in Hadoop. They will create flattened data model and will create huge tables with long records. Yes there will be redundancies and inefficiencies, but disk storage is cheap anyway. Using flat model might as well consume a lot of computing resources, however providing abundant processing power at lower cost is what Hadoop is all about.

The emergence of “schema on read” approach further exaggerates the demise of our dependency on relational model in data warehousing. It allows much flexible way on how the data can be stored and consumed. Simply store the data in Hadoop and start exploring the information inside it. We are no longer stuck in a predefined, rigid schema. But one would ask, what about data integrity? For decades, the ACID (atomicity, consistency, isolation and durability) properties have been the strong points, the bread-and-butter of relational database. Back in 1970-1990s, enterprise data was so “mission-critical”, very important and should never get corrupted.

Relational database system was designed for data consistency and integrity, not allowing a single record to be lost. But today, in the land which is flooded with petabytes of data, it is not economically feasible -and even is not necessary – to keep and to scrutinize every bit of data in our data warehouse. When you have billions of records, losing few thousands records would be quite acceptable and would not make the result of your analysis go significantly erroneous; insight and discoveries can still be obtained. Big Data platforms focus on extracting value from the data straight away, and data scientists are willing to sacrifice consistency for speed and flexibility.

There has been a lot of buzz of Hadoop these days and indisputably Hadoop has changed the landscape of data warehousing industry forever. For the first time, now we have the choice of NOT using relational database for our data warehousing needs. Does it mean the end of relational database in data warehousing? Well, not really. At least not now.

Hadoop indeed promises a lot of good things, yet I would not say that it is the silver bullet to all your data warehousing requirements. There are reports and analysis that are still better served by relational database, such as the ever-important corporate financial reports. The relational database technology is very mature, very well understood and very widely used. Relational database has its own place in the computing world and will still find its way into the data warehousing applications, however Hadoop will certainly dethrone its dominance.