Benoit Dageville

Oct 24, 2014

In an industry where change is the standard currency, it’s remarkable how little progress data warehousing has made in the past decade. During that time there have been major changes–from the emergence of the Cloud to the current Big Data explosion. And yet databases–and data warehouses–are stuck in the last century.

If you were to build a database for data warehousing from scratch today, what would it look like? Here are the key principles it would need to address:

  • First of all, users–not data–should be the focus. Users should only have to put their data in and run queries to get value out; the system would do the rest and make this happen really fast.
  • It should be able to store all the data you want. It should provide unlimited storage capacity at such a low cost that no one would ever have to think again about throwing out data.
  • It should be designed and optimized from the ground up to store and efficiently process any data in any shape, from pure relational structures like CSV to semi-structured such as JSON, Avro, and XML.
  • It should deliver quick and easy access to all the relevant data inside and outside your organization.
  • It should be truly elastic–able to grow, shrink and evolve its storage and compute resources as well as capacity to support concurrent users within minutes to adapt to any processing demand, even going all the way back to zero when no queries are running. That elasticity is critical to enabling you to scale up on down on the fly so that you can run diverse workloads concurrently without having them compete for resources.
  • Finally, the dream warehouse would always be available: no downtime, no data loss, fully accessible from anywhere, fully secure. All that with nothing to do on the administrator or user’s part: it would just happen.

Given those requirements, how do we get there? Traditional databases aren’t the answer: they’re simply too far behind the times to catch up. They’re too inflexible to handle new types of data and use cases, incredibly complex to manage, neither efficient nor elastic, and just too expensive to handle the data explosion. Evolving or modifying current technology won’t work–revolutionary change is needed.

Many people hoped that Hadoop would be that revolution. By using “free” software and commodity hardware, it allowed easy and relatively cost-efficient storage as well as processing of vast amounts of data. But “free” comes with huge costs. Hadoop systems are often orders of magnitude less efficient than traditional warehouse systems. The interfaces are geared towards data specialists, leaving millions of users behind. And Hadoop is not a product, it’s an ecosystem, meaning it is both very complex and very expensive. And while more flexible, it is still restricted by the hardware that you use.

The Cloud Is the Only Logical Solution

Cloud is the only computing platform that can produce the ‘ideal data warehouse.’ The cloud is more than just a different way to get hardware resources. It makes virtually infinite storage and compute resources available on-demand, and it frees users from all software and infrastructure management tasks. That provides the essential foundation needed to build truly elastic software and provide it as a service. But to fully leverage the cloud’s amazing capabilities, software needs to be reinvented and built from the ground up.