Posted by Jon Bock
Mar 31, 2016

Spending time this week at the Strata Hadoop conference in San Jose, it’s hard to take more than a few steps in any direction without seeing or hearing people talk about a noSQL platform like Hadoop or Spark. Over the last several years, those discussions have often led to debates on whether “no SQL” or “not only SQL” or “SQL on noSQL” was the next wave that would replace SQL systems. Sometimes those debates have become pretty passionate—I can’t say whether chairs have been thrown or fights have ensued, but there have definitely been some strong opinions. Those debates and discussions often morphed into complicated projects to try to build a “Hadoop data warehouse” or a “Spark data warehouse”.

However, I’m seeing signs that the debate is over. Instead of a debate on SQL vs. noSQL or on data warehousing versus a noSQL data lake, people are moving past expansive visions of what could possibly be built to a more grounded reality. The previous debates may have been fun to watch, but it’s refreshing to see a focus on the actual problems people need to solve rather than abstract debates on pure technology.

What’s driving that change?

Ultimately, it’s being driven by practical realities. To realize the full potential of noSQL platforms, they need to be integrated into an organization’s broader data strategy and infrastructure, not just left as a siloed project accessible to or bottlenecked on a small set of technology experts who understand MapReduce, Scala, distributed parallel programming, and Hadoop operations.

Integrating those systems into the data infrastructure is a much broader question than whether the query language is SQL or not. Those requirements, often familiar to data warehousing projects, include:

  • How do you make data available to current tools and users (many of which speak SQL, by the way)
  • What do you do to create and manage metadata
  • How do you plan for and deploy capacity and horsepower necessary to meet demand
  • How do you optimize performance and scale
  • How do you handle security
  • What do you do to ensure and monitor availability of data and access to data

As people started going down the path of trying to build a Hadoop data warehouse, or a Spark data warehouse, the reality of the overwhelming complexity of that approach started to become apparent. Stitching together all the different pieces needed to satisfy critical requirements requires a lot of duct tape, bailing wire, and elbow grease. Just figuring out how to support robust SQL access to the data is a non-trivial task, let alone figuring out how to build security, availability, and operational frameworks and processes around them.

The reality that sets in is that it’s a lot of time, effort, and distraction that ultimately is just attempting to rebuild a wheel that’s already been built—in particular by Snowflake. Unless your core business expertise is in building large scale, enterprise capable, distributed database platforms, trying to build that yourself isn’t the right choice. Instead, taking advantage of a data warehouse as a service created by leading experts in doing just that allows you to focus on where your core expertise and differentiation resides. If your core expertise is in understanding your data and how to analyze it, focusing on using Hadoop or Spark for the specialized algorithms and machine learning while combining it with a data warehouse service that handles reporting, analytics, and more gets you a faster, simpler path to data insights across your organization.

Case in Point: Spark + an Elastic Data Warehouse

At a session yesterday here at Strata, there was a great example of this approach. Celtra, who makes software to help companies create compelling digital advertising content, spoke about the evolution of their data pipeline over time.

IMG_6147

Like many growing companies, they were outgrowing their initial data pipeline implementation and needed to change. They had started using Spark to help them transform the tracking event data that fed dashboards, ad hoc queries, and applications out of a mySQL database. However, they needed to evolve their approach to simplify their development cycles, enable faster experimentation. After investigating a number of possible routes, Celtra realized that they needed something like a data warehouse to support a lot of their needs.

That led Celtra to Snowflake—combining Spark with Snowflake’s Elastic Data Warehouse, they now have a data pipeline that still gives them the availability to create complex custom processing of data in Spark but also support the needs of reporting and analytics users across the company. They got the best of both worlds, SQL and noSQL, by using them together.

Stay tuned for a future webcast coming up where Celtra will share more about what they did and why that was the right solution for their needs.