Feb 1, 2016
It’s an accepted fact that for data users, far too much of their time is spent waiting to get access to the data they need. The rapid growth in diversity, number, and volume of non-relational data sources has made that problem even worse. It was bad enough to have the ETL process be a significant bottleneck in the data pipeline when data was mostly relational and well-defined, but it’s an order of magnitude more painful when it’s diverse data streaming in from multiple sources.
One of the big challenges in the data pipeline has been handling non-relational data efficiently. This is data that doesn’t arrive in neatly organized and consistent rows and columns. Probably the most common form is what is often called semi-structured data–data in formats such as JSON, Avro, or XML that does have some form of structure, but that can have a flexible schema, hierarchies, and nesting. “Machine data,” “log data,” “application data” are all terms that often refer to data of this type. This data doesn’t fit cleanly into conventional databases as is, requiring transformation in order to turn it into something that can be put into a database for fast querying. In many cases, people have been turning to noSQL systems as a place to land and transform that data. However, adding in an additional system can add complexity and delay to the data pipeline.
DoubleDown, an online gaming studio, is an example of a company that had gone the route of putting a noSQL system–MongoDB in their case, into their data pipeline to prepare data for loading into their data warehouse. That approach wasn’t meeting their requirements–it was fragile, took a lot of care and feeding, and ultimately was making it hard for them to get data to their downstream data users in a timely way.
The need for a better solution is what led them to Snowflake. Because Snowflake can directly load semi-structured data including the JSON data generated by DoubleDown’s games without needing to transform the data first, DoubleDown was able to push their game data directly into Snowflake. Snowflake’s fast SQL engine then allowed them to transform and package that data for direct use by analysts as well as to push into their existing data warehouse. This made a huge difference in the quality and performance of their data pipeline:
- They can get fresh data to analysts 50x faster–in 15 minutes rather than 11-24 hours
- They have significantly improved reliability, eliminating almost all of the failures that occurred frequently in their previous pipeline
- Analysts now have access to the full granularity of data instead of being limited to periodic aggregates
- The icing on the cake: DoubleDown reduced the cost of their data pipeline by 80%