Enabling Diverse Analytics on MongoDB Data

The recent MongoDB World conference in New York was a testament to the rapid growth and adoption of MongoDB among a wide array of organizations and use cases–connected home IoT, streaming video, online gaming, community forum platforms, and more.

What’s common across a lot of these use cases is lots of rapidly-growing data–log data, event data, monitoring data, and the like. MongoDB is a great fit for that because it’s able to scale to support quick access to data elements even at high concurrency levels. MongoDB can do a good job handling lots of the small or modest-sized reads of data done by applications that use it to support online data access (think lookups and simple aggregations supporting web applications, for example).

What about analytics?

Once you have a lot of data in MongoDB, the next question is of course what do you do when you want to start doing analytics on that data?

You can get started with the capabilities in the MongoDB aggregation framework (e.g. counts, averages, totals, etc.). There’s also the option to use native MapReduce processing for custom logic and you can even use the MongoDB BI Connector to provide basic support for SQL-based BI and visualization tools. That can work for modest analytics needs when it’s imperative to do basic analytic operations on the data inside the operational system.

However, that isn’t enough to meet most analytics needs.

  • Doing analytics inside MongoDB starts to break down as it becomes more demanding. As data and workloads scale, heavy-duty analytics consume more and more resources, making it harder and harder to maintain performance for operational access.
  • There are a lot of places where SQL is the tool for the job, not least because it’s the native language of a huge array of BI and analytics tools out there. However, the inefficiencies of translating SQL generated by BI and analytics tools into MongoDB queries slow down analytics, limiting what you can do natively in MongoDB.
  • When you need to combine the type of data that typically resides in MongoDB with traditional data, you start needing not only full SQL support but also the ability to support dimensional models and handle more heavy-duty JOIN operations.

Those are just a few of the reasons that people start looking for solutions that can help them get more value out of the data that they have in MongoDB. What’s needed: a solution that doesn’t require lots of complex ETL, that can keep up with the large amount of data being generated and stored, and that can do all of that at a reasonable cost.

Snowflake for MongoDB analytics

We’re seeing more and more customers turning to Snowflake for a system that provides that. Snowflake is a great complement to MongoDB for a number of reasons.

Native support for JSON, the data structure used at the heart of MongoDB’s document-oriented model. Because Snowflake can load JSON data natively without requiring transformation to a fixed relational schema, it’s easy to get data from MongoDB into Snowflake. There’s no need to build an ETL pipeline to transform that data and no need to worry about anything breaking as the data structure evolves.

Unique architecture built for online elasticity and scale. When you’ve got lots of data arriving at highly variable but potentially rapid rates, you need a system that can easily keep up. Snowflake’s multi-cluster, shared data architecture makes it possible to load data at any time without competing for resources with analytics. That allows you to do micro-batch loading at any time to keep up with fast-arriving streams of data so that analysts have rapid access to recent data. Need to load a lot of data quickly? With Snowflake you can accelerate loading by simply scaling up a virtual warehouse—no downtime, read-only mode, or data redistribution required.

Native SQL support, even for JSON. Snowflake comes with a robust SQL engine at its core, making your BI and analytics tools hum along in the way they were designed to interact with data. Even better, Snowflake’s SQL isn’t limited to just relational data—that JSON data you have in MongoDB with its variable schema, hierarchies, and nesting? All accessible from SQL via Snowflake’s extensions and ability to create relational views on top of that data to make it friendly to SQL-based tools.

Putting the pieces together

Combining MongoDB with Snowflake allows you to bring the benefits of MongoDB in supporting live applications together with the analytics flexibility and power of the Snowflake data warehouse. MongoDB supplies live applications with fast response and simple analytics, while data is copied into Snowflake in small batches every few minutes so that it can be combined with other data for more demanding analytics supporting BI and analytics teams.

MongoDB blog

As an example, we wrote earlier about how DoubleDown Interactive was struggling with their prior data pipeline, trying to figure out how to make their online game event data available to analysts using BI and dashboarding tools. DoubleDown was able to dramatically simplify their data pipeline and get data to BI analysts faster by eliminating the transformations that they were previously trying to do in MongoDB and moving the data directly into Snowflake.

It’s all part of how people have been reinventing the solutions supporting their rapidly growing and evolving needs for analytics, a reinvention that Snowflake is helping to drive. You can read more in our case study about DoubleDown.

Mobile customer experience matters

Customers are not only interacting with your app, they are also using multiple social networks from their mobile devices to interact with your organization — making mobile experience essential. Customer behavior has multiple sources; understanding that behavior requires a data platform that can easily incorporate and process all these sources of data along with the web traffic being generated by your app. It can be difficult to integrate these disparate sources given key requirements such as:

  • Ingesting different types of data
  • Accommodating data changes
  • Scaling in synch with the user base scales to ensure supply for analytics matches demand

Regardless, organizations with a successful mobile strategy have been able to find a way. They use data analytics solutions to differentiate themselves and personalize their customers’ mobile experience. They use these solutions to better understand customer interaction with their organization using a mobile platform along with getting a stronger grip on customer experience and fixing unforeseen problems.

One such organization is Chime. Chime is revolutionizing banking for the mobile generation by designing their mobile presence around the customer. Their app is aimed at helping people lead healthier financial lives and automating savings. “Chime is designed for the millennial generation who expect services to be personalized and mobile-first.” says Ethan Erchinger, Chime’s Director of Technical Operations.

See how Chime is able to easily ingest and analyze data from 14 different sources including Facebook, Google and applications that emit JSON data in order to effectively analyze customer experience and feedback within their mobile platform. For example, Chime is able to collect and analyze feedback from app users to personalize their experience based on their geographic location, so that they can get helpful hints about saving money in their area. Read Chime’s story here.

Making Data Warehousing Easy

Legacy Problems

Organizations with legacy on-premise data warehouses spend a lot of time and money managing their environments and keeping up with business demands. Because of the size of the investments, organizations often run their data warehouses close to full utilization. While this may meet their current needs, the inherent lack of scalability could mean compromises on performance or failure to meet SLAs when more workloads, data sources and users need to be added.  Then the journey begins to add more capacity.  Organizations often need to acquire specialized resources, or become reliant on legacy vendors to manage and maintain the environment. All of this means that if there is a spike in demand for the environment, these organizations cannot accommodate the growth without impacting performance or must absorb additional costs for a greater footprint that waits unused until needed for that brief spike in the future.

Dealing with Growth

With performance concerns come the typical headaches of any data warehouse environment. These include, but are not limited to growing the environment, finding qualified resources for  performance tuning, optimizing queries, and dealing with concurrency and user growth. On the other hand businesses are facing stiffer competition, and end users are clamoring for faster answers to their business questions. In the past, data warehousing was limited to a set of users typically in marketing or finance. Now even field sales reps want access to up-to-date data, creating more load on the data warehouse. Plus the more data you have, the more important security becomes and the cost for performance increases. So now organizations are not only keeping the lights on, but also increasing spending to get performance, and securing the environment. In short, data warehouses have become more difficult to maintain and run!

An example of an organization facing this challenge of scalability is CapSpecialty, a leading provider of specialty insurance for small to mid-sized businesses. CapSpecialty used a legacy data warehouse to support analytics needed by their actuarial users to understand how to price and package products in various geographies. With the increased demand for access to this data, this legacy environment required a significant upgrade. Performance impacts led to users having to start their queries before leaving the office for the weekend, hoping they would be completed when they returned to work on Monday. The legacy environment also limited their ability to report on important KPIs that were critical to running the business in a timely manner. As with any financial organization, the environment also needed to provide a very secure environment to store the crown jewels: customers’ risk profile and related financial data.Unfortunately, upgrading their environment to meet this increased demand was going to cost them $500K just for licensing, and that would only give them a 2X increase in performance. This does not even include the costs for deployment, management and hosting for the new environment.

Making Data Warehousing Easy

The need for a scalable, more cost effective solution led them to Snowflake. After evaluating a number of data warehouse options,CapSpecialty decided to implement the Snowflake cloud-based Elastic Data Warehouse. Besides offering an attractive cost structure, Snowflake’s true cloud solution delivered ease of migration and scalability. With Snowflake, CapSpecialty was up and running in less than a week. In addition to achieving an increase of 200x query performance, they leveraged existing infrastructure and were set up to scale for future growth. Snowflake also provided end to end enterprise level security to protect their sensitive financial data in the cloud.

CapSpecialty underwriters are now able to analyze 10 years’ worth of governed data in 15 minutes. The stage has also been set for CapSpecialty executives to view dashboards that display real-time profitability and KPIs. Using Snowflake, CapSpecialty can also bring semi-structured data to the environment, and serve the analytics to their field agents to effectively market their products in various geographies.

To learn the details of how Snowflake made data warehousing easy for CapSpecialty, we encourage you to read more in the case study. You can also attend the our webinar  on April 27th, 2016, 10:00 AM-PST/1:00 PM-EST, to find out how Snowflake and Microstrategy enable CapSpecialty analysts to understand data in real time.

To SQL or to noSQL: Is that the Right Question?

Spending time this week at the Strata Hadoop conference in San Jose, it’s hard to take more than a few steps in any direction without seeing or hearing people talk about a noSQL platform like Hadoop or Spark. Over the last several years, those discussions have often led to debates on whether “no SQL” or “not only SQL” or “SQL on noSQL” was the next wave that would replace SQL systems. Sometimes those debates have become pretty passionate—I can’t say whether chairs have been thrown or fights have ensued, but there have definitely been some strong opinions. Those debates and discussions often morphed into complicated projects to try to build a “Hadoop data warehouse” or a “Spark data warehouse”.

However, I’m seeing signs that the debate is over. Instead of a debate on SQL vs. noSQL or on data warehousing versus a noSQL data lake, people are moving past expansive visions of what could possibly be built to a more grounded reality. The previous debates may have been fun to watch, but it’s refreshing to see a focus on the actual problems people need to solve rather than abstract debates on pure technology.

What’s driving that change?

Ultimately, it’s being driven by practical realities. To realize the full potential of noSQL platforms, they need to be integrated into an organization’s broader data strategy and infrastructure, not just left as a siloed project accessible to or bottlenecked on a small set of technology experts who understand MapReduce, Scala, distributed parallel programming, and Hadoop operations.

Integrating those systems into the data infrastructure is a much broader question than whether the query language is SQL or not. Those requirements, often familiar to data warehousing projects, include:

  • How do you make data available to current tools and users (many of which speak SQL, by the way)
  • What do you do to create and manage metadata
  • How do you plan for and deploy capacity and horsepower necessary to meet demand
  • How do you optimize performance and scale
  • How do you handle security
  • What do you do to ensure and monitor availability of data and access to data

As people started going down the path of trying to build a Hadoop data warehouse, or a Spark data warehouse, the reality of the overwhelming complexity of that approach started to become apparent. Stitching together all the different pieces needed to satisfy critical requirements requires a lot of duct tape, bailing wire, and elbow grease. Just figuring out how to support robust SQL access to the data is a non-trivial task, let alone figuring out how to build security, availability, and operational frameworks and processes around them.

The reality that sets in is that it’s a lot of time, effort, and distraction that ultimately is just attempting to rebuild a wheel that’s already been built—in particular by Snowflake. Unless your core business expertise is in building large scale, enterprise capable, distributed database platforms, trying to build that yourself isn’t the right choice. Instead, taking advantage of a data warehouse as a service created by leading experts in doing just that allows you to focus on where your core expertise and differentiation resides. If your core expertise is in understanding your data and how to analyze it, focusing on using Hadoop or Spark for the specialized algorithms and machine learning while combining it with a data warehouse service that handles reporting, analytics, and more gets you a faster, simpler path to data insights across your organization.

Case in Point: Spark + an Elastic Data Warehouse

At a session yesterday here at Strata, there was a great example of this approach. Celtra, who makes software to help companies create compelling digital advertising content, spoke about the evolution of their data pipeline over time.

IMG_6147

Like many growing companies, they were outgrowing their initial data pipeline implementation and needed to change. They had started using Spark to help them transform the tracking event data that fed dashboards, ad hoc queries, and applications out of a mySQL database. However, they needed to evolve their approach to simplify their development cycles, enable faster experimentation. After investigating a number of possible routes, Celtra realized that they needed something like a data warehouse to support a lot of their needs.

That led Celtra to Snowflake—combining Spark with Snowflake’s Elastic Data Warehouse, they now have a data pipeline that still gives them the availability to create complex custom processing of data in Spark but also support the needs of reporting and analytics users across the company. They got the best of both worlds, SQL and noSQL, by using them together.

Stay tuned for a future webcast coming up where Celtra will share more about what they did and why that was the right solution for their needs.

Doubledown and Solving Data Pipeline Dilemmas

It’s an accepted fact that for data users, far too much of their time is spent waiting to get access to the data they need. The rapid growth in diversity, number, and volume of non-relational data sources has made that problem even worse. It was bad enough to have the ETL process be a significant bottleneck in the data pipeline when data was mostly relational and well-defined, but it’s an order of magnitude more painful when it’s diverse data streaming in from multiple sources.

One of the big challenges in the data pipeline has been handling non-relational data efficiently. This is data that doesn’t arrive in neatly organized and consistent rows and columns. Probably the most common form is what is often called semi-structured data–data in formats such as JSON, Avro, or XML that does have some form of structure, but that can have a flexible schema, hierarchies, and nesting. “Machine data,” “log data,” “application data” are all terms that often refer to data of this type. This data doesn’t fit cleanly into conventional databases as is, requiring transformation in order to turn it into something that can be put into a database for fast querying. In many cases, people have been turning to noSQL systems as a place to land and transform that data. However, adding in an additional system can add complexity and delay to the data pipeline.

DoubleDown, an online gaming studio, is an example of a company that had gone the route of putting a noSQL system–MongoDB in their case, into their data pipeline to prepare data for loading into their data warehouse. That approach wasn’t meeting their requirements–it was fragile, took a lot of care and feeding, and ultimately was making it hard for them to get data to their downstream data users in a timely way.

DblDwn-art-FIN-RGB-01


The need for a better solution is what led them to Snowflake. Because Snowflake can directly load semi-structured data including the JSON data generated by DoubleDown’s games without needing to transform the data first, DoubleDown was able to push their game data directly into Snowflake. Snowflake’s fast SQL engine then allowed them to transform and package that data for direct use by analysts as well as to push into their existing data warehouse. This made a huge difference in the quality and performance of their data pipeline:

  • They can get fresh data to analysts 50x faster–in 15 minutes rather than 11-24 hours
  • They have significantly improved reliability, eliminating almost all of the failures that occurred frequently in their previous pipeline
  • Analysts now have access to the full granularity of data instead of being limited to periodic aggregates
  • The icing on the cake: DoubleDown reduced the cost of their data pipeline by 80%
DblDwn-art-FIN-RGB-02


Rethinking their data pipeline opened the door for DoubleDown to do things not possible before, from giving analysts immediate access to data from new product releases to enabling the business to make decisions based on data far faster. I’d encourage you to read more in the case study.

KIXEYE and Gaming Analytics on Snowflake

KIXEYE is a Snowflake customer who demonstrates how our customers are rethinking what technology they need to process and analyze non-traditional data. As mentioned in our recent press release, KIXEYE is an online gaming company that is using Snowflake to help them analyze game event data in support of ongoing experimentation with new features, functionalities and platforms.

Existing Systems Don’t Meet Current Needs

Gaming analytics is a great example of how data and analytics have changed in ways that don’t fit traditional data warehousing solutions, but that aren’t easy to solve with big data platforms like Hadoop either. The reasons that a traditional data warehouse wasn’t going to meet KIXEYE’s needs were similar to what we’re seeing at other gaming companies:

  • Their game event data (which is the largest share of their data) is created as JSON. Traditional relational data warehouses don’t handle JSON well if at all–either you transform the data before loading, which adds delays and makes your data pipeline fragile, or you load the data into an unoptimized data type and pay a performance penalty every time you access it.
  • They needed to allow access to data at multiple stages of refinement. Their data scientists want access to raw data as quickly as possible, while other analysts want more refined data that can be accessed with visualization and BI tools. Since a traditional data warehouse is a repository for only refined data, it can’t support the full range of these needs by itself.

KIXEYE wasn’t unique in initially using Hadoop to store, process, and analyze their semi-structured data. Although Hadoop is great for many things (e.g. machine learning algorithms, unstructured data storage), what KIXEYE and a lot of other companies have realized is that it wasn’t designed for fast SQL analytics. Trying to use it to support that adds a lot of latency and complexity, let alone the challenge in finding skilled people to keep the system up and running.

One System for Storing and Analyzing JSON Data

That’s why gaming analytics is an area where Snowflake is getting a lot of interest. To them, the fact that with Snowflake you can load JSON data as is without transformation or flattening and without needing to define a fixed schema, but still get great performance on queries of that data in a SQL engine is a huge win. Josh McDonald from KIXEYE said it best: “I can’t say enough about how fantastic the native JSON support is. I’ve never actually seen anything that worked until now. My analysts are really happy about this.” Not needing to have staff focused on implementing and maintaining the system makes Snowflake even more compelling.

I encourage you to take a look at our case study on KIXEYE to learn more about how KIXEYE is using Snowflake. Our recent webinar with DoubleDown Interactive is another great example of a gaming company taking advantage of Snowflake to help them get easier access to data and faster analysis of that data.

At the end of the day, gaming companies like KIXEYE want to focus on developing the best possible games while optimizing revenue. Implementing and maintaining complex data infrastructure shouldn’t get in the way of doing that.