Nov 8, 2015
When I worked at VMware, one of the things customers loved was the ability to easily create exact copies of application deployments. That capability solved one of the big headaches that they faced—how to ensure that their development, QA, staging, and production environments were identical. People were tired not only of the effort required to get the hardware and software in place for each of those environments, but also of the fact that any differences between the environments potentially meant problems—problems reproducing failures, ensuring that tests reflected what would happen in production, and smoothly moving updates into production.
Everybody Wants Their Own Sandbox
There’s an identical problem when it comes to data warehousing and analytics. Engineers need a copy of the data warehouse and its data to develop applications, QA needs a copy for testing those applications, the operations team needs a copy for staging, and then of course there’s the production environment and the need for different groups of BI users, data scientists and analysts to have their own copy of the data warehouse and its data in order to avoid overloading the data warehouse.
To date that has meant deploying a new data warehouse or data mart for each need, configuring it appropriately, getting data loaded into it and then keeping that data up to date. In short, the cost and work required for each data warehouse or data mart is multiplied several times, which adds up pretty quickly. So quickly that often having all of those environments is a luxury on which people commonly feel they have to cut corners.
That’s no longer necessary with Snowflake. In Snowflake, it’s easy to create exact copies without duplicating infrastructure, without copying data, and without worrying about whether your different deployments are out of sync with each other.
How does it work? Let’s start with the data. One feature of the Snowflake Elastic Data Warehouse is the ability to create what we call “zero-copy clones”—copy-on-write clones that do not require any storage space to create (more on that in a future post). That feature makes it trivial to create an exact copy of any database (or schema or table for that matter), in an instant. Here’s all that it takes:
CREATE DATABASE mydb_clone CLONE mydb;
It’s just as simple when it comes to duplicating the processing environment. Because Snowflake is a SaaS offering, the software is always at the same version—we take care of ensuring that (and do updates online). That means that it’s trivial to deploy a “virtual warehouse” (i.e. a compute cluster) that’s an exact duplicate of another virtual warehouse. There’s also no need to worry about whether the configuration settings are identical—they always are because there aren’t any (another key point about Snowflake).
The last thing needed to make this possible is Snowflake’s multi-cluster, shared data architecture. The ability to have multiple independent compute clusters (“virtual warehouses”) access the same data without competing for resources is what makes it possible to avoid needing to copy data to a different system.
Without that, you’d need to set up separate data marts for each environment in order to separate the workloads to independent processing resources. That approach would have forced you to create more copies of data so that each environment had access to the data. Instead, with Snowflake each of your different environments—development, test, production, sandboxes, etc. can all run at the same time, without requiring data copying and data movement.
Making your Data Warehouse Do More
Snowflake helps streamline and simplify many different parts of what makes traditional data warehousing so complex and cumbersome. By simplifying and reducing the cost of having multiple environments, Snowflake allows you to easily give developers their own sandboxes, do more and better testing, and move to production smoothly. That’s something from which all of your data projects can benefit.