Saving Time & Space: Simplifying DevOps with Fast Cloning

As I have written before, I think our fast cloning feature is pretty cool. Not only is it very fast (as in it took me barely a minute to clone a 2TB database!), it allows you to have multiple copies of your data without the additional cost of storage usually associated with replicating data.  

That is all well and good, but out in the real world, folks want to see lots of examples on how to make good use of this feature. A common question from customers is how to utilize the Snowflake fast clone feature to support DevOps. Specifically, can it be used to support a promotion process from development to integration/testing to production?

The answer is yes! Following are a few examples of how that might work.

Getting Started: Cloning Your Production Database

Starting out, let’s assume you have one database in Snowflake (i.e. production by default) and it has a schema named PUBLIC. In this database you have two tables (A and B) and they are being populated daily by a load process.

Cloning Figure 1

Now you find you need to make changes and additions and realize you should not do those in production so now you need a development database. How do you do that in Snowflake?

You could do it the old-fashioned way and just create a new database then copy the existing tables and data over from production (using a “CREATE TABLE AS SELECT …” or CTAS operation). If they are large tables that would take some time and cost extra as it also requires more space. However, using Snowflake’s fast clone feature you can do it way faster and not incur the cost for extra space.

Here is how to do it:

CREATE DATABASE Dev CLONE Prod;

That’s it, one command! It creates the new database, schema, and tables and gives you a logical copy of the production data to work with almost instantly. 

While you are at it, you can also  create the integration testing database (INT database) too:

CREATE DATABASE Int CLONE Prod;

Now you are set to continue your data warehouse development efforts with a proper environment in place.

Scenario 2: Promoting New Development

Based on what we did above, your current Development database has tables A and B, cloned from Prod. As you begin your development iteration, you can add more tables. In this example, we added two new tables C and D, which are under development.

Initial State:

Prod database: Tables A, B

Int database: Tables A, B (previously cloned from Prod)

Dev database:

  • Tables A, B (previously cloned from Prod)
  • Tables C, D (created as part of new development)
Cloning Figure2

Mini Release:

Now suppose you want to do a mini-release, wherein you:

  • Promote only Table C to Int for testing
  • Leave Table D as is in Dev (work-in-progress, not ready to deploy)

To do this mini-release you need to execute the following steps in a script, using the command-line client, or in the Snowflake worksheet:

  1. Deploy Table C to Int by simple clone:
        • If C already already contains production data (copy data):
          USE Int.public; 
          CREATE TABLE C CLONE Dev.public.C;  // Yes, that simple!
          • Load incremental new data into Table C
          • Validate results
        • If Table C has only test data and you do not want to copy the data:
    USE Int.public;
    CREATE TABLE C LIKE Dev.public.C;  // copy the table structure only
          • Load new data in Table C
          • Validate results
  2. If the validation of Table C succeeds in Int, then deploy it to Prod:
USE Prod.public;
CREATE TABLE  C LIKE Int.public.C;
      • Load production data into Table C

Now you have the new Table C live in your Prod database.

Cloning Figure3

But you are not quite done yet. You need to complete the cycle.

Refresh Dev:

We now want to refresh Dev to be in sync with Prod. To do that we need to create a production clone of Table C and update Tables A and B with refreshed data. This will also serve to reduce your storage by replacing the original version of Table C in Dev with a cloned table. Likewise, you should sync Int as well.

If there are just a few tables in Prod (as in our example) then:

USE Dev.public;
CREATE OR REPLACE TABLE A CLONE Prod.public.A;
CREATE OR REPLACE TABLE B CLONE Prod.public.B;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;
USE Int.public;
CREATE or REPLACE DATABASE Int CLONE Prod;
Cloning Figure4

Note: CREATE OR REPLACE TABLE will drop and rebuild the table. This will release any space used in Dev (or Int) by those tables subsequent to the last clone operation.

Hint: If there are many tables in Prod, and only a few new tables in Dev, then this approach will be more efficient:

CREATE DATABASE Dev2 CLONE Prod;
USE Dev2.public;
CREATE TABLE D CLONE Dev.public.D; // preserving the WIP table in Dev
DROP DATABASE Dev;
ALTER DATABASE Dev2 RENAME To Dev;

This process removes all previous clones and allows you to preserve Table D as it is currently in development. When your are done with development on Table D, you follow the same process as we did for Table C to promote Table D to Int for testing then up to Prod for go live. After the promotion be sure to replace Table D in Dev and Int with a fresh clone from Prod in order to release the space.

Scenario 3: Promoting a Change to an Existing Table

Now let’s suppose you have three tables already in production and you need to alter one of those tables (i.e., add a new column) and then promote the revised table to Production. This is an important scenario, specially when rows are being added constantly to the data in production table. In this scenario you do not want to reload the modified table but rather change it then continue with incremental loads.

Initial State:

Prod database: Tables A, B, C

Int database: Tables A, B, C (cloned from Prod)

Dev database: Tables A, B, C (cloned from Prod)

Cloning Figure5

Mini Release:

In this scenario, you want to do a mini-release, wherein you will:

  • Alter Table C by adding one new column
  • Adjust the load process to include that column
  • Promote Table C to Int for testing

To do this mini-release you need to execute the following steps in a script or by hand in the Snowflake worksheet:  

    1. Change the table and the code (in Dev), then refresh the table
USE Dev.public;
ALTER TABLE C ADD COLUMN NEW_COL1 VARCHAR(200);
      • Modify load to include NEW_COL1
      • Write a one-time update script to update existing rows to populate NEW_COL1
        • Pull data from Production source if possible
        • Review and validate results
      • Execute modified load routine to load incremental new rows
        • Review and validate results

2. Deploy Table C to Int by simple clone:

USE Int.public;
CREATE OR REPLACE TABLE C CLONE Dev.public.C;
      • Promote revised load process to Int environment
      • Promote Update script to Int environment
      • Execute Update script
        • Review and validate results
      • Execute load process to load incremental new data into Table C
        • Review and validate results

3. If promotion to Int succeeds, then deploy to Prod:

USE Prod.public;
ALTER TABLE C ADD COLUMN NEW_COL1 VARCHAR(200);
      • Promote revised load process to Prod environment
      • Promote Update script to Prod environment
      • Execute Update script
        • Review and validate results
      • Execute load process to load incremental new data into Table C
        • Review and validate results
      • Start writing queries and reports that use the new column

Refresh Dev:

Once the promotion to Production is complete, you can refresh Dev and Int again using the same process as mentioned above.

USE Dev.public;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;
USE Int.public;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;

Now you have a production clone of Table C in Dev & Int with the new column added and updated with fresh production data.

Conclusion

These are just a few examples of how you might use Snowflake’s fast cloning to support an agile, light-weight DevOps process that does not require a lot of DBA management and does not take up extra space! Hopefully this gives you some ideas how you can use fast cloning in your development environment.

Thanks to Saqib Mustafa (@cloudsommelier) for his help on this post!

As always, keep an eye on this blog site, our Snowflake-related Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

 

Snowflake Fast Clone

Hopefully you had a chance to read our previous posts: Query Results Sets available in history,   Ability to connect with JDBC, and  Undrop a table, schema or a database. As promised in the original Top 10 post, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#7 Fast Clone

Even cooler than UNDROP is our fast clone feature.

Have you ever had to wait hours, days, maybe weeks to spin up a copy of your production data warehouse so you could have a test or a development environment? And have to pay extra for the test or development environment to able to hold all the replicated data? Or maybe you have some new data scientists that just want to play around with the data but they really need their own copy?

With the simple Snowflake CLONE command, our customers can create multiple copies of the data tables, schemas, and databases, without replicating the data itself. This gives our customers the ability to almost instantly make the data available to use for multiple user groups, without the additional cost (or time) of actually replicating the data.

Magic??

Almost –

Magic with Meta Data.

Based on our unique solution, Snowflake users are able to clone their tables, schema or databases without creating additional copies. Snowflake stores data in files that are immutable, and encrypted, as part of our architecture. Our cloud services layer, with the metadata repository, records the information regarding the files being stored, the file locations, and a reference to a certain version of the data. This is also kept encrypted.  In addition, when any data changes, the Metadata repository is automatically updated to provide a pointer to the changed data.  All of this is performed in the background by the software without any involvement from the user. The metadata repository still retains the record for all versions of the data set.

How Cloning works

Because of the data in the metadata store, the user can quickly create a clone of the table. All the user has to do is to submit the clone command. This command can be submitted at any time.

CREATE OR REPLACE TABLE MyTable_V2 CLONE MyTable

As a result of the CLONE command, the system simply creates a new entry in the metadata store to keep track of the new clone.

Time Traveling Clones! Oh My!

In addition to simple cloning of objects, you can blend cloning with Snowflake time travel to clone tables, schemas, or even databases at a point in time in the past AT or BEFORE a specific timestamp. Here is an example:

CREATE SCHEMA mytestschema_clone_restore CLONE testschema BEFORE (TIMESTAMP => TO_TIMESTAMP(40*365*86400));

This command once executed will create a clone of the entire schema (tables, views, etc.) as it existed before the specified timestamp.

Hmm…time traveling clones…sounds like an episode of the X-Files (also cool).

Because Snowflake maintains the history of queries performed and identifies them by unique ID, we can also submit a request to create a clone using the unique ID of the query, BEFORE or AT a certain time stamp. This would allow you to perhaps run a revised set of scripts against an older data set then compare the results to the current data set.

In addition, once cloned, cloned objects are independent of each other. Despite being independent, there are no additional storage requirements and thus no additional charge (unless you add or modify records), since these clones share files. Thus Snowflake allows its customers to clone at multiple levels: table, schema (file format, views, sequences, stages) and databases and over time. And because they are independent, updates to one are not visible in the others.

Very helpful for experimentation and data exploration!

Cloning Example

As an example of what it looks like in the Snowflake UI, here is a snapshot of one of my demo databases with Twitter data. It is about 2 TB of data.

Twitter db

Now here is me launching the Clone via our Web UI at 3:02:49 PM:

Start Clone

And here is the result showing the cloned db created at 3:03:55 PM. Barely a minute to create a clone of a 2TB database with 10 tables!

Clone Completed

And I am not a DBA let alone a “cloud” DBA. This is all part of Snowflake being a Data Warehouse as a Service (DWaaS). Another reason why I think it is way cool!

Why is this hard to do in traditional systems?

In traditional RDBMS systems, if you wanted to clone an existing data warehouse and give a different set of users access to the environment, you typically have to create a whole new, separate, deployment of the environment. In Snowflake, as I have just shown, you can do this with ease with just a few commands (one of the many benefits resulting from having written Snowflake from scratch for the cloud). Moreover, while most traditional data warehouse systems allow you to create snapshots of the data, this generally results in consumption of more storage (which of course costs more $$).

All these reasons are why Fast Cloning made my Top 10 list of really cool features in the Snowflake Elastic Data Warehouse.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano),  (@cloudsommelier), and #ElasticDW for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Special thanks to Ashish Motivala from Snowflake Engineering for helping us with some of the more technical details on how this works so well.

Kent Graziano and Saqib Mustafa

Top 10 Cool Things I Like About Snowflake

I have now been with Snowflake Computing for a little over two months (my how time flies). In that time, I have run the demo, spoken at several trade shows, and written a few blogs posts. I have learned a ton about the product and what it means to be an Elastic Data Warehouse in the Cloud.

So for this post I am going to do a quick rundown of some of the coolest features I have learned about so far. 

#10 Persistent results sets available via History

Once you execute a query, the result set will persist for 24 hours (so you can go back and check your work). It may seem minor to some, but it sure is convenient to be able to pull up the results from a previous query without having to execute the query a second time. Saves on time and processing. Read more

#9 Ability to connect with JDBC

Again seems like a no brainer but very important. I had no real clear concept of how I would connect to a data warehouse in the cloud so this was good news.  After getting my favorite data modeling tool, Oracle SQL Developer Data Modeler (SDDM),  installed on my new Mac, I was able to configure it to connect to my Snowflake demo schema using JDBC and reverse engineer the design. 

So why is this cool? It means that whatever BI or ETL tool you use today, if it can talk over JDBC, you can connect it to Snowflake. Read more

#8 UNDROP

With UNDROP in Snowflake you can recover a table instantaneously with a single command:

UNDROP TABLE <tablename>

No need to reload last night’s backup to do the restore. No need to wait while all that data is pulled back in. It just happens!

Now that is a huge time (and life) saver. Read more

#7 Fast Clone

Even cooler than UNDROP is the fast clone feature.

The Snowflake CLONE command can create a clone of a table, a schema, or an entire database almost instantly. It took me barely a minute to create a clone of a 2TB database without using additional storage! And I am not a DBA, let alone a “cloud” DBA.

This means you can create multiple copies of production data without incurring additional storage costs. No need to have separate test/dev data sets.

Hence why I think it is way cool! Read more

#6 JSON Support with SQL

During the first demo of Snowflake I attended (before I even applied for a job here), this one got my attention.

Using the knowledge and skills I already had with SQL, I could quickly learn to query JSON data, and join it to traditional tabular data in relational tables.

Wow – this looked like a great stepping stone into the world of “Big Data” without having to learn complex technologies like Hadoop, MapReduce, or Hive! Read more

Yes, I call that a very cool feature. And the fact that the JSON documents are stored in a table and optimized automatically in the background for MPP and columnar access. This gives you the ability to combine semi-structured and structured data, in one location. For further details check out my detailed 2 part blog here and here.

#5 ANSI compliant SQL with Analytic Functions

Another key feature in Snowflake, that is required to be called a relational data warehouse, is of course the ability to write standard SQL. More so, for data warehousing, is access to sophisticated analytic and windowing functions (e.g., lead, lag, rank, stddev, etc.).

Well Snowflake definitely has these.  In fact we support everything you would expect including aggregation functions, nested virtual tables, subqueries, order by, and group by. This means it is fairly simple for your team to migrate your existing data warehouse technologies to Snowflake. Read more

#4 Separation of Storage and Compute

The innovative, patent-pending, Multi-Cluster, Shared Data Architecture in Snowflake is beyond cool. The architecture consists of three layers; storage, compute, and cloud services. Each layer is decoupled from the other, each layer is independently scalable. This enables customers to scale resources as they are required, rather than pre-allocating resources for peak consumption. In my 30+ years working in IT, I have not seen anything like it.  It is truly one of the advantages that comes from engineering the product, from the ground up, to take full advantage of the elasticity of the cloud. Read more

#3 Support for Multiple Workloads

With this unique architecture, Snowflake can easily support multiple disparate workloads. Because of the separation of compute and storage, you can easily spin up separate Virtual Warehouses of different sizes to run your ELT processes, support BI report users, data scientists, and data miners. And it makes total sense to be able to keep disparate workloads separate, to avoid resource contention, rather than just saying we support “mixed” workloads.

And even better – no special skills or secret configuration settings are required to make this work. It is the way Snowflake is built by design. Nice! Read more

#2 Automatic Encryption of Data

Security is a major concern for moving to the cloud. With Snowflake, your data is automatically encrypted by default. No setup, no configuration, no add-on costs for high security features.

It is just part of the service! To me that is a huge win. Read more

#1 Automatic Query Optimization. No Tuning!

As a long time data architect, and not a DBA, this is my favorite part of Snowflake. I do not have to worry about my query performance at all. It is all handled “auto-magically” via meta data and an optimization engine in our cloud services layer. I just model, load, and query the data.

So, no indexes, no need to figure out partitions and partition keys, no need to pre-shard any data for distribution, and no need to remember to update statistics.

This feature, to me, is one of the most important when it comes to making Snowflake a zero management Data Warehouse as a Service offering. Read more

Well, that is the short list of my top 10 favorite features in Snowflake. Keep a look out for future posts in the coming weeks, to provide details on these and other key features of the Snowflake Elastic Data Warehouse.

If you want to learn more about Snowflake, sign up for one of our frequent webinars, or just drop me a line at kent.graziano@snowflake.net and I will hook you up!

P.S. Keep an eye on my Twitter feed (@kentgraziano) and the Snowflake feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing. Watch for #ElasticDW.