Are all databases in the cloud, built for the cloud?

“Crap on-premises, move it to the cloud; it is still crap, just in the cloud”

David Linthicum, speaking in NYC at a recent CloudAnalyticsCityTour.com

Traveling around the country for the CloudAnalyticsCityTour.com, we see a fundamental shift in the world of analytics and the impact cloud has had on analytics. As more data sources come online (IoT, Mobile, Applications), the cloud economies of scale allow companies to bring in larger data sets and analyze them effectively. Increasingly, all major legacy data warehouse and big data platforms are moving to the cloud, to make sure they can accommodate customer workloads. However, listening to data management and analytics users in NYC and Boston, there is a genuine sense of frustration that existing platforms, whether on-premises or moved to the cloud, are not serving their purposes. Even in the cloud, these legacy platforms still require the customer to either sign up, typically in advance, for excess capacity to cater for peak performance or concurrency requirements, or they must risk sacrificing performance and business value to offset costs. Plus, none of the platforms offer the simplicity of a true, modern, built for the cloud SaaS-based solution. So you actually end up needing the same expensive resources that you did on-premises and just connecting to them in the cloud. Hence the quote: “Crap on-premises, move it to the cloud; it is still crap, just in the cloud.”

Additionally, the picture has only gotten more crowded, and confusing, with the introduction of open source platforms. As David Menninger (@dmenninger) puts it: “Hadoop is like a free puppy”. Whether in the cloud or on-premises, these open source “free puppies” need all sorts of care and feeding. They end up using even more resources than more traditional alternatives. According to a recent study by Ventana Research (link, registration required), only 1 in 6 organizations have Hadoop skills while 6 out of 10 have SQL skills. As Michal Klos, Engineering Director, Data Platform, Localytics hinted in a recent talk at the Boston edition of the CloudAnalyticsCityTour.com: they developed a data warehouse on their own in Presto for 1 year and eventually gave away that puppy because of all the additional effort it would take to get any real business value.

Then what is the alternative?

The true alternative has to be a modern cloud data warehouse that is simple to manage, offers performance and is able to meet your concurrency needs, on demand. It has to be built for the cloud, to make sure you can bring in all your diverse data and effectively scale when you need to scale, rather than requiring you to overbook capacity upfront. Around the country, companies are begging for a solution that helps them better analyze data and speed their data pipeline, rather than being stuck managing the entire environment in yet another data center. Plus, it has to fit inside the ecosystem of modern ETL/ELT and BI partners that help with moving the data from source to analysts.

Want to hear how your peers are benefitting from the shift to cloud analytics? See how data driven companies, like Snag-a-job, are able to analyze 1.5 million impressions a day and, at the same time, 18 months of historical records to understand where their customers can improve their product. Come join us for the CloudAnalyticsCityTour.com when it visits DC (October 4th) and Philly (October 6th) to hear thought leaders from AWS, Looker, Informatica, Tableau, Snowflake, and the likes of David Menninger, about how legacy platforms and free puppies keep giving us more headaches. Or join us at the TDWI San Diego conference where IAC Publishing Labs (Ask.com, About.com Investopedia.com) will talk about how they won the TDWI Best Practice award for Emerging Technologies and Methods by using Snowflake, the data warehouse built for the cloud, to turn the BI group from a cost center to a valued part of the organization. You can also visit our booth at the TDWI exhibitors hall

As always, keep an eye on this blog site and our Snowflake-related Twitter feeds (@SnowflakeDB, @drkoalz, and @kentgraziano) for more “Cool Things About Snowflake”, and for updates on all the latest action and activities here at Snowflake Computing.

Saving Time & Space: Simplifying DevOps with Fast Cloning

As I have written before, I think our fast cloning feature is pretty cool. Not only is it very fast (as in it took me barely a minute to clone a 2TB database!), it allows you to have multiple copies of your data without the additional cost of storage usually associated with replicating data.  

That is all well and good, but out in the real world, folks want to see lots of examples on how to make good use of this feature. A common question from customers is how to utilize the Snowflake fast clone feature to support DevOps. Specifically, can it be used to support a promotion process from development to integration/testing to production?

The answer is yes! Following are a few examples of how that might work.

Getting Started: Cloning Your Production Database

Starting out, let’s assume you have one database in Snowflake (i.e. production by default) and it has a schema named PUBLIC. In this database you have two tables (A and B) and they are being populated daily by a load process.

Cloning Figure 1

Now you find you need to make changes and additions and realize you should not do those in production so now you need a development database. How do you do that in Snowflake?

You could do it the old-fashioned way and just create a new database then copy the existing tables and data over from production (using a “CREATE TABLE AS SELECT …” or CTAS operation). If they are large tables that would take some time and cost extra as it also requires more space. However, using Snowflake’s fast clone feature you can do it way faster and not incur the cost for extra space.

Here is how to do it:

CREATE DATABASE Dev CLONE Prod;

That’s it, one command! It creates the new database, schema, and tables and gives you a logical copy of the production data to work with almost instantly. 

While you are at it, you can also  create the integration testing database (INT database) too:

CREATE DATABASE Int CLONE Prod;

Now you are set to continue your data warehouse development efforts with a proper environment in place.

Scenario 2: Promoting New Development

Based on what we did above, your current Development database has tables A and B, cloned from Prod. As you begin your development iteration, you can add more tables. In this example, we added two new tables C and D, which are under development.

Initial State:

Prod database: Tables A, B

Int database: Tables A, B (previously cloned from Prod)

Dev database:

  • Tables A, B (previously cloned from Prod)
  • Tables C, D (created as part of new development)
Cloning Figure2

Mini Release:

Now suppose you want to do a mini-release, wherein you:

  • Promote only Table C to Int for testing
  • Leave Table D as is in Dev (work-in-progress, not ready to deploy)

To do this mini-release you need to execute the following steps in a script, using the command-line client, or in the Snowflake worksheet:

  1. Deploy Table C to Int by simple clone:
        • If C already already contains production data (copy data):
          USE Int.public; 
          CREATE TABLE C CLONE Dev.public.C;  // Yes, that simple!
          • Load incremental new data into Table C
          • Validate results
        • If Table C has only test data and you do not want to copy the data:
    USE Int.public;
    CREATE TABLE C LIKE Dev.public.C;  // copy the table structure only
          • Load new data in Table C
          • Validate results
  2. If the validation of Table C succeeds in Int, then deploy it to Prod:
USE Prod.public;
CREATE TABLE  C LIKE Int.public.C;
      • Load production data into Table C

Now you have the new Table C live in your Prod database.

Cloning Figure3

But you are not quite done yet. You need to complete the cycle.

Refresh Dev:

We now want to refresh Dev to be in sync with Prod. To do that we need to create a production clone of Table C and update Tables A and B with refreshed data. This will also serve to reduce your storage by replacing the original version of Table C in Dev with a cloned table. Likewise, you should sync Int as well.

If there are just a few tables in Prod (as in our example) then:

USE Dev.public;
CREATE OR REPLACE TABLE A CLONE Prod.public.A;
CREATE OR REPLACE TABLE B CLONE Prod.public.B;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;
USE Int.public;
CREATE or REPLACE DATABASE Int CLONE Prod;
Cloning Figure4

Note: CREATE OR REPLACE TABLE will drop and rebuild the table. This will release any space used in Dev (or Int) by those tables subsequent to the last clone operation.

Hint: If there are many tables in Prod, and only a few new tables in Dev, then this approach will be more efficient:

CREATE DATABASE Dev2 CLONE Prod;
USE Dev2.public;
CREATE TABLE D CLONE Dev.public.D; // preserving the WIP table in Dev
DROP DATABASE Dev;
ALTER DATABASE Dev2 RENAME To Dev;

This process removes all previous clones and allows you to preserve Table D as it is currently in development. When your are done with development on Table D, you follow the same process as we did for Table C to promote Table D to Int for testing then up to Prod for go live. After the promotion be sure to replace Table D in Dev and Int with a fresh clone from Prod in order to release the space.

Scenario 3: Promoting a Change to an Existing Table

Now let’s suppose you have three tables already in production and you need to alter one of those tables (i.e., add a new column) and then promote the revised table to Production. This is an important scenario, specially when rows are being added constantly to the data in production table. In this scenario you do not want to reload the modified table but rather change it then continue with incremental loads.

Initial State:

Prod database: Tables A, B, C

Int database: Tables A, B, C (cloned from Prod)

Dev database: Tables A, B, C (cloned from Prod)

Cloning Figure5

Mini Release:

In this scenario, you want to do a mini-release, wherein you will:

  • Alter Table C by adding one new column
  • Adjust the load process to include that column
  • Promote Table C to Int for testing

To do this mini-release you need to execute the following steps in a script or by hand in the Snowflake worksheet:  

    1. Change the table and the code (in Dev), then refresh the table
USE Dev.public;
ALTER TABLE C ADD COLUMN NEW_COL1 VARCHAR(200);
      • Modify load to include NEW_COL1
      • Write a one-time update script to update existing rows to populate NEW_COL1
        • Pull data from Production source if possible
        • Review and validate results
      • Execute modified load routine to load incremental new rows
        • Review and validate results

2. Deploy Table C to Int by simple clone:

USE Int.public;
CREATE OR REPLACE TABLE C CLONE Dev.public.C;
      • Promote revised load process to Int environment
      • Promote Update script to Int environment
      • Execute Update script
        • Review and validate results
      • Execute load process to load incremental new data into Table C
        • Review and validate results

3. If promotion to Int succeeds, then deploy to Prod:

USE Prod.public;
ALTER TABLE C ADD COLUMN NEW_COL1 VARCHAR(200);
      • Promote revised load process to Prod environment
      • Promote Update script to Prod environment
      • Execute Update script
        • Review and validate results
      • Execute load process to load incremental new data into Table C
        • Review and validate results
      • Start writing queries and reports that use the new column

Refresh Dev:

Once the promotion to Production is complete, you can refresh Dev and Int again using the same process as mentioned above.

USE Dev.public;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;
USE Int.public;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;

Now you have a production clone of Table C in Dev & Int with the new column added and updated with fresh production data.

Want to see cloning in action? Check out this video:

Conclusion

These are just a few examples of how you might use Snowflake’s fast cloning to support an agile, light-weight DevOps process that does not require a lot of DBA management and does not take up extra space! Hopefully this gives you some ideas how you can use fast cloning in your development environment.

Thanks to Saqib Mustafa (@cloudsommelier) for his help on this post!

As always, keep an eye on this blog site, our Snowflake-related Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

 

Support for Multiple Workloads

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#3 Support for Multiple Workloads

With our unique Multi-cluster, shared data architecture, Snowflake can easily support multiple and disparate workloads.  This is a common issue in traditional data warehouses so it makes total sense to be able to keep disparate workloads separate, to truly avoid resource contention, rather than just saying we support “mixed” workloads.

In legacy data warehouse environments, we often found ourselves constrained by what we could run and when we could run it for fear of resource contention, especially with the CPUs. In many cases it was impossible to refresh the data during the day because the highly parallelized, batch ETL process, while tuned for maximum throughput, usually hogged all the CPUs while it ran. That meant virtually no reporting queries could get resources so they would just hang. Likewise a complex report with calculations and massive aggregations would cause normally fast, simple reports to languish. And there was no way you could let any business users in the system to do exploratory queries as those might also cause everything else to hang.

Because of the separation of compute and storage native to Snowflake’s architecture, as previously highlighted, you can easily spin up a set of compute nodes (we call them Virtual Warehouses) to run your ELT processes, and another set to support your BI report users, and a third set to support data scientists and data miners. In fact you can spin up (or down!) as many virtual warehouses as you need to execute all the workloads you have.

Virtual Warehouse

So not only does each virtual warehouse share the same data (insuring consistent results), they are able to do so without being affected by operations being launched in other virtual warehouses because they are using completely separate resources. Hence there is no more resource contention!

With the Snowflake Elastic Data Warehouse, there is no more need to run the data loads at night just to avoid slowing down the reports. No more worry that one runaway query will impact the loads or other users. You can now run loads (e.g., real time, micro-batch, etc) at any time and thus provide your analysts and users current data on a more frequent basis.

And even better – no special skills or secret configuration settings are required to make this work. It is the way Snowflake’s Data Warehouse as a Service (DWaaS) is built by design.

Nice!

For a quick look at how this works, check out this video.

Thanks to Saqib Mustafa for his help and suggestions on this post.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

 

Top 10 Cool Things I Like About Snowflake

I have now been with Snowflake Computing for a little over two months (my how time flies). In that time, I have run the demo, spoken at several trade shows, and written a few blogs posts. I have learned a ton about the product and what it means to be an Elastic Data Warehouse in the Cloud.

So for this post I am going to do a quick rundown of some of the coolest features I have learned about so far. 

#10 Persistent results sets available via History

Once you execute a query, the result set will persist for 24 hours (so you can go back and check your work). It may seem minor to some, but it sure is convenient to be able to pull up the results from a previous query without having to execute the query a second time. Saves on time and processing. Read more

#9 Ability to connect with JDBC

Again seems like a no brainer but very important. I had no real clear concept of how I would connect to a data warehouse in the cloud so this was good news.  After getting my favorite data modeling tool, Oracle SQL Developer Data Modeler (SDDM),  installed on my new Mac, I was able to configure it to connect to my Snowflake demo schema using JDBC and reverse engineer the design. 

So why is this cool? It means that whatever BI or ETL tool you use today, if it can talk over JDBC, you can connect it to Snowflake. Read more

#8 UNDROP

With UNDROP in Snowflake you can recover a table instantaneously with a single command:

UNDROP TABLE <tablename>

No need to reload last night’s backup to do the restore. No need to wait while all that data is pulled back in. It just happens!

Now that is a huge time (and life) saver. Read more

#7 Fast Clone

Even cooler than UNDROP is the fast clone feature.

The Snowflake CLONE command can create a clone of a table, a schema, or an entire database almost instantly. It took me barely a minute to create a clone of a 2TB database without using additional storage! And I am not a DBA, let alone a “cloud” DBA.

This means you can create multiple copies of production data without incurring additional storage costs. No need to have separate test/dev data sets.

Hence why I think it is way cool! Read more

#6 JSON Support with SQL

During the first demo of Snowflake I attended (before I even applied for a job here), this one got my attention.

Using the knowledge and skills I already had with SQL, I could quickly learn to query JSON data, and join it to traditional tabular data in relational tables.

Wow – this looked like a great stepping stone into the world of “Big Data” without having to learn complex technologies like Hadoop, MapReduce, or Hive! Read more

Yes, I call that a very cool feature. And the fact that the JSON documents are stored in a table and optimized automatically in the background for MPP and columnar access. This gives you the ability to combine semi-structured and structured data, in one location. For further details check out my detailed 2 part blog here and here.

#5 ANSI compliant SQL with Analytic Functions

Another key feature in Snowflake, that is required to be called a relational data warehouse, is of course the ability to write standard SQL. More so, for data warehousing, is access to sophisticated analytic and windowing functions (e.g., lead, lag, rank, stddev, etc.).

Well Snowflake definitely has these.  In fact we support everything you would expect including aggregation functions, nested virtual tables, subqueries, order by, and group by. This means it is fairly simple for your team to migrate your existing data warehouse technologies to Snowflake. Read more

#4 Separation of Storage and Compute

The innovative, patent-pending, Multi-Cluster, Shared Data Architecture in Snowflake is beyond cool. The architecture consists of three layers; storage, compute, and cloud services. Each layer is decoupled from the other, each layer is independently scalable. This enables customers to scale resources as they are required, rather than pre-allocating resources for peak consumption. In my 30+ years working in IT, I have not seen anything like it.  It is truly one of the advantages that comes from engineering the product, from the ground up, to take full advantage of the elasticity of the cloud. Read more

#3 Support for Multiple Workloads

With this unique architecture, Snowflake can easily support multiple disparate workloads. Because of the separation of compute and storage, you can easily spin up separate Virtual Warehouses of different sizes to run your ELT processes, support BI report users, data scientists, and data miners. And it makes total sense to be able to keep disparate workloads separate, to avoid resource contention, rather than just saying we support “mixed” workloads.

And even better – no special skills or secret configuration settings are required to make this work. It is the way Snowflake is built by design. Nice! Read more

#2 Automatic Encryption of Data

Security is a major concern for moving to the cloud. With Snowflake, your data is automatically encrypted by default. No setup, no configuration, no add-on costs for high security features.

It is just part of the service! To me that is a huge win. Read more

#1 Automatic Query Optimization. No Tuning!

As a long time data architect, and not a DBA, this is my favorite part of Snowflake. I do not have to worry about my query performance at all. It is all handled “auto-magically” via meta data and an optimization engine in our cloud services layer. I just model, load, and query the data.

So, no indexes, no need to figure out partitions and partition keys, no need to pre-shard any data for distribution, and no need to remember to update statistics.

This feature, to me, is one of the most important when it comes to making Snowflake a zero management Data Warehouse as a Service offering. Read more

Well, that is the short list of my top 10 favorite features in Snowflake. Keep a look out for future posts in the coming weeks, to provide details on these and other key features of the Snowflake Elastic Data Warehouse.

Now check out this short intro video to Snowflake!

If you want to learn more about Snowflake, sign up for one of our frequent webinars, or just drop me a line at kent.graziano@snowflake.net and I will hook you up!

P.S. Keep an eye on my Twitter feed (@kentgraziano) and the Snowflake feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing. Watch for #BuiltForTheCloud and #DWaaS.

Part II: Making Schema-on-Read a Reality

In the first article of this series, I discussed the Snowflake data type VARIANT, showed a simple example of how to load a VARIANT column in a table with a JSON document, and then how easy it is to query data directly from that data type. In this post I will show you how to access an array of data within the JSON document and how we handle nested arrays. Then finally I will give you an example of doing an aggregation using data in the JSON structure and how simple it is to filter your query results by referring to values within an array.

Handling Arrays of Data

One of the features in JSON is the ability to specify and imbed an array of data within the docuement. In my example one such array is children:

    "children": [
         { "name": "Jayden", "gender": "Male", "age": "10" },
         { "name": "Emma", "gender": "Female", "age": "8" },
         { "name": "Madelyn", "gender": "Female", "age": "6" }
     ]

You will notice there are effectively 3 rows in the array and each row has 3 sub-columns – name, gender, and age. Each of those rows constitutes the value of that array entry which includes all the sub-column labels and data (remember that for later). So how do you know how many rows there are if you do not have access to the raw data?

Like this:

select array_size(v:children) from json_demo;

The function array_size figures it out for us. To pull the data for each “row” in the array, we use the dot notation from before, but now with the added specification for the row number of the array (in the [] brackets):

select v:children[0].name from json_demo
union all
select v:children[1].name from json_demo
union all
select v:children[2].name from json_demo;

Childrens names

So this is interesting but then, I really do not want to write union all SQL to traverse the entire array (in which case I need to know how many values are in the array right?).

We solve that problem with another new extended SQL function called FLATTEN. FLATTEN takes an array and returns a row for each element in the array. With that you can select all the data in the array as though they were in table rows (so no need to figure out how many entries there are).

Instead of doing the set of UNION ALLs, we add the FLATTEN into the FROM clause and give it a table alias:

select f.value:name
from json_demo, table(flatten(v:children)) f;

This syntax allows us to creating an inline virtual table in the FROM clause.

In the SELECT, we can then reference it like a table. Notice the notation f.value:name.

f = the alias for the virtual table from the children array

value = the contents of the element returned by the FLATTEN function

name = the label of the specific sub-column we want to extract from the value

The results, in this case, are the same as the SELECT with the UNIONs but the output column header reflects the different syntax (since I have not yet added any column aliases).

Flatten Children 1

Now, if another element is added to the array (i.e., a 4th child), the SQL will not have to be changed. FLATTEN allows us to determine the structure and content of the array on-the-fly! This makes the SQL resilient to changes in the JSON document.

With this in hand, we can of course get all the array sub-columns and format them just like a relational table:

select
   f.value:name::string as child_name,
   f.value:gender::string as child_gender,
   f.value:age::string as child_age
from json_demo, table(flatten(v:children)) f;

Flatten Format Children

Putting this all together, I can write a query to get the parent’s name and all the children like this:

select
   v:fullName::string as parent_name,
   f.value:name::string as child_name,
   f.value:gender::string  as child_gender,
   f.value:age::string as child_age
from json_demo, table(flatten(v:children)) f;

Which results in this output:

Parent and Children

If I just want a quick count of children by parent, I do not need FLATTEN but refer back to the array_size:

select
   v:fullName::string as Parent_Name,
   array_size(v:children) as Number_of_Children
from json_demo;

Count Children

Handling Multiple Arrays

You may recall there are multiple arrays in our sample JSON string. I can pull from several arrays at once with no problem:

select
   v:fullName::string as Parent_Name,
   array_size(v:citiesLived) as Cities_lived_in,
   array_size(v:children) as Number_of_Children
from json_demo;

Parent, Children. City

What about an Array within an Array?

Snowflake can handle that too. From our sample data we can see yearsLived is an array nested inside the array described by citiesLived:

"citiesLived": [
 { "cityName": "London",
   "yearsLived": [ "1989", "1993", "1998", "2002" ]
 },
 { "cityName": "San Francisco",
   "yearsLived": [ "1990", "1993", "1998", "2008" ]
 },
 { "cityName": "Portland",
   "yearsLived<": [ "1993", "1998", "2003", "2005" ]
 },
 { "cityName": "Austin",
   "yearsLived": [ "1973", "1998", "2001", "2005" ]
 }
 ]

To pull that data out, we add a second FLATTEN clause that transforms the yearsLived array within the FLATTENed citiesLived array.

select
  tf.value:cityName::string as city_name,
  yl.value::string as year_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl;

In this case the 2nd FLATTEN (alias “yl”) is transforming (really pivoting) the yearsLived array for each value returned from the 1st FLATTEN of the citiesLived array (“tf”).

The results output shows Year Lived by City:

YearLive by City

Like my earlier example, I can then augment this result by adding in the name too (so I know who lived where):

select
  v:fullName::string as parent_name,
  tf.value:cityName::string as city_name,
  yl.value::string as year_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl;

Name and YearLived

Aggregations?

Yup, we can even aggregate data within semi-structured data. (We would not be much of a data warehouse service if we couldn’t, right?)

So, just like ANSI SQL, we can do a count(*) and a group by:

select
   tf.value:cityName::string as city_name,
   count(*) as years_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl
group by 1;

And the results:

Count Years

Filtering?

Of course! Just add a WHERE clause.

select
  tf.value:cityName::string as city_name,
  count(*) as years_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl
where city_name = 'Portland'
group by 1;

Where City

To simplify things, notice I used the column alias city_name in the predicate but you can also use the full sub-column specification tf.value:cityName as well.

Schema-on-Read is a Reality

I could go on, but by now I think you can see we have made it very easy to load and extract information from semi-structured data using Snowflake. We added a brand new data type, VARIANT, that lives in a relational table structure in a relational database without the need to analyze the structure ahead of time, design appropriate database tables, and then shred the data into that predefined schema. Then I showed you some easy to learn extensions to ANSI-standard SQL for accessing that data in a very flexible and resilient manner.

With these features, Snowflake gives you the real ability to quickly and easily load semi-structured data into a relational data warehouse and make it available for immediate analysis.

To see this feature in action, check out this short video.

Be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) for all the news and happenings here at Snowflake.

P.S. If you don’t already have a Snowflake account, you can sign up for a self-service account here and get a jumpstart with $400 in free credits!