What makes Snowflake a data warehouse?

One of the most common questions I get when speaking to people about Snowflake is: “Why do you call it a data warehouse and not a database?” This is a very reasonable question given some of the characteristics of Snowflake.

At Snowflake, in part, we say we are a full relational database management system (RDBMS) built for the cloud. We are ACID compliant and we support standard SQL. Sounds like a database to me, too. Let’s take a closer look just to be sure.

What is a database?

A database is a collection of information organized to be easily accessed, managed and updated. While there are many types of databases available today, the most common is an RDBMS. But when most folks say “database”, they usually mean a traditional RDBMS that handles Online Transaction Processing (OLTP).

So, what are some of the defining characteristics of an OLTP database?

  • Designed for rapid storage and retrieval of small sets of current data records in support of transactions and interactions within an enterprise.
  • Data is organized in tables and columns, allowing users access via structured query language (SQL).
  • Handles quick, real-time activity such as entering a customer name, recording a sale and recording all accounting activity of that sale.
  • Works well for basic operational reporting of a limited number of records. Analytic reporting is relegated to simple, static reports often driven by IT.

What is a data warehouse?

Some of the defining characteristics of a data warehouse are:

  • A database designed to store and process large volumes of current and historical data collected from multiple sources inside and outside the enterprise for deep analysis.
  • Organizes data into tables and columns, and allows users access via SQL.
  • Optimized for loading, integrating and analyzing very large amounts of data.
  • Designed to support descriptive, diagnostic, predictive and prescriptive analytic workloads.

Snowflake definitely includes the overlapping characteristics of both a database and a data warehouse-ACID compliant, support for standard SQL, etc. But Snowflake also embodies all of the defining characteristics of a data warehouse.

One of the key differentiators of Snowflake, from other solutions, is that it’s specifically designed for data warehousing and high speed analytic processing. Rather than a generalized SQL database that has been “tuned” or even adapted to handle these type of workloads, Snowflake was built from the ground up for the cloud to optimize loading, processing and query performance for very large volumes of data. Therefore, hands down, Snowflake is a data warehouse.

So, why do we still need a specialized data warehouse engine?

As OLTP databases have been able to scale higher and innovations like in-memory databases have emerged, some organizations have questioned whether they still need a separate technology or specialized system for reporting and analytics. The answer, again, requires us to look at the basics: What benefits emerge from storing and analyzing data in a separate system?

  1. It eases the burden of reporting from transactional systems by removing the contention for limited and expensive resources.
  2. It produces more business-friendly data results by allowing the data to be restructured to a more suitable format.
  3. It provides access to a wider array of reports more quickly because all the resources in the data warehouse are dedicated to reporting and analysis.
  4. It integrates valuable data from across the enterprise for richer insight. Something that can’t (and shouldn’t) be done in an OLTP system.

For more information on how you can up your data warehousing game with a modern, built-for-the-cloud approach, check out some of our free resources such as our ebook The Data Warehouse: The Engine That Drives Analytics. We would love to help you on your journey to the cloud so keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) to keep up with all the news and happenings here at Snowflake.

Automatic Query Optimization. No Tuning!

Hopefully you had a chance to read all our previous top 10 posts. We now conclude the series with a final deep dive into my #1 Cool Feature from Snowflake:

#1 Automatic Query Optimization. No Tuning!

As a long time data architect, and not a DBA, this is my favorite part of Snowflake. I do not have to worry about query performance at all. It is all handled “auto-magically” via a dynamic optimization engine in our cloud services layer. I just model, load, and query the data.

Cloud Services

Snowflake is an MPP, columnar store thus designed for high speed analytic queries by definition. Our cloud services layer does all the query planning and optimization based on data profiles that are collected automatically as the data is loaded. It automatically collects and maintains the required statistics to determine how to most effectively distribute the data and queries across the available compute nodes.

So, no indexes, no need to figure out partitions and partition keys, no need to pre-shard any data for even distribution, and no need to remember to update statistics.

Another key point to remember is that not only does this dynamic optimization auto-tune queries that work against traditional structured data, it also works against semi-structured data (like JSON, Avro, or XML). When semi-structured data is loaded into our VARIANT data type, as mentioned in an earlier blog post, the data is profiled and columnarized behind the scenes. With this approach, querying the data (using our JSON or XML extensions) is optimized just like for structured relational data, without having to spend the time and effort to split it out into relational tables and columns. Because of this unique approach to handling semi-structured data, Snowflake not only supports joining structured and semi-structured data with SQL, it optimizes those joins too.

Let go Luke – Trust the Force!

As you can imagine, there are some folks out there who do not like change and want us to add more knobs so they can manually tune their queries.

Why? Well, they say that they know their workload better than Snowflake and therefore can optimize it better than Snowflake software. Others say that they’ve already done all the work to optimize another platform for their workload and don’t want to “waste” all that time and effort.

Since I come from a traditional data warehouse background, and know many expert data warehouse architects who are world class at tuning specific databases to do specific types of queries,  I definitely understand and can sympathize with this point of view.

But the world of data has changed! The way we build, deploy, and use data (of all kinds) is continually evolving and we need to evolve our approach to to be successful.

Because of that, there are a few things to consider with regard to tuning:

  1. Your workloads will change.
  2. The sheer number of different workloads is going to grow faster than your bandwidth to manually tune (so someone has to prioritize who wins).
  3. What happens if your expert query tuner decides to sell real estate instead?

Relying on yourself, and other expert DBA tuning staff, is ultimately going to result in a performance tuning bottleneck, which will in turn lead to increased contention for resources in the data warehouse. No matter how good you are, in the end you will not be able to keep up. So why not use a data warehouse service to handle that work for you so you can focus on getting the business people access to all this data in a form they can use to make better business decisions? In short, people are the hardest thing to scale in any data analytics solution. Thus any solution that allows you to scale without bottlenecks on people is a winner.

And if the query is still not fast enough, with Snowflake you always have the option to resize your virtual warehouse to add additional compute resources, as we described in the post about elasticity. Without the benefit of a dynamic optimization engine, you could not change the compute configuration on the fly and see immediate improvements. In a many other environments, if you changed the number of compute nodes you would have to redistribute the data, or do some other manual tuning, in order to get the improved performance. That would not be elastic at all.

For all these reasons, our patent-pending dynamic optimization is, to me, one of the most important features when it comes to making the Snowflake Elastic Data Warehouse a true SaaS offering and enabling agile, high-speed analytics at any scale. As part of our vision, we developed an offering that helps customers concentrate on understanding and deriving value from their data rather than spending time refining, managing , and optimizing the environment. Business users and data scientists can easily analyze their data by connecting natively to Snowflake through various tools, and not have to worry about tuning and query optimization.

As I said before, as an architect and data warehouse designer, I really like this model!

End of this trail…more to come

Well, that is the last of my Top 10 list of cool features in Snowflake.  I want to thank Saqib Mustafa for all his help collaborating on this series of posts.

There are of course many other cool features in Snowflake, and more to come in the future. So keep your eyes on this blog for future posts with more details on the Snowflake Elastic Data Warehouse, the cloud-native data warehouse as a service.

As always, keep an eye on our Snowflake-related Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for continuing updates on all the action and activities here at Snowflake Computing.

 

Snowflake Fast Clone

Hopefully you had a chance to read our previous posts: Query Results Sets available in history,   Ability to connect with JDBC, and  Undrop a table, schema or a database. As promised in the original Top 10 post, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#7 Fast Clone

Even cooler than UNDROP is our fast clone feature.

Have you ever had to wait hours, days, maybe weeks to spin up a copy of your production data warehouse so you could have a test or a development environment? And have to pay extra for the test or development environment to able to hold all the replicated data? Or maybe you have some new data scientists that just want to play around with the data but they really need their own copy?

With the simple Snowflake CLONE command, our customers can create multiple copies of the data tables, schemas, and databases, without replicating the data itself. This gives our customers the ability to almost instantly make the data available to use for multiple user groups, without the additional cost (or time) of actually replicating the data.

Magic??

Almost –

Magic with Meta Data.

Based on our unique solution, Snowflake users are able to clone their tables, schema or databases without creating additional copies. Snowflake stores data in files that are immutable, and encrypted, as part of our architecture. Our cloud services layer, with the metadata repository, records the information regarding the files being stored, the file locations, and a reference to a certain version of the data. This is also kept encrypted.  In addition, when any data changes, the Metadata repository is automatically updated to provide a pointer to the changed data.  All of this is performed in the background by the software without any involvement from the user. The metadata repository still retains the record for all versions of the data set.

How Cloning works

Because of the data in the metadata store, the user can quickly create a clone of the table. All the user has to do is to submit the clone command. This command can be submitted at any time.

CREATE OR REPLACE TABLE MyTable_V2 CLONE MyTable

As a result of the CLONE command, the system simply creates a new entry in the metadata store to keep track of the new clone.

Time Traveling Clones! Oh My!

In addition to simple cloning of objects, you can blend cloning with Snowflake time travel to clone tables, schemas, or even databases at a point in time in the past AT or BEFORE a specific timestamp. Here is an example:

CREATE SCHEMA mytestschema_clone_restore CLONE testschema BEFORE (TIMESTAMP => TO_TIMESTAMP(40*365*86400));

This command once executed will create a clone of the entire schema (tables, views, etc.) as it existed before the specified timestamp.

Hmm…time traveling clones…sounds like an episode of the X-Files (also cool).

Because Snowflake maintains the history of queries performed and identifies them by unique ID, we can also submit a request to create a clone using the unique ID of the query, BEFORE or AT a certain time stamp. This would allow you to perhaps run a revised set of scripts against an older data set then compare the results to the current data set.

In addition, once cloned, cloned objects are independent of each other. Despite being independent, there are no additional storage requirements and thus no additional charge (unless you add or modify records), since these clones share files. Thus Snowflake allows its customers to clone at multiple levels: table, schema (file format, views, sequences, stages) and databases and over time. And because they are independent, updates to one are not visible in the others.

Very helpful for experimentation and data exploration!

Cloning Example

As an example of what it looks like in the Snowflake UI, here is a snapshot of one of my demo databases with Twitter data. It is about 2 TB of data.

Twitter db

Now here is me launching the Clone via our Web UI at 3:02:49 PM:

Start Clone

And here is the result showing the cloned db created at 3:03:55 PM. Barely a minute to create a clone of a 2TB database with 10 tables!

Clone Completed

And I am not a DBA let alone a “cloud” DBA. This is all part of Snowflake being a Data Warehouse as a Service (DWaaS). Another reason why I think it is way cool!

Why is this hard to do in traditional systems?

In traditional RDBMS systems, if you wanted to clone an existing data warehouse and give a different set of users access to the environment, you typically have to create a whole new, separate, deployment of the environment. In Snowflake, as I have just shown, you can do this with ease with just a few commands (one of the many benefits resulting from having written Snowflake from scratch for the cloud). Moreover, while most traditional data warehouse systems allow you to create snapshots of the data, this generally results in consumption of more storage (which of course costs more $$).

All these reasons are why Fast Cloning made my Top 10 list of really cool features in the Snowflake Elastic Data Warehouse.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano),  (@cloudsommelier), and #ElasticDW for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Special thanks to Ashish Motivala from Snowflake Engineering for helping us with some of the more technical details on how this works so well.

Kent Graziano and Saqib Mustafa

Connect to Snowflake with JDBC

Thank you for reading our previous post #10 Query Result Sets available to users via history.  

As promised in my original post, here is a deeper dive into another one of the top 10 cool features from Snowflake:

#9 Ability to connect with JDBC

This seems like a no brainer but is very important. If you’re interested in connecting any custom or packaged Java based applications to Snowflake, JDBC is what you need. JDBC technology lets you access information in SQL databases using standard SQL queries.

So why is this cool? Because all of the modern applications written in Java can take advantage of our elastic cloud based data warehouse through a JDBC connection.

And we have plenty of customers doing that today with industry leading tools.

JDBC Apps

You can easily connect various ETL, BI and visualization tools to Snowflake using the JDBC driver, just like they connect with many legacy databases.

Simple to Use

You can download and install the the Snowflake JDBC driver through our user interface. To do that, login to your Snowflake account, go to the online help and select the JDBC driver under the downloads menu. Click on the link provided for details on setup and configuration.

Once you have the driver installed, you have several ways you can take advantage of it. If you have, or are building, a custom Java application, you can connect that app directly to Snowflake quite easily. This example shows how simple it is to connect and query data in Snowflake with a Java program, using the JDBC driver for Snowflake.

/*
 * Copyright (c) 2012, 2013 Snowflake Computing Inc. All right reserved.
 */
package com.snowflake.client;
import java.sql.Connection;
import java.sql.DriverManager;
import java.sql.ResultSet;
import java.sql.ResultSetMetaData;
import java.sql.SQLException;
import java.sql.Statement;
import java.util.Properties;
public class SnowflakeDriverExample
{
  public static void main(String[] args) throws Exception
  {
    // get connection
    System.out.println("Create JDBC connection");
    Connection connection = getConnection();
    System.out.println("Done creating JDBC connectionn");
    // create statement
    System.out.println("Create JDBC statement");
    Statement statement = connection.createStatement();
    System.out.println("Done creating JDBC statementn");
    // create a table
    System.out.println("Create demo table");
    statement.executeUpdate("create or replace table demo(C1 STRING)");
    statement.close();
    System.out.println("Done creating demo tablen");
    // insert a row
    System.out.println("Insert 'hello world'");
    statement.executeUpdate("insert into demo values ('hello world')");
    statement.close();
    System.out.println("Done inserting 'hello world'n");
    // query the data
    System.out.println("Query demo");
    ResultSet resultSet = statement.executeQuery("SELECT * FROM demo");
    System.out.println("Metadata:");
    System.out.println("================================");
    // fetch metadata
    ResultSetMetaData resultSetMetaData = resultSet.getMetaData();
    System.out.println("Number of columns=" +
                       resultSetMetaData.getColumnCount());
    for (int colIdx = 0; colIdx < resultSetMetaData.getColumnCount();
                         colIdx++)
    {
      System.out.println("Column " + colIdx + ": type=" +
                         resultSetMetaData.getColumnTypeName(colIdx+1));
    }
    // fetch data
    System.out.println("nData:");
    System.out.println("================================");
    int rowIdx = 0;
    while(resultSet.next())
    {
      System.out.println("row " + rowIdx + ", column 0: " +
                         resultSet.getString(1));
    }
    statement.close();
  }
   private static Connection getConnection()
          throws SQLException
  {
    try
    {
      Class.forName("com.snowflake.client.jdbc.SnowflakeDriver");
    }
    catch (ClassNotFoundException ex)
    {
     System.err.println("Driver not found");
    }
    // build connection properties
    Properties properties = new Properties();
    properties.put("user", "");     // replace "" with your username
    properties.put("password", ""); // replace "" with your password
    properties.put("account", "");  // replace "" with your account name
    properties.put("db", "");       // replace "" with target database name
    properties.put("schema", "");   // replace "" with target schema name
    //properties.put("tracing", "on");

    // create a new connection
    String connectStr = System.getenv("SF_JDBC_CONNECT_STRING");
    // use the default connection string if it is not set in environment
    if(connectStr == null)
    {
     connectStr = "jdbc:snowflake://accountName.snowflakecomputing.com"; // replace accountName with your account name
    }
    return DriverManager.getConnection(connectStr, properties);
  }
}

If, on the other hand, you have a packaged application to connect, you can do that just as easily. For example, you can connect to Java based applications like Oracle SQL Developer Data Modeler (SDDM), which can be used to reverse engineer the design of whatever tables and views you have been granted access to in Snowflake. For further details on setting up SDDM to talk to Snowflake, see this detailed blog. (NB: The specifics for configuring each tool will vary but this blog will give you an example of what to look for)

 

Continue to keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake as well as all the other happenings and news at Snowflake Computing.

Kent Graziano and Saqib Mustafa

Snowflake Query Result Sets Available to Users via History

As promised in the previous post from Kent Graziano, here is a deeper dive into one of the top 10 cool features from Snowflake:

#10 Result sets available via History

There are a lot of times when you want to make a small change to your large query, and want to be able to see the effect of a change quickly without rerunning the previous query. This is hard in most systems because you have to rerun the previous query, using up resources and time. Our solution allows users to view the result sets from queries that were executed previously, via history. One benefit users get is that if they had already executed a complex query that took some amount of time to execute, the user doesn’t have to run the query again to access the previous results. They can just go back to the history, and access the result set. This is also beneficial when working on a development project using the data warehouse. Developers can use the result set history to compare the effects of changes to the query or to the data set, without running the previous queries again.

How to access the Result Set History

Once you execute a query, the result set will persist in the history of the user for 24 hours. This even includes queries executed through interfaces outside the Snowflake UI, like BI tools, JDBC, ODBC, etc.

To get to these results, you can go to the History page in the Snowflake web UI:

History

Find the query you want to recall then click on the hyper-linked Query Id. This will then let you see the details about the query plus the data it pulled.

Persitant Result Set

Easy!

How does it work?

Every time a query is run in Snowflake, the query is assigned an ID. This ID along with the resulting data set is typically stored in the same place as the Metadata repository. All data is encrypted at rest and in transit. This is performed as a simultaneous operation to sending the result back to the user. In case of a large resultant data set, the results may be stored in S3. Because of the elasticity of S3 storage, this approach allows Snowflake to retain any size result set. For a query, that was run successfully, the result is kept for 24 hours, after which the results are cleaned up based on a FIFO method.

What does this mean for the user?

It may seem like a minor addition to some but it is very convenient to pull up a resultset from a previous query without having to execute the query a second time. This saves time and processing (and therefore saves $$$), another way in which Snowflake provides significant value over traditional data warehouse solutions, slashing costs, and eliminating complexity.

P.S. Keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake.

– Kent Graziano and Saqib Mustafa

 

Top 10 Cool Things I Like About Snowflake

I have now been with Snowflake Computing for a little over two months (my how time flies). In that time, I have run the demo, spoken at several trade shows, and written a few blogs posts. I have learned a ton about the product and what it means to be an Elastic Data Warehouse in the Cloud.

So for this post I am going to do a quick rundown of some of the coolest features I have learned about so far. 

#10 Persistent results sets available via History

Once you execute a query, the result set will persist for 24 hours (so you can go back and check your work). It may seem minor to some, but it sure is convenient to be able to pull up the results from a previous query without having to execute the query a second time. Saves on time and processing. Read more

#9 Ability to connect with JDBC

Again seems like a no brainer but very important. I had no real clear concept of how I would connect to a data warehouse in the cloud so this was good news.  After getting my favorite data modeling tool, Oracle SQL Developer Data Modeler (SDDM),  installed on my new Mac, I was able to configure it to connect to my Snowflake demo schema using JDBC and reverse engineer the design. 

So why is this cool? It means that whatever BI or ETL tool you use today, if it can talk over JDBC, you can connect it to Snowflake. Read more

#8 UNDROP

With UNDROP in Snowflake you can recover a table instantaneously with a single command:

UNDROP TABLE <tablename>

No need to reload last night’s backup to do the restore. No need to wait while all that data is pulled back in. It just happens!

Now that is a huge time (and life) saver. Read more

#7 Fast Clone

Even cooler than UNDROP is the fast clone feature.

The Snowflake CLONE command can create a clone of a table, a schema, or an entire database almost instantly. It took me barely a minute to create a clone of a 2TB database without using additional storage! And I am not a DBA, let alone a “cloud” DBA.

This means you can create multiple copies of production data without incurring additional storage costs. No need to have separate test/dev data sets.

Hence why I think it is way cool! Read more

#6 JSON Support with SQL

During the first demo of Snowflake I attended (before I even applied for a job here), this one got my attention.

Using the knowledge and skills I already had with SQL, I could quickly learn to query JSON data, and join it to traditional tabular data in relational tables.

Wow – this looked like a great stepping stone into the world of “Big Data” without having to learn complex technologies like Hadoop, MapReduce, or Hive! Read more

Yes, I call that a very cool feature. And the fact that the JSON documents are stored in a table and optimized automatically in the background for MPP and columnar access. This gives you the ability to combine semi-structured and structured data, in one location. For further details check out my detailed 2 part blog here and here.

#5 ANSI compliant SQL with Analytic Functions

Another key feature in Snowflake, that is required to be called a relational data warehouse, is of course the ability to write standard SQL. More so, for data warehousing, is access to sophisticated analytic and windowing functions (e.g., lead, lag, rank, stddev, etc.).

Well Snowflake definitely has these.  In fact we support everything you would expect including aggregation functions, nested virtual tables, subqueries, order by, and group by. This means it is fairly simple for your team to migrate your existing data warehouse technologies to Snowflake. Read more

#4 Separation of Storage and Compute

The innovative, patent-pending, Multi-Cluster, Shared Data Architecture in Snowflake is beyond cool. The architecture consists of three layers; storage, compute, and cloud services. Each layer is decoupled from the other, each layer is independently scalable. This enables customers to scale resources as they are required, rather than pre-allocating resources for peak consumption. In my 30+ years working in IT, I have not seen anything like it.  It is truly one of the advantages that comes from engineering the product, from the ground up, to take full advantage of the elasticity of the cloud. Read more

#3 Support for Multiple Workloads

With this unique architecture, Snowflake can easily support multiple disparate workloads. Because of the separation of compute and storage, you can easily spin up separate Virtual Warehouses of different sizes to run your ELT processes, support BI report users, data scientists, and data miners. And it makes total sense to be able to keep disparate workloads separate, to avoid resource contention, rather than just saying we support “mixed” workloads.

And even better – no special skills or secret configuration settings are required to make this work. It is the way Snowflake is built by design. Nice! Read more

#2 Automatic Encryption of Data

Security is a major concern for moving to the cloud. With Snowflake, your data is automatically encrypted by default. No setup, no configuration, no add-on costs for high security features.

It is just part of the service! To me that is a huge win. Read more

#1 Automatic Query Optimization. No Tuning!

As a long time data architect, and not a DBA, this is my favorite part of Snowflake. I do not have to worry about my query performance at all. It is all handled “auto-magically” via meta data and an optimization engine in our cloud services layer. I just model, load, and query the data.

So, no indexes, no need to figure out partitions and partition keys, no need to pre-shard any data for distribution, and no need to remember to update statistics.

This feature, to me, is one of the most important when it comes to making Snowflake a zero management Data Warehouse as a Service offering. Read more

Well, that is the short list of my top 10 favorite features in Snowflake. Keep a look out for future posts in the coming weeks, to provide details on these and other key features of the Snowflake Elastic Data Warehouse.

If you want to learn more about Snowflake, sign up for one of our frequent webinars, or just drop me a line at kent.graziano@snowflake.net and I will hook you up!

P.S. Keep an eye on my Twitter feed (@kentgraziano) and the Snowflake feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing. Watch for #BuiltForTheCloud and #DWaaS.