SnowSQL Part II: Interactivity

In our previous blog, we introduced you to SnowSQL – Snowflake’s new modern command line tool that we built from scratch. Today we are going to talk about how SnowSQL lets you interactively query your data.

Making SnowSQL Yours

While SnowSQL simply works after its installation, users can make customizations to take full advantage of SnowSQL.

For example, by defining connections in the config file, I can preset my environment (account, warehouse, database, etc…) and don’t have to worry about connection strings or exposed passwords.

In order to keep track of which environment I am in, I customized my prompt to show only the information I need. (At this stage I am not concerned with the warehouse.)

Default prompt:

[user]#[warehouse]@[database].[schema]>

image00

My prompt:

image01
By the same token, I set output_format to fancy_grid, making it much easier to read the output on my screen.
PSQL FANCY_GRID

image02

Although you could do without a customizable prompt (as most other SQL clients do), it is preferable to be able to make customizations as you want. This, combined with progress bars for the PUT and GET commands, and a few other customization options, leads to a slick user experience, unlike other paleolithic SQL command line tools.

Using SnowSQL

SnowSQL also includes very powerful line editing commands to simplify the manipulation of queries. All the usual shortcuts are there: CTRL-R to search the history and Arrow keys to navigate up/down in the history. By using the !edit command, I can modify large queries in my favorite editor (sorry emacs users, its vim for me), rather than on the command line.

image03

Since some queries can take a long time to run, I run them concurrently to avoid disrupting my workflow. To run a query in the background, just add > after the semicolon.

I can then check all the queries that are running or finished by using the !QUERIES command. With variable_substitution turned on, the !queries command will bind a variable to each query ID.

I can abort any running query by writing !abort &. To get the result of a query, I use !result &. If the query is running, it will wait until the result comes back. If the query finished, it will just fetch the result, so I can always go back to previous queries.

image06

We want SnowSQL to be your tool of choice. If you come across any features that you like or miss, please let us know. We would like to acknowledge our main software engineers Shige Takeda (@smtakeda) and Baptiste Vauthey (@thabaptiser) for their main contributions. As always, keep an eye on this blog site, and our Snowflake Twitter feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing.

In the meantime, stay tuned for blog #3, where we will demonstrate SnowSQL’s scripting capabilities.

Happy Tweet, Sad Tweet – Building a Naive Bayes Classifier in SQL

We love to talk about how Snowflake is built and what makes Snowflake stand out from the data warehousing crowd, but we also love to show how you can put Snowflake to use for interesting big data challenges.  In this blog post we show how to build a machine learning classifier on top of Snowflake using SQL and Snowflake’s JSON extensions.  The classifier will learn to separate happy emotions from sad emotions in short text snippets.  As a data source we use the publicly available Twitter data stream that can be downloaded as raw JSON.

Twitter is a never-ending source of news, links, comments, and personal chatter. Some of this communication is highly emotional: it’s happy, it’s sad, it’s neither, or both at the same time. By loading Twitter data into Snowflake, we can build a machine learning classifier that detects such emotions automatically. The machine learning classifier can then be used to detect emotions in any tweet-sized text snippet, such as “The sun is shining, the weather is sweet again” (happy), “I must’ve called a thousand times” (sad), and “Make America Great Again” (happy).

Or, in SQL:

select * from result order by id;
----+------------------------------------------------+-------+
 ID |                      TEXT                      | LABEL |
----+------------------------------------------------+-------+
 1  | The sun is shining, the weather is sweet again | happy |
 2  | I must've called a thousand times              | sad   |
 3  | Make America Great Again                       | happy |
----+------------------------------------------------+-------+

In this blog post we will build a Naive Bayes machine learning classifier using SQL. We do this in three steps:

  1. We create a training data set by labeling tweets as happy or sad using emojis.
  2. We compute a machine learning model using a Naive Bayes classifier.
  3. We validate our model using a test data set and detect emotions in other text snippets.

Let’s quickly dive into what Twitter data looks like, what it takes to build a machine learning model, and what a Naive Bayes classifier is anyway.

Twitter Data

Twitter allows you to download a one percent sample of all public tweets for free. The format of this data is JSON. Each tweet is a separate JSON object that looks like this:

{
  "created_at": "Tue Feb 02 21:09:01 +0000 2016",
  "entities": {
    "hashtags": [ ... ],
    "media": [],
    "symbols": [],
    "trends": [],
    "urls": [],
    "user_mentions": [ ... ]
  },
  "id": 694751783172739072,
  "lang": "en",
  "retweeted_status": { ... },
  "text": "We are delighted to inform you that your submission 900: 
           The Snowflake Elastic Data Warehouse has been accepted. 
           #SIGMOD @SnowflakeDB",
  "user": { "screen_name": "hemasail" ... }
  ...
}

This tweet was sent by user @hemasail—which is me, coincidentally—on February 2, 2016. Every JSON object representing a tweet has mandatory fields and optional fields. Mandatory fields include the tweet’s id, the user that sent the tweet, and the language the tweet was written in (which is apparently detected by Twitter’s language classifier). Optional fields include “retweeted_status,” containing information about the retweeted tweet (if any); and “entities,” containing information about hashtags, media, URLs, etc. that were found in the tweet. There is much more information in a tweet’s JSON that is not shown in the example above. It is always interesting to see the full amount of refined information from “140 characters or less.”

Twitter data formatted in JSON is a great fit for Snowflake. Snowflake supports JSON natively: loading, querying, and unloading of JSON data is supported without conversion. Snowflake also supports user-defined functions in JavaScript that can operate on JSON data directly.

Machine Learning

Machine Learning is a method in computer science to automatically learn characteristics of existing data and apply the findings to new data. In this blog we use supervised machine learning. Supervised machine learning is a branch of machine learning where the existing data set needs to be labeled. That is, each item in the data set—each tweet for example—is assigned a label: this one is a happy tweet, this one is a sad tweet. There is also unsupervised machine learning where the data set does not need to be labeled, and the machine learning algorithm tries to find labels itself. However, we will use a labeled data set.

There are many ways to generate labels. Usually, labels are generated by humans. In our case, that would mean that we sit down and label lots of tweets as happy or sad. This would take a lot of time. We could also outsource label generation to services such as Amazon Mechanical Turk, but this would take a lot of money. So we won’t do either. Instead, we will assign labels to tweets using emojis. If a tweet contains a happy emoji, we will assign a label “happy” to this tweet. If a tweet contains a sad emoji, we will assign a label “sad” to this tweet. If there are no emojis or both types of emojis in a tweet, we will assign labels “none” or “both,” respectively.

Given a labeled data set, we will split it into a training data set and a test data set. The training data set will be used to train the machine learning classifier. The test data set will be used to test how good our classifier is. We will split our data set into 80% training data and 20% test data.

Naive Bayes Classifier

A Naive Bayes classifier is a simple type of machine learning model based on probabilities. In our case, a Naive Bayes classier uses word probabilities to classify a tweet as happy or sad. Simply said, a Naive Bayes classifier counts all words and computes probabilities of how often they appear in one category or another. For example, the word “great” appears more often in happy tweets than in sad tweets. The word “must’ve” appears more often in sad tweets than in happy tweets. By averaging all word probabilities together, we get an overall probability that a text belongs to one category or another. For example, by averaging all word probabilities in “Make America Great Again” together, we see that it is more happy than sad. At least, our training data from Twitter tells us so.

A much better explanation of the Naive Bayes classifier can be found in the slides from Stanford here.

Building a Naive Bayes Classifier in SQL

We create a training data set by first labeling tweets as happy or sad. Labels will be assigned using emojis. If a tweet has a happy emoji, we assign the label “happy.” If a tweet has a sad emoji, we assign the label “sad.” If the tweet does not contain an emoji, or if it contains both types of emojis, we assign the labels “none” or “both” respectively.

To compute labels, we will use the following user-defined function (UDF) in JavaScript. The function takes a string as input and outputs a label based on the type of emojis found. For this task, it splits the string into words and compares each word with a happy emoji or sad emoji. We call this function “happyLabel”:

create or replace function happyLabel(TEXT string)
returns string
language javascript
as ‘
 var happyRegex = /:-?\\)/;
 var sadRegex = /:-?\\(/;
 var happyEmoji = happyRegex.test(TEXT);
 var sadEmoji = sadRegex.test(TEXT);
 if (happyEmoji && sadEmoji) return “both”;
 if (happyEmoji) return “happy”;
 if (sadEmoji) return “sad”;
 return “none”;
‘;

We will use another JavaScript UDF that cleans a tweet and splits it into words. This function takes a string as input and outputs a variant. A variant is a Snowflake-specific data type that can contain semi-structured data like JSON objects and JSON arrays, as well as native data such as strings and numbers. This function outputs a variant containing a JSON array. Before splitting a tweet into words, the function removes any HTML entities such as & and <, any mention of other Twitter names, and any punctuation. It outputs a list of lower-case words.

create or replace function splitText(TEXT string)
returns variant
language javascript
as ‘
 var words = TEXT
   .replace(/&\\w+;/g, ” “)         // remove HTML entities
   .replace(/@[^\\s]+/g, “”)        // remove mentions
   .replace(/[^#\’\\w\\s]|_/g, ” “) // remove punctuation
   .toLowerCase()
   .trim()
   .split(/\\s+/g);                 // split on whitespace
 return words;
‘;

Now we will write SQL to generate the labeled data set. To generate the labels, we will call the JavaScript UDFs defined above. We will balance the labeled data set such that there are the same number of happy tweets and sad tweets. From this balanced data set, we will use 80 percent of the tweets for training the classifier. We will compute all probabilities and counts necessary for the Naive Bayes classifier. For a complete summary of all formulas needed to implement the classifier, see the slides from Stanford University. For any questions regarding JSON processing in Snowflake, refer to the Snowflake documentation.

-- label tweets and only select happy and sad tweets
create or replace table labeledTweets as
select tweet,
      happyLabel(tweet:text) as label,
      splitText(tweet:text) as words,
      uniform(0::float, 1::float, random()) as rand
from twitter.public.tweets_1p
where created_at > ‘2016-05-01’::date       — tweets from May 2016
 and tweet:lang = ‘en’                     — english language only
 and array_size(tweet:entities:urls) = 0   — no tweets with links
 and tweet:entities:media is null          — no tweets with media (images etc.)
 and tweet:retweeted_status is null        — no retweets, only originals
 and (label = ‘happy’ or label = ‘sad’);

-- balance tweets to have same number of happy and sad tweets
create or replace table balancedLabeledTweets as
select *
from (select *,
       rank() over (partition by label order by tweet:id) as rnk
     from labeledTweets) x
where rnk <= (select min(cnt)
             from (select label,count(*) as cnt
                   from labeledTweets
                   group by label));
-- training set, pick random 80 percent of tweets
create or replace table training as
select * from balancedLabeledTweets where rand < 0.8;

-- training helper table: tweet id, word, label
create or replace table training_helper as
select tweet:id as id, value as word, label
from training,
    lateral flatten(words);
 
-- number of total docs
create or replace table docs as
select count(*) as docs from training;

-- number of docs per class j
create or replace table docsj as
select label,count(*) as docsj
from training
group by label;

-- number of distinct words = |Vocabulary|
create or replace table voc as
select count(distinct word) as voc
from training_helper;

-- number of words per class j
create or replace table wordsj as
select label,count(*) as wordsj
from training_helper
group by label;

-- count per word n and class j
create or replace table wordsnj as
select a.label,a.word,ifnull(wordsnj,0) as wordsnj
from
(select label,word
from (select distinct label from training_helper) c
 join
    (select distinct word from training_helper) w) a
left outer join
 (select label,word,count(*) as wordsnj
 from training_helper
 group by label,word) r
on a.label = r.label and a.word = r.word;

To test our Naive Bayes classifier, we will use the remaining 20 percent of the labeled tweets. For every tweet in the test set, we will compute the actual label that the classifier assigns to the tweet. The classifier does not look at the emojis, of course. That would be cheating. We will then compare the actual labels with the expected labels. This will give us a percentage of correctly classified tweets.

-- test set
create or replace table test as
select * from balancedLabeledTweets where rand >= 0.8;

-- test helper table: tweet id, word, label
create or replace table test_helper as
select tweet:id as id,value as word,label
from test,
    lateral flatten(words);

-- classification probabilities
create or replace table probs as
select id,label,max(pc)+sum(pw) as p,expected
from (
select id,t.word,n.label,wordsnj,wordsj,docsj,docs,voc,
 log(10,docsj::real/docs) as pc,
 log(10,(wordsnj::real+0.1)/(wordsj+0.1*voc)) as pw,
 t.label as expected
from test_helper t
 inner join wordsnj n on t.word = n.word
 inner join wordsj j on n.label = j.label
 inner join docsj on docsj.label = n.label
 cross join docs
 cross join voc) x
group by id,label,expected;

-- classification result
create or replace table testResult as
select p1.id,t.tweet:text::string as text,p1.expected,p1.label as actual
from probs p1
 inner join (
   select id,max(p) as maxp
   from probs
   group by id) p2
 on p1.id = p2.id and p1.p = p2.maxp
 inner join test t on p1.id = t.tweet:id;

-- correctly classified tweets: “win probability”
select sum(win),count(*),sum(win)::real/count(*) as winprob
from (
select id,expected,actual,
 iff(expected = actual,1,0) as win
from testResult);

The output of the last query is the following:
---------+-----------+-----------------+
SUM(WIN) | COUNT(*)  |     WINPROB     |
---------+-----------+-----------------+
43926    | 56298     |   0.7802408611  |
---------+-----------+-----------------+

That means our Naive Bayes classifier classified 78% of all test tweets correctly. This is not too bad given that we did not do much data cleansing, spam detection, and word stemming. The baseline winning probability is 50% because the number of happy tweets is the same as the number of sad tweets in our data set. Thus, our classifier gives us a significant boost.

We can now use the trained classifier to label any text snippets. For example, we can label text snippets such as “The sun is shining, the weather is sweet again,” “I must’ve called a thousand times,” and “Make America Great Again.” To do this, we create a new table with all text snippets that we want to classify. To compute the classification, we split each text into words and use the word probabilities from our training set to ultimately assign a label to each text snippet. The results are stored in a table called “result”.

-- create new table with any text snippets
create or replace table query(id int, text varchar(500));
insert into query values (1, ‘We are the champions’);
insert into query values (2, ‘I must\’ve called a thousand times’);
insert into query values (3, ‘Make America Great Again’);
 
-- split texts into words
create or replace table query_helper as
select id,value as word
from query,
    lateral flatten(splitText(text));

-- compute probabilities using data from training set
create or replace table probs as
select id,label,max(pc)+sum(pw) as p
from (
select id,t.word,n.label,wordsnj,wordsj,docsj,docs,voc,
 log(10,docsj::real/docs) as pc,
 log(10,(wordsnj::real+0.1)/(wordsj+0.1*voc)) as pw
from query_helper t
 inner join wordsnj n on t.word = n.word
 inner join wordsj j on n.label = j.label
 inner join docsj on docsj.label = n.label
 cross join docs
 cross join voc) x
group by id,label;

-- assign labels to text snippets
create or replace table result as
select p1.id as id, text, p1.label as label
from probs p1
 inner join (
   select id,max(p) as maxp
   from probs
   group by id) p2
 on p1.id = p2.id and p1.p = p2.maxp
 inner join query q on p1.id = q.id;

And, ta-dah, here are the results of our classification:

select * from result order by id;
---+------------------------------------------------+-------+
ID |                      TEXT                      | LABEL |
---+------------------------------------------------+-------+
1  | The sun is shining, the weather is sweet again | happy |
2  | I must’ve called a thousand times              | sad   |
3  | Make America Great Again                       | happy |
---+------------------------------------------------+-------+

Summary

In this blog post, we built a complete Naive Bayes classifier using SQL. The classifier learned to distinguish happy tweets from sad tweets. Our classifier has a success rate of 78%, which is a significant improvement over the baseline of 50%. We used the classifier on other text snippets to show its applicability beyond tweets. Of course, this is only a demo and further data cleansing, spam detection, word stemming, and other natural language processing tricks are necessary to increase the success rate of the classifier. 

As always, keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), and my personal Twitter feed (@hemasail) for updates on all the action and activities here at Snowflake Computing.

Empowering Data Driven Businesses via Scalable Delivery

The accessibility of data has changed the way that businesses are run. Data has become essential to decision makers at all levels, and getting access to relevant data is now a top priority within many organizations. This change has created huge demands on those who manage the data itself. Data infrastructures often cannot handle the demands of the large numbers of business users, and the result is often widespread frustration.

The concept of an Enterprise Data Warehouse offers a framework to provide a cleansed and organized view of all critical business data within an organization. However, this concept requires a mechanism to distribute the data efficiently to a wide variety of audiences. Executives most often require a highly summarized version of the data for quick consumption. Analysts want the ability to ‘slice and dice’ the data to find interesting insights. Managers need a filtered view that allows them to track their team’s data in the context of the organization as a whole. Strategists and economists need highly specialized data that can require highly complex algorithms and tools to generate. If one system is to serve the data for all of these audiences, along with numerous others, it must be able to scale to provide access to all of them.

Traditional Data Warehouse architecture has not been able to adequately respond to this demand for data. Many organizations have been forced to move to hybrid architectures in response. OLAP tools and datamarts became commonplace as mechanisms to distribute data to business users. Statisticians and later data scientists often had to create and maintain entirely separate systems to handle data volumes and workloads that would overwhelm the data warehouse. Data is copied repeatedly into a variety of systems and formats in order to make it available to wider audiences. A vice president at a large retail company openly referred to their data warehouse as a data pump that was entirely consumed by the loading of data on one side and the copying of that data out to all sorts of systems on the other side. This also leads to challenges in ensuring there is one vision of the data inside the organization and synchronizing all the siloed data.
data_to_analysts

During my time working in a business intelligence role for a number of the largest organizations in the world, I encountered all of these problems. While supporting the Mobile Shopping Team at a major online retailer, I had access to a wide variety of powerful tools for delivering data to my coworkers that relied on that data for making key decisions about the business. However coordinating all of these systems so that they produced consistent results across the board was a huge headache. On top of that, many of the resources were shared with other internal teams, which meant that we were regularly competing for access to a limited pool of resources. There were a number of situations where I could not get data that was requested by my VP in a timely manner because other teams had consumed all available resources in the warehouse environment.
The architects behind Snowflake were all too familiar with these problems, and they came to the conclusion that the situation demanded an entirely new architecture. The rise of cloud computing provided the raw materials: vast, scalable storage completely abstracted from the idea of server hardware or storage arrays and elastic compute. More importantly, this storage is self-replicating, thus automatically creating additional copies of items in storage if they become ‘hot’. This new storage, combined with the availability to leverage nearly unlimited computing power, is at the heart of the multi-cluster shared data architecture that is central to Snowflake’s Elastic Data Warehouse.

data_warehouse (2)

At Snowflake, data is stored once, but can be accessed by as many different compute engines as are required to provide responses to any number of requests. Data workers no longer need to focus on building out frameworks to deal with concurrency issues. They can keep all of their data in one system and provide access without having to consider how different groups of users might impact one another. The focus can then be placed where it belongs, on deriving value from the data itself. Thus you can load data and at the same time have multiple groups, each with their own computing resources, query the data. No more waiting for some other groups to free up resources. Your business can now deliver analytics to the various users: executives, analysts, data scientists, and managers, without the environment getting in the way of performance.

As always, keep an eye on our Snowflake-related Twitter feeds (@SnowflakeDB and (@ToddBeauchene) for continuing updates on all the action and activities here at Snowflake Computing.

Automatic Encryption of Data

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#2 Automatic Encryption of Data

One of the biggest worries people have about moving to the cloud is security. One key piece of providing enterprise class security is the ability to encrypt the data in your data warehouse environment. With Snowflake, your data is automatically encrypted by default.

No setup, no configuration, no add-on costs for high security features. Data is encrypted during its entire lifecycle. From loading data to storing data at rest, we apply end-to-end encryption, such that only the customer can read the data, and no one else. It is just part of the Snowflake service! That is a huge win for anyone who has ever tried to set up database security of any kind. In addition, this gives Snowflake a significant advantage compared to environments like Hadoop, where encryption and security is almost almost entirely left up to the customer to implement and maintain.

So what level of encryption?

For all data within Snowflake, we use strong AES 256-bit keys. Your data is encrypted as you load it. That is the default, and you cannot turn it off. In addition, our Snowflake security framework includes additional security best practices such as the use of a hierarchical key model and regular key rotation. All of this is automatic and transparent to our customers. In this way we provide our customers best-in-class data security as a service. For even more details on our approach to end-to-end encryption and how we secure your data, check out these blogs; end to end encryption and encryption key management. 

Remember the #10 Top feature – persistent result sets? Well those query results are also encrypted with 256-bit encryption keys (all the data, all the time).

So do you have your data at rest encrypted in your data warehouse today? Are your loading tools, and staging environments also encrypted?

If not, then put your data into the Snowflake Elastic Data Warehouse and rest easy knowing your data is safe and secure. Snowflake brings enterprise grade security to data warehousing in the cloud, with end-to-end encryption as a major part of our offering.

As a co-writer of this series, I would like to thank Kent Graziano, who has put in a lot of effort into bringing the thoughts behind this series to the audience. Without his persistence and vision, this would not have been possible. As always, keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Kent Graziano and Saqib Mustafa

Support for Multiple Workloads

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#3 Support for Multiple Workloads

With our unique Multi-cluster, shared data architecture, Snowflake can easily support multiple and disparate workloads.  This is a common issue in traditional data warehouses so it makes total sense to be able to keep disparate workloads separate, to truly avoid resource contention, rather than just saying we support “mixed” workloads.

In legacy data warehouse environments, we often found ourselves constrained by what we could run and when we could run it for fear of resource contention, especially with the CPUs. In many cases it was impossible to refresh the data during the day because the highly parallelized, batch ETL process, while tuned for maximum throughput, usually hogged all the CPUs while it ran. That meant virtually no reporting queries could get resources so they would just hang. Likewise a complex report with calculations and massive aggregations would cause normally fast, simple reports to languish. And there was no way you could let any business users in the system to do exploratory queries as those might also cause everything else to hang.

Because of the separation of compute and storage native to Snowflake’s architecture, as previously highlighted, you can easily spin up a set of compute nodes (we call them Virtual Warehouses) to run your ELT processes, and another set to support your BI report users, and a third set to support data scientists and data miners. In fact you can spin up (or down!) as many virtual warehouses as you need to execute all the workloads you have.

Virtual Warehouse

So not only does each virtual warehouse share the same data (insuring consistent results), they are able to do so without being affected by operations being launched in other virtual warehouses because they are using completely separate resources. Hence there is no more resource contention!

With the Snowflake Elastic Data Warehouse, there is no more need to run the data loads at night just to avoid slowing down the reports. No more worry that one runaway query will impact the loads or other users. You can now run loads (e.g., real time, micro-batch, etc) at any time and thus provide your analysts and users current data on a more frequent basis.

And even better – no special skills or secret configuration settings are required to make this work. It is the way Snowflake’s Data Warehouse as a Service (DWaaS) is built by design.

Nice!

For a quick look at how this works, check out this video.

Thanks to Saqib Mustafa for his help and suggestions on this post.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

 

Automatic Concurrency Scaling in Snowflake – Another Way the Cloud Changes the Game

The challenge with concurrency

Today we take a major step forward by extending our elastic architecture to solve another major pain point in existing on-premises and cloud data warehousing solutions: how to run massively concurrent workloads at scale in a single system.

Have you had the following experiences when building mission-critical applications that incorporate data analytics:

  • My application can only support a certain level of user concurrency due to the underlying data warehouse, which only allows 32-50 concurrent user queries.
  • To build my application, I need to acquire multiple data warehouse instances in order to isolate numerous workloads and users from each other. This adds to costs and complexity.
  • We have built our own scheduling policies around the data warehouse. We use query queues to control and prioritize incoming queries issued by our numerous users.
  • During peak times, users are getting frustrated because their requests are getting queued or fail entirely.

At Snowflake, we separate compute from storage by introducing the unique concept of virtual data warehouses. That concept makes it possible to instantly resize virtual warehouses or pause them entirely. In addition, because of that concept Snowflake is the only cloud data warehousing solution that allows concurrent workloads to run without impacting each other.

However, we saw the need to go a step further to offer a service that adapts to changing workloads and addresses concurrency at the same time:

  • Imagine you didn’t have any concurrency limitations on your mission-critical business application.
  • Imagine you didn’t need users to adjust their workloads to accommodate data warehouse bottlenecks.
  • Imagine the data warehouse itself could detect increasing workloads and add additional compute resources as needed or shut-down/pause compute resources when workload activities subside again.
  • Imagine your application could scale out-of-the-box with one single (virtual) data warehouse without the need to provision additional data warehouses.
  • Imagine a world without any scheduling scripts and queued queries – a world in which you can leverage a smart data warehousing service that ensures all your users get their questions answered within the application’s SLA.

With Snowflake, we allow you to do that all of this for real, not just in your imagination, with our new multi-cluster data warehouse feature.

Multi-cluster data warehouses

A virtual warehouse represents a number of physical nodes a user can provision to perform data warehousing tasks, e.g. running analytical queries. While a user can instantly resize a warehouse by choosing a different size (e.g. from small to 3X large), until now a virtual data warehouse in Snowflake always consisted of one physical cluster.

With the recent introduction of multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, more resources for a warehouse by specifying additional clusters for the warehouse.

QueryScheduler-1b

Figure 1: How automatic scaling works

 

The figure above shows a multi-cluster DW that consists of three compute clusters. All compute clusters in the warehouse are of the same size. The user can choose from two different modes for the warehouse:

  • Maximized:When the warehouse is started, Snowflake always starts all the clusters to ensure maximum resources are available while the warehouse is running.
  • Auto-scale:Snowflake starts and stops clusters as needed to dynamically manage the workload on the warehouse.

As always, in Snowflake a user can either leverage the user interface or use SQL to specify the minimum/maximum number of clusters per multi-cluster DW:

Create Warehouse UI wizard

MultiCluster

Figure 2: Create Warehouse in UI Wizard

 

Create Warehouse SQL script

Multi_cluster_sql

Figure 3: Create Warehouse via SQL

 

Similar to regular virtual warehouses, a user can resize all additional clusters of a multi-cluster warehouse instantly by choosing a different size (e.g. XS, S, M, L, …) either through the UI or programmatically via corresponding SQL DDL statements. In auto-scale mode, Snowflake automatically adds or resumes additional clusters (up to the maximum number defined by user) as soon as the workload increases. If the load subsides again, Snowflake shuts down or pauses the additional clusters. No user interaction is required – this all takes place transparently to the end user.

For these decisions, internally, the query scheduler takes into account multiple factors. There are two main factors considered in this context:

  1. The memory capacity of the cluster, i.e. whether clusters have reached their maximum memory capacity
  2. The degree of concurrency in a particular cluster, i.e. whether there are many queries executing concurrently on the cluster

As we learn more from our customers’ use cases, we will extend this feature further and share interesting use cases where multi-cluster data warehouses make a difference. Please stay tuned as we continue reinventing modern data warehousing and analytics by leveraging the core principles of cloud computing.

We would like to thank our co-authors, Florian Funke and Benoit Dageville, for their significant contributions as main engineer and architect, respectively, to making multi-cluster warehouses a reality. As always, keep an eye on the blog and our Snowflake Twitter feed (@SnowflakeDB) for updates on Snowflake Computing.

 

 

Elasticity & Separation of Compute and Storage

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#4 Elasticity & Separation of Storage and Compute

The innovative, patent-pending, Multi-Cluster, Shared Data Architecture in Snowflake takes full advantage of the elasticity of the cloud. With the advantage of building a data warehouse from the ground up, Snowflake has built an environment that really allows the customer to scale their data warehouse as they want. This unique environment gives the users the ability to add more compute power on demand, even in the middle of a process.

How it works

  1. Data is stored, once, on elastic disk storage in the cloud (for example it is currently stored AWS S3)
  2. Compute clusters (for example using AWS EC2 nodes) are allocated dynamically as required to do work (queries or loads)
  3. The compute clusters are connected to the data storage. The access to the data is based on the approved security credentials of the user that is running the query

The important point is that storage and warehouse are not tightly coupled together. This means you can grow and reduce both independently of each other.

More Power at the Push of a Button

Say you are running a series of complex SQL queries on a Small (2 node) warehouse in Snowflake. Normally you have until the close of business to run the reports and prepare the data. But today, at 8 AM, your management requests the results of that analysis for a board meeting at noon. So now you have half the time to get the reports to run. Yikes!

In a traditional data warehouse you would pretty much have only one option: kick everyone off the system and hope things run faster.

If however, you are running the reports on the Snowflake Elastic Data Warehouse, all you have to do is increase the size of the warehouse to a Medium which doubles the horsepower of the data warehouse (i.e. it doubles the number of nodes). And you can perform this change with a few clicks in our easy to use web-based interface, even if the process has already started.

Warehouse Resize 1

Warehouse Resize 2

Once you bump up the size of the warehouse, the next statement that starts will immediately use the additional resources. Notice that the first statement running with the Medium warehouse runs in about half the time of the prior statement, with about double the data!

Warehouse Resize 3

You don’t need to interrupt the process, start over, or restart the service. It just takes effect automatically, at the push of a button. And once the process is done, you can reduce the warehouse back to a Small or even just turn it off.

And in this way, the unique cloud-based architecture of Snowflake allows you to truly take advantage of the elasticity of the cloud for your data warehouse.

Alternatives

In many other architectures, cloud and on-premises appliances, you have a certain amount of disk space associated with the compute nodes. If you need more storage; you must add more nodes. If you need more compute power, likewise you add more nodes. Plus other architectures may not allow you to scale down. Also in most of these cases, when adding more nodes, you must also shut down the environment or make it read only for a time while you redistribute the data manually across the new nodes.

Thus in the legacy data warehouses, elasticity is limited. And this forces you to plan for, acquire and manage the resources for your expected, future peak capacity and peak data demand. Otherwise you risk having your queries or your hardware fail (because you run out of disk capacity or CPU power, or both). In addition, trying to scale beyond existing capacity may lead to disruptions to the service (as you rebuild/extend the hardware), or some workloads being rejected in favor of others. And if you over-provision, and buy too much capacity, there may not be a way to scale back after the fact.

The Snowflake Advantage

For Snowflake, this is not an issue. You can grow and shrink the environment dynamically. The data storage grows and shrinks as you add or remove data, while the compute nodes can be ramped up or down, or turned off, as you require. You are not forced to pay for capacity up front, or kick other workloads off, or plan downtimes when ramping up your data warehouse capacity. That is the promise of cloud. This is why Snowflake is the real elastic Data Warehouse as a Service.

Very cool indeed.

As always, keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Kent Graziano and Saqib Mustafa

End-to-End Encryption in the Snowflake Data Warehouse

By Martin Hentschel and Peter Povinec.

Protecting customer data is one of the highest priorities for Snowflake. The Snowflake data warehouse encrypts all customer data by default, using the latest security standards, at no additional cost. Snowflake provides best-in-class key management, which is entirely transparent to customers. This makes Snowflake one of the easiest to use and most secure data warehouses on the market.

In previous blog posts, we explained critical components of Snowflake’s security architecture, including:

  • How Snowflake manages encryption keys; including the key hierarchy, automatic key rotation, and automatic re-encryption of data (“rekeying”)
  • How Snowflake uses Amazon CloudHSM to store and use Snowflake’s master keys with the highest protection
  • How Amazon CloudHSM is configured to run in high-availability mode.

In this blog post, we explain Snowflake’s ability to support end-to-end encryption, including:

  • How customers can upload their files to Amazon S3 using client-side encryption
  • How to use Snowflake to import and export client-side encrypted data.

End-to-End Encryption

End-to-end encryption is a form of communication where only the end users can read the data, but nobody else. For the Snowflake data warehouse service it means that only the customer and runtime components of the Snowflake service can read the data. No third parties, including Amazon AWS and any ISPs, can see data in the clear. This makes end-to-end encryption the most secure way to communicate with the Snowflake data warehouse service.

End-to-end encryption is important because it minimizes the attack surface. In the case of a security breach of any third party (for example of Amazon S3) the data is protected because it is always encrypted, regardless whether the breach is due to the exposure of access credentials indirectly or the exposure of data files directly, whether by an insider or by an external attacker, whether inadvertent or intentional. The encryption keys are only in custody of the customer and Snowflake. Nobody else can see the data in the clear – encryption works!


encryption_blog_diagrams_Diagram 1_Diagram 1
Figure 1: End-to-end Encryption in Snowflake

Figure 1 illustrates end-to-end encryption in the Snowflake data warehouse. There are three actors involved: the customer in its corporate network, a staging area, and the Snowflake data warehouse running in a secure virtual private cloud (VPC). Staging areas are either provided by the customer (option A) or by Snowflake (option B). Customer-provided staging areas are buckets or directories on Amazon S3 that the customer owns and manages. Snowflake-provided stages are built into Snowflake and are available to every customer in their account. In both cases, Snowflake supports end-to-end encryption.

The flow of end-to-end encryption in Snowflake is the following (illustrated in Figure 1):

  1. The customer uploads data to the staging area. If the customer uses their own staging area (option A), the customer may choose to encrypt the data files using client-side encryption. If the customer uses Snowflake’s staging area (option B), data files are automatically encrypted by default.
  2. The customer copies the data from the staging area into the Snowflake data warehouse. Within the Snowflake data warehouse, the data is transformed into Snowflake’s proprietary file format and stored on Amazon S3 (“data at rest”). In Snowflake, all data at rest is always encrypted.
  3. Results can be copied back into the staging area. Results are (optionally) encrypted using client-side encryption in the case of customer-provided staging areas or automatically encrypted in the case of Snowflake-provided staging areas.
  4. The customer downloads data from the staging area and decrypts the data on the client side.

In all of these steps, all data files are encrypted. Only the customer and runtime components of Snowflake can read the data. Snowflake’s runtime components decrypt the data in memory for query processing. No third-party service can see data in the clear.

Customer-provided staging areas are an attractive option for customers that already have data stored on Amazon S3, which they want to copy into Snowflake. If customers want extra security, they may use client-side encryption to protect their data. However, client-side encryption of customer-provided stages is optional.

Client-Side Encryption on Amazon S3

Client-side encryption, in general, is the most secure form of managing data on Amazon S3. With client-side encryption, the data is encrypted on the client before it is uploaded. That means, Amazon S3 only stores the encrypted version of the data and never sees data in the clear.

end-to-end-encryption-05

Figure 2: Uploading data to Amazon S3 using client-side encryption

Client-side encryption follows a specific protocol defined by Amazon AWS. The AWS SDK and third-party tools such as s3cmd or S3 Browser implement this protocol. Amazon S3’s client-side encryption protocol works as follows (Figure 2):

  1. The customer creates a secret master key, which remains with the customer.
  2. Before uploading a file to Amazon S3, a random encryption key is created and used to encrypt the file. The random encryption key, in turn, is encrypted with the customer’s master key.
  3. Both the encrypted file and the encrypted random key are uploaded to Amazon S3. The encrypted random key is stored with the file’s metadata.

When downloading data, the encrypted file and the encrypted random key are both downloaded. First, the encrypted random key is decrypted using the customer’s master key. Second, the encrypted file is decrypted using the now decrypted random key. All encryption and decryption happens on the client side. Never does Amazon S3 or any other third party (for example an ISP) see the data in the clear. Customers may upload client-side encrypted data using any clients or tools that support client-side encryption (AWS SDK, s3cmd, etc.).

Ingesting Client-Side Encrypted Data into Snowflake

Snowflake supports reading and writing to staging areas using Amazon S3’s client-side encryption protocol. In particular, Snowflake supports client-side encryption using a client-side master key.

end-to-end-encryption-06
Figure 3: Ingesting client-Side encrypted data into Snowflake

Ingesting client-side encrypted data from a customer-provided staging area into Snowflake (Figure 3) is just as easy as ingesting any other data into Snowflake. To ingest client-side encrypted data, the customer first creates a stage object with an additional master key parameter and then copies data from the stage into their database tables.

As an example, the following SQL snippet creates a stage object in Snowflake that supports client-side encryption:

-- create encrypted stage
create stage encrypted_customer_stage
url='s3://customer-bucket/data/'
credentials=(AWS _KEY_ID='ABCDEFGH' AWS_SECRET_KEY='12345678')
encryption=(MASTER_KEY='aBcDeFgHiJkL7890=');

The master key specified in this SQL command is the Base64-encoded string of the customer’s secret master key. As with all other credentials, this master key is transmitted to Snowflake over TLS (HTTPS) and stored in a secure, encrypted way in Snowflake’s metadata storage. Only the customer and query-processing components of Snowflake know the master key and are therefore able to decrypt data stored in the staging area.

As a side note, stage objects can be granted to other users within a Snowflake account without revealing S3 access credentials and client-side encryption keys to those users. This makes a stage object an interesting security feature in itself that is a Snowflake advantage over alternatives.

After the customer creates the stage object in Snowflake, the customer may copy data into their database tables. As an example, the following SQL command creates a database table “users” in Snowflake and copies data from the encrypted stage into the “users” table:

-- create table and ingest data from stage
create table users (id bigint, name varchar(500), purchases int);
copy into table from @encrypted_customer_stage/users;

The data is now ready to be analyzed using the Snowflake data warehouse. Of course, data can be offloaded into the staging area as well. As a last example, the following SQL command first creates a table “most_purchases” as the result of a query that finds the top 10 users with the most purchases, and then offloads the table into the staging area:

-- find top 10 users by purchases, unload into stage
create table most_purchases as select * from users order by purchases desc limit 10;
copy into @encrypted_customer_stage/most_purchases from most_purchases;

Snowflake encrypts the data files copied into the customer’s staging area using the master key stored in the stage object. Of course, Snowflake adheres to the client-side encryption protocol of Amazon S3. Therefore, the customer may download the encrypted data files using any clients or tools that support client-side encryption.

Summary

All customer data in the Snowflake warehouse is encrypted at transit and at rest. By supporting encryption for all types of staging areas as well, whether customer-owned staging areas or Snowflake-owned staging areas, Snowflake supports full end-to-end encryption in all cases. With end-to-end encryption, only the customer and runtime components of the Snowflake service can read the data. No third parties in the middle, for example Amazon AWS or any ISPs, see data in the clear. Therefore, end-to-end encryption secures data communicated with the Snowflake data warehouse service.

Protecting customer data at all levels is one of the pillars of the Snowflake data warehouse service. As such, Snowflake takes great care in securing the import and export of data into the Snowflake data warehouse. When importing and exporting data in Snowflake, customers may choose to use customer-provided staging areas or Snowflake-provided staging areas. Using customer-provided staging areas, customers may choose to upload data files using client-side encryption. Using Snowflake-provided staging, data files are always encrypted by default. Thus, Snowflake supports the end-to-end encryption in both cases.

Making Data Warehousing Easy

Legacy Problems

Organizations with legacy on-premise data warehouses spend a lot of time and money managing their environments and keeping up with business demands. Because of the size of the investments, organizations often run their data warehouses close to full utilization. While this may meet their current needs, the inherent lack of scalability could mean compromises on performance or failure to meet SLAs when more workloads, data sources and users need to be added.  Then the journey begins to add more capacity.  Organizations often need to acquire specialized resources, or become reliant on legacy vendors to manage and maintain the environment. All of this means that if there is a spike in demand for the environment, these organizations cannot accommodate the growth without impacting performance or must absorb additional costs for a greater footprint that waits unused until needed for that brief spike in the future.

Dealing with Growth

With performance concerns come the typical headaches of any data warehouse environment. These include, but are not limited to growing the environment, finding qualified resources for  performance tuning, optimizing queries, and dealing with concurrency and user growth. On the other hand businesses are facing stiffer competition, and end users are clamoring for faster answers to their business questions. In the past, data warehousing was limited to a set of users typically in marketing or finance. Now even field sales reps want access to up-to-date data, creating more load on the data warehouse. Plus the more data you have, the more important security becomes and the cost for performance increases. So now organizations are not only keeping the lights on, but also increasing spending to get performance, and securing the environment. In short, data warehouses have become more difficult to maintain and run!

An example of an organization facing this challenge of scalability is CapSpecialty, a leading provider of specialty insurance for small to mid-sized businesses. CapSpecialty used a legacy data warehouse to support analytics needed by their actuarial users to understand how to price and package products in various geographies. With the increased demand for access to this data, this legacy environment required a significant upgrade. Performance impacts led to users having to start their queries before leaving the office for the weekend, hoping they would be completed when they returned to work on Monday. The legacy environment also limited their ability to report on important KPIs that were critical to running the business in a timely manner. As with any financial organization, the environment also needed to provide a very secure environment to store the crown jewels: customers’ risk profile and related financial data.Unfortunately, upgrading their environment to meet this increased demand was going to cost them $500K just for licensing, and that would only give them a 2X increase in performance. This does not even include the costs for deployment, management and hosting for the new environment.

Making Data Warehousing Easy

The need for a scalable, more cost effective solution led them to Snowflake. After evaluating a number of data warehouse options,CapSpecialty decided to implement the Snowflake cloud-based Elastic Data Warehouse. Besides offering an attractive cost structure, Snowflake’s true cloud solution delivered ease of migration and scalability. With Snowflake, CapSpecialty was up and running in less than a week. In addition to achieving an increase of 200x query performance, they leveraged existing infrastructure and were set up to scale for future growth. Snowflake also provided end to end enterprise level security to protect their sensitive financial data in the cloud.

CapSpecialty underwriters are now able to analyze 10 years’ worth of governed data in 15 minutes. The stage has also been set for CapSpecialty executives to view dashboards that display real-time profitability and KPIs. Using Snowflake, CapSpecialty can also bring semi-structured data to the environment, and serve the analytics to their field agents to effectively market their products in various geographies.

To learn the details of how Snowflake made data warehousing easy for CapSpecialty, we encourage you to read more in the case study. You can also attend the our webinar  on April 27th, 2016, 10:00 AM-PST/1:00 PM-EST, to find out how Snowflake and Microstrategy enable CapSpecialty analysts to understand data in real time.

JSON Support with Snowflake

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake. 

#6 JSON support with Snowflake

One of the things that got people at #Strata Hadoop excited this week was our support for JSON and other semi-structured data types. For traditional data warehouse users, the world of the big data can be challenging. We are used to using SQL to query data, used to having a well defined data model, and knowing what the source and target schemas look like. We needed a simpler way to easily handle the flexible schemas that come with using semi-structured data like JSON documents. As much as they try, legacy data warehouse systems do not provide extensive support for JSON data, and big data systems require learning new extensive programming skills.

When our founders started out from scratch to build a data warehouse for the cloud, they wanted a solution that could combine all your data in one place without the need to resort to using multiple platforms or programming paradigms. As a result, combining structured and semi-structured in one place and making it available through standard ANSI SQL is a strong feature of the Snowflake service and extensively used by our customers.

Snowflake was built with features to simplify access to JSON data and provide the ability to combine it with structured data! Using Snowflake, you can learn to query JSON data using SQL, and join it to traditional tabular data in relational tables easily. Our innovative approach allows the user to store the JSON documents in a relational table using a new data type (VARIANT) that is optimized automatically in the background for MPP and columnar access.

This is a great way to eliminate  the gap between the big data world and the relational world and simplify access for users. Most legacy databases (with their legacy code base) can not do this efficiently.  Some legacy data warehouse providers cannot support JSON data at all, and you may have to acquire and manage a separate big data system.  Others may require some sort of pre-processing of the data such as conversion into simplified CSV type data. This may make it easier to ingest the data, but requires time and resources. And this also limits the ability to accommodate potential changes that come with a flexible schema data type into the relational data model easily. In addition, the JSON data may be stored in a text field, rather than an optimized data type, which has its cost in speed of query execution and data storage.

Snowflake makes semi-structured data available inside the data warehouse service seamlessly. The data can be ingested directly into a table in Snowflake and can then be queried easily. And any changes to the schema of the inbound JSON are accommodated automatically without impact to existing queries.

Example Code

In this scenario, we are going to use Snowflake SQL extensions for querying semi-structured data, and our innovative data type (VARIANT) to join data to other purely relational tables. We are going to combine Twitter data (JSON data) with  product data in relational tables.

The main table which stores the Twitter JSON data, twitter.data.tweets, has two columns: tweet and created_at.  The column Tweet is defined as a VARIANT type and holds the JSON from a Twitter feed, while created_at is a relational column with a data type of TIMESTAMP_NTZ (NTZ = no time zone).

Tweet table

Here is an example showing a fairly simple SQL query with the JSON extensions. In this query we are joining some Twitter data to product data in relational tables in order to get a count of Tweets that contain hashtags related to a particular product:

select extract('day',created_at) janday,count(*) cnt
  from
    twitter.data.tweets t,

     -- unnest a tweet on the hashtags of each entities
     lateral flatten (input=> t.tweet,'entities.hashtags')tags,
     (select distinct ph_hashtag
        from 
          sales.public.producthashtags,
          sales.public.product
        where p_name ='Blue Sky'
        and   p_productkey = ph_productkey) p

     where tags.value:text::string = p.ph_hashtag
     and   created_at >= '2014-01-01 00:00:00'
     and   created_at >= '2014-02-01 00:00:00'

    group by 1
    order by 1

The following  section of the code pivots the elements in the JSON string into a set of rows so we can do traditional joins:

     -- unnest a tweet on the hashtags of each entities
     lateral flatten (input=> t.tweet,'entities.hashtags')tags,

Specifically it is pulling out a nested array of hastags within the entities element. Then the predicate is where we join those hashtag values in the Tweet string to the hashtag column within the Product table (aliased as “p”):

     where tags.value:text::string = p.ph_hashtag

In this case “tags” equal the virtual table alias created by the FLATTEN function and the keyword “value” indicates we want the content of that row. The rest of the specification indicates it is text data that we want cast as a STRING so it matches the data type of the column p.ph_hastag. 

Then the last part of the predicate is a normal filter for a date range using the date column in the the TWEETS table:

     and   created_at >= '2014-01-01 00:00:00'
     and   created_at >= '2014-02-01 00:00:00'

So there you have it, using SQL to combine semi-structured data with traditional structured data in a relational data warehouse in the cloud. No big data system required.

Not bad!

But that is just a flavor of how you can utilize Snowflake to easily get value from your JSON data. There’s way more than we can cover in a simple blog post (like building relational views on the JSON for example).

Want to find out more? Ask us for a demo or Check out the presentation by Grega Kaspret (@gregakespret) from Celtra Mobile at Strata Hadoop World (San Jose) this week, talking about simplifying a JSON data pipeline using Snowflake. And follow our Twitter feeds:  (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and  updates on all the action at Snowflake Computing.

Kent Graziano and Saqib Mustafa