Rethink What You Know About Creating a Data Lake for JSON

Over the last 10 years, the notion has been that to quickly and cost-effectively gain insights from a variety of data sources, you need a Hadoop platform. Sources of data could be weblogs, clickstreams, events, IoT and other machine-born JSON or semi-structured data. The proposition with Hadoop-based data processing is having a single repository (a data lake) with the flexibility, capacity and performance to store and analyze an array of data types.

It shouldn’t be complicated

In reality, analyzing data with an Hadoop-based platform is not simple. Hadoop platforms start you with an HDFS file system, or equivalent. You then must piece together about a half-dozen software packages (minimum) just to provide basic enterprise-level functionality. Functionality such as provisioning, security, system management, data protection, database management and the necessary interface to explore and query data.

Despite the efforts of open-source communities to provide tools to improve the capabilities of Hadoop platforms to operate at the highest enterprise-class level, there is the constant need for highly skilled resources. Skilled resources to continually support Hadoop to keep it up and running, while enabling users to do more than just explore data. This all adds up to unnecessary complexity.

A much simpler proposition

Snowflake, which is built for the cloud and delivered as a service, provides you with a different option for handling JSON and semi-structured data. Just point your data pipelines to Snowflake, land the data in our elastic storage repository and you have instant access to a bottomless data lake. You also have access to a full-fledged data warehouse. With Snowflake, you can easily load JSON and query the data with relational, robust SQL. You can mix JSON with traditional structured data and data from other sources, all from within the same database. Moreover, you can also support endless concurrent analytic workloads and work groups against the JSON data. Whether it is one workload or 1,000 workloads, Snowflake can handle it all with ease.

As a combined data lake and data warehouse platform, Snowflake allows you do much more. Read more about it with our new eBook, Beyond Hadoop: Modern Cloud Data Warehousing.

Try Snowflake for free. Sign up and receive $400 dollars of free usage credits. You can create a sandbox or launch a production implementation from the same Snowflake account.

Rethink what you’ve been told

In order to gain insights from JSON or other machine data, Hadoop is not a prerequisite

When you need to store, warehouse and analyze JSON and other machine data, rethink what you’ve been told. Snowflake, easily, allows you to develop insights or uncover relationships that can drive business forward. You can support all of your structured and semi-structured data warehousing and analytic workload needs with a single tool. A single tool that is built for the cloud and is ACID-compliant. Unlike the special skills often needed to operate an Hadoop platform, Snowflake is a fully relational SQL environment that utilizes the familiar semantics and commands that are known to millions of SQL users and programmers, and thousands of SQL tools.

Be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb and @miclnixon1) for all the news and happenings here at Snowflake.

 

 

 

 

How to analyze JSON with SQL

One of the key reasons I joined  Snowflake is its built-in support to load and query semi-structured data such as JSON. In most conventional data warehouse and Big Data environments, you have to first load this type of data to a Hadoop or NoSQL platform. Then, you must parse it with a product such as MapReduce so you can load the data into tables in a relational database. Then, and only then, you’re ready to analyze the data with SQL queries or a BI/Analytics tool. But why take extra steps when there is a faster, easier way to get the job done?

With Snowflake, you can load your semi-structured data direct into a relational table, query the data with a SQL statement and then join it to other structured data – all while not fretting about future changes to the schema of that data. Snowflake actually keeps track of the self-describing schema so you don’t have to. No ETL or shredding required.

This means you can leverage your existing knowledge and skills in SQL to jump into the world of big data. Even with this feature, there is still a tiny bit to learn. However, it’s easy with Snowflake’s extensions to SQL. To get you started, we have produced a handy ebook that takes you step-by-step through loading some JSON into Snowflake, then querying that data with SQL. The book is called How to analyze JSON with SQL: Schema-on-read made easy.

Get your free ebook now and start your journey to analyzing big data in the cloud!

Be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) for all the news and happenings here at Snowflake.

P.S. If you don’t already have a Snowflake account, you can sign up for a self-service account here and get a jumpstart with $400 in free credits!

 

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

Happy Tweet, Sad Tweet – Building a Naive Bayes Classifier in SQL

We love to talk about how Snowflake is built and what makes Snowflake stand out from the data warehousing crowd, but we also love to show how you can put Snowflake to use for interesting big data challenges.  In this blog post we show how to build a machine learning classifier on top of Snowflake using SQL and Snowflake’s JSON extensions.  The classifier will learn to separate happy emotions from sad emotions in short text snippets.  As a data source we use the publicly available Twitter data stream that can be downloaded as raw JSON.

Twitter is a never-ending source of news, links, comments, and personal chatter. Some of this communication is highly emotional: it’s happy, it’s sad, it’s neither, or both at the same time. By loading Twitter data into Snowflake, we can build a machine learning classifier that detects such emotions automatically. The machine learning classifier can then be used to detect emotions in any tweet-sized text snippet, such as “The sun is shining, the weather is sweet again” (happy), “I must’ve called a thousand times” (sad), and “Make America Great Again” (happy).

Or, in SQL:

select * from result order by id;
----+------------------------------------------------+-------+
 ID |                      TEXT                      | LABEL |
----+------------------------------------------------+-------+
 1  | The sun is shining, the weather is sweet again | happy |
 2  | I must've called a thousand times              | sad   |
 3  | Make America Great Again                       | happy |
----+------------------------------------------------+-------+

In this blog post we will build a Naive Bayes machine learning classifier using SQL. We do this in three steps:

  1. We create a training data set by labeling tweets as happy or sad using emojis.
  2. We compute a machine learning model using a Naive Bayes classifier.
  3. We validate our model using a test data set and detect emotions in other text snippets.

Let’s quickly dive into what Twitter data looks like, what it takes to build a machine learning model, and what a Naive Bayes classifier is anyway.

Twitter Data

Twitter allows you to download a one percent sample of all public tweets for free. The format of this data is JSON. Each tweet is a separate JSON object that looks like this:

{
  "created_at": "Tue Feb 02 21:09:01 +0000 2016",
  "entities": {
    "hashtags": [ ... ],
    "media": [],
    "symbols": [],
    "trends": [],
    "urls": [],
    "user_mentions": [ ... ]
  },
  "id": 694751783172739072,
  "lang": "en",
  "retweeted_status": { ... },
  "text": "We are delighted to inform you that your submission 900: 
           The Snowflake Elastic Data Warehouse has been accepted. 
           #SIGMOD @SnowflakeDB",
  "user": { "screen_name": "hemasail" ... }
  ...
}

This tweet was sent by user @hemasail—which is me, coincidentally—on February 2, 2016. Every JSON object representing a tweet has mandatory fields and optional fields. Mandatory fields include the tweet’s id, the user that sent the tweet, and the language the tweet was written in (which is apparently detected by Twitter’s language classifier). Optional fields include “retweeted_status,” containing information about the retweeted tweet (if any); and “entities,” containing information about hashtags, media, URLs, etc. that were found in the tweet. There is much more information in a tweet’s JSON that is not shown in the example above. It is always interesting to see the full amount of refined information from “140 characters or less.”

Twitter data formatted in JSON is a great fit for Snowflake. Snowflake supports JSON natively: loading, querying, and unloading of JSON data is supported without conversion. Snowflake also supports user-defined functions in JavaScript that can operate on JSON data directly.

Machine Learning

Machine Learning is a method in computer science to automatically learn characteristics of existing data and apply the findings to new data. In this blog we use supervised machine learning. Supervised machine learning is a branch of machine learning where the existing data set needs to be labeled. That is, each item in the data set—each tweet for example—is assigned a label: this one is a happy tweet, this one is a sad tweet. There is also unsupervised machine learning where the data set does not need to be labeled, and the machine learning algorithm tries to find labels itself. However, we will use a labeled data set.

There are many ways to generate labels. Usually, labels are generated by humans. In our case, that would mean that we sit down and label lots of tweets as happy or sad. This would take a lot of time. We could also outsource label generation to services such as Amazon Mechanical Turk, but this would take a lot of money. So we won’t do either. Instead, we will assign labels to tweets using emojis. If a tweet contains a happy emoji, we will assign a label “happy” to this tweet. If a tweet contains a sad emoji, we will assign a label “sad” to this tweet. If there are no emojis or both types of emojis in a tweet, we will assign labels “none” or “both,” respectively.

Given a labeled data set, we will split it into a training data set and a test data set. The training data set will be used to train the machine learning classifier. The test data set will be used to test how good our classifier is. We will split our data set into 80% training data and 20% test data.

Naive Bayes Classifier

A Naive Bayes classifier is a simple type of machine learning model based on probabilities. In our case, a Naive Bayes classier uses word probabilities to classify a tweet as happy or sad. Simply said, a Naive Bayes classifier counts all words and computes probabilities of how often they appear in one category or another. For example, the word “great” appears more often in happy tweets than in sad tweets. The word “must’ve” appears more often in sad tweets than in happy tweets. By averaging all word probabilities together, we get an overall probability that a text belongs to one category or another. For example, by averaging all word probabilities in “Make America Great Again” together, we see that it is more happy than sad. At least, our training data from Twitter tells us so.

A much better explanation of the Naive Bayes classifier can be found in the slides from Stanford here.

Building a Naive Bayes Classifier in SQL

We create a training data set by first labeling tweets as happy or sad. Labels will be assigned using emojis. If a tweet has a happy emoji, we assign the label “happy.” If a tweet has a sad emoji, we assign the label “sad.” If the tweet does not contain an emoji, or if it contains both types of emojis, we assign the labels “none” or “both” respectively.

To compute labels, we will use the following user-defined function (UDF) in JavaScript. The function takes a string as input and outputs a label based on the type of emojis found. For this task, it splits the string into words and compares each word with a happy emoji or sad emoji. We call this function “happyLabel”:

create or replace function happyLabel(TEXT string)
returns string
language javascript
as ‘
 var happyRegex = /:-?\\)/;
 var sadRegex = /:-?\\(/;
 var happyEmoji = happyRegex.test(TEXT);
 var sadEmoji = sadRegex.test(TEXT);
 if (happyEmoji && sadEmoji) return “both”;
 if (happyEmoji) return “happy”;
 if (sadEmoji) return “sad”;
 return “none”;
‘;

We will use another JavaScript UDF that cleans a tweet and splits it into words. This function takes a string as input and outputs a variant. A variant is a Snowflake-specific data type that can contain semi-structured data like JSON objects and JSON arrays, as well as native data such as strings and numbers. This function outputs a variant containing a JSON array. Before splitting a tweet into words, the function removes any HTML entities such as & and <, any mention of other Twitter names, and any punctuation. It outputs a list of lower-case words.

create or replace function splitText(TEXT string)
returns variant
language javascript
as ‘
 var words = TEXT
   .replace(/&\\w+;/g, ” “)         // remove HTML entities
   .replace(/@[^\\s]+/g, “”)        // remove mentions
   .replace(/[^#\’\\w\\s]|_/g, ” “) // remove punctuation
   .toLowerCase()
   .trim()
   .split(/\\s+/g);                 // split on whitespace
 return words;
‘;

Now we will write SQL to generate the labeled data set. To generate the labels, we will call the JavaScript UDFs defined above. We will balance the labeled data set such that there are the same number of happy tweets and sad tweets. From this balanced data set, we will use 80 percent of the tweets for training the classifier. We will compute all probabilities and counts necessary for the Naive Bayes classifier. For a complete summary of all formulas needed to implement the classifier, see the slides from Stanford University. For any questions regarding JSON processing in Snowflake, refer to the Snowflake documentation.

-- label tweets and only select happy and sad tweets
create or replace table labeledTweets as
select tweet,
      happyLabel(tweet:text) as label,
      splitText(tweet:text) as words,
      uniform(0::float, 1::float, random()) as rand
from twitter.public.tweets_1p
where created_at > ‘2016-05-01’::date       — tweets from May 2016
 and tweet:lang = ‘en’                     — english language only
 and array_size(tweet:entities:urls) = 0   — no tweets with links
 and tweet:entities:media is null          — no tweets with media (images etc.)
 and tweet:retweeted_status is null        — no retweets, only originals
 and (label = ‘happy’ or label = ‘sad’);

-- balance tweets to have same number of happy and sad tweets
create or replace table balancedLabeledTweets as
select *
from (select *,
       rank() over (partition by label order by tweet:id) as rnk
     from labeledTweets) x
where rnk <= (select min(cnt)
             from (select label,count(*) as cnt
                   from labeledTweets
                   group by label));
-- training set, pick random 80 percent of tweets
create or replace table training as
select * from balancedLabeledTweets where rand < 0.8;

-- training helper table: tweet id, word, label
create or replace table training_helper as
select tweet:id as id, value as word, label
from training,
    lateral flatten(words);
 
-- number of total docs
create or replace table docs as
select count(*) as docs from training;

-- number of docs per class j
create or replace table docsj as
select label,count(*) as docsj
from training
group by label;

-- number of distinct words = |Vocabulary|
create or replace table voc as
select count(distinct word) as voc
from training_helper;

-- number of words per class j
create or replace table wordsj as
select label,count(*) as wordsj
from training_helper
group by label;

-- count per word n and class j
create or replace table wordsnj as
select a.label,a.word,ifnull(wordsnj,0) as wordsnj
from
(select label,word
from (select distinct label from training_helper) c
 join
    (select distinct word from training_helper) w) a
left outer join
 (select label,word,count(*) as wordsnj
 from training_helper
 group by label,word) r
on a.label = r.label and a.word = r.word;

To test our Naive Bayes classifier, we will use the remaining 20 percent of the labeled tweets. For every tweet in the test set, we will compute the actual label that the classifier assigns to the tweet. The classifier does not look at the emojis, of course. That would be cheating. We will then compare the actual labels with the expected labels. This will give us a percentage of correctly classified tweets.

-- test set
create or replace table test as
select * from balancedLabeledTweets where rand >= 0.8;

-- test helper table: tweet id, word, label
create or replace table test_helper as
select tweet:id as id,value as word,label
from test,
    lateral flatten(words);

-- classification probabilities
create or replace table probs as
select id,label,max(pc)+sum(pw) as p,expected
from (
select id,t.word,n.label,wordsnj,wordsj,docsj,docs,voc,
 log(10,docsj::real/docs) as pc,
 log(10,(wordsnj::real+0.1)/(wordsj+0.1*voc)) as pw,
 t.label as expected
from test_helper t
 inner join wordsnj n on t.word = n.word
 inner join wordsj j on n.label = j.label
 inner join docsj on docsj.label = n.label
 cross join docs
 cross join voc) x
group by id,label,expected;

-- classification result
create or replace table testResult as
select p1.id,t.tweet:text::string as text,p1.expected,p1.label as actual
from probs p1
 inner join (
   select id,max(p) as maxp
   from probs
   group by id) p2
 on p1.id = p2.id and p1.p = p2.maxp
 inner join test t on p1.id = t.tweet:id;

-- correctly classified tweets: “win probability”
select sum(win),count(*),sum(win)::real/count(*) as winprob
from (
select id,expected,actual,
 iff(expected = actual,1,0) as win
from testResult);

The output of the last query is the following:
---------+-----------+-----------------+
SUM(WIN) | COUNT(*)  |     WINPROB     |
---------+-----------+-----------------+
43926    | 56298     |   0.7802408611  |
---------+-----------+-----------------+

That means our Naive Bayes classifier classified 78% of all test tweets correctly. This is not too bad given that we did not do much data cleansing, spam detection, and word stemming. The baseline winning probability is 50% because the number of happy tweets is the same as the number of sad tweets in our data set. Thus, our classifier gives us a significant boost.

We can now use the trained classifier to label any text snippets. For example, we can label text snippets such as “The sun is shining, the weather is sweet again,” “I must’ve called a thousand times,” and “Make America Great Again.” To do this, we create a new table with all text snippets that we want to classify. To compute the classification, we split each text into words and use the word probabilities from our training set to ultimately assign a label to each text snippet. The results are stored in a table called “result”.

-- create new table with any text snippets
create or replace table query(id int, text varchar(500));
insert into query values (1, ‘We are the champions’);
insert into query values (2, ‘I must\’ve called a thousand times’);
insert into query values (3, ‘Make America Great Again’);
 
-- split texts into words
create or replace table query_helper as
select id,value as word
from query,
    lateral flatten(splitText(text));

-- compute probabilities using data from training set
create or replace table probs as
select id,label,max(pc)+sum(pw) as p
from (
select id,t.word,n.label,wordsnj,wordsj,docsj,docs,voc,
 log(10,docsj::real/docs) as pc,
 log(10,(wordsnj::real+0.1)/(wordsj+0.1*voc)) as pw
from query_helper t
 inner join wordsnj n on t.word = n.word
 inner join wordsj j on n.label = j.label
 inner join docsj on docsj.label = n.label
 cross join docs
 cross join voc) x
group by id,label;

-- assign labels to text snippets
create or replace table result as
select p1.id as id, text, p1.label as label
from probs p1
 inner join (
   select id,max(p) as maxp
   from probs
   group by id) p2
 on p1.id = p2.id and p1.p = p2.maxp
 inner join query q on p1.id = q.id;

And, ta-dah, here are the results of our classification:

select * from result order by id;
---+------------------------------------------------+-------+
ID |                      TEXT                      | LABEL |
---+------------------------------------------------+-------+
1  | The sun is shining, the weather is sweet again | happy |
2  | I must’ve called a thousand times              | sad   |
3  | Make America Great Again                       | happy |
---+------------------------------------------------+-------+

Summary

In this blog post, we built a complete Naive Bayes classifier using SQL. The classifier learned to distinguish happy tweets from sad tweets. Our classifier has a success rate of 78%, which is a significant improvement over the baseline of 50%. We used the classifier on other text snippets to show its applicability beyond tweets. Of course, this is only a demo and further data cleansing, spam detection, word stemming, and other natural language processing tricks are necessary to increase the success rate of the classifier. 

As always, keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), and my personal Twitter feed (@hemasail) for updates on all the action and activities here at Snowflake Computing.

JSON Support with Snowflake

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake. 

#6 JSON support with Snowflake

One of the things that got people at #Strata Hadoop excited this week was our support for JSON and other semi-structured data types. For traditional data warehouse users, the world of the big data can be challenging. We are used to using SQL to query data, used to having a well defined data model, and knowing what the source and target schemas look like. We needed a simpler way to easily handle the flexible schemas that come with using semi-structured data like JSON documents. As much as they try, legacy data warehouse systems do not provide extensive support for JSON data, and big data systems require learning new extensive programming skills.

When our founders started out from scratch to build a data warehouse for the cloud, they wanted a solution that could combine all your data in one place without the need to resort to using multiple platforms or programming paradigms. As a result, combining structured and semi-structured in one place and making it available through standard ANSI SQL is a strong feature of the Snowflake service and extensively used by our customers.

Snowflake was built with features to simplify access to JSON data and provide the ability to combine it with structured data! Using Snowflake, you can learn to query JSON data using SQL, and join it to traditional tabular data in relational tables easily. Our innovative approach allows the user to store the JSON documents in a relational table using a new data type (VARIANT) that is optimized automatically in the background for MPP and columnar access.

This is a great way to eliminate  the gap between the big data world and the relational world and simplify access for users. Most legacy databases (with their legacy code base) can not do this efficiently.  Some legacy data warehouse providers cannot support JSON data at all, and you may have to acquire and manage a separate big data system.  Others may require some sort of pre-processing of the data such as conversion into simplified CSV type data. This may make it easier to ingest the data, but requires time and resources. And this also limits the ability to accommodate potential changes that come with a flexible schema data type into the relational data model easily. In addition, the JSON data may be stored in a text field, rather than an optimized data type, which has its cost in speed of query execution and data storage.

Snowflake makes semi-structured data available inside the data warehouse service seamlessly. The data can be ingested directly into a table in Snowflake and can then be queried easily. And any changes to the schema of the inbound JSON are accommodated automatically without impact to existing queries.

Example Code

In this scenario, we are going to use Snowflake SQL extensions for querying semi-structured data, and our innovative data type (VARIANT) to join data to other purely relational tables. We are going to combine Twitter data (JSON data) with  product data in relational tables.

The main table which stores the Twitter JSON data, twitter.data.tweets, has two columns: tweet and created_at.  The column Tweet is defined as a VARIANT type and holds the JSON from a Twitter feed, while created_at is a relational column with a data type of TIMESTAMP_NTZ (NTZ = no time zone).

Tweet table

Here is an example showing a fairly simple SQL query with the JSON extensions. In this query we are joining some Twitter data to product data in relational tables in order to get a count of Tweets that contain hashtags related to a particular product:

select extract('day',created_at) janday,count(*) cnt
  from
    twitter.data.tweets t,

     -- unnest a tweet on the hashtags of each entities
     lateral flatten (input=> t.tweet,'entities.hashtags')tags,
     (select distinct ph_hashtag
        from 
          sales.public.producthashtags,
          sales.public.product
        where p_name ='Blue Sky'
        and   p_productkey = ph_productkey) p

     where tags.value:text::string = p.ph_hashtag
     and   created_at >= '2014-01-01 00:00:00'
     and   created_at >= '2014-02-01 00:00:00'

    group by 1
    order by 1

The following  section of the code pivots the elements in the JSON string into a set of rows so we can do traditional joins:

     -- unnest a tweet on the hashtags of each entities
     lateral flatten (input=> t.tweet,'entities.hashtags')tags,

Specifically it is pulling out a nested array of hastags within the entities element. Then the predicate is where we join those hashtag values in the Tweet string to the hashtag column within the Product table (aliased as “p”):

     where tags.value:text::string = p.ph_hashtag

In this case “tags” equal the virtual table alias created by the FLATTEN function and the keyword “value” indicates we want the content of that row. The rest of the specification indicates it is text data that we want cast as a STRING so it matches the data type of the column p.ph_hastag. 

Then the last part of the predicate is a normal filter for a date range using the date column in the the TWEETS table:

     and   created_at >= '2014-01-01 00:00:00'
     and   created_at >= '2014-02-01 00:00:00'

So there you have it, using SQL to combine semi-structured data with traditional structured data in a relational data warehouse in the cloud. No big data system required.

Not bad!

But that is just a flavor of how you can utilize Snowflake to easily get value from your JSON data. There’s way more than we can cover in a simple blog post (like building relational views on the JSON for example).

Want to find out more? Ask us for a demo or Check out the presentation by Grega Kaspret (@gregakespret) from Celtra Mobile at Strata Hadoop World (San Jose) this week, talking about simplifying a JSON data pipeline using Snowflake. And follow our Twitter feeds:  (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and  updates on all the action at Snowflake Computing.

Kent Graziano and Saqib Mustafa

Top 10 Cool Things I Like About Snowflake

I have now been with Snowflake Computing for a little over two months (my how time flies). In that time, I have run the demo, spoken at several trade shows, and written a few blogs posts. I have learned a ton about the product and what it means to be an Elastic Data Warehouse in the Cloud.

So for this post I am going to do a quick rundown of some of the coolest features I have learned about so far. 

#10 Persistent results sets available via History

Once you execute a query, the result set will persist for 24 hours (so you can go back and check your work). It may seem minor to some, but it sure is convenient to be able to pull up the results from a previous query without having to execute the query a second time. Saves on time and processing. Read more

#9 Ability to connect with JDBC

Again seems like a no brainer but very important. I had no real clear concept of how I would connect to a data warehouse in the cloud so this was good news.  After getting my favorite data modeling tool, Oracle SQL Developer Data Modeler (SDDM),  installed on my new Mac, I was able to configure it to connect to my Snowflake demo schema using JDBC and reverse engineer the design. 

So why is this cool? It means that whatever BI or ETL tool you use today, if it can talk over JDBC, you can connect it to Snowflake. Read more

#8 UNDROP

With UNDROP in Snowflake you can recover a table instantaneously with a single command:

UNDROP TABLE <tablename>

No need to reload last night’s backup to do the restore. No need to wait while all that data is pulled back in. It just happens!

Now that is a huge time (and life) saver. Read more

#7 Fast Clone

Even cooler than UNDROP is the fast clone feature.

The Snowflake CLONE command can create a clone of a table, a schema, or an entire database almost instantly. It took me barely a minute to create a clone of a 2TB database without using additional storage! And I am not a DBA, let alone a “cloud” DBA.

This means you can create multiple copies of production data without incurring additional storage costs. No need to have separate test/dev data sets.

Hence why I think it is way cool! Read more

#6 JSON Support with SQL

During the first demo of Snowflake I attended (before I even applied for a job here), this one got my attention.

Using the knowledge and skills I already had with SQL, I could quickly learn to query JSON data, and join it to traditional tabular data in relational tables.

Wow – this looked like a great stepping stone into the world of “Big Data” without having to learn complex technologies like Hadoop, MapReduce, or Hive! Read more

Yes, I call that a very cool feature. And the fact that the JSON documents are stored in a table and optimized automatically in the background for MPP and columnar access. This gives you the ability to combine semi-structured and structured data, in one location. For further details check out my detailed 2 part blog here and here.

#5 ANSI compliant SQL with Analytic Functions

Another key feature in Snowflake, that is required to be called a relational data warehouse, is of course the ability to write standard SQL. More so, for data warehousing, is access to sophisticated analytic and windowing functions (e.g., lead, lag, rank, stddev, etc.).

Well Snowflake definitely has these.  In fact we support everything you would expect including aggregation functions, nested virtual tables, subqueries, order by, and group by. This means it is fairly simple for your team to migrate your existing data warehouse technologies to Snowflake. Read more

#4 Separation of Storage and Compute

The innovative, patent-pending, Multi-Cluster, Shared Data Architecture in Snowflake is beyond cool. The architecture consists of three layers; storage, compute, and cloud services. Each layer is decoupled from the other, each layer is independently scalable. This enables customers to scale resources as they are required, rather than pre-allocating resources for peak consumption. In my 30+ years working in IT, I have not seen anything like it.  It is truly one of the advantages that comes from engineering the product, from the ground up, to take full advantage of the elasticity of the cloud. Read more

#3 Support for Multiple Workloads

With this unique architecture, Snowflake can easily support multiple disparate workloads. Because of the separation of compute and storage, you can easily spin up separate Virtual Warehouses of different sizes to run your ELT processes, support BI report users, data scientists, and data miners. And it makes total sense to be able to keep disparate workloads separate, to avoid resource contention, rather than just saying we support “mixed” workloads.

And even better – no special skills or secret configuration settings are required to make this work. It is the way Snowflake is built by design. Nice! Read more

#2 Automatic Encryption of Data

Security is a major concern for moving to the cloud. With Snowflake, your data is automatically encrypted by default. No setup, no configuration, no add-on costs for high security features.

It is just part of the service! To me that is a huge win. Read more

#1 Automatic Query Optimization. No Tuning!

As a long time data architect, and not a DBA, this is my favorite part of Snowflake. I do not have to worry about my query performance at all. It is all handled “auto-magically” via meta data and an optimization engine in our cloud services layer. I just model, load, and query the data.

So, no indexes, no need to figure out partitions and partition keys, no need to pre-shard any data for distribution, and no need to remember to update statistics.

This feature, to me, is one of the most important when it comes to making Snowflake a zero management Data Warehouse as a Service offering. Read more

Well, that is the short list of my top 10 favorite features in Snowflake. Keep a look out for future posts in the coming weeks, to provide details on these and other key features of the Snowflake Elastic Data Warehouse.

Now check out this short intro video to Snowflake!

If you want to learn more about Snowflake, sign up for one of our frequent webinars, or just drop me a line at kent.graziano@snowflake.net and I will hook you up!

P.S. Keep an eye on my Twitter feed (@kentgraziano) and the Snowflake feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing. Watch for #BuiltForTheCloud and #DWaaS.

Customizing Oracle Sql Developer Data Modeler (SDDM) to Support Snowflake VARIANT

On a recent customer call, the data architects were asking if Snowflake provided a data model diagramming tool to design and generate data warehouse tables or to view a data model of an existing Snowflake data warehouse. Or if we knew of any that would work with Snowflake.

Well, we do not provide one of our own – our service is the Snowflake Elastic Data Warehouse (#ElasticDW).

The good news is that there are data modeling tools in the broader ecosystem that you can of course use (since we are ANSI SQL compliant).

You may have seen several posts (on my personal blog) where I used Oracle SQL Developer Data Modeler (aka SDDM) to reverse engineer and re-engineer a Snowflake database. (As a reminder, SDDM is a free-to-use enterprise class data modeling tool that you can download from Oracle here.)

If you have read my previous posts on using JSON within the Snowflake, you also know that we have a new data type called VARIANT for storing semi structured data like JSON, AVRO, and XML.

In this post I will bring it together and show you the steps to customize SDDM to allow you to model and generate table DDL that contain columns that use the VARIANT data type.

Creating a User Defined Data Type

One of the features of Snowflake is that you can run most Oracle DDL syntax (among others) without edits to create your tables, views, and constraints in a Snowflake database (so as to make it easier for you to migrate your current data warehouse to our cloud-based data warehouse service). So that means you can create your data warehouse model in SDDM, generate the DDL using Oracle syntax, then run it without any edits.

Of course Oracle does not have a VARIANT data type, so SDDM does not have it either. Nicely though you can pretty easily define your own custom data type to use in models and output in create table DDL statements.

I knew the feature was there but had a hard time getting it to do what I wanted, so I posted a question on the OTN Forum for Data Modeler and as usually Philip (the lead architect for Data Modeler) answered and laid out the steps for me. As he said, and you will see, it is not totally straightforward and a bit of a chicken and egg scenario.

Define a User Defined Native Type

First you have to go to Tools -> Types Administration and go to the tab for user defined native types. Then pick the RDBMS type you want to associate the new type with. I used Oracle Database 12c (remember there is no native Snowflake option). Press the green plus (+) to create a new entry. Under native type give it a name (in my case VARIANT), then under Logical type select one of the standard types. I used CLOB for now (this really is a temporary setting that I will change shortly).

Define Custom Type 1

Notice that there are other options for “Has size” and  “Has precision and scale” that you might use if the type you want to define will need those properties, but for VARIANT you do not define size or precision so I left those unchecked.

Add a new Logical Type

This was the step that was not obvious at all to me. It turns out when you are defining the data type on a column you are usually selecting a Logical Data Type (unless you are picking a Domain). To make this work we need to add a new Logical Type to map our VARIANT data type to.

So now we need to switch to the first tab in Types Administration dialog. It is the one labeled “Logical types to native types.”

On this tab click the Add button on the bottom left. Now you can type in a new logical type name. I named mine VARIANT to prevent any confusion. The key piece now is to map it to the Native Type that was defined in the previous step. To do that you pick from the drop down on the database you set up (in this case Oracle 12c). In that list is the VARIANT type. Once that is selected you need to be sure to hit APPLY and then SAVE.

Define New Logical Type

Note that if you wanted to define a new type for multiple databases and versions, you would need to repeat this process for all others too.

Now change the original Native Type

Again not that obvious a step, but go back to the “User defined native types” tab (the 3rd tab) and click on the Logical type drop down. Scroll all the way to the bottom and you should see your newly defined logical type. Pick it then press the SAVE button again. Now you can CLOSE the dialog (and save your design to be sure!)

Refine Custom Type 2

Now you have the custom data type defined so it can be used!

Apply the User Defined Type

At this point you can either define a new column using your custom data type or edit an existing one to change the data type.

In my case, I reversed engineered a table in Snowflake that had a VARIANT column in it (before I made these changes to the data types).

VARIANT in Snowflake UI

Since SDDM did not know what VARIANT was, it guessed and assigned it to the Source Type of VARCHAR.

Import of VARIANT

Note that the Data Type property is set to the Logical radio button – hence the need to set up a custom Logical Type too.

With the new type defined, I can now simply edit the column and pick VARIANT from the drop down list. It does appear that custom types go to the bottom of the list.

Assign VARIANT to Column

Don’t forget to press APPLY or OK to save the reassignment.

Generate the DDL

With the assignment done now do a right mouse click on the table in the diagram and pick DDL Preview. You should see the DDL includes the columns using the newly defined data type.

VARIANT DDL

Test It!

In my case I just copied the DDL from the preview window and pasted it into the Snowflake Web UI, then executed it.

Success! It worked!

I can now use Oracle SQL Developer Data Modeler to design tables for a Snowflake database that include columns using our innovative VARIANT data type.

And now you know how to do it to!

 

Part II: Making Schema-on-Read a Reality

In the first article of this series, I discussed the Snowflake data type VARIANT, showed a simple example of how to load a VARIANT column in a table with a JSON document, and then how easy it is to query data directly from that data type. In this post I will show you how to access an array of data within the JSON document and how we handle nested arrays. Then finally I will give you an example of doing an aggregation using data in the JSON structure and how simple it is to filter your query results by referring to values within an array.

Handling Arrays of Data

One of the features in JSON is the ability to specify and imbed an array of data within the docuement. In my example one such array is children:

    "children": [
         { "name": "Jayden", "gender": "Male", "age": "10" },
         { "name": "Emma", "gender": "Female", "age": "8" },
         { "name": "Madelyn", "gender": "Female", "age": "6" }
     ]

You will notice there are effectively 3 rows in the array and each row has 3 sub-columns – name, gender, and age. Each of those rows constitutes the value of that array entry which includes all the sub-column labels and data (remember that for later). So how do you know how many rows there are if you do not have access to the raw data?

Like this:

select array_size(v:children) from json_demo;

The function array_size figures it out for us. To pull the data for each “row” in the array, we use the dot notation from before, but now with the added specification for the row number of the array (in the [] brackets):

select v:children[0].name from json_demo
union all
select v:children[1].name from json_demo
union all
select v:children[2].name from json_demo;

Childrens names

So this is interesting but then, I really do not want to write union all SQL to traverse the entire array (in which case I need to know how many values are in the array right?).

We solve that problem with another new extended SQL function called FLATTEN. FLATTEN takes an array and returns a row for each element in the array. With that you can select all the data in the array as though they were in table rows (so no need to figure out how many entries there are).

Instead of doing the set of UNION ALLs, we add the FLATTEN into the FROM clause and give it a table alias:

select f.value:name
from json_demo, table(flatten(v:children)) f;

This syntax allows us to creating an inline virtual table in the FROM clause.

In the SELECT, we can then reference it like a table. Notice the notation f.value:name.

f = the alias for the virtual table from the children array

value = the contents of the element returned by the FLATTEN function

name = the label of the specific sub-column we want to extract from the value

The results, in this case, are the same as the SELECT with the UNIONs but the output column header reflects the different syntax (since I have not yet added any column aliases).

Flatten Children 1

Now, if another element is added to the array (i.e., a 4th child), the SQL will not have to be changed. FLATTEN allows us to determine the structure and content of the array on-the-fly! This makes the SQL resilient to changes in the JSON document.

With this in hand, we can of course get all the array sub-columns and format them just like a relational table:

select
   f.value:name::string as child_name,
   f.value:gender::string as child_gender,
   f.value:age::string as child_age
from json_demo, table(flatten(v:children)) f;

Flatten Format Children

Putting this all together, I can write a query to get the parent’s name and all the children like this:

select
   v:fullName::string as parent_name,
   f.value:name::string as child_name,
   f.value:gender::string  as child_gender,
   f.value:age::string as child_age
from json_demo, table(flatten(v:children)) f;

Which results in this output:

Parent and Children

If I just want a quick count of children by parent, I do not need FLATTEN but refer back to the array_size:

select
   v:fullName::string as Parent_Name,
   array_size(v:children) as Number_of_Children
from json_demo;

Count Children

Handling Multiple Arrays

You may recall there are multiple arrays in our sample JSON string. I can pull from several arrays at once with no problem:

select
   v:fullName::string as Parent_Name,
   array_size(v:citiesLived) as Cities_lived_in,
   array_size(v:children) as Number_of_Children
from json_demo;

Parent, Children. City

What about an Array within an Array?

Snowflake can handle that too. From our sample data we can see yearsLived is an array nested inside the array described by citiesLived:

"citiesLived": [
 { "cityName": "London",
   "yearsLived": [ "1989", "1993", "1998", "2002" ]
 },
 { "cityName": "San Francisco",
   "yearsLived": [ "1990", "1993", "1998", "2008" ]
 },
 { "cityName": "Portland",
   "yearsLived<": [ "1993", "1998", "2003", "2005" ]
 },
 { "cityName": "Austin",
   "yearsLived": [ "1973", "1998", "2001", "2005" ]
 }
 ]

To pull that data out, we add a second FLATTEN clause that transforms the yearsLived array within the FLATTENed citiesLived array.

select
  tf.value:cityName::string as city_name,
  yl.value::string as year_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl;

In this case the 2nd FLATTEN (alias “yl”) is transforming (really pivoting) the yearsLived array for each value returned from the 1st FLATTEN of the citiesLived array (“tf”).

The results output shows Year Lived by City:

YearLive by City

Like my earlier example, I can then augment this result by adding in the name too (so I know who lived where):

select
  v:fullName::string as parent_name,
  tf.value:cityName::string as city_name,
  yl.value::string as year_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl;

Name and YearLived

Aggregations?

Yup, we can even aggregate data within semi-structured data. (We would not be much of a data warehouse service if we couldn’t, right?)

So, just like ANSI SQL, we can do a count(*) and a group by:

select
   tf.value:cityName::string as city_name,
   count(*) as years_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl
group by 1;

And the results:

Count Years

Filtering?

Of course! Just add a WHERE clause.

select
  tf.value:cityName::string as city_name,
  count(*) as years_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl
where city_name = 'Portland'
group by 1;

Where City

To simplify things, notice I used the column alias city_name in the predicate but you can also use the full sub-column specification tf.value:cityName as well.

Schema-on-Read is a Reality

I could go on, but by now I think you can see we have made it very easy to load and extract information from semi-structured data using Snowflake. We added a brand new data type, VARIANT, that lives in a relational table structure in a relational database without the need to analyze the structure ahead of time, design appropriate database tables, and then shred the data into that predefined schema. Then I showed you some easy to learn extensions to ANSI-standard SQL for accessing that data in a very flexible and resilient manner.

With these features, Snowflake gives you the real ability to quickly and easily load semi-structured data into a relational data warehouse and make it available for immediate analysis.

To see this feature in action, check out this short video.

Be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) for all the news and happenings here at Snowflake.

P.S. If you don’t already have a Snowflake account, you can sign up for a self-service account here and get a jumpstart with $400 in free credits!

Part I: Making Schema-on-Read a Reality

(Note: This the the first in a two-part series discussing how we handle semi-structured data in Snowflake)

Schema? I don’t need no stinking schema!

Over the last several years, I have heard this phrase schema-on-read used to explain the benefit of loading semi-structured data into a Big Data platform like Hadoop. The idea being you could delay data modeling and schema design until long after the data was loaded (so as to not slow down getting your data while waiting for those darn data modelers).

Every time I heard it, I thought (and sometimes said) – “but that implies there is a knowable schema.”  So really you are just delaying the inevitable need to understand the structure in order to derive some business value from that data. Pay me now or pay me later.

Why delay the pain?

So even though folks are able to quickly load this type of data into Hadoop or NoSQL, there is still more work ahead to actually pull the data apart so it can be analyzed. The person writing the query often has the burden of figuring out the schema and writing code to extract it. Additionally there may be a query performance penalty in this process (over that of querying columns in a relational database).

Not so with the Snowflake Elastic Data Warehouse (#ElasticDW)! With Snowflake, you can load your semi-structured data directly into a relational table, then query the data with a SQL statement, join it to other structured data, all while not fretting about future changes to the “schema” of that data. Snowflake actually keeps track of the self-describing schema so you don’t have to. No ETL or fancy shredding required.

One of the key differentiators which really attracted me to Snowflake is our built in support to load and query semi-structured data such as JSON, XML, and AVRO. In most conventional data warehouse and Big Data environments today, you have to first load this type of data to a Hadoop or NoSQL platform, then shred it (using for example MapReduce) in order to then load it into columns in a relational database (that is if you want to then run SQL queries or a BI/Analytics tool against that data).

How did we do it?

Simple – we invented a new data type called VARIANT that allows us to load semi-structured data (i.e., flexible schema) as-is into a column in a relational table.

Read that again – we load the data directly into a relational table.

Okay, so that means no Hadoop or NoSQL needed in your data warehouse architecture just to hold semi-structured data. Just an RDBMS (in the cloud) that uses SQL that your staff already knows how to write.

But that is only half the equation. Once the data is in, how do you get it out?

Our brilliant founders and excellent engineering team (#DataSuperStars) have created extensions to SQL to reference the internal schema of the data (it is self-describing after all) so you can query the components and join it to columns in other tables as if it had been shredded into a standard relational table. Except there is no coding or shredding required to prep the data. Cool.

That also means that as the data source evolves and changes over time (e.g., new attributes, nesting, or arrays are added), there is no re-coding of ETL (or even ELT) code required to adapt. The VARIANT data type does not care if the schema varies.

What does it really look like?

Enough of the theory – let’s walk through an example of how this all works.

1 – Create a table

I have a Snowflake account, database and virtual warehouse set up already so just like I would in any other db, I simply issue a create table DDL statement:

create or replace table json_demo (v variant);

Now I have a table with one column (“v”) with a declared data type of VARIANT.

2 – Load some data

Now I load a sample JSON Document using an INSERT and our PARSE_JSON function. We are not simply loading it as text but rather storing it as an object in the VARIANT data type while at the same time converting it to an optimized columnar format (for when we query it later):

insert into json_demo
 select
 parse_json(
 '{
     "fullName": "Johnny Appleseed",
     "age": 42,
     "gender": "Male",
     "phoneNumber": {
                     "areaCode": "415",
                     "subscriberNumber": "5551234"
                    },
     "children": [
                  { "name": "Jayden", "gender": "Male", "age": "10" },
                  { "name": "Emma", "gender": "Female", "age": "8" },
                  { "name": "Madelyn", "gender": "Female", "age": "6" }
                 ],
     "citiesLived": [
                    { "cityName": "London",
                      "yearsLived": [ "1989", "1993", "1998", "2002" ]
                    },
                    { "cityName": "San Francisco",
                      "yearsLived": [ "1990", "1993", "1998", "2008" ]
                    },
                    { "cityName": "Portland",
                      "yearsLived": [ "1993", "1998", "2003", "2005" ]
                    },
                    { "cityName": "Austin",
                      "yearsLived": [ "1973", "1998", "2001", "2005" ]
                    }
                   ]
  }');

3 – Start pulling data out

So let’s start with just getting the name:

select v:fullName from json_demo;

Get fullname

Where:

v = the column name in the json_demo table

fullName = attribute in the JSON schema

v:fullName = notation to indicate which attribute in column “v” we want to select. 

So, similar to the table.column notation all SQL people are familiar with, in Snowflake we added the ability to effectively specify a column within the column (i.e., a sub-column) which is dynamically derived based on the schema definition imbedded in the JSON string.

4 – Casting the Data

Usually we don’t want to see the double quotes around the data in the report output (unless we were going to create an extract file of some sort) , so we can format it as a string and give it a nicer column alias (like we would do with a normal column):

select v:fullName::string as full_name
from json_demo;

Format FullName

Next let’s look at a bit more of the data using the same syntax from above:

select
   v:fullName::string as full_name,
   v:age::int as age,
   v:gender::string as gender
from json_demo;

Demographics

Again, simple SQL and the output looks like the results from any table you might have built in your traditional data warehouse.

Safe to say at this point, with what I have already shown you, you could look at a table in Snowflake with a VARIANT column and quickly start “shredding” the JSON with SQL.

How long did that take to learn?

This is why I love Snowflake! I can now query semi-structured data and I did not have to learn a new programming language or framework or whatever over in Big Data land – yet I have the same capabilities as if I did.

Much lower learning curve for sure.

Let’s get a little more complex

Yes, those examples are very simple, so let’s dive deeper. Notice in the original string there is some nesting of the data:

{
    "fullName": "Johnny Appleseed",
    "age": 42,
    "gender": "Male",
    "phoneNumber": {
                    "areaCode": "415",
                    "subscriberNumber": "5551234"
                   },
...

How do we pull that apart? With a very familiar table.column dot notation:

select
   v:phoneNumber.areaCode::string as area_code,
   v:phoneNumber.subscriberNumber::string as subscriber_number
from json_demo;

So just as fullName, age, and gender are sub-columns, so to is phoneNumber. And subsequently areaCode and subscriberNumber are sub-columns of the sub-column. Not only can we pull apart nested objects like this, you might infer how easily we can adapt if the schema changes and another sub-column is added.

What happens if the structure changes?

Imagine in a subsequent load the provider changed the specification to this:

{
    "fullName": "Johnny Appleseed",
    "age": 42,
    "gender": "Male",
    "phoneNumber": {
                    "areaCode": "415",
                    "subscriberNumber": "5551234",
                    "extensionNumber": "24"
                   },
...

They added a new attribute (extensionNumber)! What happens to the load?

Nothing – it keeps working because we ingest the string into the VARIANT column in the table.

What about the ETL code?

What ETL code? There is no ETL so there is nothing to break.

What about existing reports?

They keep working too. The previous query will work fine. If you want to see the new column, then the SQL needs to be refactored to account for the change:

select
   v:phoneNumber.areaCode::string as area_code,
   v:phoneNumber.subscriberNumber::string as subscriber_number,
   v:phoneNumber.extensionNumber::string as extension_number
from json_demo;

In addition, if the reverse happens and an attribute is dropped, the query will not fail. Instead it simply returns a NULL value. In this way we insulate all the code you write from these type of dynamic changes.

Next time

This post has looked at the basics of how Snowflake handles semi-structured data, using JSON as a specific example. In Part 2, I will show you how we handle more complex schema structures like arrays and nested arrays within the JSON document as well as give you a brief example of an aggregation and filtering against the contents of a VARIANT data type.