How to analyze JSON with SQL

One of the key reasons I joined  Snowflake is its built-in support to load and query semi-structured data such as JSON. In most conventional data warehouse and Big Data environments, you have to first load this type of data to a Hadoop or NoSQL platform. Then, you must parse it with a product such as MapReduce so you can load the data into tables in a relational database. Then, and only then, you’re ready to analyze the data with SQL queries or a BI/Analytics tool. But why take extra steps when there is a faster, easier way to get the job done?

With Snowflake, you can load your semi-structured data direct into a relational table, query the data with a SQL statement and then join it to other structured data – all while not fretting about future changes to the schema of that data. Snowflake actually keeps track of the self-describing schema so you don’t have to. No ETL or shredding required.

This means you can leverage your existing knowledge and skills in SQL to jump into the world of big data. Even with this feature, there is still a tiny bit to learn. However, it’s easy with Snowflake’s extensions to SQL. To get you started, we have produced a handy ebook that takes you step-by-step through loading some JSON into Snowflake, then querying that data with SQL. The book is called How to analyze JSON with SQL: Schema-on-read made easy.

Get your free ebook now and start your journey to analyzing big data in the cloud!

Be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) for all the news and happenings here at Snowflake.

P.S. If you don’t already have a Snowflake account, you can sign up for a self-service account here and get a jumpstart with $400 in free credits!

 

What makes Snowflake a data warehouse?

One of the most common questions I get when speaking to people about Snowflake is: “Why do you call it a data warehouse and not a database?” This is a very reasonable question given some of the characteristics of Snowflake.

At Snowflake, in part, we say we are a full relational database management system (RDBMS) built for the cloud. We are ACID compliant and we support standard SQL. Sounds like a database to me, too. Let’s take a closer look just to be sure.

What is a database?

A database is a collection of information organized to be easily accessed, managed and updated. While there are many types of databases available today, the most common is an RDBMS. But when most folks say “database”, they usually mean a traditional RDBMS that handles Online Transaction Processing (OLTP).

So, what are some of the defining characteristics of an OLTP database?

  • Designed for rapid storage and retrieval of small sets of current data records in support of transactions and interactions within an enterprise.
  • Data is organized in tables and columns, allowing users access via structured query language (SQL).
  • Handles quick, real-time activity such as entering a customer name, recording a sale and recording all accounting activity of that sale.
  • Works well for basic operational reporting of a limited number of records. Analytic reporting is relegated to simple, static reports often driven by IT.

What is a data warehouse?

Some of the defining characteristics of a data warehouse are:

  • A database designed to store and process large volumes of current and historical data collected from multiple sources inside and outside the enterprise for deep analysis.
  • Organizes data into tables and columns, and allows users access via SQL.
  • Optimized for loading, integrating and analyzing very large amounts of data.
  • Designed to support descriptive, diagnostic, predictive and prescriptive analytic workloads.

Snowflake definitely includes the overlapping characteristics of both a database and a data warehouse-ACID compliant, support for standard SQL, etc. But Snowflake also embodies all of the defining characteristics of a data warehouse.

One of the key differentiators of Snowflake, from other solutions, is that it’s specifically designed for data warehousing and high speed analytic processing. Rather than a generalized SQL database that has been “tuned” or even adapted to handle these type of workloads, Snowflake was built from the ground up for the cloud to optimize loading, processing and query performance for very large volumes of data. Therefore, hands down, Snowflake is a data warehouse.

So, why do we still need a specialized data warehouse engine?

As OLTP databases have been able to scale higher and innovations like in-memory databases have emerged, some organizations have questioned whether they still need a separate technology or specialized system for reporting and analytics. The answer, again, requires us to look at the basics: What benefits emerge from storing and analyzing data in a separate system?

  1. It eases the burden of reporting from transactional systems by removing the contention for limited and expensive resources.
  2. It produces more business-friendly data results by allowing the data to be restructured to a more suitable format.
  3. It provides access to a wider array of reports more quickly because all the resources in the data warehouse are dedicated to reporting and analysis.
  4. It integrates valuable data from across the enterprise for richer insight. Something that can’t (and shouldn’t) be done in an OLTP system.

For more information on how you can up your data warehousing game with a modern, built-for-the-cloud approach, check out some of our free resources such as our ebook The Data Warehouse: The Engine That Drives Analytics. We would love to help you on your journey to the cloud so keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) to keep up with all the news and happenings here at Snowflake.

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

RI (Referential Integrity) Constraints: 3 Reasons to Include Them in Your Data Warehouse

Over the years, I have had numerous conversations about the value of having referential integrity (RI) constraints, such as primary and foreign keys, in a relational data warehouse or data mart.

Many DBAs object that RI constraints slow the load process. This is a valid point if you are talking about enforced constraints that are checked in real time during the load. But this is not an issue if you define the constraints as disabled.

Which then leads to this common question:

Is there any reason to maintain a permanently disabled FK in the data model?  If it is not going to be enabled, then from my perspective, it doesn’t make any sense to define the FK.  Instead, the relationship can be described in the comment of the child column.

So, why would I want RI constraints in my data warehouse?

Mostly it has to do with good design and development best practices. Here is my rationale for why you should consider including RI constraints in your data warehouse design.

#1 – Design Metadata

RI constraints are valuable metadata/documentation. If somebody reverse engineers the database (say with ERWin or Oracle Data Modeler), the PKs and FKs show up in the diagram (much better than having to read a column comment to discover a relationship). This is quite valuable for new people on your project to orient themselves to the existing schema design and understand how the various tables in your data warehouse are related to each other.

RI in a diagram
A picture is worth a thousand words

#2 – BI Metadata

If you want to use any sort of reporting or BI tool against the database (it is a data warehouse,  after all), most modern business intelligence and visualization tools import the foreign key definitions with the tables and build the proper join conditions. This is much better than having someone guess what the join will be and then manually adding it to the metadata layer in the reporting tool. This also ensures that different developers don’t interpret the joins differently.

Examples of tools that can read the Snowflake data dictionary include Looker, Tableau, COGNOS, MicroStrategy, and many others. Some of these tools actually use the FK definitions for join culling to provide better query performance.

#3 – QA your ETL/ELT code

I know you think your ETL code is perfect.

But does every developer test to the same standards? Do you maybe have a QA team who separately validates that the ETL is doing what you expect?

If so, having declared primary, unique, and foreign key constraints in your data warehouse gives the team more information they can use to ensure the quality of the data. In fact, using the Snowflake Information Schema, a QA engineer can potentially generate SQL to test that the loaded data conforms to the defined constraints.

Defining RI Constraints in Snowflake

You, of course, can (and IMHO should) define RI constraints in Snowflake. You can define primary keys, unique keys, foreign keys, and NOT NULL constraints. Because Snowflake is specifically engineered for data warehousing, only the NOT NULL constraints are enforced. The rest are always created as disabled.

The syntax is standard SQL. You can define the constraints both inline and out-of-line.

Here is a simple example of inline constraints:

Create or replace TABLE SAT_REGIONS (
HUB_REGION_KEY NUMBER(38,0) NOT NULL,
SAT_LOAD_DTS DATE NOT NULL,
REGION_COMMENT VARCHAR(152),
HASH_DIFF VARCHAR(32) NOT NULL,
SAT_REC_SRC VARCHAR(50) NOT NULL,
constraint SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS),
constraint SAT_REGIONS_FK1 foreign key (HUB_REGION_KEY)
references KENT_DB.DATAVAULT_MAIN.HUB_REGION(HUB_REGION_KEY)
);

Example of an out-of-line constraint:

ALTER TABLE SAT_REGIONS ADD constraint 
SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS);

It’s that easy. So why don’t you have constraints in your data warehouse?

 

Did you hear? Snowflake was declared the #1 Cloud Data Warehouse in a recent GigaOM analyst report. Go here to get your copy of the report.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for updates on all the action and activities here at Snowflake Computing.

SnowSQL Part I: Introducing SnowSQL – a modern command line tool built for the cloud

What & Why

Starting this month, we have introduced a new modern command line tool for an interactive query experience with the Snowflake Elastic Data Warehouse cloud service. Yes indeed – a new command line tool!

Now you might ask – why did Snowflake decide to invest in building a new SQL command line tool in the 21st century when there are so many different ways of accessing and developing against the Snowflake service? For example, you could use our current user interface, or various drivers, or other SQL editor tools available today.

The answer is simple:

  • First of all, developers still care very deeply about light-weight mechanisms to quickly ask SQL questions via a text-based terminal with simple text input and output.
  • Secondly, we recognized the need for a more modern command line tool designed for the cloud that is easy to use, built on high security standards, and is tightly integrated with the actual core Snowflake elastic cloud data service.
  • Finally, we wanted to build a command line tool which offers users more powerful scripting capabilities overall.   

The result of these efforts is SnowSQL – our new SQL command line tool that is entirely built in Python and leverages Snowflake’s Python connector underneath.

Looking at existing more generic command line tools, such as SQLLine or HenPlus, it became obvious that these tools lack the ease-of-use and did not really offer sufficient scripting features. Also, Snowflake’s approach was to consider the command line tool as part of the overall cloud service. That is, SnowSQL can be seen as an extension of the Snowflake Elastic DW service. This cloud-first philosophy has important implications on the entire life cycle of Snowflake’s command line tool including the agility of delivering new SnowSQL features as part of the frequent Snowflake service updates (please see below the auto upgrade capabilities)

Getting Started

To get started, you can download SnowSQL from the Snowflake UI after logging into your Snowflake account. SnowSQL is currently supported on all three major platforms including:

  • Linux 64-bit,
  • Mac OS X 10.6+, and
  • Windows 64-bit

We provide a native installer for each platform with easy-to-follow installation steps. Once you have installed SnowSQL, open a new terminal and type:

$ SNOWSQL_PWD=<password> snowsql -a <account_name> -u <user_name>

Upon first use, SnowSQL users will notice a progress bar which indicates the download of the initial version of SnowSQL. This is a one time operation.  Any future downloads will happen in the background and remain entirely transparent for the user. The ability to automatically and fully transparent upgrade the command line tool is one of the key benefits of building a command line tool as cloud extension rather than a stand-alone software component interacting with a remote service (please see the auto-upgrade section below).

Using SnowSQL

There are a few capabilities we would like to highlight in this blog.  

SnowSQL Commands

First, SnowSQL offers a wide range of commands a user can make use of. As  a general rule, all SnowSQL commands start with a bang character ‘!’. For a complete list of currently supported commands, please see our documentation here.

Auto-Complete and Syntax Highlighting

Secondly, with our context-sensitive auto-complete feature, SnowSQL users are released of cumbersome and error-prone typing of long object names Instead they can complete SQL keywords and functions once typing the first three letters. By leveraging auto-complete, SnowSQL users can become increasingly more productive and quickly explore data in Snowflake. Furthermore, SQL statements are highlighted in different colors resulting in a better readability for SnowSQL users when interacting with a terminal

 

Auto-Upgrade

Thinking as a service, the auto-update framework enables users to always stay up-to-date with both –  Snowflake’s and SnowSQL’s latest features to streamline end user experience. No more additional downloads or tedious re-installation are needed. The upgrade is transparent for the end user and takes place  as a background process when you start SnowSQL. Next time a user runs SnowSQL, the new version will be automatically picked up while their workflows remain unaffected and will not be interrupted during the upgrade.

Secure Connection and Encryption

Finally, security is core in  SnowSQL’s design. SnowSQL secures connections to Snowflake using TLS (Transport Layer Security) with OCSP (Online Certificate Status Protocol – OCSP) checks. The auto-upgrade binaries are always validated by using RSA signature. In addition to the secured connection, SnowSQL provides end-to-end security of data moving in and out of Snowflake by using AES (Advanced Encryption Standard)  for Snowflake’s PUT and GET commands.

In the next few weeks, we will provide a deeper dive into some of the unique capabilities of SnowSQL. Please stay tuned as we gather and incorporate feedback from our customers. We would like to acknowledge our main software engineers Shige Takeda (@smtakeda) and Baptiste Vauthey (@thabaptiser) for their main contributions.

Finally, all of this would not have been possible without the active Python community who inspired us in many ways and offered us tools and packages to build SnowSQL. Thank you.   

As always, keep an eye on this blog site, and our Snowflake Twitter feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing.

Elasticity & Separation of Compute and Storage

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#4 Elasticity & Separation of Storage and Compute

The innovative, patent-pending, Multi-Cluster, Shared Data Architecture in Snowflake takes full advantage of the elasticity of the cloud. With the advantage of building a data warehouse from the ground up, Snowflake has built an environment that really allows the customer to scale their data warehouse as they want. This unique environment gives the users the ability to add more compute power on demand, even in the middle of a process.

How it works

  1. Data is stored, once, on elastic disk storage in the cloud (for example it is currently stored AWS S3)
  2. Compute clusters (for example using AWS EC2 nodes) are allocated dynamically as required to do work (queries or loads)
  3. The compute clusters are connected to the data storage. The access to the data is based on the approved security credentials of the user that is running the query

The important point is that storage and warehouse are not tightly coupled together. This means you can grow and reduce both independently of each other.

More Power at the Push of a Button

Say you are running a series of complex SQL queries on a Small (2 node) warehouse in Snowflake. Normally you have until the close of business to run the reports and prepare the data. But today, at 8 AM, your management requests the results of that analysis for a board meeting at noon. So now you have half the time to get the reports to run. Yikes!

In a traditional data warehouse you would pretty much have only one option: kick everyone off the system and hope things run faster.

If however, you are running the reports on the Snowflake Elastic Data Warehouse, all you have to do is increase the size of the warehouse to a Medium which doubles the horsepower of the data warehouse (i.e. it doubles the number of nodes). And you can perform this change with a few clicks in our easy to use web-based interface, even if the process has already started.

Warehouse Resize 1

Warehouse Resize 2

Once you bump up the size of the warehouse, the next statement that starts will immediately use the additional resources. Notice that the first statement running with the Medium warehouse runs in about half the time of the prior statement, with about double the data!

Warehouse Resize 3

You don’t need to interrupt the process, start over, or restart the service. It just takes effect automatically, at the push of a button. And once the process is done, you can reduce the warehouse back to a Small or even just turn it off.

And in this way, the unique cloud-based architecture of Snowflake allows you to truly take advantage of the elasticity of the cloud for your data warehouse.

Alternatives

In many other architectures, cloud and on-premises appliances, you have a certain amount of disk space associated with the compute nodes. If you need more storage; you must add more nodes. If you need more compute power, likewise you add more nodes. Plus other architectures may not allow you to scale down. Also in most of these cases, when adding more nodes, you must also shut down the environment or make it read only for a time while you redistribute the data manually across the new nodes.

Thus in the legacy data warehouses, elasticity is limited. And this forces you to plan for, acquire and manage the resources for your expected, future peak capacity and peak data demand. Otherwise you risk having your queries or your hardware fail (because you run out of disk capacity or CPU power, or both). In addition, trying to scale beyond existing capacity may lead to disruptions to the service (as you rebuild/extend the hardware), or some workloads being rejected in favor of others. And if you over-provision, and buy too much capacity, there may not be a way to scale back after the fact.

The Snowflake Advantage

For Snowflake, this is not an issue. You can grow and shrink the environment dynamically. The data storage grows and shrinks as you add or remove data, while the compute nodes can be ramped up or down, or turned off, as you require. You are not forced to pay for capacity up front, or kick other workloads off, or plan downtimes when ramping up your data warehouse capacity. That is the promise of cloud. This is why Snowflake is the real elastic Data Warehouse as a Service.

Very cool indeed.

As always, keep an eye on this blog site, our Snowflake Twitter feed (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

Kent Graziano and Saqib Mustafa

ANSI SQL with Analytic Functions

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake:

#5 ANSI compliant SQL with Analytic Functions

At Snowflake, we believe that it should be easy to access, query, and derive insights from your data. To support that, we provide our users with the ability to query all their data using ANSI compliant SQL . (Hard to call yourself a relational database otherwise, right?).

However, Snowflake goes beyond  basic SQL, delivering sophisticated analytic and windowing functions as part of our data warehouse service. Functions like:

  • CUME_DIST
  • DENSE_RANK
  • FIRST_VALUE
  • LAG
  • LAST_VALUE
  • LEAD
  • NTILE
  • PERCENT_RANK
  • RANK
  • ROW_NUMBER

select Nation, Customer, Total
from (select n.n_name Nation,
             c.c_name Customer,             
             sum(o.o_totalprice) Total,
             rank() over (partition by n.n_name
      order by sum(o.o_totalprice) desc)
     customer_rank
     from orders o,
     customer c,
     nation n
     where o.o_custkey = c.c_custkey
     and c.c_nationkey = n.n_nationkey
     group by 1, 2)
where customer_rank <= 3
order by 1, customer_rank;

As you see in the example, we support not only analytic windowing functions, but all the other features you would expect in SQL. This includes but is not limited to general aggregation functions (e.g. sum), nested virtual tables, sub queries, order by, and group by.

In additional to general aggregation functions, we also have:

  • Bitwise aggregation functions
  • Linear regressions functions and
  • Cardinality estimation functions (i.e., HyperLogLog)

So, if your existing queries are written with standard SQL, they will run in Snowflake. And, as we noted in the previous blog on JSON, you can apply all these functions to your semi-structured data natively using Snowflake.

Another reason to love the Snowflake Elastic Data Warehouse.

As always, keep an eye on this blog site (or better – sign up for the RSS feed), and our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

JSON Support with Snowflake

Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake. 

#6 JSON support with Snowflake

One of the things that got people at #Strata Hadoop excited this week was our support for JSON and other semi-structured data types. For traditional data warehouse users, the world of the big data can be challenging. We are used to using SQL to query data, used to having a well defined data model, and knowing what the source and target schemas look like. We needed a simpler way to easily handle the flexible schemas that come with using semi-structured data like JSON documents. As much as they try, legacy data warehouse systems do not provide extensive support for JSON data, and big data systems require learning new extensive programming skills.

When our founders started out from scratch to build a data warehouse for the cloud, they wanted a solution that could combine all your data in one place without the need to resort to using multiple platforms or programming paradigms. As a result, combining structured and semi-structured in one place and making it available through standard ANSI SQL is a strong feature of the Snowflake service and extensively used by our customers.

Snowflake was built with features to simplify access to JSON data and provide the ability to combine it with structured data! Using Snowflake, you can learn to query JSON data using SQL, and join it to traditional tabular data in relational tables easily. Our innovative approach allows the user to store the JSON documents in a relational table using a new data type (VARIANT) that is optimized automatically in the background for MPP and columnar access.

This is a great way to eliminate  the gap between the big data world and the relational world and simplify access for users. Most legacy databases (with their legacy code base) can not do this efficiently.  Some legacy data warehouse providers cannot support JSON data at all, and you may have to acquire and manage a separate big data system.  Others may require some sort of pre-processing of the data such as conversion into simplified CSV type data. This may make it easier to ingest the data, but requires time and resources. And this also limits the ability to accommodate potential changes that come with a flexible schema data type into the relational data model easily. In addition, the JSON data may be stored in a text field, rather than an optimized data type, which has its cost in speed of query execution and data storage.

Snowflake makes semi-structured data available inside the data warehouse service seamlessly. The data can be ingested directly into a table in Snowflake and can then be queried easily. And any changes to the schema of the inbound JSON are accommodated automatically without impact to existing queries.

Example Code

In this scenario, we are going to use Snowflake SQL extensions for querying semi-structured data, and our innovative data type (VARIANT) to join data to other purely relational tables. We are going to combine Twitter data (JSON data) with  product data in relational tables.

The main table which stores the Twitter JSON data, twitter.data.tweets, has two columns: tweet and created_at.  The column Tweet is defined as a VARIANT type and holds the JSON from a Twitter feed, while created_at is a relational column with a data type of TIMESTAMP_NTZ (NTZ = no time zone).

Tweet table

Here is an example showing a fairly simple SQL query with the JSON extensions. In this query we are joining some Twitter data to product data in relational tables in order to get a count of Tweets that contain hashtags related to a particular product:

select extract('day',created_at) janday,count(*) cnt
  from
    twitter.data.tweets t,

     -- unnest a tweet on the hashtags of each entities
     lateral flatten (input=> t.tweet,'entities.hashtags')tags,
     (select distinct ph_hashtag
        from 
          sales.public.producthashtags,
          sales.public.product
        where p_name ='Blue Sky'
        and   p_productkey = ph_productkey) p

     where tags.value:text::string = p.ph_hashtag
     and   created_at >= '2014-01-01 00:00:00'
     and   created_at >= '2014-02-01 00:00:00'

    group by 1
    order by 1

The following  section of the code pivots the elements in the JSON string into a set of rows so we can do traditional joins:

     -- unnest a tweet on the hashtags of each entities
     lateral flatten (input=> t.tweet,'entities.hashtags')tags,

Specifically it is pulling out a nested array of hastags within the entities element. Then the predicate is where we join those hashtag values in the Tweet string to the hashtag column within the Product table (aliased as “p”):

     where tags.value:text::string = p.ph_hashtag

In this case “tags” equal the virtual table alias created by the FLATTEN function and the keyword “value” indicates we want the content of that row. The rest of the specification indicates it is text data that we want cast as a STRING so it matches the data type of the column p.ph_hastag. 

Then the last part of the predicate is a normal filter for a date range using the date column in the the TWEETS table:

     and   created_at >= '2014-01-01 00:00:00'
     and   created_at >= '2014-02-01 00:00:00'

So there you have it, using SQL to combine semi-structured data with traditional structured data in a relational data warehouse in the cloud. No big data system required.

Not bad!

But that is just a flavor of how you can utilize Snowflake to easily get value from your JSON data. There’s way more than we can cover in a simple blog post (like building relational views on the JSON for example).

Want to find out more? Ask us for a demo or Check out the presentation by Grega Kaspret (@gregakespret) from Celtra Mobile at Strata Hadoop World (San Jose) this week, talking about simplifying a JSON data pipeline using Snowflake. And follow our Twitter feeds:  (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and  updates on all the action at Snowflake Computing.

Kent Graziano and Saqib Mustafa

Data Vault Modeling and Snowflake

Since I have joined Snowflake, I have been asked multiple times what data warehouse modeling approach does Snowflake support best. Well, the cool thing is that we support multiple data modeling approaches equally.

Turns out we have a few customers who have existing data warehouses built using a particular approach known as the Data Vault modeling approach and they have decided to move into Snowflake (yeah!).

So the conversation often goes like this:

Customer: “Can you do Data Vault on Snowflake?”

Me: “Yes you can! Why do you ask?”

Customer: “Well, your name is “snowflake” so we thought that might mean you only support snowflake-type schemas.”

Me: “Well, yes I can see your confusion in that case, but the name has nothing to do with data warehouse design really. In fact, we support any type of relational design, including Data Vault.”

What is a Data Vault?

For those of you have not yet heard of the Data Vault System of Business Intelligence or simply Data Vault modeling, it provides (among other things) a method and approach to modeling your enterprise data warehouse (EDW) that is agile, flexible, and scalable.

The formal definition as written by the inventor Dan Linstedt:

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.

It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.

The main point here is that Data Vault (DV) was developed specifically to address agility, flexibility, and scalability issues found in the other main stream data modeling approaches used in the data warehousing space. It was built to be a granular, non-volatile, auditable, historical repository of enterprise data.

At its core is a repeatable modeling technique that consists of just three main types of tables:

  • Hubs = Unique list of Business Keys
  • Links = Unique List of Associations / Transactions
  • Satellites = Descriptive Data for Hubs and Links (Type 2 with history)

Hubs make it business driven and allow for semantic integration across systems.

Links give you the flexibility to absorb structural and business rule changes without re-engineering (and therefore without reloading any data).

Satellites give you the adaptability to record history at any interval you want plus unquestionable auditability and traceability to your source systems.

Here is a simple example of what at Data Vault 2.0 model looks like:
Data Vault 2.0 Example

Snowflake Features to use in a Data Vault

As we have said in the past, Snowflake is an ANSI SQL RDBMS with pay-as-you-go pricing. We support tables and views like all the relational solutions on the market today. Since, from a data modeling perspective, Data Vault is specific way and pattern for designing tables for your data warehouse, there are no issues implementing one in Snowflake.

In fact, with our combination of MPP compute clusters, optimized columnar storage format, and our patent-pending Adaptive Data Warehouse technology, I think you will get better results with your Data Vault loads and queries with less effort than you get today on your legacy data warehouse solutions. Remember that with Snowflake you don’t need to pre-plan partitioning or distribution keys, or build indexes to get great performance. That is all handled as part of our Dynamic Query Optimization feature that uses our secure cloud-based metadata store and sophisticated feedback loop to monitor and tune your queries based on data access patterns and resource availability among other things.

Because our customers are getting such great query performance (some up to 100x improvement), I think Snowflake will be a great place to try virtualizing your information mart (i.e., reporting) layer that you expose to your BI tools.

Data Vault 2.0

For those of you interested in implementing the Data Vault 2.0 specification, Snowflake can handle that as well. We have a built in MD5 hash function so you can implement MD5-based keys and do your change data capture using the DV 2.0 HASH_DIFF concept.

Not only does Snowflake support DV 2.0 use of hash functions, you can also take advantage of Snowflake’s Multi-table Insert (MTI) when loading your main Data Vault tables. With this feature you can load multiple tables in parallel from a single source table.

For example you can take the data from a stage table and load it into a  Hub and Satellite with one statement. Assuming your source table is in a schema called STAGE and your vault tables in a schema called DV, the first pass at loading a Hub and associated Satellite might look like this:

INSERT ALL
INTO DV.HUB_COUNTRY (HUB_COUNTRY_KEY, COUNTRY_ABBRV, HUB_LOAD_DTS, HUB_REC_SRC)
              VALUES (stg.HASH_KEY, stg.COUNTRY_ABBRV, stg.LOAD_DTS, stg.REC_SRC)
INTO DV.SAT_COUNTRIES (HUB_COUNTRY_KEY, SAT_LOAD_DTS, HASH_DIFF, SAT_REC_SRC, COUNTRY_NAME)
               VALUES (stg.HASH_KEY,  stg.LOAD_DTS,  stg.HASH_DIFF,  stg.REC_SRC,  stg.COUNTRY_NAME)
SELECT HASH_KEY, 
COUNTRY_ABBRV, 
COUNTRY_NAME,
HASH_DIFF,
CURRENT_TIMESTAMP AS LOAD_DTS,
STAGE_REC_SRC AS REC_SRC
from STAGE.COUNTRY stg;

Likewise you can use an MTI to load Links and their associated Satellites.

Data Vault Resources

If you are interested in learning more about Data Vault, there is a website dedicated to Data Vault, and a few books on the subject that you might want to peruse:

  1. Introduction to Agile Data Engineering by Kent Graziano (me)
  2. Intro to Data Vault – free white paper on my personal blog
  3. Free Data Vault Intro Videos – from the inventor Dan Linstedt
  4. Building a Scalable Data Warehouse with Data Vault 2.0 by Dan Linstedt
  5. Dan Linstedt’s Blog

So hopefully this answers the basic questions about doing Data Vault on Snowflake (yes you can). If you want to start using Snowflake, feel free to take advantage of our On-Demand offer with $400 in free usage. In the future, as our customers roll out their solutions, I hope we will be able to give you some real-world case studies on how they implemented their Data Vaults on the Snowflake Elastic Data Warehouse.

Until then, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for updates on all the action and activities here at Snowflake Computing.