Apr 1, 2016
Hopefully you had a chance to read our previous top 10 posts. As promised, we continue the series with a deeper dive into another of the Top 10 Cool Features from Snowflake.
#6 JSON support with Snowflake
One of the things that got people at #Strata Hadoop excited this week was our support for JSON and other semi-structured data types. For traditional data warehouse users, the world of the big data can be challenging. We are used to using SQL to query data, used to having a well defined data model, and knowing what the source and target schemas look like. We needed a simpler way to easily handle the flexible schemas that come with using semi-structured data like JSON documents. As much as they try, legacy data warehouse systems do not provide extensive support for JSON data, and big data systems require learning new extensive programming skills.
When our founders started out from scratch to build a data warehouse for the cloud, they wanted a solution that could combine all your data in one place without the need to resort to using multiple platforms or programming paradigms. As a result, combining structured and semi-structured in one place and making it available through standard ANSI SQL is a strong feature of the Snowflake service and extensively used by our customers.
Snowflake was built with features to simplify access to JSON data and provide the ability to combine it with structured data! Using Snowflake, you can learn to query JSON data using SQL, and join it to traditional tabular data in relational tables easily. Our innovative approach allows the user to store the JSON documents in a relational table using a new data type (VARIANT) that is optimized automatically in the background for MPP and columnar access.
This is a great way to eliminate the gap between the big data world and the relational world and simplify access for users. Most legacy databases (with their legacy code base) can not do this efficiently. Some legacy data warehouse providers cannot support JSON data at all, and you may have to acquire and manage a separate big data system. Others may require some sort of pre-processing of the data such as conversion into simplified CSV type data. This may make it easier to ingest the data, but requires time and resources. And this also limits the ability to accommodate potential changes that come with a flexible schema data type into the relational data model easily. In addition, the JSON data may be stored in a text field, rather than an optimized data type, which has its cost in speed of query execution and data storage.
Snowflake makes semi-structured data available inside the data warehouse service seamlessly. The data can be ingested directly into a table in Snowflake and can then be queried easily. And any changes to the schema of the inbound JSON are accommodated automatically without impact to existing queries.
In this scenario, we are going to use Snowflake SQL extensions for querying semi-structured data, and our innovative data type (VARIANT) to join data to other purely relational tables. We are going to combine Twitter data (JSON data) with product data in relational tables.
The main table which stores the Twitter JSON data, twitter.data.tweets, has two columns: tweet and created_at. The column Tweet is defined as a VARIANT type and holds the JSON from a Twitter feed, while created_at is a relational column with a data type of TIMESTAMP_NTZ (NTZ = no time zone).
Here is an example showing a fairly simple SQL query with the JSON extensions. In this query we are joining some Twitter data to product data in relational tables in order to get a count of Tweets that contain hashtags related to a particular product:
select extract('day',created_at) janday,count(*) cnt from twitter.data.tweets t, -- unnest a tweet on the hashtags of each entities lateral flatten (input=> t.tweet,'entities.hashtags')tags, (select distinct ph_hashtag from sales.public.producthashtags, sales.public.product where p_name ='Blue Sky' and p_productkey = ph_productkey) p where tags.value:text::string = p.ph_hashtag and created_at >= '2014-01-01 00:00:00' and created_at >= '2014-02-01 00:00:00' group by 1 order by 1
The following section of the code pivots the elements in the JSON string into a set of rows so we can do traditional joins:
-- unnest a tweet on the hashtags of each entities lateral flatten (input=> t.tweet,'entities.hashtags')tags,
Specifically it is pulling out a nested array of hastags within the entities element. Then the predicate is where we join those hashtag values in the Tweet string to the hashtag column within the Product table (aliased as “p”):
where tags.value:text::string = p.ph_hashtag
In this case “tags” equal the virtual table alias created by the FLATTEN function and the keyword “value” indicates we want the content of that row. The rest of the specification indicates it is text data that we want cast as a STRING so it matches the data type of the column p.ph_hastag.
Then the last part of the predicate is a normal filter for a date range using the date column in the the TWEETS table:
and created_at >= '2014-01-01 00:00:00' and created_at >= '2014-02-01 00:00:00'
So there you have it, using SQL to combine semi-structured data with traditional structured data in a relational data warehouse in the cloud. No big data system required.
But that is just a flavor of how you can utilize Snowflake to easily get value from your JSON data. There’s way more than we can cover in a simple blog post (like building relational views on the JSON for example).
Want to find out more? Ask us for a demo or Check out the presentation by Grega Kaspret (@gregakespret) from Celtra Mobile at Strata Hadoop World (San Jose) this week, talking about simplifying a JSON data pipeline using Snowflake. And follow our Twitter feeds: (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Top 10 Cool Things About Snowflake and updates on all the action at Snowflake Computing.
Kent Graziano and Saqib Mustafa