Part II: Making Schema-on-Read a Reality

In the first article of this series, I discussed the Snowflake data type VARIANT, showed a simple example of how to load a VARIANT column in a table with a JSON document, and then how easy it is to query data directly from that data type. In this post I will show you how to access an array of data within the JSON document and how we handle nested arrays. Then finally I will give you an example of doing an aggregation using data in the JSON structure and how simple it is to filter your query results by referring to values within an array.

Handling Arrays of Data

One of the features in JSON is the ability to specify and imbed an array of data within the docuement. In my example one such array is children:

    "children": [
         { "name": "Jayden", "gender": "Male", "age": "10" },
         { "name": "Emma", "gender": "Female", "age": "8" },
         { "name": "Madelyn", "gender": "Female", "age": "6" }
     ]

You will notice there are effectively 3 rows in the array and each row has 3 sub-columns – name, gender, and age. Each of those rows constitutes the value of that array entry which includes all the sub-column labels and data (remember that for later). So how do you know how many rows there are if you do not have access to the raw data?

Like this:

select array_size(v:children) from json_demo;

The function array_size figures it out for us. To pull the data for each “row” in the array, we use the dot notation from before, but now with the added specification for the row number of the array (in the [] brackets):

select v:children[0].name from json_demo
union all
select v:children[1].name from json_demo
union all
select v:children[2].name from json_demo;

Childrens names

So this is interesting but then, I really do not want to write union all SQL to traverse the entire array (in which case I need to know how many values are in the array right?).

We solve that problem with another new extended SQL function called FLATTEN. FLATTEN takes an array and returns a row for each element in the array. With that you can select all the data in the array as though they were in table rows (so no need to figure out how many entries there are).

Instead of doing the set of UNION ALLs, we add the FLATTEN into the FROM clause and give it a table alias:

select f.value:name
from json_demo, table(flatten(v:children)) f;

This syntax allows us to creating an inline virtual table in the FROM clause.

In the SELECT, we can then reference it like a table. Notice the notation f.value:name.

f = the alias for the virtual table from the children array

value = the contents of the element returned by the FLATTEN function

name = the label of the specific sub-column we want to extract from the value

The results, in this case, are the same as the SELECT with the UNIONs but the output column header reflects the different syntax (since I have not yet added any column aliases).

Flatten Children 1

Now, if another element is added to the array (i.e., a 4th child), the SQL will not have to be changed. FLATTEN allows us to determine the structure and content of the array on-the-fly! This makes the SQL resilient to changes in the JSON document.

With this in hand, we can of course get all the array sub-columns and format them just like a relational table:

select
   f.value:name::string as child_name,
   f.value:gender::string as child_gender,
   f.value:age::string as child_age
from json_demo, table(flatten(v:children)) f;

Flatten Format Children

Putting this all together, I can write a query to get the parent’s name and all the children like this:

select
   v:fullName::string as parent_name,
   f.value:name::string as child_name,
   f.value:gender::string  as child_gender,
   f.value:age::string as child_age
from json_demo, table(flatten(v:children)) f;

Which results in this output:

Parent and Children

If I just want a quick count of children by parent, I do not need FLATTEN but refer back to the array_size:

select
   v:fullName::string as Parent_Name,
   array_size(v:children) as Number_of_Children
from json_demo;

Count Children

Handling Multiple Arrays

You may recall there are multiple arrays in our sample JSON string. I can pull from several arrays at once with no problem:

select
   v:fullName::string as Parent_Name,
   array_size(v:citiesLived) as Cities_lived_in,
   array_size(v:children) as Number_of_Children
from json_demo;

Parent, Children. City

What about an Array within an Array?

Snowflake can handle that too. From our sample data we can see yearsLived is an array nested inside the array described by citiesLived:

"citiesLived": [
 { "cityName": "London",
   "yearsLived": [ "1989", "1993", "1998", "2002" ]
 },
 { "cityName": "San Francisco",
   "yearsLived": [ "1990", "1993", "1998", "2008" ]
 },
 { "cityName": "Portland",
   "yearsLived<": [ "1993", "1998", "2003", "2005" ]
 },
 { "cityName": "Austin",
   "yearsLived": [ "1973", "1998", "2001", "2005" ]
 }
 ]

To pull that data out, we add a second FLATTEN clause that transforms the yearsLived array within the FLATTENed citiesLived array.

select
  tf.value:cityName::string as city_name,
  yl.value::string as year_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl;

In this case the 2nd FLATTEN (alias “yl”) is transforming (really pivoting) the yearsLived array for each value returned from the 1st FLATTEN of the citiesLived array (“tf”).

The results output shows Year Lived by City:

YearLive by City

Like my earlier example, I can then augment this result by adding in the name too (so I know who lived where):

select
  v:fullName::string as parent_name,
  tf.value:cityName::string as city_name,
  yl.value::string as year_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl;

Name and YearLived

Aggregations?

Yup, we can even aggregate data within semi-structured data. (We would not be much of a data warehouse service if we couldn’t, right?)

So, just like ANSI SQL, we can do a count(*) and a group by:

select
   tf.value:cityName::string as city_name,
   count(*) as years_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl
group by 1;

And the results:

Count Years

Filtering?

Of course! Just add a WHERE clause.

select
  tf.value:cityName::string as city_name,
  count(*) as years_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl
where city_name = 'Portland'
group by 1;

Where City

To simplify things, notice I used the column alias city_name in the predicate but you can also use the full sub-column specification tf.value:cityName as well.

Schema-on-Read is a Reality

I could go on, but by now I think you can see we have made it very easy to load and extract information from semi-structured data using Snowflake. We added a brand new data type, VARIANT, that lives in a relational table structure in a relational database without the need to analyze the structure ahead of time, design appropriate database tables, and then shred the data into that predefined schema. Then I showed you some easy to learn extensions to ANSI-standard SQL for accessing that data in a very flexible and resilient manner.

With these features, Snowflake gives you the real ability to quickly and easily load semi-structured data into a relational data warehouse and make it available for immediate analysis.

Part I: Making Schema-on-Read a Reality

(Note: This the the first in a two-part series discussing how we handle semi-structured data in Snowflake)

Schema? I don’t need no stinking schema!

Over the last several years, I have heard this phrase schema-on-read used to explain the benefit of loading semi-structured data into a Big Data platform like Hadoop. The idea being you could delay data modeling and schema design until long after the data was loaded (so as to not slow down getting your data while waiting for those darn data modelers).

Every time I heard it, I thought (and sometimes said) – “but that implies there is a knowable schema.”  So really you are just delaying the inevitable need to understand the structure in order to derive some business value from that data. Pay me now or pay me later.

Why delay the pain?

So even though folks are able to quickly load this type of data into Hadoop or NoSQL, there is still more work ahead to actually pull the data apart so it can be analyzed. The person writing the query often has the burden of figuring out the schema and writing code to extract it. Additionally there may be a query performance penalty in this process (over that of querying columns in a relational database).

Not so with the Snowflake Elastic Data Warehouse (#ElasticDW)! With Snowflake, you can load your semi-structured data directly into a relational table, then query the data with a SQL statement, join it to other structured data, all while not fretting about future changes to the “schema” of that data. Snowflake actually keeps track of the self-describing schema so you don’t have to. No ETL or fancy shredding required.

One of the key differentiators which really attracted me to Snowflake is our built in support to load and query semi-structured data such as JSON, XML, and AVRO. In most conventional data warehouse and Big Data environments today, you have to first load this type of data to a Hadoop or NoSQL platform, then shred it (using for example MapReduce) in order to then load it into columns in a relational database (that is if you want to then run SQL queries or a BI/Analytics tool against that data).

How did we do it?

Simple – we invented a new data type called VARIANT that allows us to load semi-structured data (i.e., flexible schema) as-is into a column in a relational table.

Read that again – we load the data directly into a relational table.

Okay, so that means no Hadoop or NoSQL needed in your data warehouse architecture just to hold semi-structured data. Just an RDBMS (in the cloud) that uses SQL that your staff already knows how to write.

But that is only half the equation. Once the data is in, how do you get it out?

Our brilliant founders and excellent engineering team (#DataSuperStars) have created extensions to SQL to reference the internal schema of the data (it is self-describing after all) so you can query the components and join it to columns in other tables as if it had been shredded into a standard relational table. Except there is no coding or shredding required to prep the data. Cool.

That also means that as the data source evolves and changes over time (e.g., new attributes, nesting, or arrays are added), there is no re-coding of ETL (or even ELT) code required to adapt. The VARIANT data type does not care if the schema varies.

What does it really look like?

Enough of the theory – let’s walk through an example of how this all works.

1 – Create a table

I have a Snowflake account, database and virtual warehouse set up already so just like I would in any other db, I simply issue a create table DDL statement:

create or replace table json_demo (v variant);

Now I have a table with one column (“v”) with a declared data type of VARIANT.

2 – Load some data

Now I load a sample JSON Document using an INSERT and our PARSE_JSON function. We are not simply loading it as text but rather storing it as an object in the VARIANT data type while at the same time converting it to an optimized columnar format (for when we query it later):

insert into json_demo
 select
 parse_json(
 '{
     "fullName": "Johnny Appleseed",
     "age": 42,
     "gender": "Male",
     "phoneNumber": {
                     "areaCode": "415",
                     "subscriberNumber": "5551234"
                    },
     "children": [
                  { "name": "Jayden", "gender": "Male", "age": "10" },
                  { "name": "Emma", "gender": "Female", "age": "8" },
                  { "name": "Madelyn", "gender": "Female", "age": "6" }
                 ],
     "citiesLived": [
                    { "cityName": "London",
                      "yearsLived": [ "1989", "1993", "1998", "2002" ]
                    },
                    { "cityName": "San Francisco",
                      "yearsLived": [ "1990", "1993", "1998", "2008" ]
                    },
                    { "cityName": "Portland",
                      "yearsLived": [ "1993", "1998", "2003", "2005" ]
                    },
                    { "cityName": "Austin",
                      "yearsLived": [ "1973", "1998", "2001", "2005" ]
                    }
                   ]
  }');

3 – Start pulling data out

So let’s start with just getting the name:

select v:fullName from json_demo;

Get fullname

Where:

v = the column name in the json_demo table

fullName = attribute in the JSON schema

v:fullName = notation to indicate which attribute in column “v” we want to select. 

So, similar to the table.column notation all SQL people are familiar with, in Snowflake we added the ability to effectively specify a column within the column (i.e., a sub-column) which is dynamically derived based on the schema definition imbedded in the JSON string.

4 – Casting the Data

Usually we don’t want to see the double quotes around the data in the report output (unless we were going to create an extract file of some sort) , so we can format it as a string and give it a nicer column alias (like we would do with a normal column):

select v:fullName::string as full_name
from json_demo;

Format FullName

Next let’s look at a bit more of the data using the same syntax from above:

select
   v:fullName::string as full_name,
   v:age::int as age,
   v:gender::string as gender
from json_demo;

Demographics

Again, simple SQL and the output looks like the results from any table you might have built in your traditional data warehouse.

Safe to say at this point, with what I have already shown you, you could look at a table in Snowflake with a VARIANT column and quickly start “shredding” the JSON with SQL.

How long did that take to learn?

This is why I love Snowflake! I can now query semi-structured data and I did not have to learn a new programming language or framework or whatever over in Big Data land – yet I have the same capabilities as if I did.

Much lower learning curve for sure.

Let’s get a little more complex

Yes, those examples are very simple, so let’s dive deeper. Notice in the original string there is some nesting of the data:

{
    "fullName": "Johnny Appleseed",
    "age": 42,
    "gender": "Male",
    "phoneNumber": {
                    "areaCode": "415",
                    "subscriberNumber": "5551234"
                   },
...

How do we pull that apart? With a very familiar table.column dot notation:

select
   v:phoneNumber.areaCode::string as area_code,
   v:phoneNumber.subscriberNumber::string as subscriber_number
from json_demo;

So just as fullName, age, and gender are sub-columns, so to is phoneNumber. And subsequently areaCode and subscriberNumber are sub-columns of the sub-column. Not only can we pull apart nested objects like this, you might infer how easily we can adapt if the schema changes and another sub-column is added.

What happens if the structure changes?

Imagine in a subsequent load the provider changed the specification to this:

{
    "fullName": "Johnny Appleseed",
    "age": 42,
    "gender": "Male",
    "phoneNumber": {
                    "areaCode": "415",
                    "subscriberNumber": "5551234",
                    "extensionNumber": "24"
                   },
...

They added a new attribute (extensionNumber)! What happens to the load?

Nothing – it keeps working because we ingest the string into the VARIANT column in the table.

What about the ETL code?

What ETL code? There is no ETL so there is nothing to break.

What about existing reports?

They keep working too. The previous query will work fine. If you want to see the new column, then the SQL needs to be refactored to account for the change:

select
   v:phoneNumber.areaCode::string as area_code,
   v:phoneNumber.subscriberNumber::string as subscriber_number,
   v:phoneNumber.extensionNumber::string as extension_number
from json_demo;

In addition, if the reverse happens and an attribute is dropped, the query will not fail. Instead it simply returns a NULL value. In this way we insulate all the code you write from these type of dynamic changes.

Next time

This post has looked at the basics of how Snowflake handles semi-structured data, using JSON as a specific example. In Part 2, I will show you how we handle more complex schema structures like arrays and nested arrays within the JSON document as well as give you a brief example of an aggregation and filtering against the contents of a VARIANT data type.