Customizing Oracle Sql Developer Data Modeler (SDDM) to Support Snowflake VARIANT

On a recent customer call, the data architects were asking if Snowflake provided a data model diagramming tool to design and generate data warehouse tables or to view a data model of an existing Snowflake data warehouse. Or if we knew of any that would work with Snowflake.

Well, we do not provide one of our own – our service is the Snowflake Elastic Data Warehouse (#ElasticDW).

The good news is that there are data modeling tools in the broader ecosystem that you can of course use (since we are ANSI SQL compliant).

You may have seen several posts (on my personal blog) where I used Oracle SQL Developer Data Modeler (aka SDDM) to reverse engineer and re-engineer a Snowflake database. (As a reminder, SDDM is a free-to-use enterprise class data modeling tool that you can download from Oracle here.)

If you have read my previous posts on using JSON within the Snowflake, you also know that we have a new data type called VARIANT for storing semi structured data like JSON, AVRO, and XML.

In this post I will bring it together and show you the steps to customize SDDM to allow you to model and generate table DDL that contain columns that use the VARIANT data type.

Creating a User Defined Data Type

One of the features of Snowflake is that you can run most Oracle DDL syntax (among others) without edits to create your tables, views, and constraints in a Snowflake database (so as to make it easier for you to migrate your current data warehouse to our cloud-based data warehouse service). So that means you can create your data warehouse model in SDDM, generate the DDL using Oracle syntax, then run it without any edits.

Of course Oracle does not have a VARIANT data type, so SDDM does not have it either. Nicely though you can pretty easily define your own custom data type to use in models and output in create table DDL statements.

I knew the feature was there but had a hard time getting it to do what I wanted, so I posted a question on the OTN Forum for Data Modeler and as usually Philip (the lead architect for Data Modeler) answered and laid out the steps for me. As he said, and you will see, it is not totally straightforward and a bit of a chicken and egg scenario.

Define a User Defined Native Type

First you have to go to Tools -> Types Administration and go to the tab for user defined native types. Then pick the RDBMS type you want to associate the new type with. I used Oracle Database 12c (remember there is no native Snowflake option). Press the green plus (+) to create a new entry. Under native type give it a name (in my case VARIANT), then under Logical type select one of the standard types. I used CLOB for now (this really is a temporary setting that I will change shortly).

Define Custom Type 1

Notice that there are other options for “Has size” and  “Has precision and scale” that you might use if the type you want to define will need those properties, but for VARIANT you do not define size or precision so I left those unchecked.

Add a new Logical Type

This was the step that was not obvious at all to me. It turns out when you are defining the data type on a column you are usually selecting a Logical Data Type (unless you are picking a Domain). To make this work we need to add a new Logical Type to map our VARIANT data type to.

So now we need to switch to the first tab in Types Administration dialog. It is the one labeled “Logical types to native types.”

On this tab click the Add button on the bottom left. Now you can type in a new logical type name. I named mine VARIANT to prevent any confusion. The key piece now is to map it to the Native Type that was defined in the previous step. To do that you pick from the drop down on the database you set up (in this case Oracle 12c). In that list is the VARIANT type. Once that is selected you need to be sure to hit APPLY and then SAVE.

Define New Logical Type

Note that if you wanted to define a new type for multiple databases and versions, you would need to repeat this process for all others too.

Now change the original Native Type

Again not that obvious a step, but go back to the “User defined native types” tab (the 3rd tab) and click on the Logical type drop down. Scroll all the way to the bottom and you should see your newly defined logical type. Pick it then press the SAVE button again. Now you can CLOSE the dialog (and save your design to be sure!)

Refine Custom Type 2

Now you have the custom data type defined so it can be used!

Apply the User Defined Type

At this point you can either define a new column using your custom data type or edit an existing one to change the data type.

In my case, I reversed engineered a table in Snowflake that had a VARIANT column in it (before I made these changes to the data types).

VARIANT in Snowflake UI

Since SDDM did not know what VARIANT was, it guessed and assigned it to the Source Type of VARCHAR.

Import of VARIANT

Note that the Data Type property is set to the Logical radio button – hence the need to set up a custom Logical Type too.

With the new type defined, I can now simply edit the column and pick VARIANT from the drop down list. It does appear that custom types go to the bottom of the list.

Assign VARIANT to Column

Don’t forget to press APPLY or OK to save the reassignment.

Generate the DDL

With the assignment done now do a right mouse click on the table in the diagram and pick DDL Preview. You should see the DDL includes the columns using the newly defined data type.

VARIANT DDL

Test It!

In my case I just copied the DDL from the preview window and pasted it into the Snowflake Web UI, then executed it.

Success! It worked!

I can now use Oracle SQL Developer Data Modeler to design tables for a Snowflake database that include columns using our innovative VARIANT data type.

And now you know how to do it to!

 

Part II: Making Schema-on-Read a Reality

In the first article of this series, I discussed the Snowflake data type VARIANT, showed a simple example of how to load a VARIANT column in a table with a JSON document, and then how easy it is to query data directly from that data type. In this post I will show you how to access an array of data within the JSON document and how we handle nested arrays. Then finally I will give you an example of doing an aggregation using data in the JSON structure and how simple it is to filter your query results by referring to values within an array.

Handling Arrays of Data

One of the features in JSON is the ability to specify and imbed an array of data within the docuement. In my example one such array is children:

    "children": [
         { "name": "Jayden", "gender": "Male", "age": "10" },
         { "name": "Emma", "gender": "Female", "age": "8" },
         { "name": "Madelyn", "gender": "Female", "age": "6" }
     ]

You will notice there are effectively 3 rows in the array and each row has 3 sub-columns – name, gender, and age. Each of those rows constitutes the value of that array entry which includes all the sub-column labels and data (remember that for later). So how do you know how many rows there are if you do not have access to the raw data?

Like this:

select array_size(v:children) from json_demo;

The function array_size figures it out for us. To pull the data for each “row” in the array, we use the dot notation from before, but now with the added specification for the row number of the array (in the [] brackets):

select v:children[0].name from json_demo
union all
select v:children[1].name from json_demo
union all
select v:children[2].name from json_demo;

Childrens names

So this is interesting but then, I really do not want to write union all SQL to traverse the entire array (in which case I need to know how many values are in the array right?).

We solve that problem with another new extended SQL function called FLATTEN. FLATTEN takes an array and returns a row for each element in the array. With that you can select all the data in the array as though they were in table rows (so no need to figure out how many entries there are).

Instead of doing the set of UNION ALLs, we add the FLATTEN into the FROM clause and give it a table alias:

select f.value:name
from json_demo, table(flatten(v:children)) f;

This syntax allows us to creating an inline virtual table in the FROM clause.

In the SELECT, we can then reference it like a table. Notice the notation f.value:name.

f = the alias for the virtual table from the children array

value = the contents of the element returned by the FLATTEN function

name = the label of the specific sub-column we want to extract from the value

The results, in this case, are the same as the SELECT with the UNIONs but the output column header reflects the different syntax (since I have not yet added any column aliases).

Flatten Children 1

Now, if another element is added to the array (i.e., a 4th child), the SQL will not have to be changed. FLATTEN allows us to determine the structure and content of the array on-the-fly! This makes the SQL resilient to changes in the JSON document.

With this in hand, we can of course get all the array sub-columns and format them just like a relational table:

select
   f.value:name::string as child_name,
   f.value:gender::string as child_gender,
   f.value:age::string as child_age
from json_demo, table(flatten(v:children)) f;

Flatten Format Children

Putting this all together, I can write a query to get the parent’s name and all the children like this:

select
   v:fullName::string as parent_name,
   f.value:name::string as child_name,
   f.value:gender::string  as child_gender,
   f.value:age::string as child_age
from json_demo, table(flatten(v:children)) f;

Which results in this output:

Parent and Children

If I just want a quick count of children by parent, I do not need FLATTEN but refer back to the array_size:

select
   v:fullName::string as Parent_Name,
   array_size(v:children) as Number_of_Children
from json_demo;

Count Children

Handling Multiple Arrays

You may recall there are multiple arrays in our sample JSON string. I can pull from several arrays at once with no problem:

select
   v:fullName::string as Parent_Name,
   array_size(v:citiesLived) as Cities_lived_in,
   array_size(v:children) as Number_of_Children
from json_demo;

Parent, Children. City

What about an Array within an Array?

Snowflake can handle that too. From our sample data we can see yearsLived is an array nested inside the array described by citiesLived:

"citiesLived": [
 { "cityName": "London",
   "yearsLived": [ "1989", "1993", "1998", "2002" ]
 },
 { "cityName": "San Francisco",
   "yearsLived": [ "1990", "1993", "1998", "2008" ]
 },
 { "cityName": "Portland",
   "yearsLived<": [ "1993", "1998", "2003", "2005" ]
 },
 { "cityName": "Austin",
   "yearsLived": [ "1973", "1998", "2001", "2005" ]
 }
 ]

To pull that data out, we add a second FLATTEN clause that transforms the yearsLived array within the FLATTENed citiesLived array.

select
  tf.value:cityName::string as city_name,
  yl.value::string as year_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl;

In this case the 2nd FLATTEN (alias “yl”) is transforming (really pivoting) the yearsLived array for each value returned from the 1st FLATTEN of the citiesLived array (“tf”).

The results output shows Year Lived by City:

YearLive by City

Like my earlier example, I can then augment this result by adding in the name too (so I know who lived where):

select
  v:fullName::string as parent_name,
  tf.value:cityName::string as city_name,
  yl.value::string as year_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl;

Name and YearLived

Aggregations?

Yup, we can even aggregate data within semi-structured data. (We would not be much of a data warehouse service if we couldn’t, right?)

So, just like ANSI SQL, we can do a count(*) and a group by:

select
   tf.value:cityName::string as city_name,
   count(*) as years_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl
group by 1;

And the results:

Count Years

Filtering?

Of course! Just add a WHERE clause.

select
  tf.value:cityName::string as city_name,
  count(*) as years_lived
from json_demo,
     table(flatten(v:citiesLived)) tf,
     table(flatten(tf.value:yearsLived)) yl
where city_name = 'Portland'
group by 1;

Where City

To simplify things, notice I used the column alias city_name in the predicate but you can also use the full sub-column specification tf.value:cityName as well.

Schema-on-Read is a Reality

I could go on, but by now I think you can see we have made it very easy to load and extract information from semi-structured data using Snowflake. We added a brand new data type, VARIANT, that lives in a relational table structure in a relational database without the need to analyze the structure ahead of time, design appropriate database tables, and then shred the data into that predefined schema. Then I showed you some easy to learn extensions to ANSI-standard SQL for accessing that data in a very flexible and resilient manner.

With these features, Snowflake gives you the real ability to quickly and easily load semi-structured data into a relational data warehouse and make it available for immediate analysis.

Part I: Making Schema-on-Read a Reality

(Note: This the the first in a two-part series discussing how we handle semi-structured data in Snowflake)

Schema? I don’t need no stinking schema!

Over the last several years, I have heard this phrase schema-on-read used to explain the benefit of loading semi-structured data into a Big Data platform like Hadoop. The idea being you could delay data modeling and schema design until long after the data was loaded (so as to not slow down getting your data while waiting for those darn data modelers).

Every time I heard it, I thought (and sometimes said) – “but that implies there is a knowable schema.”  So really you are just delaying the inevitable need to understand the structure in order to derive some business value from that data. Pay me now or pay me later.

Why delay the pain?

So even though folks are able to quickly load this type of data into Hadoop or NoSQL, there is still more work ahead to actually pull the data apart so it can be analyzed. The person writing the query often has the burden of figuring out the schema and writing code to extract it. Additionally there may be a query performance penalty in this process (over that of querying columns in a relational database).

Not so with the Snowflake Elastic Data Warehouse (#ElasticDW)! With Snowflake, you can load your semi-structured data directly into a relational table, then query the data with a SQL statement, join it to other structured data, all while not fretting about future changes to the “schema” of that data. Snowflake actually keeps track of the self-describing schema so you don’t have to. No ETL or fancy shredding required.

One of the key differentiators which really attracted me to Snowflake is our built in support to load and query semi-structured data such as JSON, XML, and AVRO. In most conventional data warehouse and Big Data environments today, you have to first load this type of data to a Hadoop or NoSQL platform, then shred it (using for example MapReduce) in order to then load it into columns in a relational database (that is if you want to then run SQL queries or a BI/Analytics tool against that data).

How did we do it?

Simple – we invented a new data type called VARIANT that allows us to load semi-structured data (i.e., flexible schema) as-is into a column in a relational table.

Read that again – we load the data directly into a relational table.

Okay, so that means no Hadoop or NoSQL needed in your data warehouse architecture just to hold semi-structured data. Just an RDBMS (in the cloud) that uses SQL that your staff already knows how to write.

But that is only half the equation. Once the data is in, how do you get it out?

Our brilliant founders and excellent engineering team (#DataSuperStars) have created extensions to SQL to reference the internal schema of the data (it is self-describing after all) so you can query the components and join it to columns in other tables as if it had been shredded into a standard relational table. Except there is no coding or shredding required to prep the data. Cool.

That also means that as the data source evolves and changes over time (e.g., new attributes, nesting, or arrays are added), there is no re-coding of ETL (or even ELT) code required to adapt. The VARIANT data type does not care if the schema varies.

What does it really look like?

Enough of the theory – let’s walk through an example of how this all works.

1 – Create a table

I have a Snowflake account, database and virtual warehouse set up already so just like I would in any other db, I simply issue a create table DDL statement:

create or replace table json_demo (v variant);

Now I have a table with one column (“v”) with a declared data type of VARIANT.

2 – Load some data

Now I load a sample JSON Document using an INSERT and our PARSE_JSON function. We are not simply loading it as text but rather storing it as an object in the VARIANT data type while at the same time converting it to an optimized columnar format (for when we query it later):

insert into json_demo
 select
 parse_json(
 '{
     "fullName": "Johnny Appleseed",
     "age": 42,
     "gender": "Male",
     "phoneNumber": {
                     "areaCode": "415",
                     "subscriberNumber": "5551234"
                    },
     "children": [
                  { "name": "Jayden", "gender": "Male", "age": "10" },
                  { "name": "Emma", "gender": "Female", "age": "8" },
                  { "name": "Madelyn", "gender": "Female", "age": "6" }
                 ],
     "citiesLived": [
                    { "cityName": "London",
                      "yearsLived": [ "1989", "1993", "1998", "2002" ]
                    },
                    { "cityName": "San Francisco",
                      "yearsLived": [ "1990", "1993", "1998", "2008" ]
                    },
                    { "cityName": "Portland",
                      "yearsLived": [ "1993", "1998", "2003", "2005" ]
                    },
                    { "cityName": "Austin",
                      "yearsLived": [ "1973", "1998", "2001", "2005" ]
                    }
                   ]
  }');

3 – Start pulling data out

So let’s start with just getting the name:

select v:fullName from json_demo;

Get fullname

Where:

v = the column name in the json_demo table

fullName = attribute in the JSON schema

v:fullName = notation to indicate which attribute in column “v” we want to select. 

So, similar to the table.column notation all SQL people are familiar with, in Snowflake we added the ability to effectively specify a column within the column (i.e., a sub-column) which is dynamically derived based on the schema definition imbedded in the JSON string.

4 – Casting the Data

Usually we don’t want to see the double quotes around the data in the report output (unless we were going to create an extract file of some sort) , so we can format it as a string and give it a nicer column alias (like we would do with a normal column):

select v:fullName::string as full_name
from json_demo;

Format FullName

Next let’s look at a bit more of the data using the same syntax from above:

select
   v:fullName::string as full_name,
   v:age::int as age,
   v:gender::string as gender
from json_demo;

Demographics

Again, simple SQL and the output looks like the results from any table you might have built in your traditional data warehouse.

Safe to say at this point, with what I have already shown you, you could look at a table in Snowflake with a VARIANT column and quickly start “shredding” the JSON with SQL.

How long did that take to learn?

This is why I love Snowflake! I can now query semi-structured data and I did not have to learn a new programming language or framework or whatever over in Big Data land – yet I have the same capabilities as if I did.

Much lower learning curve for sure.

Let’s get a little more complex

Yes, those examples are very simple, so let’s dive deeper. Notice in the original string there is some nesting of the data:

{
    "fullName": "Johnny Appleseed",
    "age": 42,
    "gender": "Male",
    "phoneNumber": {
                    "areaCode": "415",
                    "subscriberNumber": "5551234"
                   },
...

How do we pull that apart? With a very familiar table.column dot notation:

select
   v:phoneNumber.areaCode::string as area_code,
   v:phoneNumber.subscriberNumber::string as subscriber_number
from json_demo;

So just as fullName, age, and gender are sub-columns, so to is phoneNumber. And subsequently areaCode and subscriberNumber are sub-columns of the sub-column. Not only can we pull apart nested objects like this, you might infer how easily we can adapt if the schema changes and another sub-column is added.

What happens if the structure changes?

Imagine in a subsequent load the provider changed the specification to this:

{
    "fullName": "Johnny Appleseed",
    "age": 42,
    "gender": "Male",
    "phoneNumber": {
                    "areaCode": "415",
                    "subscriberNumber": "5551234",
                    "extensionNumber": "24"
                   },
...

They added a new attribute (extensionNumber)! What happens to the load?

Nothing – it keeps working because we ingest the string into the VARIANT column in the table.

What about the ETL code?

What ETL code? There is no ETL so there is nothing to break.

What about existing reports?

They keep working too. The previous query will work fine. If you want to see the new column, then the SQL needs to be refactored to account for the change:

select
   v:phoneNumber.areaCode::string as area_code,
   v:phoneNumber.subscriberNumber::string as subscriber_number,
   v:phoneNumber.extensionNumber::string as extension_number
from json_demo;

In addition, if the reverse happens and an attribute is dropped, the query will not fail. Instead it simply returns a NULL value. In this way we insulate all the code you write from these type of dynamic changes.

Next time

This post has looked at the basics of how Snowflake handles semi-structured data, using JSON as a specific example. In Part 2, I will show you how we handle more complex schema structures like arrays and nested arrays within the JSON document as well as give you a brief example of an aggregation and filtering against the contents of a VARIANT data type.