RI (Referential Integrity) Constraints: 3 Reasons to Include Them in Your Data Warehouse

Over the years, I have had numerous conversations about the value of having referential integrity (RI) constraints, such as primary and foreign keys, in a relational data warehouse or data mart.

Many DBAs object that RI constraints slow the load process. This is a valid point if you are talking about enforced constraints that are checked in real time during the load. But this is not an issue if you define the constraints as disabled.

Which then leads to this common question:

Is there any reason to maintain a permanently disabled FK in the data model?  If it is not going to be enabled, then from my perspective, it doesn’t make any sense to define the FK.  Instead, the relationship can be described in the comment of the child column.

So, why would I want RI constraints in my data warehouse?

Mostly it has to do with good design and development best practices. Here is my rationale for why you should consider including RI constraints in your data warehouse design.

#1 – Design Metadata

RI constraints are valuable metadata/documentation. If somebody reverse engineers the database (say with ERWin or Oracle Data Modeler), the PKs and FKs show up in the diagram (much better than having to read a column comment to discover a relationship). This is quite valuable for new people on your project to orient themselves to the existing schema design and understand how the various tables in your data warehouse are related to each other.

RI in a diagram
A picture is worth a thousand words

#2 – BI Metadata

If you want to use any sort of reporting or BI tool against the database (it is a data warehouse,  after all), most modern business intelligence and visualization tools import the foreign key definitions with the tables and build the proper join conditions. This is much better than having someone guess what the join will be and then manually adding it to the metadata layer in the reporting tool. This also ensures that different developers don’t interpret the joins differently.

Examples of tools that can read the Snowflake data dictionary include Looker, Tableau, COGNOS, MicroStrategy, and many others. Some of these tools actually use the FK definitions for join culling to provide better query performance.

#3 – QA your ETL/ELT code

I know you think your ETL code is perfect.

But does every developer test to the same standards? Do you maybe have a QA team who separately validates that the ETL is doing what you expect?

If so, having declared primary, unique, and foreign key constraints in your data warehouse gives the team more information they can use to ensure the quality of the data. In fact, using the Snowflake Information Schema, a QA engineer can potentially generate SQL to test that the loaded data conforms to the defined constraints.

Defining RI Constraints in Snowflake

You, of course, can (and IMHO should) define RI constraints in Snowflake. You can define primary keys, unique keys, foreign keys, and NOT NULL constraints. Because Snowflake is specifically engineered for data warehousing, only the NOT NULL constraints are enforced. The rest are always created as disabled.

The syntax is standard SQL. You can define the constraints both inline and out-of-line.

Here is a simple example of inline constraints:

Create or replace TABLE SAT_REGIONS (
HUB_REGION_KEY NUMBER(38,0) NOT NULL,
SAT_LOAD_DTS DATE NOT NULL,
REGION_COMMENT VARCHAR(152),
HASH_DIFF VARCHAR(32) NOT NULL,
SAT_REC_SRC VARCHAR(50) NOT NULL,
constraint SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS),
constraint SAT_REGIONS_FK1 foreign key (HUB_REGION_KEY)
references KENT_DB.DATAVAULT_MAIN.HUB_REGION(HUB_REGION_KEY)
);

Example of an out-of-line constraint:

ALTER TABLE SAT_REGIONS ADD constraint 
SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS);

It’s that easy. So why don’t you have constraints in your data warehouse?

 

Did you hear? Snowflake was declared the #1 Cloud Data Warehouse in a recent GigaOM analyst report. Go here to get your copy of the report.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for updates on all the action and activities here at Snowflake Computing.

Data Vault Modeling and Snowflake

Since I have joined Snowflake, I have been asked multiple times what data warehouse modeling approach does Snowflake support best. Well, the cool thing is that we support multiple data modeling approaches equally.

Turns out we have a few customers who have existing data warehouses built using a particular approach known as the Data Vault modeling approach and they have decided to move into Snowflake (yeah!).

So the conversation often goes like this:

Customer: “Can you do Data Vault on Snowflake?”

Me: “Yes you can! Why do you ask?”

Customer: “Well, your name is “snowflake” so we thought that might mean you only support snowflake-type schemas.”

Me: “Well, yes I can see your confusion in that case, but the name has nothing to do with data warehouse design really. In fact, we support any type of relational design, including Data Vault.”

What is a Data Vault?

For those of you have not yet heard of the Data Vault System of Business Intelligence or simply Data Vault modeling, it provides (among other things) a method and approach to modeling your enterprise data warehouse (EDW) that is agile, flexible, and scalable.

The formal definition as written by the inventor Dan Linstedt:

The Data Vault is a detail oriented, historical tracking and uniquely linked set of normalized tables that support one or more functional areas of business.

It is a hybrid approach encompassing the best of breed between 3rd normal form (3NF) and star schema. The design is flexible, scalable, consistent, and adaptable to the needs of the enterprise. It is a data model that is architected specifically to meet the needs of today’s enterprise data warehouses.

The main point here is that Data Vault (DV) was developed specifically to address agility, flexibility, and scalability issues found in the other main stream data modeling approaches used in the data warehousing space. It was built to be a granular, non-volatile, auditable, historical repository of enterprise data.

At its core is a repeatable modeling technique that consists of just three main types of tables:

  • Hubs = Unique list of Business Keys
  • Links = Unique List of Associations / Transactions
  • Satellites = Descriptive Data for Hubs and Links (Type 2 with history)

Hubs make it business driven and allow for semantic integration across systems.

Links give you the flexibility to absorb structural and business rule changes without re-engineering (and therefore without reloading any data).

Satellites give you the adaptability to record history at any interval you want plus unquestionable auditability and traceability to your source systems.

Here is a simple example of what at Data Vault 2.0 model looks like:
Data Vault 2.0 Example

Snowflake Features to use in a Data Vault

As we have said in the past, Snowflake is an ANSI SQL RDBMS with pay-as-you-go pricing. We support tables and views like all the relational solutions on the market today. Since, from a data modeling perspective, Data Vault is specific way and pattern for designing tables for your data warehouse, there are no issues implementing one in Snowflake.

In fact, with our combination of MPP compute clusters, optimized columnar storage format, and our patent-pending Adaptive Data Warehouse technology, I think you will get better results with your Data Vault loads and queries with less effort than you get today on your legacy data warehouse solutions. Remember that with Snowflake you don’t need to pre-plan partitioning or distribution keys, or build indexes to get great performance. That is all handled as part of our Dynamic Query Optimization feature that uses our secure cloud-based metadata store and sophisticated feedback loop to monitor and tune your queries based on data access patterns and resource availability among other things.

Because our customers are getting such great query performance (some up to 100x improvement), I think Snowflake will be a great place to try virtualizing your information mart (i.e., reporting) layer that you expose to your BI tools.

Data Vault 2.0

For those of you interested in implementing the Data Vault 2.0 specification, Snowflake can handle that as well. We have a built in MD5 hash function so you can implement MD5-based keys and do your change data capture using the DV 2.0 HASH_DIFF concept.

Not only does Snowflake support DV 2.0 use of hash functions, you can also take advantage of Snowflake’s Multi-table Insert (MTI) when loading your main Data Vault tables. With this feature you can load multiple tables in parallel from a single source table.

For example you can take the data from a stage table and load it into a  Hub and Satellite with one statement. Assuming your source table is in a schema called STAGE and your vault tables in a schema called DV, the first pass at loading a Hub and associated Satellite might look like this:

INSERT ALL
INTO DV.HUB_COUNTRY (HUB_COUNTRY_KEY, COUNTRY_ABBRV, HUB_LOAD_DTS, HUB_REC_SRC)
              VALUES (stg.HASH_KEY, stg.COUNTRY_ABBRV, stg.LOAD_DTS, stg.REC_SRC)
INTO DV.SAT_COUNTRIES (HUB_COUNTRY_KEY, SAT_LOAD_DTS, HASH_DIFF, SAT_REC_SRC, COUNTRY_NAME)
               VALUES (stg.HASH_KEY,  stg.LOAD_DTS,  stg.HASH_DIFF,  stg.REC_SRC,  stg.COUNTRY_NAME)
SELECT HASH_KEY, 
COUNTRY_ABBRV, 
COUNTRY_NAME,
HASH_DIFF,
CURRENT_TIMESTAMP AS LOAD_DTS,
STAGE_REC_SRC AS REC_SRC
from STAGE.COUNTRY stg;

Likewise you can use an MTI to load Links and their associated Satellites.

Data Vault Resources

If you are interested in learning more about Data Vault, there is a website dedicated to Data Vault, and a few books on the subject that you might want to peruse:

  1. Introduction to Agile Data Engineering by Kent Graziano (me)
  2. Intro to Data Vault – free white paper on my personal blog
  3. Free Data Vault Intro Videos – from the inventor Dan Linstedt
  4. Building a Scalable Data Warehouse with Data Vault 2.0 by Dan Linstedt
  5. Dan Linstedt’s Blog

So hopefully this answers the basic questions about doing Data Vault on Snowflake (yes you can). If you want to start using Snowflake, feel free to take advantage of our On-Demand offer with $400 in free usage. In the future, as our customers roll out their solutions, I hope we will be able to give you some real-world case studies on how they implemented their Data Vaults on the Snowflake Elastic Data Warehouse.

Until then, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for updates on all the action and activities here at Snowflake Computing.

 

Customizing Oracle Sql Developer Data Modeler (SDDM) to Support Snowflake VARIANT

On a recent customer call, the data architects were asking if Snowflake provided a data model diagramming tool to design and generate data warehouse tables or to view a data model of an existing Snowflake data warehouse. Or if we knew of any that would work with Snowflake.

Well, we do not provide one of our own – our service is the Snowflake Elastic Data Warehouse (#ElasticDW).

The good news is that there are data modeling tools in the broader ecosystem that you can of course use (since we are ANSI SQL compliant).

You may have seen several posts (on my personal blog) where I used Oracle SQL Developer Data Modeler (aka SDDM) to reverse engineer and re-engineer a Snowflake database. (As a reminder, SDDM is a free-to-use enterprise class data modeling tool that you can download from Oracle here.)

If you have read my previous posts on using JSON within the Snowflake, you also know that we have a new data type called VARIANT for storing semi structured data like JSON, AVRO, and XML.

In this post I will bring it together and show you the steps to customize SDDM to allow you to model and generate table DDL that contain columns that use the VARIANT data type.

Creating a User Defined Data Type

One of the features of Snowflake is that you can run most Oracle DDL syntax (among others) without edits to create your tables, views, and constraints in a Snowflake database (so as to make it easier for you to migrate your current data warehouse to our cloud-based data warehouse service). So that means you can create your data warehouse model in SDDM, generate the DDL using Oracle syntax, then run it without any edits.

Of course Oracle does not have a VARIANT data type, so SDDM does not have it either. Nicely though you can pretty easily define your own custom data type to use in models and output in create table DDL statements.

I knew the feature was there but had a hard time getting it to do what I wanted, so I posted a question on the OTN Forum for Data Modeler and as usually Philip (the lead architect for Data Modeler) answered and laid out the steps for me. As he said, and you will see, it is not totally straightforward and a bit of a chicken and egg scenario.

Define a User Defined Native Type

First you have to go to Tools -> Types Administration and go to the tab for user defined native types. Then pick the RDBMS type you want to associate the new type with. I used Oracle Database 12c (remember there is no native Snowflake option). Press the green plus (+) to create a new entry. Under native type give it a name (in my case VARIANT), then under Logical type select one of the standard types. I used CLOB for now (this really is a temporary setting that I will change shortly).

Define Custom Type 1

Notice that there are other options for “Has size” and  “Has precision and scale” that you might use if the type you want to define will need those properties, but for VARIANT you do not define size or precision so I left those unchecked.

Add a new Logical Type

This was the step that was not obvious at all to me. It turns out when you are defining the data type on a column you are usually selecting a Logical Data Type (unless you are picking a Domain). To make this work we need to add a new Logical Type to map our VARIANT data type to.

So now we need to switch to the first tab in Types Administration dialog. It is the one labeled “Logical types to native types.”

On this tab click the Add button on the bottom left. Now you can type in a new logical type name. I named mine VARIANT to prevent any confusion. The key piece now is to map it to the Native Type that was defined in the previous step. To do that you pick from the drop down on the database you set up (in this case Oracle 12c). In that list is the VARIANT type. Once that is selected you need to be sure to hit APPLY and then SAVE.

Define New Logical Type

Note that if you wanted to define a new type for multiple databases and versions, you would need to repeat this process for all others too.

Now change the original Native Type

Again not that obvious a step, but go back to the “User defined native types” tab (the 3rd tab) and click on the Logical type drop down. Scroll all the way to the bottom and you should see your newly defined logical type. Pick it then press the SAVE button again. Now you can CLOSE the dialog (and save your design to be sure!)

Refine Custom Type 2

Now you have the custom data type defined so it can be used!

Apply the User Defined Type

At this point you can either define a new column using your custom data type or edit an existing one to change the data type.

In my case, I reversed engineered a table in Snowflake that had a VARIANT column in it (before I made these changes to the data types).

VARIANT in Snowflake UI

Since SDDM did not know what VARIANT was, it guessed and assigned it to the Source Type of VARCHAR.

Import of VARIANT

Note that the Data Type property is set to the Logical radio button – hence the need to set up a custom Logical Type too.

With the new type defined, I can now simply edit the column and pick VARIANT from the drop down list. It does appear that custom types go to the bottom of the list.

Assign VARIANT to Column

Don’t forget to press APPLY or OK to save the reassignment.

Generate the DDL

With the assignment done now do a right mouse click on the table in the diagram and pick DDL Preview. You should see the DDL includes the columns using the newly defined data type.

VARIANT DDL

Test It!

In my case I just copied the DDL from the preview window and pasted it into the Snowflake Web UI, then executed it.

Success! It worked!

I can now use Oracle SQL Developer Data Modeler to design tables for a Snowflake database that include columns using our innovative VARIANT data type.

And now you know how to do it to!