RI (Referential Integrity) Constraints: 3 Reasons to Include Them in Your Data Warehouse

Over the years, I have had numerous conversations about the value of having referential integrity (RI) constraints, such as primary and foreign keys, in a relational data warehouse or data mart.

Many DBAs object that RI constraints slow the load process. This is a valid point if you are talking about enforced constraints that are checked in real time during the load. But this is not an issue if you define the constraints as disabled.

Which then leads to this common question:

Is there any reason to maintain a permanently disabled FK in the data model?  If it is not going to be enabled, then from my perspective, it doesn’t make any sense to define the FK.  Instead, the relationship can be described in the comment of the child column.

So, why would I want RI constraints in my data warehouse?

Mostly it has to do with good design and development best practices. Here is my rationale for why you should consider including RI constraints in your data warehouse design.

#1 – Design Metadata

RI constraints are valuable metadata/documentation. If somebody reverse engineers the database (say with ERWin or Oracle Data Modeler), the PKs and FKs show up in the diagram (much better than having to read a column comment to discover a relationship). This is quite valuable for new people on your project to orient themselves to the existing schema design and understand how the various tables in your data warehouse are related to each other.

RI in a diagram
A picture is worth a thousand words

#2 – BI Metadata

If you want to use any sort of reporting or BI tool against the database (it is a data warehouse,  after all), most modern business intelligence and visualization tools import the foreign key definitions with the tables and build the proper join conditions. This is much better than having someone guess what the join will be and then manually adding it to the metadata layer in the reporting tool. This also ensures that different developers don’t interpret the joins differently.

Examples of tools that can read the Snowflake data dictionary include Looker, Tableau, COGNOS, MicroStrategy, and many others. Some of these tools actually use the FK definitions for join culling to provide better query performance.

#3 – QA your ETL/ELT code

I know you think your ETL code is perfect.

But does every developer test to the same standards? Do you maybe have a QA team who separately validates that the ETL is doing what you expect?

If so, having declared primary, unique, and foreign key constraints in your data warehouse gives the team more information they can use to ensure the quality of the data. In fact, using the Snowflake Information Schema, a QA engineer can potentially generate SQL to test that the loaded data conforms to the defined constraints.

Defining RI Constraints in Snowflake

You, of course, can (and IMHO should) define RI constraints in Snowflake. You can define primary keys, unique keys, foreign keys, and NOT NULL constraints. Because Snowflake is specifically engineered for data warehousing, only the NOT NULL constraints are enforced. The rest are always created as disabled.

The syntax is standard SQL. You can define the constraints both inline and out-of-line.

Here is a simple example of inline constraints:

Create or replace TABLE SAT_REGIONS (
HUB_REGION_KEY NUMBER(38,0) NOT NULL,
SAT_LOAD_DTS DATE NOT NULL,
REGION_COMMENT VARCHAR(152),
HASH_DIFF VARCHAR(32) NOT NULL,
SAT_REC_SRC VARCHAR(50) NOT NULL,
constraint SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS),
constraint SAT_REGIONS_FK1 foreign key (HUB_REGION_KEY)
references KENT_DB.DATAVAULT_MAIN.HUB_REGION(HUB_REGION_KEY)
);

Example of an out-of-line constraint:

ALTER TABLE SAT_REGIONS ADD constraint 
SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS);

It’s that easy. So why don’t you have constraints in your data warehouse?

 

Did you hear? Snowflake was declared the #1 Cloud Data Warehouse in a recent GigaOM analyst report. Go here to get your copy of the report.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for updates on all the action and activities here at Snowflake Computing.

Customizing Oracle Sql Developer Data Modeler (SDDM) to Support Snowflake VARIANT

On a recent customer call, the data architects were asking if Snowflake provided a data model diagramming tool to design and generate data warehouse tables or to view a data model of an existing Snowflake data warehouse. Or if we knew of any that would work with Snowflake.

Well, we do not provide one of our own – our service is the Snowflake Elastic Data Warehouse (#ElasticDW).

The good news is that there are data modeling tools in the broader ecosystem that you can of course use (since we are ANSI SQL compliant).

You may have seen several posts (on my personal blog) where I used Oracle SQL Developer Data Modeler (aka SDDM) to reverse engineer and re-engineer a Snowflake database. (As a reminder, SDDM is a free-to-use enterprise class data modeling tool that you can download from Oracle here.)

If you have read my previous posts on using JSON within the Snowflake, you also know that we have a new data type called VARIANT for storing semi structured data like JSON, AVRO, and XML.

In this post I will bring it together and show you the steps to customize SDDM to allow you to model and generate table DDL that contain columns that use the VARIANT data type.

Creating a User Defined Data Type

One of the features of Snowflake is that you can run most Oracle DDL syntax (among others) without edits to create your tables, views, and constraints in a Snowflake database (so as to make it easier for you to migrate your current data warehouse to our cloud-based data warehouse service). So that means you can create your data warehouse model in SDDM, generate the DDL using Oracle syntax, then run it without any edits.

Of course Oracle does not have a VARIANT data type, so SDDM does not have it either. Nicely though you can pretty easily define your own custom data type to use in models and output in create table DDL statements.

I knew the feature was there but had a hard time getting it to do what I wanted, so I posted a question on the OTN Forum for Data Modeler and as usually Philip (the lead architect for Data Modeler) answered and laid out the steps for me. As he said, and you will see, it is not totally straightforward and a bit of a chicken and egg scenario.

Define a User Defined Native Type

First you have to go to Tools -> Types Administration and go to the tab for user defined native types. Then pick the RDBMS type you want to associate the new type with. I used Oracle Database 12c (remember there is no native Snowflake option). Press the green plus (+) to create a new entry. Under native type give it a name (in my case VARIANT), then under Logical type select one of the standard types. I used CLOB for now (this really is a temporary setting that I will change shortly).

Define Custom Type 1

Notice that there are other options for “Has size” and  “Has precision and scale” that you might use if the type you want to define will need those properties, but for VARIANT you do not define size or precision so I left those unchecked.

Add a new Logical Type

This was the step that was not obvious at all to me. It turns out when you are defining the data type on a column you are usually selecting a Logical Data Type (unless you are picking a Domain). To make this work we need to add a new Logical Type to map our VARIANT data type to.

So now we need to switch to the first tab in Types Administration dialog. It is the one labeled “Logical types to native types.”

On this tab click the Add button on the bottom left. Now you can type in a new logical type name. I named mine VARIANT to prevent any confusion. The key piece now is to map it to the Native Type that was defined in the previous step. To do that you pick from the drop down on the database you set up (in this case Oracle 12c). In that list is the VARIANT type. Once that is selected you need to be sure to hit APPLY and then SAVE.

Define New Logical Type

Note that if you wanted to define a new type for multiple databases and versions, you would need to repeat this process for all others too.

Now change the original Native Type

Again not that obvious a step, but go back to the “User defined native types” tab (the 3rd tab) and click on the Logical type drop down. Scroll all the way to the bottom and you should see your newly defined logical type. Pick it then press the SAVE button again. Now you can CLOSE the dialog (and save your design to be sure!)

Refine Custom Type 2

Now you have the custom data type defined so it can be used!

Apply the User Defined Type

At this point you can either define a new column using your custom data type or edit an existing one to change the data type.

In my case, I reversed engineered a table in Snowflake that had a VARIANT column in it (before I made these changes to the data types).

VARIANT in Snowflake UI

Since SDDM did not know what VARIANT was, it guessed and assigned it to the Source Type of VARCHAR.

Import of VARIANT

Note that the Data Type property is set to the Logical radio button – hence the need to set up a custom Logical Type too.

With the new type defined, I can now simply edit the column and pick VARIANT from the drop down list. It does appear that custom types go to the bottom of the list.

Assign VARIANT to Column

Don’t forget to press APPLY or OK to save the reassignment.

Generate the DDL

With the assignment done now do a right mouse click on the table in the diagram and pick DDL Preview. You should see the DDL includes the columns using the newly defined data type.

VARIANT DDL

Test It!

In my case I just copied the DDL from the preview window and pasted it into the Snowflake Web UI, then executed it.

Success! It worked!

I can now use Oracle SQL Developer Data Modeler to design tables for a Snowflake database that include columns using our innovative VARIANT data type.

And now you know how to do it to!