RI (Referential Integrity) Constraints: 3 Reasons to Include Them in Your Data Warehouse

Over the years, I have had numerous conversations about the value of having referential integrity (RI) constraints, such as primary and foreign keys, in a relational data warehouse or data mart.

Many DBAs object that RI constraints slow the load process. This is a valid point if you are talking about enforced constraints that are checked in real time during the load. But this is not an issue if you define the constraints as disabled.

Which then leads to this common question:

Is there any reason to maintain a permanently disabled FK in the data model?  If it is not going to be enabled, then from my perspective, it doesn’t make any sense to define the FK.  Instead, the relationship can be described in the comment of the child column.

So, why would I want RI constraints in my data warehouse?

Mostly it has to do with good design and development best practices. Here is my rationale for why you should consider including RI constraints in your data warehouse design.

#1 – Design Metadata

RI constraints are valuable metadata/documentation. If somebody reverse engineers the database (say with ERWin or Oracle Data Modeler), the PKs and FKs show up in the diagram (much better than having to read a column comment to discover a relationship). This is quite valuable for new people on your project to orient themselves to the existing schema design and understand how the various tables in your data warehouse are related to each other.

RI in a diagram
A picture is worth a thousand words

#2 – BI Metadata

If you want to use any sort of reporting or BI tool against the database (it is a data warehouse,  after all), most modern business intelligence and visualization tools import the foreign key definitions with the tables and build the proper join conditions. This is much better than having someone guess what the join will be and then manually adding it to the metadata layer in the reporting tool. This also ensures that different developers don’t interpret the joins differently.

Examples of tools that can read the Snowflake data dictionary include Looker, Tableau, COGNOS, MicroStrategy, and many others. Some of these tools actually use the FK definitions for join culling to provide better query performance.

#3 – QA your ETL/ELT code

I know you think your ETL code is perfect.

But does every developer test to the same standards? Do you maybe have a QA team who separately validates that the ETL is doing what you expect?

If so, having declared primary, unique, and foreign key constraints in your data warehouse gives the team more information they can use to ensure the quality of the data. In fact, using the Snowflake Information Schema, a QA engineer can potentially generate SQL to test that the loaded data conforms to the defined constraints.

Defining RI Constraints in Snowflake

You, of course, can (and IMHO should) define RI constraints in Snowflake. You can define primary keys, unique keys, foreign keys, and NOT NULL constraints. Because Snowflake is specifically engineered for data warehousing, only the NOT NULL constraints are enforced. The rest are always created as disabled.

The syntax is standard SQL. You can define the constraints both inline and out-of-line.

Here is a simple example of inline constraints:

Create or replace TABLE SAT_REGIONS (
HUB_REGION_KEY NUMBER(38,0) NOT NULL,
SAT_LOAD_DTS DATE NOT NULL,
REGION_COMMENT VARCHAR(152),
HASH_DIFF VARCHAR(32) NOT NULL,
SAT_REC_SRC VARCHAR(50) NOT NULL,
constraint SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS),
constraint SAT_REGIONS_FK1 foreign key (HUB_REGION_KEY)
references KENT_DB.DATAVAULT_MAIN.HUB_REGION(HUB_REGION_KEY)
);

Example of an out-of-line constraint:

ALTER TABLE SAT_REGIONS ADD constraint 
SAT_REGIONS_PK primary key (HUB_REGION_KEY, SAT_LOAD_DTS);

It’s that easy. So why don’t you have constraints in your data warehouse?

 

Did you hear? Snowflake was declared the #1 Cloud Data Warehouse in a recent GigaOM analyst report. Go here to get your copy of the report.

As always, keep an eye on this blog site, our Snowflake Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for updates on all the action and activities here at Snowflake Computing.

Automatic Concurrency Scaling in Snowflake – Another Way the Cloud Changes the Game

The challenge with concurrency

Today we take a major step forward by extending our elastic architecture to solve another major pain point in existing on-premises and cloud data warehousing solutions: how to run massively concurrent workloads at scale in a single system.

Have you had the following experiences when building mission-critical applications that incorporate data analytics:

  • My application can only support a certain level of user concurrency due to the underlying data warehouse, which only allows 32-50 concurrent user queries.
  • To build my application, I need to acquire multiple data warehouse instances in order to isolate numerous workloads and users from each other. This adds to costs and complexity.
  • We have built our own scheduling policies around the data warehouse. We use query queues to control and prioritize incoming queries issued by our numerous users.
  • During peak times, users are getting frustrated because their requests are getting queued or fail entirely.

At Snowflake, we separate compute from storage by introducing the unique concept of virtual data warehouses. That concept makes it possible to instantly resize virtual warehouses or pause them entirely. In addition, because of that concept Snowflake is the only cloud data warehousing solution that allows concurrent workloads to run without impacting each other.

However, we saw the need to go a step further to offer a service that adapts to changing workloads and addresses concurrency at the same time:

  • Imagine you didn’t have any concurrency limitations on your mission-critical business application.
  • Imagine you didn’t need users to adjust their workloads to accommodate data warehouse bottlenecks.
  • Imagine the data warehouse itself could detect increasing workloads and add additional compute resources as needed or shut-down/pause compute resources when workload activities subside again.
  • Imagine your application could scale out-of-the-box with one single (virtual) data warehouse without the need to provision additional data warehouses.
  • Imagine a world without any scheduling scripts and queued queries – a world in which you can leverage a smart data warehousing service that ensures all your users get their questions answered within the application’s SLA.

With Snowflake, we allow you to do that all of this for real, not just in your imagination, with our new multi-cluster data warehouse feature.

Multi-cluster data warehouses

A virtual warehouse represents a number of physical nodes a user can provision to perform data warehousing tasks, e.g. running analytical queries. While a user can instantly resize a warehouse by choosing a different size (e.g. from small to 3X large), until now a virtual data warehouse in Snowflake always consisted of one physical cluster.

With the recent introduction of multi-cluster warehouses, Snowflake supports allocating, either statically or dynamically, more resources for a warehouse by specifying additional clusters for the warehouse.

QueryScheduler-1b

Figure 1: How automatic scaling works

 

The figure above shows a multi-cluster DW that consists of three compute clusters. All compute clusters in the warehouse are of the same size. The user can choose from two different modes for the warehouse:

  • Maximized:When the warehouse is started, Snowflake always starts all the clusters to ensure maximum resources are available while the warehouse is running.
  • Auto-scale:Snowflake starts and stops clusters as needed to dynamically manage the workload on the warehouse.

As always, in Snowflake a user can either leverage the user interface or use SQL to specify the minimum/maximum number of clusters per multi-cluster DW:

Create Warehouse UI wizard

MultiCluster

Figure 2: Create Warehouse in UI Wizard

 

Create Warehouse SQL script

Multi_cluster_sql

Figure 3: Create Warehouse via SQL

 

Similar to regular virtual warehouses, a user can resize all additional clusters of a multi-cluster warehouse instantly by choosing a different size (e.g. XS, S, M, L, …) either through the UI or programmatically via corresponding SQL DDL statements. In auto-scale mode, Snowflake automatically adds or resumes additional clusters (up to the maximum number defined by user) as soon as the workload increases. If the load subsides again, Snowflake shuts down or pauses the additional clusters. No user interaction is required – this all takes place transparently to the end user.

For these decisions, internally, the query scheduler takes into account multiple factors. There are two main factors considered in this context:

  1. The memory capacity of the cluster, i.e. whether clusters have reached their maximum memory capacity
  2. The degree of concurrency in a particular cluster, i.e. whether there are many queries executing concurrently on the cluster

As we learn more from our customers’ use cases, we will extend this feature further and share interesting use cases where multi-cluster data warehouses make a difference. Please stay tuned as we continue reinventing modern data warehousing and analytics by leveraging the core principles of cloud computing.

We would like to thank our co-authors, Florian Funke and Benoit Dageville, for their significant contributions as main engineer and architect, respectively, to making multi-cluster warehouses a reality. As always, keep an eye on the blog and our Snowflake Twitter feed (@SnowflakeDB) for updates on Snowflake Computing.

 

 

Making Data Warehousing Easy

Legacy Problems

Organizations with legacy on-premise data warehouses spend a lot of time and money managing their environments and keeping up with business demands. Because of the size of the investments, organizations often run their data warehouses close to full utilization. While this may meet their current needs, the inherent lack of scalability could mean compromises on performance or failure to meet SLAs when more workloads, data sources and users need to be added.  Then the journey begins to add more capacity.  Organizations often need to acquire specialized resources, or become reliant on legacy vendors to manage and maintain the environment. All of this means that if there is a spike in demand for the environment, these organizations cannot accommodate the growth without impacting performance or must absorb additional costs for a greater footprint that waits unused until needed for that brief spike in the future.

Dealing with Growth

With performance concerns come the typical headaches of any data warehouse environment. These include, but are not limited to growing the environment, finding qualified resources for  performance tuning, optimizing queries, and dealing with concurrency and user growth. On the other hand businesses are facing stiffer competition, and end users are clamoring for faster answers to their business questions. In the past, data warehousing was limited to a set of users typically in marketing or finance. Now even field sales reps want access to up-to-date data, creating more load on the data warehouse. Plus the more data you have, the more important security becomes and the cost for performance increases. So now organizations are not only keeping the lights on, but also increasing spending to get performance, and securing the environment. In short, data warehouses have become more difficult to maintain and run!

An example of an organization facing this challenge of scalability is CapSpecialty, a leading provider of specialty insurance for small to mid-sized businesses. CapSpecialty used a legacy data warehouse to support analytics needed by their actuarial users to understand how to price and package products in various geographies. With the increased demand for access to this data, this legacy environment required a significant upgrade. Performance impacts led to users having to start their queries before leaving the office for the weekend, hoping they would be completed when they returned to work on Monday. The legacy environment also limited their ability to report on important KPIs that were critical to running the business in a timely manner. As with any financial organization, the environment also needed to provide a very secure environment to store the crown jewels: customers’ risk profile and related financial data.Unfortunately, upgrading their environment to meet this increased demand was going to cost them $500K just for licensing, and that would only give them a 2X increase in performance. This does not even include the costs for deployment, management and hosting for the new environment.

Making Data Warehousing Easy

The need for a scalable, more cost effective solution led them to Snowflake. After evaluating a number of data warehouse options,CapSpecialty decided to implement the Snowflake cloud-based Elastic Data Warehouse. Besides offering an attractive cost structure, Snowflake’s true cloud solution delivered ease of migration and scalability. With Snowflake, CapSpecialty was up and running in less than a week. In addition to achieving an increase of 200x query performance, they leveraged existing infrastructure and were set up to scale for future growth. Snowflake also provided end to end enterprise level security to protect their sensitive financial data in the cloud.

CapSpecialty underwriters are now able to analyze 10 years’ worth of governed data in 15 minutes. The stage has also been set for CapSpecialty executives to view dashboards that display real-time profitability and KPIs. Using Snowflake, CapSpecialty can also bring semi-structured data to the environment, and serve the analytics to their field agents to effectively market their products in various geographies.

To learn the details of how Snowflake made data warehousing easy for CapSpecialty, we encourage you to read more in the case study. You can also attend the our webinar  on April 27th, 2016, 10:00 AM-PST/1:00 PM-EST, to find out how Snowflake and Microstrategy enable CapSpecialty analysts to understand data in real time.