The Dream Data Warehouse Development Environment

Earlier this month, Snowflake’s Customer Enablement Team was assigned an email from one of our customers. The customer stated that he was not happy about the idea of cloning full copies of production databases for development purposes. “Do we really want to develop and maintain a system to copy a fraction of the production DB to dev?”, citing the reason for his message that, by just copying the entire production database, the dev team would have access to too much data. Being a veteran of Snowflake, I initially dismissed his concern because of Snowflake’s zero-copy clone capability, as outlined in this article. From my perspective, the zero-copy clone would not incur any additional cost for development purposes, so why not give the dev team all of the data?

The answer of course, as the customer pointed out, has to do with making the development effort more efficient. The assumption of zero-copy clone equating to zero-cost development is, of course, incorrect. There’s the cost of querying the data (requiring virtual warehouse credits) and the cost behind each development hour. After all, longer query times lead to longer development iterations and longer testing cycles. To create a “blissful” development environment in Snowflake, we need a more refined approach towards building the development data set.

The approach outlined by the customer was rooted in complimenting Snowflake’s zero-copy clone with the additional technique of using Block Sampling. In other words, they proposed creating a view containing a sample set from the original table. This approach enables an administrator to quickly set up a dev environment with minimal data. Controlling the amount of sample data is a good thing for many development situations because developers seldom require access to the full dataset.

Ok, let’s take a look at SAMPLE / TABLESAMPLE and see how we can do this. The syntax is quite simple:

 
 SELECT ...
 FROM ...
   { SAMPLE | TABLESAMPLE } [ samplingMethod ] ( <probability> ) [ { REPEATABLE | SEED } ( <seed> ) ]
 [ ... ]
 
 -- Where:
    samplingMethod :: = { { BERNOULLI | ROW } | { SYSTEM | BLOCK } }


Note that there are some interchangeable terms we should pay attention to when writing the query. These terms are synonyms and only differ in syntax:

SAMPLE | TABLESAMPLE

BERNOULLI | ROW

SYSTEM | BLOCK

REPEATABLE | SEED

The two main methods of sampling are ROW (or BERNOULLI) and BLOCK (or SYSTEM) sampling. Let’s take a closer look at each one. 

Row Sampling

This approach uses the Bernoulli principle to select data by applying a probability of p/100 to each row. In the Snowflake documentation, we mention that this is similar to “flipping a weighted coin” on each row. The number of sampled (selected) rows should be equal to (p/100) * n, where n is the total number of rows in the table and p is the sample probability value set by the user.

This method implies that we will iterate through each row and calculate a probability at each row to match a row candidate. This iteration is going to impact query performance. However, the advantage here is that we will end up with a formal and concise distribution of samples from our data. Regarding creating a dev environment, using the Bernoulli sampling method to create a view does not result in better performance. However, you could certainly create a new data table based on this sampling method and still scan fewer data. For example:

create table dev_Sales_LineItem_Bernoulli_20 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample bernoulli (20);

This statement will create a development table from the LINEITEM table in our SALES database with 20% of the rows from the original table.

I won’t focus too much on this method in this post. You can feel free to experiment on your own to see if this sampling method would suit your needs. Instead, I will talk more about Block Sampling.

Block Sampling

In this approach, we apply a probability of p/100 to each block of rows. For those who are familiar with Snowflake’s micro partitions (details below), block sampling chooses individual partitions based on a specific probability. If you simply want quick access to production data and to run queries against a small percentage of the rows in a table, leveraging Block Sampling is a good way to go.

Let’s look at this with a specific example using the Block Sampling method on the LINEITEM table in the SALES database. Here are the table metadata:

In this example, I have a developer who is working on an aggregation query against the LINEITEM table. He created a simple query like this:

select

l_returnflag,

l_linestatus,

sum(l_quantity) as sum_qty,

sum(l_extendedprice) as sum_base_price,

sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

avg(l_quantity) as avg_qty,

avg(l_extendedprice) as avg_price,

avg(l_discount) as avg_disc,

count(*) as count_order

from

lineitem

where

l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

l_returnflag,

l_linestatus

order by

l_returnflag,

L_linestatus;

Result:

 

This query reports the amount of business that was billed, shipped and returned. If the developer has decided to issue a query against this table, we would see the following outcome in our query plan. (Note: the below result was performed on an XS warehouse):


It took about 19 seconds on an XS warehouse, which is not bad by any means. However, we can see that the query still performed a full table scan and performed a pretty good amount of aggregation.

Now, let’s see how we can improve the experience for the developer by creating a sample set of the LINEITEM table. We’ll create a special view (against our cloned dev database) for this developer and call it V_SALES_LINEITEM_SAMPLE_10. In this view, we’ll use the Block Sampling technique and only scan 10% of the partitions required by this query.

First create the view:

create view v_Sales_LineItem_Sample_10 as

    select *

    from SALES.PUBLIC.LINEITEM

    sample block (10);

Next, let’s update the query to use the view instead.

select

       l_returnflag,

       l_linestatus,

       sum(l_quantity) as sum_qty,

       sum(l_extendedprice) as sum_base_price,

       sum(l_extendedprice * (1 - l_discount)) as sum_disc_price,

       sum(l_extendedprice * (1 - l_discount) * (1 + l_tax)) as sum_charge,

       avg(l_quantity) as avg_qty,

       avg(l_extendedprice) as avg_price,

       avg(l_discount) as avg_disc,

       count(*) as count_order

  from

       v_Sales_LineItem_Sample_10

where

       l_orderdate <= dateadd(day, -90, to_date('1998-12-01'))

group by

       l_returnflag,

       l_linestatus

 order by

       l_returnflag,

       l_linestatus;

Result:

 

Let’s take a look at the query profile for this query:

 

This query ran about 2.5 seconds on an XS warehouse, and we scanned about 10% of the total table partition. This result is much better! Snowflake is sampling 10% of the partition to pull based on the WHERE filter. The result should still be accurate enough to let the developer know this query is working as expected.

In conclusion, leveraging Snowflake’s SAMPLE clause considerably reduces the amount of test dataset for a developer without losing data integrity. Even if a specific id or a timestamp biases the native data, chances are the developer is going to accept the nuance and continue with the development instead of spending additional compute credit and hours waiting for the results to return. The alternative is to use the LIMIT function perhaps to reduce the number of rows being returned. However, this involves modifying the original query or creating a view that still needs to be maintained. Using the SAMPLE clause, you can reliably subset a large table with acceptable result mix and performance. I hope you find this helpful. Feel free to leave comments in our community and let me know your thoughts.

 

TPC-DS at 100TB and 10TB Scale Now Available in Snowflake’s Samples

We are happy to announce that a full 100 TB version of TPC-DS data, along with samples of all the benchmark’s 99 queries, are available now to all Snowflake customers for exploration and testing. We also provide a 10TB version if you are interested in smaller scale testing.

STORE_SALES sub-schema from the TPC-DS Benchmark

The STORE_SALES sub-schema from the TPC-DS Benchmark
Source: TPC Benchmark™ DS Specification

You can find the tables in:

  • Database: SNOWFLAKE_SAMPLE_DATA
  • Schema: TPCDS_SF100TCL (100TB version) or TPCDS_SF10TCL (10TB version) .

(Note that the raw data compresses in Snowflake to less than 1/3 of it’s original size.)

Sample TPC-DS queries are available as a tutorial under the + menu in the Snowflake Worksheet UI:

TPC-DS tutorial in Snowflake web interface

Accessing Sample TPC-DS queries in the Snowflake Worksheet UI

What is TPS-DS?

TPC-DS data has been used extensively by Database and Big Data companies for testing performance, scalability and SQL compatibility across a range of Data Warehouse queries — from fast, interactive reports to complex analytics. It reflects a multi-dimensional data model of a retail enterprise selling through 3 channels (stores, web, and catalogs), while the data is sliced across 17 dimensions including Customer, Store, Time, Item, etc. The bulk of the data is contained in the large fact tables: Store Sales, Catalog Sales, Web Sales — representing daily transactions spanning 5 years.

The 100TB version of TPC-DS is the largest public sample relational database we know of available on any platform for public testing and evaluation. For perspective, the STORE_SALES table alone contains over 280 billion rows loaded using 42 terabytes of CSV files.

Full details of the TPC-DS schema and queries, including business descriptions of each query, can be found in the TPC Benchmark™ DS Specification. To test examples of different types of queries, consider:

 Type  Queries
 Interactive (1-3 months of data scanned) — Simple star-join queries  19, 42, 52, 55
 Reporting (1 year of data scanned) — Simple star-join queries  3, 7, 53, 89
 Analytic (Multiple years, customer patterns) — Customer extracts, star joins  34, 34, 59
 Complex — Fact-to-fact joins, windows, extensive subqueries  23, 36, 64, 94
  • At 10 TB scale, the full set of 99 queries should complete in well under 2 hours on a Snowflake 2X-Large virtual warehouse.
  • At 100 TB, we recommend using the largest size virtual warehouse available. For example, on a 3X-Large warehouse, you can expect all 99 queries to complete within 7 hours.

Note that, if you plan to run identical queries multiple times or concurrently, be sure to disable result caching in Snowflake when you run tests by adding the following to your script:

alter session set use_cached_result = false;

TPS-DS Benchmark Kit and Working with Date Ranges

While we provide samples of the 99 queries containing specific parameter values, the TPC-DS Benchmark Kit includes tools for generating random permutations of parameters for each query — which is what we use in our internal testing.

In all queries, the date ranges are supplied using predicates on the DATE_DIM table — as specified by the TPC-DS benchmark — rather than using Date Key restrictions directly on the large fact tables (a strategy that some vendors have used to unrealistically simplify queries). If you want to create variations on these queries without using the benchmark kit, you can create versions that scan different ranges by changing the year, month and day restrictions in the WHERE clauses.

Conclusion

TPC-DS data (and other sample data sets) are made available to you through Snowflake’s unique Data Sharing feature, which allows the contents of any database in Snowflake to be shared with other Snowflake customers without requiring copies of the data.

We hope you enjoy working with this demanding and diverse workload, and invite you to compare your Snowflake results with other platforms.

And, be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb) for all the news and happenings here at Snowflake.