TPC-DS at 100TB and 10TB Scale Now Available in Snowflake’s Samples

We are happy to announce that a full 100 TB version of TPC-DS data, along with samples of all the benchmark’s 99 queries, are available now to all Snowflake customers for exploration and testing. We also provide a 10TB version if you are interested in smaller scale testing.

STORE_SALES sub-schema from the TPC-DS Benchmark

The STORE_SALES sub-schema from the TPC-DS Benchmark
Source: TPC Benchmark™ DS Specification

You can find the tables in:

  • Database: SNOWFLAKE_SAMPLE_DATA
  • Schema: TPCDS_SF100TCL (100TB version) or TPCDS_SF10TCL (10TB version) .

(Note that the raw data compresses in Snowflake to less than 1/3 of it’s original size.)

Sample TPC-DS queries are available as a tutorial under the + menu in the Snowflake Worksheet UI:

TPC-DS tutorial in Snowflake web interface

Accessing Sample TPC-DS queries in the Snowflake Worksheet UI

What is TPS-DS?

TPC-DS data has been used extensively by Database and Big Data companies for testing performance, scalability and SQL compatibility across a range of Data Warehouse queries — from fast, interactive reports to complex analytics. It reflects a multi-dimensional data model of a retail enterprise selling through 3 channels (stores, web, and catalogs), while the data is sliced across 17 dimensions including Customer, Store, Time, Item, etc. The bulk of the data is contained in the large fact tables: Store Sales, Catalog Sales, Web Sales — representing daily transactions spanning 5 years.

The 100TB version of TPC-DS is the largest public sample relational database we know of available on any platform for public testing and evaluation. For perspective, the STORE_SALES table alone contains over 280 billion rows loaded using 42 terabytes of CSV files.

Full details of the TPC-DS schema and queries, including business descriptions of each query, can be found in the TPC Benchmark™ DS Specification. To test examples of different types of queries, consider:

 Type  Queries
 Interactive (1-3 months of data scanned) — Simple star-join queries  19, 42, 52, 55
 Reporting (1 year of data scanned) — Simple star-join queries  3, 7, 53, 89
 Analytic (Multiple years, customer patterns) — Customer extracts, star joins  34, 34, 59
 Complex — Fact-to-fact joins, windows, extensive subqueries  23, 36, 64, 94
  • At 10 TB scale, the full set of 99 queries should complete in well under 2 hours on a Snowflake 2X-Large virtual warehouse.
  • At 100 TB, we recommend using the largest size virtual warehouse available. For example, on a 3X-Large warehouse, you can expect all 99 queries to complete within 7 hours.

Note that, if you plan to run identical queries multiple times or concurrently, be sure to disable result caching in Snowflake when you run tests by adding the following to your script:

alter session set use_cached_result = false;

TPS-DS Benchmark Kit and Working with Date Ranges

While we provide samples of the 99 queries containing specific parameter values, the TPC-DS Benchmark Kit includes tools for generating random permutations of parameters for each query — which is what we use in our internal testing.

In all queries, the date ranges are supplied using predicates on the DATE_DIM table — as specified by the TPC-DS benchmark — rather than using Date Key restrictions directly on the large fact tables (a strategy that some vendors have used to unrealistically simplify queries). If you want to create variations on these queries without using the benchmark kit, you can create versions that scan different ranges by changing the year, month and day restrictions in the WHERE clauses.

Conclusion

TPC-DS data (and other sample data sets) are made available to you through Snowflake’s unique Data Sharing feature, which allows the contents of any database in Snowflake to be shared with other Snowflake customers without requiring copies of the data.

We hope you enjoy working with this demanding and diverse workload, and invite you to compare your Snowflake results with other platforms.

And, be sure to keep an eye on this blog or follow us on Twitter (@snowflakedb) for all the news and happenings here at Snowflake.

What makes Snowflake a data warehouse?

One of the most common questions I get when speaking to people about Snowflake is: “Why do you call it a data warehouse and not a database?” This is a very reasonable question given some of the characteristics of Snowflake.

At Snowflake, in part, we say we are a full relational database management system (RDBMS) built for the cloud. We are ACID compliant and we support standard SQL. Sounds like a database to me, too. Let’s take a closer look just to be sure.

What is a database?

A database is a collection of information organized to be easily accessed, managed and updated. While there are many types of databases available today, the most common is an RDBMS. But when most folks say “database”, they usually mean a traditional RDBMS that handles Online Transaction Processing (OLTP).

So, what are some of the defining characteristics of an OLTP database?

  • Designed for rapid storage and retrieval of small sets of current data records in support of transactions and interactions within an enterprise.
  • Data is organized in tables and columns, allowing users access via structured query language (SQL).
  • Handles quick, real-time activity such as entering a customer name, recording a sale and recording all accounting activity of that sale.
  • Works well for basic operational reporting of a limited number of records. Analytic reporting is relegated to simple, static reports often driven by IT.

What is a data warehouse?

Some of the defining characteristics of a data warehouse are:

  • A database designed to store and process large volumes of current and historical data collected from multiple sources inside and outside the enterprise for deep analysis.
  • Organizes data into tables and columns, and allows users access via SQL.
  • Optimized for loading, integrating and analyzing very large amounts of data.
  • Designed to support descriptive, diagnostic, predictive and prescriptive analytic workloads.

Snowflake definitely includes the overlapping characteristics of both a database and a data warehouse-ACID compliant, support for standard SQL, etc. But Snowflake also embodies all of the defining characteristics of a data warehouse.

One of the key differentiators of Snowflake, from other solutions, is that it’s specifically designed for data warehousing and high speed analytic processing. Rather than a generalized SQL database that has been “tuned” or even adapted to handle these type of workloads, Snowflake was built from the ground up for the cloud to optimize loading, processing and query performance for very large volumes of data. Therefore, hands down, Snowflake is a data warehouse.

So, why do we still need a specialized data warehouse engine?

As OLTP databases have been able to scale higher and innovations like in-memory databases have emerged, some organizations have questioned whether they still need a separate technology or specialized system for reporting and analytics. The answer, again, requires us to look at the basics: What benefits emerge from storing and analyzing data in a separate system?

  1. It eases the burden of reporting from transactional systems by removing the contention for limited and expensive resources.
  2. It produces more business-friendly data results by allowing the data to be restructured to a more suitable format.
  3. It provides access to a wider array of reports more quickly because all the resources in the data warehouse are dedicated to reporting and analysis.
  4. It integrates valuable data from across the enterprise for richer insight. Something that can’t (and shouldn’t) be done in an OLTP system.

For more information on how you can up your data warehousing game with a modern, built-for-the-cloud approach, check out some of our free resources such as our ebook The Data Warehouse: The Engine That Drives Analytics. We would love to help you on your journey to the cloud so keep an eye on this blog or follow us on Twitter (@snowflakedb and @kentgraziano) to keep up with all the news and happenings here at Snowflake.

Understanding Your Snowflake Utilization, Part 2: Storage Profiling

This article is the second in a three-part series to help you use Snowflake’s Information Schema to better understand and effectively utilize Snowflake.

As a Customer Success Engineer, my daily job entails helping our customers get the most value from our service. And I’m now passing along some of what I’ve learned to help you become more self-sufficient. In my first post, I discussed getting a handle on your utilization of compute resources by using various Information Schema views and functions to profile your virtual warehouse usage.

In this post, I provide a deep-dive into understanding how you are utilizing data storage in Snowflake at the database, stage, and table level. To do this, I will show you examples of two functions and a view provided in the Information Schema for monitoring storage usage. I will also show you a handy page in the UI that provides an account-level view of your storage. Keep in mind that you need ACCOUNTADMIN access to perform any of the tasks described in this post.

Let’s get started.

Summary Storage Profiling in the UI

Before diving into our detailed analysis of data storage, let’s take a quick look at the summary, account-level storage view provided by Snowflake. As a user with the ACCOUNTADMIN role, you can navigate to the Account page in the Snowflake UI to get a visual overview of the data storage for your account.


This page provides a view, by month, of the average and daily storage usage across your entire account. You can use the filters on the page to filter by database, Snowflake stage, and data maintained in Fail-safe (for disaster recovery).

Detailed Storage Profiling Using the Information Schema

The Snowflake Information Schema provides two functions and one view for monitoring detailed data storage at the database, stage, and table level:

  • DATABASE_STORAGE_USAGE_HISTORY (function)
  • STAGE_STORAGE_USAGE_HISTORY (function)
  • TABLE_STORAGE_METRICS (view)

The DATABASE_STORAGE_USAGE_HISTORY table function shows your database status and usage for all databases in your account or a specified database. Here’s an example of the usage over the last 10 days for a database named sales:

use warehouse mywarehouse;

select * from table(sales.information_schema.database_storage_usage_history(dateadd('days',-10,current_date()),current_date(), ‘SALES’));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation. Also, per the Snowflake documentation:

If a database has been dropped and its data retention period has passed (i.e. database cannot be recovered using Time Travel), then the database name is reported as DROPPED_id.

At its core, the most useful insight from this function is the average growth in your database. Keep in mind, the output includes both AVERAGE_DATABASE_BYTES and AVERAGE_FAILSAFE_BYTES. Leveraging these data points to derive a percentage of Fail-safe over actual database size should give you an idea of how much you should be investing towards your Fail-safe storage. If certain data is not mission critical and doesn’t require Fail-safe, try setting these tables to transient. More granular information about Fail-safe data is provided in TABLE_STORAGE_METRICS, which we will look at more closely later in this post.

Next, let’s look at STAGE_STORAGE_USAGE_HSTORY. This function shows you how much storage is being used for staged files across all your Snowflake staging locations, including named, internal stages. Note that this function does not allow querying storage on individual stages.

Here’s an example of staged file usage for the last 10 days:

select *
from table(sales.information_schema.stage_storage_usage_history(dateadd('days',-10,current_date()),current_date()));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation.

Also note that you can only query up to 6 months worth of data using this function. Some of our users like to use Snowflake stages to store their raw data. For example, one user leverages table staging locations for their raw data storage just in case they need to access the data in the future. There’s nothing wrong with this approach, and since Snowflake compresses your staged data files, it certainly makes sense; however, only the last 6 months of staged data storage is available.

Finally, the TABLE_STORAGE_METRICS view shows your table-level storage at runtime. This is a snapshot of your table storage which includes your active and Fail-safe storage. Additionally, you can derive cloned storage as well utilizing the CLONE_GROUP_ID column. As of today, this is the most granular level of storage detail available to users.

Here’s a general use example:

select *
from sales.information_schema.table_storage_metrics
where table_catalog = 'SALES';

Note that the above screenshot only shows a portion of the output columns. For full details about the output, see the online documentation.

One interesting analysis I’ve been helping our customers with is deriving how much of their table storage is based on cloned data. In Snowflake, cloning data has no additional costs (until the data is modified or deleted) and it’s done very quickly. All users benefit from “zero-copy cloning”, but some are curious to know exactly what percentage of their table storage actually came from cloned data. To determine this, we’ll leverage the CLONE_GROUP_ID column in TABLE_STORAGE_METRICS.

For example:

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select * , referred_bytes / owned_bytes as ratio
from storage_sum
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

The ratio in the above query gives you an idea of how much of the original data is being “referred to” by the clone. In general, when you make a clone of a table, the CLONE_GROUP_ID for the original table is assigned to the new, cloned table. As you perform DML on the new table, your REFERRED_BYTES value gets updated.  If you join the CLONE_GROUP_ID back into the original view, you get the output of the original table along with the cloned table. A ratio of 1 in the above example means the table data is not cloned.

If you need to find out the exact table name from the above query, then simply join the CTE back to the TABLE_STORAGE_METRICS view and ask for the TABLE_NAME column.

For example:

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select b.table_name, a,* , referred_bytes / owned_bytes as ratio
from storage_sum a
join concurrency_wh.information_schema.table_storage_metrics b
on a.clone_group_id = b.clone_group_id
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

Storage Profiling - Example

Conclusion

By utilizing the UI and the Information Schema functions and views described in this post, you can profile your data storage to help you keep your storage costs under control and understand how your business is growing over time. It’s a good idea to take regular snapshots of your storage so that you can analyze your growth month-over-month. This will help you both formulate usage insight and take actions.

To dig in some more on this subject, check out our online documentation:

I hope this article has given you some good ideas for how to manage your Snowflake instance. Look for Part 3 of this series in coming weeks where I will show you how to analyze your query performance. As already shown in Parts 1 and 2, there are a lot of options to play with in Snowflake and they’re all intended to give you the flexibility and control you need to best use Snowflake. Please share your thoughts with us!

Also, for more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.