Modern Data Sharing: The Opportunities Are Endless

Data sharing between organizations for commercial purposes has been around for over 100 years. But until very recently, enterprises have been forced to rely on traditional data sharing methods that are labor-intensive, costly and error-prone. These methods are also more open to hackers and produce stale data. Snowflake Data Sharing, one of the newest innovations to Snowflake’s cloud-built data warehouse, has eliminated those barriers and enabled enterprises to easily share live data in real time via one-to-one, one-to-many and many-to-many relationships. Best of all, the shared data between data providers and data consumers doesn’t move.

Below is an example of how Snowflake Data Sharing reduced the time to create a live, secure data share to a fraction of the time and cost of a standard method. Most interestingly, the application of Snowflake Data Sharing in this instance reveals that the solutions addressed by modern data sharing are endless.

The Challenge

The Federal Risk and Authorization Management Program (FedRAMP), “is a government-wide program that provides a standardized approach to security assessment, authorization, and continuous monitoring for cloud products and services.” Complying with the program’s approximately 670 security requirements and collecting supporting evidence is a significant challenge. But having to do so monthly, as required, is a Sisyphean task if you attempt it manually. Cloud providers must complete an inventory of all of their FedRAMP assets, which include, binaries, running network services, asset types and more. Automation is really the only logical approach in solving this FedRAMP inventory conundrum.

The Run-of-the-mill Method

Often, people gravitate to what’s familiar so it’s no surprise we initially considered a solution that comprised the combination of an AWS tool, an IT automation tool and some Python to clean up the data. However, we estimated a significant effort to develop, test, and deploy the code, which is separate from the required, ongoing maintenance. Instead, we took a different approach.

The Solution

Snowflake’s Data Sharing technology is a fast, simple and powerful solution that allows us to maintain our security governance posture while sharing live data in real time without having to move the data. With modern data sharing, there is the concept of a data provider and a data consumer. For this project we engaged Lacework, a Snowflake security partner, as our initial data provider.

Lacework provides us with advanced threat detection by running their agents on our servers to help capture relevant activities, organize events logically, baseline behaviors, and identify deviations. In doing so, Lacework looks for file modifications, running processes, installed software packages, network sockets, and monitors for suspicious activities in one’s AWS account. All of that data is then analyzed and stored in their Snowflake account. Lacework is both a Snowflake vendor and a Snowflake data warehouse-as-a-service customer. They use Snowflake to provide their security services to Snowflake. Basically, Lacework already collects the data required to complete the FedRAMP system inventory collection task.

We contacted Lacework and presented them with our FedRAMP challenge and suggested leveraging data sharing between their Snowflake account and our Security Snowflake account. Within a couple of hours, they provided us with live FedRAMP data through Snowflake Data Sharing. Yes, you read that right. It only took a couple of hours. The following describes the steps for creating, sharing, and consuming the data:                                                                                                                               

Data Provider (Lacework) steps

  1. Create a share and give it a name
  2. Grant privilege to the database and its objects (schemas, tables, views)
  3. Alter share to add other Snowflake accounts to the share

Data Consumer (Snowflake Security) steps

  1. Create Database from the share
  2. Perform some SQL commands and voila! We have our FedRAMP System inventory data (we redacted the results in this image for security reasons)

 

 

 

Again, the whole data sharing effort took just a few hours.

Beyond FedRAMP

You are probably asking yourself at this point, “why didn’t you just ask Lacework to generate a FedRAMP report instead doing data sharing?” I would wholeheartedly agree with you if we were dealing with a conventional data warehouse built with a 90’s philosophy of share nothing. But Snowflake is the farthest thing from a conventional data warehouse and data sharing is nothing short of transformational. How so?

In addition to consuming data from Lacework, we also consume data from other data providers that share application logs, JIRA cases and more. We combine these data sources to automatically track software packages to determine if they are approved or not. Before data sharing, this activity was a time-consuming, manual process. Now, the team is free to focus on more critical security activities since data sharing has helped with FedRAMP and has improved our overall security posture.

Conclusion

As I wrote this blog, I watched my kids enthusiastically playing with their new Lego set. It’s remarkable how simple blocks can form such complex structures. Modern data sharing displays similar extrinsic properties because it offers data consumers and data providers with infinite ways of solving challenges and thus creating boundless business opportunities.

Learn more about Lacework and Fedramp.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Dresner ADI Report: Snowflake Recommended by 100% of Customer Survey Respondents

Snowflake has ranked highly among the data warehouse vendors in the quadrants of the 2018 Analytical Data Infrastructure Market Study. Snowflake has achieved the highest or second highest marks across all of the metrics within the study’s two models: Customer Experience (product/technology, sales and service), and Vendor Credibility (product value and ethics).

The ADI study is based on a detailed survey completed by customers from each competing vendor. The customer responses formed the basis for how each vendor ranked a specific data warehouse across the two models.

The first is the Customer Experience Model. Within the survey, customers report their experience with the sales and service staff of their data warehouse vendor, along with their experience with their vendor’s technology. Snowflake had the best combined score across both of these attributes. Download the report to see our placement.

The Vendor Credibility Model illustrates the overall perception of the value customers receive from a product along with their confidence in their chosen data warehouse vendor. Vendors that rated highly on the vertical axis of this quadrant are perceived as highly trustworthy, while a placement to the right shows significant value being driven by the product within the customer organization. Snowflake also ranked high in this model. Download the report to see our placement.

Most importantly, 100 percent of the Snowflake customer respondents said they would recommend Snowflake to others. We’re thankful for the confidence our customers have placed in us. Snowflake is a values-driven company and our most important value is “Put Customers First”. It’s the value that drove us to create the only cloud-built data warehouse so customers could experience an infinitely scalable cloud data warehouse. More recently, it’s the value that led us to develop Instant Elasticity, lowering customer costs by up to 80% in some cases. We believe this report validates our approach and our focus on our customers. Download the 2018 Analytical Data Infrastructure Market Study to learn more.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Data is Only Transformative with Transformative Technology

At the recent AWS re:Invent show in Las Vegas, The Cube host, Lisa Martin, had the chance to sit down with Bob Muglia, CEO and President of Snowflake. Bob shared his thoughts on Snowflake’s latest addition to its cloud-built data warehouse, Snowpipe, while looking back at Snowflake’s origins and ahead to its future in order to enable the data-driven enterprise.

What is Snowpipe, and how do customers get started with it?

Muglia: Snowpipe is a way of ingesting data into Snowflake in a streaming, continuous way. You simply drop new data that’s coming into S3 and we ingest it for you automatically. Snowpipe makes it simple to bring the data into your data warehouse on a continuous basis, ensuring that you’re always up-to-date and that your analysts are getting the latest insights and the latest data.

In the five years since you launched, how has the opportunity around cloud data warehousing changed? How has Snowflake evolved to become a leader in this space?

Muglia: If you go back five years, this was a timeframe where NoSQL was all the rage. Everybody was talking about how SQL was passé and something you’re not going to see in the future. Our founders had a different view. They had been working on true relational databases for almost 20 years, and they recognized the power of SQL and relational database technology. But they also saw that customers were experiencing significant limitations with existing technology. They saw in the cloud, and in what Amazon had done, the ability to build an all new database that takes advantage of the full elasticity and power of the cloud to deliver whatever analytics the business requires. However much data you want, however many queries you want to run simultaneously, Snowflake takes what you love about a relational database and allows you to operate in a very different way. Our founders had that vision five years ago and successfully executed on it. The product has worked beyond the dreams of our customers, and that response from our customers is what we get so excited about.

How did you identify what data should even be streamed to Snowpipe?

Muglia: As an example, in entertainment we’re experiencing a data explosion. You have streaming video data, subscription data, billing data, social media data and on and on. None of this is arriving in any sort of regular format. It’s coming as semi-structured data, like JSON or XML. Up until Snowflake came onto the scene with a truly cloud-based solution for data warehousing, everyone was struggling to wrangle all these data sets. Snowpipe lets you bring in multiple data sets, merge them in real-time and get the analytics back to your business in an agile way that’s never been seen before.

How does your partnership with AWS extend Snowflake’s capabilities?

Muglia: People don’t want their data scattered all over the place. With the cloud, with what Amazon’s done and with a product like Snowflake, you can bring all of your data together. That can change the culture of a company and the way people work. All of a sudden, data is not power. Data is available to everyone, and it’s democratized so every person can work with that data and help to bring the business forward. It can really change the dynamics around the way people work.

Tell us little bit about Snowflake’s collaboration with its customers. How are they helping to influence your future?

Muglia: As a company, we run Snowflake on Snowflake. All of our data is in Snowflake, all of our sales data, our financial data, our marketing data, our product support data and our engineering data. Every time a user runs a query, that query is logged in Snowflake and the intrinsics about it are logged. When you have a tool with the power of Snowflake, you can effectively answer any business question in just a matter of minutes. And that’s transformative to the way people work. And to me, that’s what it means to build a data-driven culture: The answers to business questions are inside what customers are doing and are encapsulated in the data.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

5 Must-Have Features for High Concurrency Environments

Whether you’re new to cloud data warehousing or comparing multiple cloud data warehouse technologies, it’s critical to assess whether your data warehouse environment will need to support concurrent processing of any sort. Unless you’re a lone database shop, in all likelihood, the answer is yes you will. Concurrency, or concurrent data processing, is simultaneous access and/or manipulation of the same data. This is not to be confused with parallel processing, which is multiple operations happening at the same time, but not against the same data.

Concurrency can take the form of multiple users interactively exploring and manipulating a particular data set, concurrent applications querying and visualizing the same data set, transactional updates to the data, or even concurrent loading of new data or change data into the data set. And if you thought to yourself, “What? I can concurrently load new data, while supporting queries on the same data set?,” then keep reading.

If you require concurrency, at any capacity — you will want these five cloud data warehouse features:

  • High relational performance across a broad range of data types: Of course, you want high performance–that’s a given. However, the more challenging aspect to plan for is fast queries on a broad range of data types, including semi-structured/JSON data. You don’t want your relational data warehouse to bog down and hold up corporate users because now it must handle non-traditional data from groups like the web team or product engineering. JSON is the new normal.  
  • Automatic and instant warehouse scaling: Let’s say your warehouse can handle a high number of concurrent accesses, but it bogs down during a period of high demand. Now what? This is a very important question to have answered as you compare technologies. Will you have to kick-off users? Will you have to schedule after-hour jobs? Will you have to add nodes? Will this require you to redistribute data? If so, redistributing data takes time and it’s a double-whammy because the existing data in the warehouse has to be unloaded. This is a huge disruption to your users. Instead, you want the ability to load balance across new virtual warehouses (compute engines) and have your cloud data warehouse continue to execute at fast speeds–including loading new data–all against the same data set. The more automatic this load balancing happens, the better.
  • ACID compliance: Right up there with poor performance is inaccurate results or reporting due to inconsistent data or dirty reads. With more people, workgroups, or applications accessing the same data simultaneously, the more pressure there is to maintain data consistency. A data warehouse with ACID compliance ensures consistency and data integrity are validated without having to write scripts or manually managing data integrity and consistency yourself.
  • Multi-statement transaction support: Tied to ACID compliance is multi-statement transaction support. If you have users that nest multiple transactions within a single query, you want a warehouse solution that you can trust will completely execute the transactions with integrity or will automatically rollback transactions should another transaction in the line fail.  
  • Data sharing: Data integrity and freshness should not stop just because you have to share data with an external stakeholder. Traditional approaches require engaging in an ETL process or spending time manually deconstructing, securing and transferring data. Data consumers on the receiving end must also spend time to reverse your steps and reconstruct data. This process is slow and manual, and does not ensure your data consumers are always working with live and up to date data. Data sharing allows you to eliminate all of this effort and ensure access to live data.

The business value for proactively planning for concurrency is that you want to ensure your cloud data warehouse can support your environment, regardless of what it throws at you. Especially during times of sudden, unpredictable, heavy query loads against a common data set.

Try Snowflake for free. Sign up and receive $400 US dollars worth of free usage. You can create a sandbox or launch a production implementation from the same Snowflake environment.

Deliveroo Delivers with Real-time Data

In a field of struggling food delivery startups, one notable success story has emerged from the fray. Termed “the European unicorn” by TechCrunch, Deliveroo is a British startup that offers fast and reliable food delivery service from a premium network of restaurants.

Deliveroo recently raised a $385 million funding round, boasts an estimated $2 billion valuation and is credited with transforming the way people think about food delivery. What is this unicorn doing differently? How has it found success where so many others have failed?

“Data is baked into every aspect of the organization,” Deliveroo’s head of business intelligence, Henry Crawford said. “Having instant access to data reveals which geographic areas are experiencing a shortage of restaurants and a shortage of particular cuisines so we can create these hubs right at the consumer’s doorstep.”

Deliveroo analyzes customer behavior, gains insights into market trends and responds with swift decisions and rapid execution by using data-driven insights. Snowflake makes all of this possible.

“With data coming from a variety of sources, including web traffic, transactions and customer behavior, having a data warehouse built for the cloud provides one repository for a single source of truth,” Henry explains.“The shift to Snowflake’s cloud data warehouse has enabled us to make good on our promise that got Deliveroo started: To connect consumers with great food from great restaurants, wherever you are, and whatever it takes.“

Snowflake also accommodates Deliveroo’s 650% growth in 2016. Such rapid momentum prompted Deliveroo to expand its business intelligence team from two employees to 14. Additional team members triggered the need for more access to the same data but without impacting performance.

Since Snowflake is built for the cloud, an unlimited number of users can access all of an organization’s data from a single repository, which is critical to Deliveroo’s success. There’s no replicating data, shifting queries and other workloads to non-business hours, or queueing users to preserve performance. Instead, Snowflake’s true cloud elasticity means Deliveroo can automatically scale up, down and out (concurrency) to load and analyze data without disruption.

“None of these future plans would be possible without real-time, concurrent access to massive volumes of data,” Henry said.

What’s next for Deliveroo? Using real-time logistics algorithms to increase the number and the speed of deliveries. Deliveroo’s expansion plans also include an “Editions” program—delivery-only kitchens so partner restaurants can expand their footprint without opening brick-and-mortar locations.

Learn more about how Snowflake can accelerate your data storage and analytics initiatives.

Financial Services: Welcome to Virtual Private Snowflake

Correct, consistent data is the lifeblood of the financial services industry. If your data is correct and consistent, it’s valuable. If it’s wrong or inconsistent, it’s useless and may be dangerous to your organization.

I saw this firsthand during the financial meltdown of 2007/08. At that time, I had been working in the industry for nearly 20 years as a platform architect. Financial services companies needed that “single source of truth” more than ever. To remain viable, we needed to consolidate siloed data sets before we could calculate risk exposure. Most financial services companies were on the brink of collapse. Those that survived did so because they had access to the right data.

At my employer, we looked for a way to achieve this single source with in-house resources, but my team and I quickly realized it would be an extraordinary challenge. Multiple data marts were sprawled across the entire enterprise, and multiple sets of the same data existed in different places, so the numbers didn’t add up. In a global financial services company, even a one percent difference can represent billions of dollars and major risk.

We ultimately built an analytics platform powered by a data warehouse. It was a huge a success. It was so successful that everybody wanted to use it for wide-ranging production use cases. However, it couldn’t keep up with demand, and no amount of additional investment would solve that problem.

That’s when I began my quest to find a platform that could provide universal access, true data consistency and unlimited concurrency. And for financial services, it had to be more secure than anything enterprises were already using. I knew the cloud could address most of these needs. However, even with the right leap forward in technical innovation, would the industry accept it as secure? Then I found Snowflake. But my story doesn’t end there.

I knew Snowflake, the company, was onto something. So, I left financial services to join Snowflake and lead its product team. Snowflake represents a cloud-first approach to data warehousing, with a level of security and unlimited concurrency that financial services companies demand.

We’ve since taken that a step further with Virtual Private Snowflake (VPS) – our most secure version of Snowflake. VPS gives each customer a dedicated and managed instance of Snowflake within a separate Amazon Web Services (AWS) Virtual Private Cloud (VPC). In addition, customers get our existing, best-in-class Snowflake security features including end-to-end encryption, at rest and in-transit. VPS also includes Tri-Secret Secure, which combines a customer-provided encryption key, a Snowflake-provided encryption key and user credentials. Together, these features thwart an attempted data decryption attack by instantly rendering data unreadable. Tri-Secret Secure also includes user credentials to authenticate approved users.

VPS is more secure than any on-premises solution and provides unlimited access to a single source of data without degrading performance. This means financial services companies don’t have to look at the cloud as a compromise between security and performance.

To find out more, read our VPS white paper and solution brief: Snowflake for Sensitive Data.

 

3 Ways a Data Sharehouse™ Can Catapult Your Data-Driven Initiatives

How would your organization benefit from effortlessly sharing limitless amounts of data with business partners and commercializing that data to share with other organizations?

With most data sharing methods today, the best you can do is imagine these benefits because it’s cumbersome, time consuming and costly to share even small slices of data. If you share data using an FTP approach, you will spend time deconstructing, scripting, securing and governing the data, and your data consumers will spend time reconstructing, scripting and rebuilding the data. Sharing data via email can be very slow and insecure. Email is not even a practical option for large data sizes and is exponentially more difficult if sharing a large database.

Sharing data using a basic cloud storage service is equally inefficient, including the lack of ability for you or for your data consumers to query the data directly. And if you want to enable “direct” data queries, without loading, you will likely evaluate a traditional cloud data warehouse or Hadoop platform that is “integrated” with a data lake storage platform. Do you think this will be simple? Not so fast! You’ll have to factor in the need to manage a separate data catalogue and external tables to share data, or you’ll have to contend with data inconsistency and performance issues.

These limitations all add up to mean that traditional options to share data are more difficult to take data sharing to the next level of capabilities, which goes beyond finding insights to commercializing your data. If any of this feels familiar, consider these three reasons why Snowflake Data Sharing – the technology making it possible to create a Data Sharehouse™ – is so compelling.

1 – Drive simplicity: Allocate more of your time and resources to strategic data sharing projects

Traditional options to share data create unnecessary data processing and platform complexity. Complexity adds burdens and requires a lot of extra time and resources, including infrastructure costs, for you and for your data consumers. 

Snowflake Data Sharing, on the other hand, is an all-in-one, data warehouse-as-service solution that makes sharing data simple and easy. The vision for the Snowflake architecture, from day one, featured separation of compute, storage, and services to allow unlimited scaling, concurrency and data sharing.  

 

 

For example, with Snowflake, it is not necessary to manage an external data catalogue and not required to have a separate effort to build data security. It’s all built-in. In addition, Snowflake metadata management makes it possible to keep tabs and track all data warehouse activities. Thus, to share data, follow these steps:

  • Point your pipeline to land your data into Snowflake and set up your warehouse.
  • To share any portions of your data warehouse, use Snowflake’s included SQL client to CREATE a new database shell that could contain the entire database or any portions of it. CREATE a reference list for the objects you wish to share.
  • Issue GRANT statements that will enable access to the shared-database and any objects referenced within “the share”.

These are all the steps you will need to quickly and easily share data. All of the related metadata activity, including cataloguing, happens automatically within the Snowflake data warehouse service. No additional infrastructure or separate cataloguing effort required.

2 – Enable more valuable insights: Your data consumers will always operate with live and current data

Most options to share data require you to unload the data from your data warehouse in order to transmit or email to your data consumers. Or if using a cloud data warehouse platform to avoid this, it relies on a physically separate storage pool in order to scale and share data. The downside with either option is that the data set is read-only and disconnected from any updates that occur within the data warehouse. If you modify the data set, or if the data set is regularly updated, you must reload (ETL) it into the data warehouse, operate on it, and then unload it again to retransmit or place it in the shared storage pool.

For data consumers that received or were connected to the old data set, this means there is a period of time during which they’ll be exposed to stale and inconsistent data. Fresh data won’t be available until a new transmission is made or until a new connection is made to the fresh data. No one wants to run analytics on stale data.

Snowflake Data sharing delivers a better choice. Because data sets shared with data consumers within the Snowflake environment are live, in real time, and from the data warehouse, data consumers will immediately see fresh data as soon as an update or transaction is committed and successfully executed. Data consistency is maintained, without any extra effort from you. Your data consumers will not have struggle with decisions to run analytics now or wait for a new update.

3 – Support any scale, with ease: Seamlessly and cost-effectively share data with any number of data consumers

At the end of the day, it’s all about growing and expanding your business and services while providing excellent experiences for the customers and consumers of your data. If you anticipate sharing data with tens, hundreds, or thousands of data consumers, each of which with unique data sharing requirements, how can you easily support this? And support the growth without manually building more clusters, managing external tables or metadata stores, suffering through performance penalties or creating data inconsistencies? It would be very difficult or impossible with other architecture approaches to accomplish these objectives.

Snowflake data sharing allows you to scale and easily add data consumers and specify granular secure views, all at the highest performance profile and with data consistency. On the other end, with Snowflake, your data consumers can immediately use and query the shared data, also at the highest performance profile.

Summary: Share and imagine more

These are just a few compelling examples of how data Snowflake Data Sharing, and the Data Sharehouse approach, can transform the creation of high-value business assets from data. It’s a new, exciting, powerful and easy-to-use feature of all Snowflake data warehouse deployments.

Also check out data sharing blogs from Snowflake customers Snagajob and Playfab. To learn more, you can easily signup for Snowflake here. Or jump to our webpage to review our data sharing ebook, white paper, demo, and other resources. In future posts, we’ll cover more details about the technical capabilities and business-value advantages of Snowflake Data Sharing – built for the cloud. Be sure to follow us on Twitter (@snowflakedb and @miclnixon1) for all the news and happenings here at Snowflake. Stay tuned!

Understanding Your Snowflake Utilization, Part 2: Storage Profiling

This article is the second in a three-part series to help you use Snowflake’s Information Schema to better understand and effectively utilize Snowflake.

As a Customer Success Engineer, my daily job entails helping our customers get the most value from our service. And I’m now passing along some of what I’ve learned to help you become more self-sufficient. In my first post, I discussed getting a handle on your utilization of compute resources by using various Information Schema views and functions to profile your virtual warehouse usage.

In this post, I provide a deep-dive into understanding how you are utilizing data storage in Snowflake at the database, stage, and table level. To do this, I will show you examples of two functions and a view provided in the Information Schema for monitoring storage usage. I will also show you a handy page in the UI that provides an account-level view of your storage. Keep in mind that you need ACCOUNTADMIN access to perform any of the tasks described in this post.

Let’s get started.

Summary Storage Profiling in the UI

Before diving into our detailed analysis of data storage, let’s take a quick look at the summary, account-level storage view provided by Snowflake. As a user with the ACCOUNTADMIN role, you can navigate to the Account page in the Snowflake UI to get a visual overview of the data storage for your account.


This page provides a view, by month, of the average and daily storage usage across your entire account. You can use the filters on the page to filter by database, Snowflake stage, and data maintained in Fail-safe (for disaster recovery).

Detailed Storage Profiling Using the Information Schema

The Snowflake Information Schema provides two functions and one view for monitoring detailed data storage at the database, stage, and table level:

  • DATABASE_STORAGE_USAGE_HISTORY (function)
  • STAGE_STORAGE_USAGE_HISTORY (function)
  • TABLE_STORAGE_METRICS (view)

The DATABASE_STORAGE_USAGE_HISTORY table function shows your database status and usage for all databases in your account or a specified database. Here’s an example of the usage over the last 10 days for a database named sales:

use warehouse mywarehouse;

select * from
table(sales.information_schema.database_storage_usage_history(dateadd('days',-10,current_date()),current_date(), ‘SALES’));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation. Also, per the Snowflake documentation:

If a database has been dropped and its data retention period has passed (i.e. database cannot be recovered using Time Travel), then the database name is reported as DROPPED_id.

At its core, the most useful insight from this function is the average growth in your database. Keep in mind, the output includes both AVERAGE_DATABASE_BYTES and AVERAGE_FAILSAFE_BYTES. Leveraging these data points to derive a percentage of Fail-safe over actual database size should give you an idea of how much you should be investing towards your Fail-safe storage. If certain data is not mission critical and doesn’t require Fail-safe, try setting these tables to transient. More granular information about Fail-safe data is provided in TABLE_STORAGE_METRICS, which we will look at more closely later in this post.

Next, let’s look at STAGE_STORAGE_USAGE_HSTORY. This function shows you how much storage is being used for staged files across all your Snowflake staging locations, including named, internal stages. Note that this function does not allow querying storage on individual stages.

Here’s an example of staged file usage for the last 10 days (using the same database, sales, from the previous example):

select * from
table(sales.information_schema.stage_storage_usage_history(dateadd('days',-10,current_date()),current_date()));

Note that the above screenshot only displays some of the output columns. For full details about the output, see the online documentation.

Also note that you can only query up to 6 months worth of data using this function. Some of our users like to use Snowflake stages to store their raw data. For example, one user leverages table staging locations for their raw data storage just in case they need to access the data in the future. There’s nothing wrong with this approach, and since Snowflake compresses your staged data files, it certainly makes sense; however, only the last 6 months of staged data storage is available.

Finally, the TABLE_STORAGE_METRICS view shows your table-level storage at runtime. This is a snapshot of your table storage which includes your active and Fail-safe storage. Additionally, you can derive cloned storage as well utilizing the CLONE_GROUP_ID column. As of today, this is the most granular level of storage detail available to users.

Here’s a general use example (using the sales database):

select *
from sales.information_schema.table_storage_metrics
where table_catalog = 'SALES';

Note that the above screenshot only shows a portion of the output columns. For full details about the output, see the online documentation.

One interesting analysis I’ve been helping our customers with is deriving how much of their table storage is based on cloned data. In Snowflake, cloning data has no additional costs (until the data is modified or deleted) and it’s done very quickly. All users benefit from “zero-copy cloning”, but some are curious to know exactly what percentage of their table storage actually came from cloned data. To determine this, we’ll leverage the CLONE_GROUP_ID column in TABLE_STORAGE_METRICS.

For example (using a database named concurrency_wh):

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select * , referred_bytes / owned_bytes as ratio
from storage_sum
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

The ratio in the above query gives you an idea of how much of the original data is being “referred to” by the clone. In general, when you make a clone of a table, the CLONE_GROUP_ID for the original table is assigned to the new, cloned table. As you perform DML on the new table, your REFERRED_BYTES value gets updated.  If you join the CLONE_GROUP_ID back into the original view, you get the output of the original table along with the cloned table. A ratio of 1 in the above example means the table data is not cloned.

If you need to find out the exact table name from the above query, then simply join the CTE back to the TABLE_STORAGE_METRICS view and ask for the TABLE_NAME column.

For example (using the same database, concurrency_wh, from the previous example):

with storage_sum as (
  select clone_group_id,
         sum(owned_active_and_time_travel_bytes) as owned_bytes,
         sum(active_bytes) + sum(time_travel_bytes) as referred_bytes
  from concurrency_wh.information_schema.table_storage_metrics
  where active_bytes > 0
  group by 1)
select b.table_name, a.* , referred_bytes / owned_bytes as ratio
from storage_sum a
join concurrency_wh.information_schema.table_storage_metrics b
on a.clone_group_id = b.clone_group_id
where referred_bytes > 0 and ratio > 1
order by owned_bytes desc;

Storage Profiling - Example

Conclusion

By utilizing the UI and the Information Schema functions and views described in this post, you can profile your data storage to help you keep your storage costs under control and understand how your business is growing over time. It’s a good idea to take regular snapshots of your storage so that you can analyze your growth month-over-month. This will help you both formulate usage insight and take actions.

To dig in some more on this subject, check out our online documentation:

I hope this article has given you some good ideas for how to manage your Snowflake instance. Look for Part 3 of this series in coming weeks where I will show you how to analyze your query performance. As already shown in Parts 1 and 2, there are a lot of options to play with in Snowflake and they’re all intended to give you the flexibility and control you need to best use Snowflake. Please share your thoughts with us!

Also, for more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.