Snowflake: Seriously Serious About Security

How Serious Are We About Security? Extremely.

No self-respecting security team is ever satisfied with the existing security controls it has in place. Some mistake this dissatisfaction as a personality disorder, referring to their security team members as “control-freaks” or “over-achievers”. Let’s face it: security professionals tend to be an eccentric group. However, for a truly committed and competent security team, this eccentricity is simply the symptom of the healthy paranoia that comes with being responsible for the protection of vital infrastructure and sensitive data.

Snowflake’s security team has channeled this paranoia into a program we call Seriously Serious Security Testing. There are several components of this program, including the audit of all the usual administrative and technical controls you would expect from a cloud company. However, where Snowflake’s eccentricity truly surfaces is in our embrace of the dreaded Penetration Test. Here are the highlights of Snowflake’s security testing program and the key role that penetration testing plays.

First: What is a Penetration Test?

A penetration test is a controlled attempt to exploit vulnerabilities to determine whether unauthorized access or other malicious activity is possible within the target environment. This is a requirement for PCI-DSS compliance, and is also considered best practice for any organization that takes security seriously. Snowflake engages globally-recognized experts to perform this activity within specific constraints and guidelines as described in the Methodology section below.

Frequency

Most companies avoid penetration tests altogether. Others perform them annually at best, which is the minimum frequency required to meet the standards and certifications their auditors tell them they need. What many auditors don’t challenge is whether or not adequate penetration testing has been performed after every “material change” to the company’s product or infrastructure. It’s unlikely that performing penetration tests annually would be sufficient in a cloud environment where most vendors take pride in the frequent deployment of new features and functionality (Snowflake is no different in this regard with releases several times a month, at least). Because of these frequent changes, it’s important to ensure your cloud vendors are performing frequent penetration testing to ensure no new vulnerabilities have inadvertently been introduced.

Security Penetration Test - Frequency

source: https://www.rapid7.com/globalassets/_pdfs/whitepaperguide/rapid7-research-report-under-the-hoodie.pdf

Much to the irritation of our Operations and Engineering teams, Snowflake has performed more than 5 penetration tests in the past 6 months.

Why would we do this to ourselves? Because we want to know what our weaknesses are! The frequency with which we perform these tests provides Snowflake with the assurance that changes to the Snowflake service, as well as newly discovered vulnerabilities within other components of our environment, are not putting Snowflake or (more importantly) Snowflake’s customers and their data at risk.

Methodology

Another example of Snowflake Security’s paranoia is the approach we take with our penetration testers. Typical penetration testing engagements at Snowflake are designed to simulate the compromise of an employee’s or customer’s credentials by providing the tester with limited access to a non-production environment. Engagements run a minimum of two weeks and begin with providing the testers not only with the aforementioned credentials, but also with substantial information about the architecture, network design, and, when applicable, our application code. (This method is sometimes referred to as White Box Testing.) If, after a specific period of time, the testers have not been able to find an entry point, Snowflake gradually provides the testers with slightly more access until they are able to uncover vulnerabilities, or until the time is up.

Why would we divulge so much information? We want to know what ALL our weaknesses are! This provides us with visibility into what would happen if, for example, we had an insider attempting to gain unauthorized access to data. How far would they get? How quickly could we detect them? How would we contain them? And so on. The information is invaluable.

Security Penetration Test - Vulnerabilities

Most common vulnerabilities found by penetration testers

source: https://www.rapid7.com/globalassets/_pdfs/whitepaperguide/rapid7-research-report-under-the-hoodie.pdf

Transparency

The final example of Snowflake’s Seriously Serious Security Testing program is the highly unusual practice of sharing penetration test reports and remediation documentation with qualified prospects and customers (under NDA, transmitted securely, and with the promise of their first born if there is a compromise). By sharing our reports we are able to solicit additional feedback on ways to improve our testing.

I’ve been on both sides of the audit fence for years, and I’ve yet to find an organization as willing to share as much information about its penetration testing frequency and methodology as Snowflake. However, it comes as no surprise to anyone who has worked with Snowflake. Snowflake’s corporate culture is based on teamwork and collaboration, which spills over into Snowflake’s relationships with customers and vendors. We believe that transparency is the cornerstone to trust, and trust is the cornerstone to a healthy partnership between Snowflake and our customers. Providing the penetration test report and remediation evidence allows customers to see for themselves how seriously we take security, as well as how effective we are at achieving it. This allows our customers and prospects to make an informed decision about the risks they’re taking.

Conclusion

Security is a constantly moving target. Our team will never stop this extreme security testing of our infrastructure because threats are constantly evolving.

So…
Call us control freaks.
Call us over-achievers.
Call us paranoid.

One thing you’ll never call us is complacent…seriously.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud, securely. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

 

Saving Time & Space: Simplifying DevOps with Fast Cloning

As I have written before, I think our fast cloning feature is pretty cool. Not only is it very fast (as in it took me barely a minute to clone a 2TB database!), it allows you to have multiple copies of your data without the additional cost of storage usually associated with replicating data.  

That is all well and good, but out in the real world, folks want to see lots of examples on how to make good use of this feature. A common question from customers is how to utilize the Snowflake fast clone feature to support DevOps. Specifically, can it be used to support a promotion process from development to integration/testing to production?

The answer is yes! Following are a few examples of how that might work.

Getting Started: Cloning Your Production Database

Starting out, let’s assume you have one database in Snowflake (i.e. production by default) and it has a schema named PUBLIC. In this database you have two tables (A and B) and they are being populated daily by a load process.

Cloning Figure 1

Now you find you need to make changes and additions and realize you should not do those in production so now you need a development database. How do you do that in Snowflake?

You could do it the old-fashioned way and just create a new database then copy the existing tables and data over from production (using a “CREATE TABLE AS SELECT …” or CTAS operation). If they are large tables that would take some time and cost extra as it also requires more space. However, using Snowflake’s fast clone feature you can do it way faster and not incur the cost for extra space.

Here is how to do it:

CREATE DATABASE Dev CLONE Prod;

That’s it, one command! It creates the new database, schema, and tables and gives you a logical copy of the production data to work with almost instantly. 

While you are at it, you can also  create the integration testing database (INT database) too:

CREATE DATABASE Int CLONE Prod;

Now you are set to continue your data warehouse development efforts with a proper environment in place.

Scenario 2: Promoting New Development

Based on what we did above, your current Development database has tables A and B, cloned from Prod. As you begin your development iteration, you can add more tables. In this example, we added two new tables C and D, which are under development.

Initial State:

Prod database: Tables A, B

Int database: Tables A, B (previously cloned from Prod)

Dev database:

  • Tables A, B (previously cloned from Prod)
  • Tables C, D (created as part of new development)
Cloning Figure2

Mini Release:

Now suppose you want to do a mini-release, wherein you:

  • Promote only Table C to Int for testing
  • Leave Table D as is in Dev (work-in-progress, not ready to deploy)

To do this mini-release you need to execute the following steps in a script, using the command-line client, or in the Snowflake worksheet:

  1. Deploy Table C to Int by simple clone:
        • If C already already contains production data (copy data):
          USE Int.public; 
          CREATE TABLE C CLONE Dev.public.C;  // Yes, that simple!
          • Load incremental new data into Table C
          • Validate results
        • If Table C has only test data and you do not want to copy the data:
    USE Int.public;
    CREATE TABLE C LIKE Dev.public.C;  // copy the table structure only
          • Load new data in Table C
          • Validate results
  2. If the validation of Table C succeeds in Int, then deploy it to Prod:
USE Prod.public;
CREATE TABLE  C LIKE Int.public.C;
      • Load production data into Table C

Now you have the new Table C live in your Prod database.

Cloning Figure3

But you are not quite done yet. You need to complete the cycle.

Refresh Dev:

We now want to refresh Dev to be in sync with Prod. To do that we need to create a production clone of Table C and update Tables A and B with refreshed data. This will also serve to reduce your storage by replacing the original version of Table C in Dev with a cloned table. Likewise, you should sync Int as well.

If there are just a few tables in Prod (as in our example) then:

USE Dev.public;
CREATE OR REPLACE TABLE A CLONE Prod.public.A;
CREATE OR REPLACE TABLE B CLONE Prod.public.B;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;
USE Int.public;
CREATE or REPLACE DATABASE Int CLONE Prod;
Cloning Figure4

Note: CREATE OR REPLACE TABLE will drop and rebuild the table. This will release any space used in Dev (or Int) by those tables subsequent to the last clone operation.

Hint: If there are many tables in Prod, and only a few new tables in Dev, then this approach will be more efficient:

CREATE DATABASE Dev2 CLONE Prod;
USE Dev2.public;
CREATE TABLE D CLONE Dev.public.D; // preserving the WIP table in Dev
DROP DATABASE Dev;
ALTER DATABASE Dev2 RENAME To Dev;

This process removes all previous clones and allows you to preserve Table D as it is currently in development. When your are done with development on Table D, you follow the same process as we did for Table C to promote Table D to Int for testing then up to Prod for go live. After the promotion be sure to replace Table D in Dev and Int with a fresh clone from Prod in order to release the space.

Scenario 3: Promoting a Change to an Existing Table

Now let’s suppose you have three tables already in production and you need to alter one of those tables (i.e., add a new column) and then promote the revised table to Production. This is an important scenario, specially when rows are being added constantly to the data in production table. In this scenario you do not want to reload the modified table but rather change it then continue with incremental loads.

Initial State:

Prod database: Tables A, B, C

Int database: Tables A, B, C (cloned from Prod)

Dev database: Tables A, B, C (cloned from Prod)

Cloning Figure5

Mini Release:

In this scenario, you want to do a mini-release, wherein you will:

  • Alter Table C by adding one new column
  • Adjust the load process to include that column
  • Promote Table C to Int for testing

To do this mini-release you need to execute the following steps in a script or by hand in the Snowflake worksheet:  

    1. Change the table and the code (in Dev), then refresh the table
USE Dev.public;
ALTER TABLE C ADD COLUMN NEW_COL1 VARCHAR(200);
      • Modify load to include NEW_COL1
      • Write a one-time update script to update existing rows to populate NEW_COL1
        • Pull data from Production source if possible
        • Review and validate results
      • Execute modified load routine to load incremental new rows
        • Review and validate results

2. Deploy Table C to Int by simple clone:

USE Int.public;
CREATE OR REPLACE TABLE C CLONE Dev.public.C;
      • Promote revised load process to Int environment
      • Promote Update script to Int environment
      • Execute Update script
        • Review and validate results
      • Execute load process to load incremental new data into Table C
        • Review and validate results

3. If promotion to Int succeeds, then deploy to Prod:

USE Prod.public;
ALTER TABLE C ADD COLUMN NEW_COL1 VARCHAR(200);
      • Promote revised load process to Prod environment
      • Promote Update script to Prod environment
      • Execute Update script
        • Review and validate results
      • Execute load process to load incremental new data into Table C
        • Review and validate results
      • Start writing queries and reports that use the new column

Refresh Dev:

Once the promotion to Production is complete, you can refresh Dev and Int again using the same process as mentioned above.

USE Dev.public;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;
USE Int.public;
CREATE OR REPLACE TABLE C CLONE Prod.public.C;

Now you have a production clone of Table C in Dev & Int with the new column added and updated with fresh production data.

Want to see cloning in action? Check out this video:

Conclusion

These are just a few examples of how you might use Snowflake’s fast cloning to support an agile, light-weight DevOps process that does not require a lot of DBA management and does not take up extra space! Hopefully this gives you some ideas how you can use fast cloning in your development environment.

Thanks to Saqib Mustafa (@cloudsommelier) for his help on this post!

As always, keep an eye on this blog site, our Snowflake-related Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for more Cool Things About Snowflake and for updates on all the action and activities here at Snowflake Computing.

 

Automatic Query Optimization. No Tuning!

Hopefully you had a chance to read all our previous top 10 posts. We now conclude the series with a final deep dive into my #1 Cool Feature from Snowflake:

#1 Automatic Query Optimization. No Tuning!

As a long time data architect, and not a DBA, this is my favorite part of Snowflake. I do not have to worry about query performance at all. It is all handled “auto-magically” via a dynamic optimization engine in our cloud services layer. I just model, load, and query the data.

Cloud Services

Snowflake is an MPP, columnar store thus designed for high speed analytic queries by definition. Our cloud services layer does all the query planning and optimization based on data profiles that are collected automatically as the data is loaded. It automatically collects and maintains the required statistics to determine how to most effectively distribute the data and queries across the available compute nodes.

So, no indexes, no need to figure out partitions and partition keys, no need to pre-shard any data for even distribution, and no need to remember to update statistics.

Another key point to remember is that not only does this dynamic optimization auto-tune queries that work against traditional structured data, it also works against semi-structured data (like JSON, Avro, or XML). When semi-structured data is loaded into our VARIANT data type, as mentioned in an earlier blog post, the data is profiled and columnarized behind the scenes. With this approach, querying the data (using our JSON or XML extensions) is optimized just like for structured relational data, without having to spend the time and effort to split it out into relational tables and columns. Because of this unique approach to handling semi-structured data, Snowflake not only supports joining structured and semi-structured data with SQL, it optimizes those joins too.

Let go Luke – Trust the Force!

As you can imagine, there are some folks out there who do not like change and want us to add more knobs so they can manually tune their queries.

Why? Well, they say that they know their workload better than Snowflake and therefore can optimize it better than Snowflake software. Others say that they’ve already done all the work to optimize another platform for their workload and don’t want to “waste” all that time and effort.

Since I come from a traditional data warehouse background, and know many expert data warehouse architects who are world class at tuning specific databases to do specific types of queries,  I definitely understand and can sympathize with this point of view.

But the world of data has changed! The way we build, deploy, and use data (of all kinds) is continually evolving and we need to evolve our approach to to be successful.

Because of that, there are a few things to consider with regard to tuning:

  1. Your workloads will change.
  2. The sheer number of different workloads is going to grow faster than your bandwidth to manually tune (so someone has to prioritize who wins).
  3. What happens if your expert query tuner decides to sell real estate instead?

Relying on yourself, and other expert DBA tuning staff, is ultimately going to result in a performance tuning bottleneck, which will in turn lead to increased contention for resources in the data warehouse. No matter how good you are, in the end you will not be able to keep up. So why not use a data warehouse service to handle that work for you so you can focus on getting the business people access to all this data in a form they can use to make better business decisions? In short, people are the hardest thing to scale in any data analytics solution. Thus any solution that allows you to scale without bottlenecks on people is a winner.

And if the query is still not fast enough, with Snowflake you always have the option to resize your virtual warehouse to add additional compute resources, as we described in the post about elasticity. Without the benefit of a dynamic optimization engine, you could not change the compute configuration on the fly and see immediate improvements. In a many other environments, if you changed the number of compute nodes you would have to redistribute the data, or do some other manual tuning, in order to get the improved performance. That would not be elastic at all.

For all these reasons, our patent-pending dynamic optimization is, to me, one of the most important features when it comes to making the Snowflake Elastic Data Warehouse a true SaaS offering and enabling agile, high-speed analytics at any scale. As part of our vision, we developed an offering that helps customers concentrate on understanding and deriving value from their data rather than spending time refining, managing , and optimizing the environment. Business users and data scientists can easily analyze their data by connecting natively to Snowflake through various tools, and not have to worry about tuning and query optimization.

As I said before, as an architect and data warehouse designer, I really like this model!

End of this trail…more to come

Well, that is the last of my Top 10 list of cool features in Snowflake.  I want to thank Saqib Mustafa for all his help collaborating on this series of posts.

There are of course many other cool features in Snowflake, and more to come in the future. So keep your eyes on this blog for future posts with more details on the Snowflake Elastic Data Warehouse, the cloud-native data warehouse as a service.

As always, keep an eye on our Snowflake-related Twitter feeds (@SnowflakeDB), (@kentgraziano), and (@cloudsommelier) for continuing updates on all the action and activities here at Snowflake Computing.