Cloud-Based Data Warehouse Architecture
The Snowflake Architectural Difference
Snowflake is a fully relational SQL data warehouse. It’s all new and built for the cloud on Amazon Web Services (AWS). Snowflake provides complete relational database support for both structured and semi-structured data (JSON, Avro, XML), and implements comprehensive support for the SQL language. It requires no administration and is delivered as a turn-key cloud service. Snowflake provides broad support for ETL and BI tools, and enables developers to build modern data applications. It is secure by design.
Beyond these attributes, what makes Snowflake different from a traditional data warehouse, Hadoop system, or other cloud database?
In one word, architecture.
Snowflake is built on a patent-pending, multi-cluster, shared data architecture born and built for the cloud to revolutionize big data analysis.
Snowflake is a single integrated system with fully independent scaling for compute, storage and services.
Unlike shared-storage architectures that couple storage and compute together, Snowflake enables automatic scaling of storage, analytics or workgroup resources for any job, quickly and easily.
Built on Amazon S3 cloud storage, the storage layer holds all the diverse data, tables and query results for Snowflake. Maximum scalability, elasticity, and performance for big data warehousing and analytics are assured since the storage layer is engineered to scale completely independent of compute resources. As a result, Snowflake delivers unique capabilities such as processing data loading or unloading without impacting running queries.
Under the covers of Storage, Snowflake utilizes micro-partitions to securely and efficiently store customer data. When loaded into Snowflake, data is automatically split into modest-sized micro-partitions, and metadata is extracted to enable efficient query processing. The micro-partitions are then columnar compressed and fully encrypted using a secure key hierarchy.
Snowflake Architecture Benefits
Transactional SQL Data Warehouse
- Separation of services from storage and compute allows multiple, virtual warehouses to simultaneously operate on the same data. Concurrency is unlimited and can be automatically scaled with a multi-cluster warehouse.
- Activity in one virtual warehouse has zero impact on all other virtual warehouses. For example, data loading in a virtual warehouse does not impact performance on queries running in other virtual warehouses, even when they are accessing the same data.
- Full ACID transactional integrity is maintained across separate, virtual warehouses. Queries always see a consistent view of data, while transaction commits are immediately visible to new queries running in all data warehouses.
- Zero-copy clones of databases or tables happen in a matter of seconds but without incurring the extra storage cost. A clone is a full replica of the original object with an independent lifecycle.
- Time travel enables any select statement or zero-copy clone to view the database in a consistent, “as of” state from the past over a user-determined retention period.
Performance and Throughput
- Snowflake outperforms all other solutions. Compute resources scale linearly, while efficient query optimization delivers answers in a fraction of the time of legacy systems.
- Performance issues are addressed in seconds. Virtual warehouse size is specified based on the service level and performance required. It can be changed at any time, even while a warehouse is running.
- You only pay for the compute resources you use by pausing any virtual warehouse at any time.
- Multi-cluster warehouses deliver a consistent SLA to an unlimited number of concurrent users. As the concurrent workload increases, Snowflake automatically adds clusters to the virtual warehouse and distributes queries across those clusters. When the workload decreases, the clusters are paused. Charges only accrue for active clusters so you only pay for performance and throughput you need.
Storage and Support for All Data
- Storage is inexpensive and can scale indefinitely. Snowflake is the optimal platform for a data lake, delivering cost-effective and highly performant support for multi-petabyte databases. All storage costs are based on actual usage for compressed data and measured in TB stored per month.
- You can query both structured and machine-generated, semi-structured data (i.e. JSON, Avro, XML) using relational SQL operators with similar performance characteristics. Loading semi-structured data is painless. The schema is dynamic and is automatically discovered during load. This support for dynamic schemas enables efficient query execution using natural extensions to SQL.
- With Snowflake, there is no need to implement separate systems to process structured and semi-structured data. You can eliminate the complex Hadoop + data warehouse pipeline. Snowflake can perform both roles much more efficiently with better business results at a lower cost.
Availability and Security
- High availability is achieved using a scale-out architecture that is fully distributed across multiple Amazon availability zones. Snowflake can continue operations and withstand the loss of any Amazon availability zone. The system is designed to tolerate failures with minimal impact to the customer.
- Snowflake is secure by design. All data is encrypted over the Internet and on disk. Snowflake supports two-factor and federation authentication with single sign-on. Authorization is role-based. You can enable policies to limit access to predefined client addresses.
- Snowflake is SOC 2 Type 2 certified and support for PHI data for HIPAA customers is available with a Business Associate Agreement. Additional levels of security, such as encryption across all network communications and virtual private or dedicated isolation, are also available.
Seamless Database Sharing
- Snowflake’s unique built for the cloud architecture enables you to share your databases within your account with other Snowflake accounts.
- Database Sharing doesn’t require copying or transferring data to the consumer’s account: it functions like secure and curated access to your database.
- Accounts that are receiving shared data only pay for the compute resources they use to consume the data. Data from a shared database can easily be combined with data in another Snowflake database without onerous effort or third party tools.
- Avoid the burdens and complexities of decades old FTP and EDI technology. Simply decide what you want to share with your data consumers, and share the database through easy to use SQL functions.
We chose Snowflake to help power the web performance reporting and analytics that we provide our customers because it differentiated itself from the alternatives.
– Matt Solnit, Founder and VP of Engineering, Soasta
The whole company rests on top of Snowflake – all of our analytics. It’s the base for our entire strategy.
– Michael Bigby, CTO, Research Now
Inside the Data Warehouse Built for the Cloud
Snowflake's architecture departs from the limitations of traditional shared-disk and shared-nothing architectures. Learn more in our SIGMOD whitepaper.