Connecting a Jupyter Notebook to Snowflake through Python (Part 3)

In part two of this four-part series, we learned how to create a Sagemaker Notebook instance. In part three, we’ll learn how to connect that Sagemaker Notebook instance to Snowflake. If you’ve completed the steps outlined in part one and part two, the Jupyter Notebook instance is up and running and you have access to your Snowflake instance, including the demo dataset. Now, you’re ready to connect the two platforms.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

The Sagemaker Console

The first step is to open the Jupyter service using the link on the Sagemaker console.

 

There are two options for creating a Jupyter Notebook. You can create the notebook from scratch by following the step-by-step instructions below, or you can download sample notebooks here. If you decide to build the notebook from scratch, select the conda_python3 kernel. Alternatively, if you decide to work with a pre-made sample, make sure to upload it to your Sagemaker notebook instance first.

There are several options for connecting Sagemaker to Snowflake. The simplest way to get connected is through the Snowflake Connector for Python. By the way, the connector doesn’t come pre-installed with Sagemaker, so you will need to install it through the Python Package manager. (Note: As of the writing of this post, the Snowflake Python connector has a dependency to C foreign function interface (CFFI), so it requires a minimum version of 1.11). Sagemaker, on the other hand, comes preinstalled with libcffi 1.10, which unfortunately causes the Snowflake Python connector to fail.

The step outlined below automatically detects the failure and, if necessary, triggers de-installation and re-installation of the most recent version of libcffi. Note: If re-installation of the CFFI package is necessary, you must also restart the Kernel so the new version is recognized.

%%bash
CFFI_VERSION=$(pip list 2>/dev/null | grep cffi )
echo $CFFI_VERSION
if [[ "$CFFI_VERSION" == "cffi (1.10.0)" ]]
then 
   pip uninstall --yes cffi
fi
yum_log=$(sudo yum install -y libffi-devel openssl-devel)
pip_log=$(pip install --upgrade snowflake-connector-python)  
if [[ "$CFFI_VERSION" == "cffi (1.10.0)" ]]
then 
   echo "configuration has changed; restart notebook"
fi

This  problem is persistent and needs to be resolved each time the Sagemaker server is shut down and restarted. The Jupyter Kernel also needs to be restarted after each shutdown. The good news is that the cffi package will eventually be updated on the Sagemaker AMI, and the step indicated above will begin to automatically recognize the update.

 

The next step is to connect to the Snowflake instance with your credentials.

import snowflake.connector
# Connecting to Snowflake using the default authenticator
ctx = snowflake.connector.connect(
  user=<user>,
  password=<password>,
  account=<account>
)

Here you have the option to hard code all credentials and other specific information, including the S3 bucket names. However, for security reasons, it’s advisable to not store credentials in the notebook. Another option is to enter your credentials every time you run the notebook.

Rather than storing credentials directly in the notebook, I opted to store a reference to the credentials. The actual credentials are automatically stored in a secure key/value management system called AWS Systems Manager Parameter Store (SSM).  

With most AWS systems, the first step requires setting up permissions for SSM through AWS IAM. Please ask your AWS security admin to create another policy with the following Actions on KMS and SSM.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",

            "Effect": "Allow",

            "Action": [

                "ssm:PutParameter",

                "ssm:DeleteParameter",

                "kms:Decrypt",

                "ssm:GetParameterHistory",

                "ssm:GetParametersByPath",

                "ssm:GetParameters",

                "ssm:GetParameter",

                "ssm:DeleteParameters"
            ],

            "Resource": [

                "arn:aws:kms:<region>:<accountid>:key/SNOWFLAKE/*",

                "arn:aws:ssm:<region>:<accountid>:parameter/SNOWFLAKE/*"
            ]
        },

        {
            "Sid": "VisualEditor1",

            "Effect": "Allow",

            "Action": [

                "ssm:DescribeParameters",

                "kms:ListAliases"
            ],

            "Resource": "*"
        }
    ]

}         

Adhering to the best-practice principle of least permissions, I recommend limiting usage of the “Actions by Resource”. Also, be sure to change the region and accountid in the code segment shown above or, alternatively, grant access to all resources (i.e. “*”).

In the code segment shown above, I created a root name of “SNOWFLAKE”. This is only an example. You’re free to create your own, unique naming convention.

Next, check permissions for your login. Assuming the new policy has been called SagemakerCredentialsPolicy, permissions for your login should look like the example shown below.

With the SagemakerCredentialsPolicy in place, you’re ready to begin configuring all your secrets (i.e. credentials) in SSM. Be sure to take the same namespace that you used to configure the credentials policy and apply them to the prefixes of your secrets.

After setting up your key/value pairs in SSM, use the following step to read the key/value pairs into your Jupyter Notebook.

import boto3

params=['/SNOWFLAKE/URL','/SNOWFLAKE/ACCOUNT_ID'
        ,'/SNOWFLAKE/USER_ID','/SNOWFLAKE/PASSWORD'
        ,'/SNOWFLAKE/DATABASE','/SNOWFLAKE/SCHEMA'
        ,'/SNOWFLAKE/WAREHOUSE','/SNOWFLAKE/BUCKET'
        ,'/SNOWFLAKE/PREFIX']
region='us-east-1'

def get_credentials(params):
   ssm = boto3.client('ssm',region)
   response = ssm.get_parameters(
      Names=params,
      WithDecryption=True
   )
   #Build dict of credentials
   param_values={k['Name']:k['Value'] for k in  response['Parameters']}
   return param_values

param_values=get_credentials(params)

Instead of hard coding the credentials, you can reference key/value pairs via the variable param_values. In addition to the credentials (account_id, user_id, password), also I stored the warehouse, database and schema.

import snowflake.connector
# Connecting to Snowflake using the default authenticator
ctx = snowflake.connector.connect(
  user=param_values['/SNOWFLAKE/USER_ID'],
  password=param_values['/SNOWFLAKE/PASSWORD'],
  account=param_values['/SNOWFLAKE/ACCOUNT_ID'],
  warehouse=param_values['/SNOWFLAKE/WAREHOUSE'],
  database=param_values['/SNOWFLAKE/DATABASE'],
  schema=param_values['/SNOWFLAKE/SCHEMA']

Now, you’re ready to read data from Snowflake. To illustrate the benefits of using data in Snowflake, we will read semi-structured data from the database I named “SNOWFLAKE_SAMPLE_DATABASE”.

When data is stored in Snowflake, you can use the Snowflake JSON parser and the SQL engine to easily query, transform, cast and filter JSON data data before it gets to the Jupyter Notebook.

From the JSON documents stored in WEATHER_14_TOTAL, the following step shows the minimum and maximum temperature values, a date and timestamp and the latitude/longitude coordinates for New York City.

cs=ctx.cursor()
allrows=cs.execute( \
"select (V:main.temp_max - 273.15) * 1.8000 + 32.00 as temp_max_far, " +\
"       (V:main.temp_min - 273.15) * 1.8000 + 32.00 as temp_min_far, " +\
"       cast(V:time as timestamp) time, " +\
"       V:city.coord.lat lat, " +\
"       V:city.coord.lon lon " +\
"from snowflake_sample_data.weather.weather_14_total " +\
"where v:city.name = 'New York' " +\
"and   v:city.country = 'US' ").fetchall()

The final step converts the result set into a Pandas DataFrame, which is suitable for machine learning algorithms.

import pandas as pd                               # For munging tabular data

data = pd.DataFrame(allrows)
data.columns=['temp_max_far','temp_min_far','time','lat','lon']
pd.set_option('display.max_columns', 500)     # Make sure we can see all of the columns
pd.set_option('display.max_rows', 10)         # Keep the output on one page
Data

 

Conclusion

Now that we’ve connected a Jupyter Notebook in Sagemaker to the data in Snowflake using the Snowflake Connector for Python, we’re ready for the final stage: Connecting Sagemaker and a Jupyter Notebook to both a local Spark instance and a multi-node EMR Spark cluster. I’ll cover how to accomplish this connection in the fourth and final installment of this series — Connecting a Jupyter Notebook to Snowflake via Spark.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

 

subscribe to the snowflake blog

Quickstart Guide for Sagemaker + Snowflake (Part One)

Machine Learning (ML) and predictive analytics are quickly becoming irreplaceable tools for small startups and large enterprises. The questions that ML can answer are boundless. For example, you might want to ask, “What job would appeal to someone based on their  interests or the interests of jobseekers like them?” Or, “Is this attempt to access the network an indication of an intruder?” Or, “What type of credit card usage indicates fraud?”

Setting up a machine learning environment, in particular for on-premise infrastructure, has its challenges. An infrastructure team must request physical and/or virtual machines and then build and integrate those resources. This approach is both time-consuming and error prone due to the number of manual steps involved. It may work in a small environment, but the task becomes exponentially more complicated and impractical at scale.

There are many different ML systems to choose from, including  TensorFlow, XGBoost, Spark ML and MXNet, to name a few. They all come with their own installation guides, system requirements and dependencies. What’s more, implementation is just the first step. The next challenge is figuring out how to make the output from the machine learning step (e.g. a model) available for consumption. Then, all of the components for building a model within the machine learning tier and the access to the model in the API tier need to scale to provide predictions in real-time. Last but not least, the team needs to figure out where to store all the data needed to build the model.

Managing this whole process from end-to-end becomes significantly easier when using cloud-based technologies. The ability to provision infrastructure on demand (IaaS) solves the problem of manually requesting virtual machines. It also provides immediate access to compute resources whenever they are needed. But that still leaves the administrative overhead of managing the ML software and the platform to store and manage the data.   

At last year’s AWS developer conference, AWS announced Sagemaker, a “Fully managed end-to-end machine learning service that enables data scientists, developers and machine learning experts to quickly build, train and host machine learning models at scale.”

Sagemaker can access data from many different sources (specifically the underlying kernels like Python, PySpark, Spark and R), and access data provided by Snowflake.  Storing data in Snowflake also has significant advantages.

Single source of truth

If data is stored in multiple locations, inevitably those locations will get out of sync. Even if data is supposed to be immutable, often one location is modified to fix a problem for one system while other locations are not. In contrast, if data is stored in one central, enterprise-grade, scalable repository, it serves as a “single source of truth” because keeping data in sync is made easy. Different tools are not required for structured and semi-structured data. Data can be modified transactionally, which immediately reduces the risk of problems.

Shorten the data preparation cycle

According to a study published in Forbes, data preparation accounts for about 80% of the work performed by data scientists. Shortening the data preparation cycle therefore has a major impact on the overall efficiency of data scientists.

Snowflake is uniquely positioned to shorten the data preparation cycle due to its excellent support for both structured and semi-structured data into language, SQL. This means that semi-structured data and structured data can be seamlessly parsed, combined, joined and modified through SQL statements in set-based operations. This enables data scientists to use the power of a full SQL engine for rapid data cleansing and preparation.

Scale as you go

Another problem that ML implementers frequently encounter is what we at Snowflake call “works on my machine” syndrome. Small datasets easily work on a local machine but when migrated to production dataset size, reading all the data into a single machine doesn’t scale, or it may behave unexpectedly. Even if it does finish the job, it can take hours to load a terabyte-sized dataset. In Snowflake there is no infrastructure that needs to be provisioned, and Snowflake’s elasticity feature allows you to scale horizontally as well as vertically, all with the push of a button.

Connecting Sagemaker and Snowflake

Sagemaker and Snowflake both utilize cloud infrastructure as a service offerings by AWS, which enables us to build the Infrastructure when we need it, where we need it (geographically) and at any scale required.

Since building the services becomes simplified with Sagemaker and Snowflake, the question becomes how to connect the two services. And that’s exactly the subject of the following parts of this post. How do you get started? What additional configuration do you need in AWS for security and networking? How do you store credentials?

In part two of this four-part blog, I’ll explain how to build a Sagemaker ML environment in AWS from scratch. In the third post, I will put it all together and show you how to connect a Jupyter Notebook to Snowflake via the Snowflake Python connector.  With the Python connector, you can import data from Snowflake into a Jupyter Notebook. Once connected, you can begin to explore data, run statistical analysis, visualize the data and call the Sagemaker ML interfaces.

However, to perform any analysis at scale, you really don’t want to use a single server setup like Jupyter running a python kernel. Jupyter running a PySpark kernel against a Spark cluster on EMR is a much better solution for that use case. So, in part four of this series I’ll connect a Jupyter Notebook to a local Spark instance and an EMR cluster using the Snowflake Spark connector.

The Snowflake difference

Snowflake is the only data warehouse built for the cloud. Snowflake delivers the performance, concurrency and simplicity needed to store and analyze all data available to an organization in one location. Snowflake’s technology combines the power of data warehousing, the flexibility of big data platforms, the elasticity of the cloud, and live data sharing at a fraction of the cost of traditional solutions. Snowflake: Your data, no limits.

You can review the entire blog series here: Part One > Part Two > Part Three > Part Four.  

New Snowflake features released in Q1’17

We recently celebrated an important milestone in reaching 500+ customers since Snowflake became generally available in June 2015. As companies of all sizes increasingly adopt Snowflake, we wanted to look back and provide an overview of the major new Snowflake features we released during Q1 of this year, and highlight the value these features provide for our customers.

Expanding global reach and simplifying on-boarding experience

Giving our customers freedom of choice, along with a simple, secure, and guided “Getting Started” experience, was a major focus of the last quarter.

  • We added a new region outside of the US; customers now have the option to analyze and store their data in Snowflake accounts deployed in EU-Frankfurt. Choosing the appropriate region is integrated into our self-service portal when new customers sign up.
  • In addition, we added our high-value product editions, Enterprise and Enterprise for Sensitive Data (ESD), to our self-service offerings across all available regions. For example, with Enterprise, customers can quickly implement auto-scale mode for multi-cluster warehouses to support varying, high concurrency workloads. And customers requiring HIPAA compliance can choose ESD.
  • Exploring other venues for enabling enterprises to get started quickly with Snowflake, we partnered with the AWS Marketplace team to include our on-demand Snowflake offerings, including the EU-Frankfurt option, in their newly-launched SaaS subscriptions.

Improving out-of-the-box performance & SQL coverage

We are committed to building the fastest cloud DW for your concurrent workloads with the SQL you love.

  • One key performance improvement introduced this quarter was the reduction of compilation times for JSON data. Internal TPC-DS tests demonstrate a reduction between 30-60% for most of the TPC-DS queries (single stream on a single, 100TB JSON table). In parallel, we worked on improving query compile time in general, providing up to a 50% improvement in performance for short queries.
  • Another new key capability is the support for bulk data inserts on a table concurrently with other DML operations (e.g. DELETE, UPDATE, MERGE). By introducing more fine-grained locking at the micro-partition level, we are able to allow concurrent DML statements on the same table.
  • To improve our data clustering feature (currently in preview), we added support for specifying expressions on table columns in clustering keys. This enables more fine-grained control over the data in the columns used for clustering.
  • Also, we reduced the startup time for virtual warehouses (up to XL in size) to a few seconds, ensuring almost instantaneous provisioning for most virtual warehouses.
  • We extended our SQL by adding support for the ANSI SQL TABLESAMPLE clause. This is useful when a user wants to limit a query operation performed on a table to only a random subset of rows from the table.

Staying Ahead with Enterprise-ready Security

From day one, security has always been core to Snowflake’s design.

  • We expanded Snowflake’s federated authentication and single sign-on capability by integrating with many of the most popular SAML 2.0-compliant identity providers. Now, in addition to Okta, Snowflake now supports ADFS/AD, Azure AD, Centrify, and OneLogin, to name just a few.
  • To advance Snowflake’s built-in auditing, we introduced new Information Schema table functions (LOGIN_HISTORY and LOGIN_HISTORY_BY_USER) that users can query to retrieve the short-term history of all successful and failed login requests in the previous 7 days. If required, users can maintain a long-term history by copying the output from these functions into regular SQL tables.

Improving our ecosystem

Enabling developers and builders to create applications with their favorite tools and languages remains a high priority for us.

  • With respect to enterprise-class ETL, we successfully collaborated with Talend in building a native Snowflake connector based on Talend’s new and modern connector SDK. The connector, currently in preview, has already been deployed by a number of joint customers with great initial feedback on performance and ease-of-use.
  • To tighten the integration of our Snowflake service with platforms suited for machine learning and advanced data transformations, we released a new version of our Snowflake Connector for Spark, drastically improving performance by pushing more query operations, including JOINs and various aggregation functions, down to Snowflake. Our internal 10 TB TPC-DS performance benchmark tests demonstrate that running TPC-DS queries using this new v2 Spark connector is up to 70% faster compared to executing SQL in Spark with Parquet or CSV (see this Blog post for details).
  • We continue to improve our drivers for our developer community. Listening to feedback from our large Python developer community, we worked on a new version of Snowflake’s native Python client driver, resulting in up to 40% performance improvements when fetching result sets from Snowflake. And, after we open-sourced our JDBC driver last quarter, we have now made the entire source code available on our official GitHub repository.
  • And, last, but not least, to enhance our parallel data loading via the COPY command, ETL developers can now dynamically add file metadata information, such as the actual file name and row number, which might not be part of the initial payload.

Increasing transparency and usability

These features are designed to strike the right balance between offering a service that is easy to operate and exposing actionable insights into the running service.

  • One major addition to our service is Query Profile, now general available and fully integrated into Snowflake’s web interface. Query Profile is a graphical tool you can use to detect performance bottlenecks and areas for improving query performance.
  • Various UI enhancements were implemented: Snowflake’s History page now supports additional filtering by the actual SQL text and query identifier. We also added UI support for creating a Parquet file format in preparation for loading Parquet data into variant-type table columns in Snowflake.
  • A new Information Schema table function (TABLE_STORAGE_METRICS) exposes information about the data storage for individual tables. In particular, a user can now better understand how tables are impacted by Continuous Data Protection, particularly Time Travel and Fail-safe retention periods, as well as which tables contain cloned data.
  • We also recently introduced smarter virtual warehouse billing through Warehouse Billing Continuation (see this Blog post for details). If a warehouse is suspended and resumed within 60 minutes of the last charge, we do not charge again for the servers in the warehouse. WBC eliminates additional credit charges, and we hope it will reduce the need for our customers to strictly monitor and control when warehouses are suspended and resized.

Scaling and investing in service robustness

These service enhancements aren’t customer visible, but are crucial for scaling to meet the demands of our rapidly growing base of customers.

  • As part of rolling out the new EU (Frankfurt) region, we increased automation of our internal deployment procedures to (a) further improve engineering efficiency while (b) laying the foundation for rapidly adding new regions based on customer feedback.
  • We further streamlined and strengthened our various internal testing and pre-release activities, allowing us to ship new features to our customers on a weekly basis – all in a fully transparent fashion with no downtown or impact to users.

Conclusion and Acknowledgements

This summary list of features delivered in Q1 highlights the high velocity and broad range of features the Snowflake Engineering Team has successfully delivered in a short period of time. We are committed to putting our customers first and maintaining this steady pace of shipping enterprise-ready features each quarter. Stay tuned for another feature-rich Q2.

For more information, please feel free to reach out to us at info@snowflake.net. We would love to help you on your journey to the cloud. And keep an eye on this blog or follow us on Twitter (@snowflakedb) to keep up with all the news and happenings here at Snowflake Computing.

SnowSQL Part I: Introducing SnowSQL – a modern command line tool built for the cloud

What & Why

Starting this month, we have introduced a new modern command line tool for an interactive query experience with the Snowflake Elastic Data Warehouse cloud service. Yes indeed – a new command line tool!

Now you might ask – why did Snowflake decide to invest in building a new SQL command line tool in the 21st century when there are so many different ways of accessing and developing against the Snowflake service? For example, you could use our current user interface, or various drivers, or other SQL editor tools available today.

The answer is simple:

  • First of all, developers still care very deeply about light-weight mechanisms to quickly ask SQL questions via a text-based terminal with simple text input and output.
  • Secondly, we recognized the need for a more modern command line tool designed for the cloud that is easy to use, built on high security standards, and is tightly integrated with the actual core Snowflake elastic cloud data service.
  • Finally, we wanted to build a command line tool which offers users more powerful scripting capabilities overall.   

The result of these efforts is SnowSQL – our new SQL command line tool that is entirely built in Python and leverages Snowflake’s Python connector underneath.

Looking at existing more generic command line tools, such as SQLLine or HenPlus, it became obvious that these tools lack the ease-of-use and did not really offer sufficient scripting features. Also, Snowflake’s approach was to consider the command line tool as part of the overall cloud service. That is, SnowSQL can be seen as an extension of the Snowflake Elastic DW service. This cloud-first philosophy has important implications on the entire life cycle of Snowflake’s command line tool including the agility of delivering new SnowSQL features as part of the frequent Snowflake service updates (please see below the auto upgrade capabilities)

Getting Started

To get started, you can download SnowSQL from the Snowflake UI after logging into your Snowflake account. SnowSQL is currently supported on all three major platforms including:

  • Linux 64-bit,
  • Mac OS X 10.6+, and
  • Windows 64-bit

We provide a native installer for each platform with easy-to-follow installation steps. Once you have installed SnowSQL, open a new terminal and type:

$ SNOWSQL_PWD=<password> snowsql -a <account_name> -u <user_name>

Upon first use, SnowSQL users will notice a progress bar which indicates the download of the initial version of SnowSQL. This is a one time operation.  Any future downloads will happen in the background and remain entirely transparent for the user. The ability to automatically and fully transparent upgrade the command line tool is one of the key benefits of building a command line tool as cloud extension rather than a stand-alone software component interacting with a remote service (please see the auto-upgrade section below).

Using SnowSQL

There are a few capabilities we would like to highlight in this blog.  

SnowSQL Commands

First, SnowSQL offers a wide range of commands a user can make use of. As  a general rule, all SnowSQL commands start with a bang character ‘!’. For a complete list of currently supported commands, please see our documentation here.

Auto-Complete and Syntax Highlighting

Secondly, with our context-sensitive auto-complete feature, SnowSQL users are released of cumbersome and error-prone typing of long object names Instead they can complete SQL keywords and functions once typing the first three letters. By leveraging auto-complete, SnowSQL users can become increasingly more productive and quickly explore data in Snowflake. Furthermore, SQL statements are highlighted in different colors resulting in a better readability for SnowSQL users when interacting with a terminal

 

Auto-Upgrade

Thinking as a service, the auto-update framework enables users to always stay up-to-date with both –  Snowflake’s and SnowSQL’s latest features to streamline end user experience. No more additional downloads or tedious re-installation are needed. The upgrade is transparent for the end user and takes place  as a background process when you start SnowSQL. Next time a user runs SnowSQL, the new version will be automatically picked up while their workflows remain unaffected and will not be interrupted during the upgrade.

Secure Connection and Encryption

Finally, security is core in  SnowSQL’s design. SnowSQL secures connections to Snowflake using TLS (Transport Layer Security) with OCSP (Online Certificate Status Protocol – OCSP) checks. The auto-upgrade binaries are always validated by using RSA signature. In addition to the secured connection, SnowSQL provides end-to-end security of data moving in and out of Snowflake by using AES (Advanced Encryption Standard)  for Snowflake’s PUT and GET commands.

In the next few weeks, we will provide a deeper dive into some of the unique capabilities of SnowSQL. Please stay tuned as we gather and incorporate feedback from our customers. We would like to acknowledge our main software engineers Shige Takeda (@smtakeda) and Baptiste Vauthey (@thabaptiser) for their main contributions.

Finally, all of this would not have been possible without the active Python community who inspired us in many ways and offered us tools and packages to build SnowSQL. Thank you.   

As always, keep an eye on this blog site, and our Snowflake Twitter feed (@SnowflakeDB) for updates on all the action and activities here at Snowflake Computing.