Create Snowflake table from Pandas DataFrame, explained (2022)

Working with Pandas DataFrames in Python is a common workflow, but say you want to take that data and create a new table in Snowflake with it for it later use, what’s the best way of doing that? That’s what I’m going to be discussing in this post today.

Create Snowflake table from Pandas DataFrame

The current best practice for creating a Snowflake table from a Pandas DataFrame is:

  1. Install the required packages pandas, snowflake-connector-python, sqlalchemy
  2. Connect to Snowflake using sqlalchemy
  3. Define what to do if the table already exists
  4. Write DataFrame to Snowflake using to_sql from Pandas and pd_writer from the Python connector

to_sql, write_pandas, and pd_writer, what’s the deal?

When creating tables Snowflake from Pandas there are three methods or functions that get mentioned, which leads to a bit of confusion around what to use, when, and how. Here I will outline what these three functions do and how we will use them in this example:

write_pandas

write_pandas is a method in the Snowflake Connector for Python package which allows the user to append data from a DataFrame to an existing table in Snowflake.

The positive of using write_pandas is that everything is contained within the Snowflake Connector, however the tradeoff is that you can only append tables that already exist.

For this reason it is of course not helpful for our use case so we will not be using this method and instead will use to_sql.

to_sql

to_sql is a method in the Pandas package which can load a DataFrame to a SQL database. It requires a connection to your database, which is provided through SQLAlchemy.

to_sql is what most people use to send data to Snowflake as it allows for the possibility of creating tables if they don’t currently exist yet, as well as options for what to do if it does exist. For this reason, we will be using it in our example.

pd_writer

pd_writer is a function in the Snowflake Connector package which can be used in the to_sql method to speed up the insertion into your Snowflake table. We will be using this in our example as it provides a significant performance improvement, especially for large datasets.

Install SQLAlchemy, Snowflake Connector for Python and Pandas to your machine

In order to execute the code described in this post you need to first install some required packages to your environment or machine, these are:

  • pandas
  • snowflake-connector-python
  • sqlalchemy

The best practice is to create a separate Python environment for each project, so I will first create a Conda environment and then install the required packages:

conda create -n env_name python=3.9 pandas sqlalchemy
conda activate env_name
pip install "snowflake-connector-python[pandas]"

Connect to Snowflake using SQLAlchemy

Now onto the Python code. First of all you need to connect to Snowflake using your credentials. We do this by using the SQLAlchemy package to create the engine which allows us to interact with Snowflake.

from sqlalchemy import create_engine

#Create connection to Snowflake using your account and user

account_identifier = '<account_identifier>'
user = '<user_login_name>'
password = '<password>'
database_name = '<database_name>'
schema_name = '<schema_name>'

conn_string = f"snowflake://{user}:{password}@{account_identifier}/{database_name}/{schema_name}"
engine = create_engine(conn_string)

Entering Snowflake credentials in Python

As you can see, your Snowflake credentials are required to create this connection. You can of course decide what the best insertion method is for your project, whether that be environment variables or something else, but given that this is sensitive information you should remember the golden rule:

Do not enter your credentials in plain text, especially when working within a Git repository

Gaining write access to Snowflake tables

OK, one last note on connecting to your instance, I promise. Your Snowflake user will have a certain level of access which grants you rights to certain databases, schemas, and tables. So make sure that you have write access to the database and schema you are interacting with.

What if the Snowflake table already exists?

If the table you want to create already exists then we need to know what to do with it. to_sql has a parameter if_exists for this eventuality, and there are three options that can be parsed in:

  • append: Add the data at the end of the existing table
  • replace: Drop the table and create a new one in it’s place
  • fail: Return an error

Use to_sql to create Snowflake table from Pandas

Now that we have a connection to Snowflake and decided what to do if the table already exists, we can define the table name. Once this is decided upon then we can let the to_sql method take our DataFrame, do some SQL magic in the background, and create our new table:

from snowflake.connector.pandas_tools import pd_writer
import pandas as pd

#Create your DataFrame

table_name = 'cities'
df = pd.DataFrame(data=[['Stephen','Oslo'],['Jane','Stockholm']],columns=['Name','City'])

#What to do if the table exists? replace, append, or fail?

if_exists = 'replace'

#Write the data to Snowflake, using pd_writer to speed up loading

with engine.connect() as con:
        df.to_sql(name=table_name.lower(), con=con, if_exists=if_exists, method=pd_writer)

You may have noticed that we use table_name.lower() as the table name in the method, this is because Snowflake requires only lower case table names, otherwise an error is returned.

Create empty Snowflake table from Pandas DataFrame

There could be situations that arise where you need to create an empty table using just the column definition of the DataFrame, for example to prepare for a larger data import from an external stage. This is simple to do with just a minor modification:

from snowflake.connector.pandas_tools import pd_writer
import pandas as pd

#Create your DataFrame

table_name = 'cities'
df = pd.DataFrame(data=[['Stephen','Oslo'],['Jane','Stockholm']],columns=['Name','City'])

#Only take the columns of the DataFrame

df = df.head(0)

#What to do if the table exists? replace, append, or fail?

if_exists = 'replace'

#Write the data to Snowflake, using pd_writer to speed up loading

with engine.connect() as con:
        df.to_sql(name=table_name.lower(), con=con, if_exists=if_exists, method=pd_writer)

Create Snowflake table from Pandas DataFrame, tying it all together

Now that we have explored all aspects of the task as hand, it’s time to bring it all together into one code snippet:

from snowflake.connector.pandas_tools import pd_writer
import pandas as pd
from sqlalchemy import create_engine

#Create connection to Snowflake using your account and user

account_identifier = '<account_identifier>'
user = '<user_login_name>'
password = '<password>'
database_name = '<database_name>'
schema_name = '<schema_name>'

conn_string = f"snowflake://{user}:{password}@{account_identifier}/{database_name}/{schema_name}"
engine = create_engine(conn_string)

#Create your DataFrame

table_name = 'cities'
df = pd.DataFrame(data=[['Stephen','Oslo'],['Jane','Stockholm']],columns=['Name','City'])

#What to do if the table exists? replace, append, or fail?

if_exists = 'replace'

#Write the data to Snowflake, using pd_writer to speed up loading

with engine.connect() as con:
        df.to_sql(name=table_name.lower(), con=con, if_exists=if_exists, method=pd_writer)

Write Pandas data to Snowflake table
Connect to Snowflake from Python

References

write_pandas documentation
Python connector installation documentation
Pandas to_sql documentation
Snowflake pd_writer documentation
SQLAlchemy create engine documentation

Stephen Allwright

Stephen Allwright

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.
Oslo, Norway