
Create Snowflake table from Pandas DataFrame, explained (2022)
Working with Pandas DataFrames in Python is a common workflow, but say you want to take that data and create a new table in Snowflake with it for it later use, what’s the best way of doing that? That’s what I’m going to be discussing in this post today.
Working with Pandas DataFrames in Python is a common workflow, but say you want to take that data and create a new table in Snowflake with it for it later use, what’s the best way of doing that? That’s what I’m going to be discussing in this post today.
Create Snowflake table from Pandas DataFrame
The current best practice for creating a Snowflake table from a Pandas DataFrame is:
- Install the required packages
pandas, snowflake-connector-python, sqlalchemy
- Connect to Snowflake using
sqlalchemy
- Define what to do if the table already exists
- Write DataFrame to Snowflake using
to_sql
from Pandas andpd_writer
from the Python connector
to_sql, write_pandas, and pd_writer, what’s the deal?
When creating tables Snowflake from Pandas there are three methods or functions that get mentioned, which leads to a bit of confusion around what to use, when, and how. Here I will outline what these three functions do and how we will use them in this example:
write_pandas
write_pandas
is a method in the Snowflake Connector for Python package which allows the user to append data from a DataFrame to an existing table in Snowflake.
The positive of using write_pandas
is that everything is contained within the Snowflake Connector, however the tradeoff is that you can only append tables that already exist.
For this reason it is of course not helpful for our use case so we will not be using this method and instead will use to_sql
.
to_sql
to_sql
is a method in the Pandas package which can load a DataFrame to a SQL database. It requires a connection to your database, which is provided through SQLAlchemy.
to_sql
is what most people use to send data to Snowflake as it allows for the possibility of creating tables if they don’t currently exist yet, as well as options for what to do if it does exist. For this reason, we will be using it in our example.
pd_writer
pd_writer
is a function in the Snowflake Connector package which can be used in the to_sql
method to speed up the insertion into your Snowflake table. We will be using this in our example as it provides a significant performance improvement, especially for large datasets.
Install SQLAlchemy, Snowflake Connector for Python and Pandas to your machine
In order to execute the code described in this post you need to first install some required packages to your environment or machine, these are:
pandas
snowflake-connector-python
sqlalchemy
The best practice is to create a separate Python environment for each project, so I will first create a Conda environment and then install the required packages:
conda create -n env_name python=3.9 pandas sqlalchemy
conda activate env_name
pip install "snowflake-connector-python[pandas]"
Connect to Snowflake using SQLAlchemy
Now onto the Python code. First of all you need to connect to Snowflake using your credentials. We do this by using the SQLAlchemy package to create the engine which allows us to interact with Snowflake.
from sqlalchemy import create_engine
#Create connection to Snowflake using your account and user
account_identifier = '<account_identifier>'
user = '<user_login_name>'
password = '<password>'
database_name = '<database_name>'
schema_name = '<schema_name>'
conn_string = f"snowflake://{user}:{password}@{account_identifier}/{database_name}/{schema_name}"
engine = create_engine(conn_string)
Entering Snowflake credentials in Python
As you can see, your Snowflake credentials are required to create this connection. You can of course decide what the best insertion method is for your project, whether that be environment variables or something else, but given that this is sensitive information you should remember the golden rule:
Do not enter your credentials in plain text, especially when working within a Git repository
Gaining write access to Snowflake tables
OK, one last note on connecting to your instance, I promise. Your Snowflake user will have a certain level of access which grants you rights to certain databases, schemas, and tables. So make sure that you have write access to the database and schema you are interacting with.
What if the Snowflake table already exists?
If the table you want to create already exists then we need to know what to do with it. to_sql
has a parameter if_exists
for this eventuality, and there are three options that can be parsed in:
append
: Add the data at the end of the existing tablereplace
: Drop the table and create a new one in it’s placefail
: Return an error
Use to_sql to create Snowflake table from Pandas
Now that we have a connection to Snowflake and decided what to do if the table already exists, we can define the table name. Once this is decided upon then we can let the to_sql
method take our DataFrame, do some SQL magic in the background, and create our new table:
from snowflake.connector.pandas_tools import pd_writer
import pandas as pd
#Create your DataFrame
table_name = 'cities'
df = pd.DataFrame(data=[['Stephen','Oslo'],['Jane','Stockholm']],columns=['Name','City'])
#What to do if the table exists? replace, append, or fail?
if_exists = 'replace'
#Write the data to Snowflake, using pd_writer to speed up loading
with engine.connect() as con:
df.to_sql(name=table_name.lower(), con=con, if_exists=if_exists, method=pd_writer)
You may have noticed that we use table_name.lower()
as the table name in the method, this is because Snowflake requires only lower case table names, otherwise an error is returned.
Create empty Snowflake table from Pandas DataFrame
There could be situations that arise where you need to create an empty table using just the column definition of the DataFrame, for example to prepare for a larger data import from an external stage. This is simple to do with just a minor modification:
from snowflake.connector.pandas_tools import pd_writer
import pandas as pd
#Create your DataFrame
table_name = 'cities'
df = pd.DataFrame(data=[['Stephen','Oslo'],['Jane','Stockholm']],columns=['Name','City'])
#Only take the columns of the DataFrame
df = df.head(0)
#What to do if the table exists? replace, append, or fail?
if_exists = 'replace'
#Write the data to Snowflake, using pd_writer to speed up loading
with engine.connect() as con:
df.to_sql(name=table_name.lower(), con=con, if_exists=if_exists, method=pd_writer)
Create Snowflake table from Pandas DataFrame, tying it all together
Now that we have explored all aspects of the task as hand, it’s time to bring it all together into one code snippet:
from snowflake.connector.pandas_tools import pd_writer
import pandas as pd
from sqlalchemy import create_engine
#Create connection to Snowflake using your account and user
account_identifier = '<account_identifier>'
user = '<user_login_name>'
password = '<password>'
database_name = '<database_name>'
schema_name = '<schema_name>'
conn_string = f"snowflake://{user}:{password}@{account_identifier}/{database_name}/{schema_name}"
engine = create_engine(conn_string)
#Create your DataFrame
table_name = 'cities'
df = pd.DataFrame(data=[['Stephen','Oslo'],['Jane','Stockholm']],columns=['Name','City'])
#What to do if the table exists? replace, append, or fail?
if_exists = 'replace'
#Write the data to Snowflake, using pd_writer to speed up loading
with engine.connect() as con:
df.to_sql(name=table_name.lower(), con=con, if_exists=if_exists, method=pd_writer)
Related articles
Write Pandas data to Snowflake table
Connect to Snowflake from Python
References
write_pandas documentation
Python connector installation documentation
Pandas to_sql documentation
Snowflake pd_writer documentation
SQLAlchemy create engine documentation