Fit Predict #1: Evolving from Pandas to Polars

In this issue of the Fit Predict Data Science newsletter, we look at Snowflake extensions for VSCode, Reinforcement Learning at Spotify and more.

Stephen Allwright
Stephen Allwright
πŸ“₯
What is Fit Predict?
This is a light-hearted overview of what's been going on in the world of Data Science this week. See it as your 5-minute update such that you can sound at least slightly knowledgeable at your next coffee chat β˜•

Have you been forwarded this? You can subscribe here!

Hey there,

I'm back in Norway after spending the Christmas break in England, where I was trying ever so hard to avoid the urge to work on a little (OK, maybe not so little) side project. When the idea popped into my mind, I believe I even uttered the famous phrase:

"that would be a fun side project and shouldn't take too long either"

It turns out, that wasn't true.

Even after all these years, I still clearly have not learnt that projects always take longer than you expect!

But anyway, enough of me rambling about not being able to set good boundaries, let's get to the main reason we are here...

🧰 Tools

The tools that will make your life that little bit easier, or at least more interesting... but either way it's fun to play with new toys.

Snowflake extension for VSCode

This neat little extension allows you to interact with your Snowflake instance in VSCode, with many of the same features you would use in their GUI.
Polars

Polars is a package designed to be a fast DataFrame library for Python, written in Rust. It's been the talk of the town lately in my company and could be the answer for those of us who struggle with large and slow operations in Pandas.
PRQL

Some more Rust for you! PRQL is a SQL replacement written in Rust which is supposed to be simpler to use and easier to analyse data with.

πŸ§‘β€πŸ”¬ In practice

Stories of those who are genuinely implementing Data Science. Step aside Titanic dataset, this is the real deal

Reinforcement Learning for Personalization at Spotify

In this interview on the TWIML AI Podcast, Tony Jebara (Head of Machine Learning) from Spotify explains how they use reinforcement learning to improve their personalized recommendations.
Machine Learning at Monzo in 2022

This blog post from Neal Lathia (Staff Machine Learning Engineer) is a good reminder of the choices and tradeoffs we must make when machine learning becomes an established part of the business.

🐦 The best of Data Twitter

Data Twitter is the best Twitter.

It's an oldie but a goldie. The recent popularity of ChatGPT means it's time to bring out this evergreen tweet.

2022: β€œWOW you can write a prompt and an AI will draw it!”

2028: β€œYou want to write a prompt? First you need to hire 10-15 promptOps Engineers to build out your PromptFlow pipelines which sends promptjobs to your PromptLake from the PromptQueue using the EventPrompt stream”

from @chrisalbon

Could this be the next big breakthrough in machine learning? Pigeon learning 🐦

new Deep Learning benchmark: your model doesn't have to be perfect but it does have to be better, cheaper, and more carbon efficient than a pigeon!

from @ChelseaParlett

Simple machine learning tutorials are good, but it's helpful to be reminded of how real-world data science looks.

What beginners think ML pipelines are:

β€’ Data comes in
β€’ Model makes predictions
β€’ Done

What they actually are:

β€’ Raw data comes in
β€’ Goes through multiple transformation layers
β€’ Quality checks, anomaly detection, etc
β€’ Feature engineering

(continued)

from @marktenenholtz

Who doesn't enjoy a good data visualisation πŸ“Š

New personal viz #quantifiedself πŸ€“

My goal this year was to get a little lighter and fitter. This #dataviz illustrates how I went above and beyond that goal, and got addicted to cardio in the process! πŸ•ΊπŸ»

Who's ready for a fit 2023? πŸ™ƒ

from @parabolestudio

πŸ’­ Thought-provoking

Content to inspire, or at the very least keep you informed.

A recent TalkPython podcast covered the topic of using the command line for data science 🀯. There's a lot here that I didn't know was possible before, so I guarantee you will come away with at least one new trick after watching this.

Data Science from the Command Line
When you think data science, Jupyter notebooks and associated tools probably come to mind. But I want to broaden your toolset a bit and encourage you to look around at other tools that are literally at your fingertips. The terminal and shell command line tools. On this episode, you’ll meed Je…

How could homework work at educational institutions in the age of ChatGPT and other AI language models? Well, here is one suggestion:

AI Homework
The first obvious casualty of large language models is homework: the real training for everyone, though, and the best way to leverage AI, will be in verifying and editing information.

πŸ”§ Updates

Did you know that your favourite Python packages actually get updated regularly and you should update your requirements.txt file?

Scikit learn 1.2.0

You can now use PredictionErrorDisplay to easily make plots that show the error of your model predictions. Nice πŸ‘Œ
Streamlit 1.16.0

New theme for you! The Streamlit theme for Altair, Plotly, and Vega-Lite charts is here πŸ“ˆ

A few other minor updates you should be aware of:

Pandas released 1.5.2
Python released 3.11.1

πŸ’¬ Enjoyed this issue? Share it

πŸ”— stephenallwright.com/newsletter-issue-1
🐦 Share on Twitter
βœ‰οΈ Forward via email

Stephen Allwright Twitter

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.