Fit Predict #2: Fluff your SQL and don't be imbalanced

In this issue of the Fit Predict newsletter, we look at fixing imbalanced datasets, running causal inference, the reality of working as a Data Scientist, and more.

Stephen Allwright

10 Jan 2023

📥

What is Fit Predict?
This is a light-hearted overview of what's been going on in the world of Data Science this week. See it as your 5-minute update such that you can sound at least slightly knowledgeable at your next coffee chat ☕

Have you been forwarded this? You can subscribe here!

Hey there,

If you're anything like me, you'll just about be getting back into the swing of things after the Christmas holidays. Coming back to work is always a bit of a shock to the system, that's for sure. Going from solving sudokus by a log fire, to putting out fires in production models can be rather jarring. 🥴

On that fire I put out in production, let me share a completely new and unique learning from that experience /s . 👨‍🏫

It's not a smart idea to drastically increase your dataset size and make significant changes to your training pipeline the day before you sign off for 2 weeks. If you do, you are likely to find yourself debugging in your parent's kitchen at 11 pm on their dial-up speed wifi. 🙃

You're welcome.

So, now that the model is running smoothly in production, let's get on with the newsletter!

🧰 Tools

The tools that will make your life that little bit easier, or at least more interesting... but either way it's fun to play with new toys.

Imbalanced learn

Imbalanced data is a common problem that needs to be contended with when developing machine learning models. This package provides various out-of-the-box techniques to deal with it, and it's compatible with scitkit-learn which is handy.

CausalPy from PyMC Labs

This package from PyMC Labs is for running causal inference using Bayesian models. It's still in beta, but it would be fun to test out.

SQLFluff linter

I've been using a linter in Python for a long time now, but I've never considered using one for my SQL scripts, until recently. I was introduced to this linter, which I'm happy to say is making my queries look smoother than ever.

🧑‍🔬 In practice

Stories of those who are genuinely implementing Data Science. Step aside Titanic dataset, this is the real deal.

What’s a “Listening Personality”?

This year's Wrapped from Spotify included a "Listening Personality", which it turns out has a very interesting set of calculations behind it. One can only dream of having this kind of data available to work with!

How AI Text Generation Models Are Reshaping Customer Support at Airbnb

Managing customer service at Airbnb's scale must be a real challenge. To tackle it, they're employing some interesting machine learning techniques and products.

Notifications: why less is more

A common trap that we fall into is to try and improve metrics by adding features, but it can be just as effective to remove features instead. This story from Facebook is a great example of that.

🐦 The best of Data Twitter

Data Twitter is the best Twitter.

It’s easy to succeed in data/AI. All you need is to put data into a DWH for OLAP using ETL, or ELT with DBT. Spot any issues on your DAG, run SQL for some BI, reverse ETL into SaaS or use a CDP, then leverage for ML/AI with CNNs, RNNs, RLs, GANs, and maybe LLMs or LDMs for GenAI.

from @mattturck

How do I even evaluate a fashion recommender? I don't know YSL from Gucci.

from @eigenhector

"Correlation doesn't imply causation

Unless the two things are actually causally related, in which case it does

But even when that's the case, it might not be"

Causal inference, everyone!

from @PhDemetri

For every tree I cut down, I train a new (gradient boosted) one.

from @marktenenholtz

💭 Thought-provoking

Content to inspire, or at the very least keep you informed.

Chris Albon's new newsletter, Short Notes on AI, was launched last week and the first issue covers the famous MNIST dataset. It's a fascinating history lesson on one of the cornerstone datasets in our industry.

These two blog posts, the latter inspired by the former, are brutally honest portrayals of how being a Data Scientist can look in reality. I must say, that this is personally not my experience, but I learned about some red flags to look out for at least!

🔧 Updates

Did you know that your favourite Python packages actually get updated regularly and you should update your requirements.txt file?

Data2vec 2.0

The new version of Meta's self-supervised learning model, data2vec, is out and promises major significant performance improvements.

🔗 stephenallwright.com/newsletter-issue-2
🐦 Share on Twitter
✉️ Forward via email

Stephen Allwright Twitter

I'm a Data Scientist currently working for Oda, an online grocery retailer, in Oslo, Norway. These posts are my way of sharing some of the tips and tricks I've picked up along the way.

🧰 Tools

🧑‍🔬 In practice

🐦 The best of Data Twitter

💭 Thought-provoking

🔧 Updates

💬 Enjoyed this issue? Share it

Stephen Allwright Twitter

Check out our handy topic pages