The biggest Python topics of 2023 ›

Data Engineering Pipelines

This topic revolves around open-source tools and frameworks in the fields of data engineering and software engineering, focusing on dataflows, pipeline development, and machine learning operations. It covers a range of technologies such as XGBoost, Fugue, Hamilton, Feast, Meltano, Prefect, and more, offering solutions for tasks like enhancing classification models, orchestrating data workflows, and managing machine learning pipelines efficiently. The intersection of data engineering, software engineering, and machine learning practices is explored through a variety of tools and resources designed to streamline data processing and model deployment.

prefect: Workflow Orchestration Tool Project

Prefect is a workflow orchestration tool empowering developers to build, observe, and react to data pipelines

feast: Feature Store for Machine Learning Project

Feature Store for Machine Learning

modelscope: Model-as-a-Service Platform for ML Learning Project

ModelScope: bring the notion of Model-as-a-Service to life.

Improving Classification Models With XGBoost Article

How can you improve a classification model while avoiding overfitting? Once you have a model, what tools can you use to explain it to others? This week on the show, we talk with author and Python trainer Matt Harrison about his new book Effective XGBoost: Tuning, Understanding, and Deploying Classification Models.

ML System Design: 200 Case Studies Article

A collection of links to 200 different blog posts / case studies from leaders in the ML space. Learn how companies such as Netflix and Airbnb implement and use ML in their organizations.

The Dangers Behind Image Resizing Article

When training an ML model on image data you likely want smaller, consistently sized images. That means image processing in your pipeline, but the expectation that image resizing is the same across libraries can cause unforeseen problems.

Failed ML Project About Real Estate Article

“There aren’t enough failed data science projects out there. Usually, projects only show up in public if they work. I think that’s a shame. If we learn more from our successes than our failures, it makes sense to share more failures to help those around us.”

fugue: Unified Interface for Distributed Computing Project

A unified interface for distributed computing. Fugue executes SQL, Python, Pandas, and Polars code on Spark, Dask and Ray without any rewrites.

meltano: CLI for ELT+ Project

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

hamilton: Micro-Framework for Defining Dataflows Project Started in 2023

Hamilton helps data scientists and engineers define testable, modular, self-documenting dataflows, that encode lineage and metadata. Runs and scales everywhere python does.

Xorbits: Compatible, Scalable Data Science Project

Scalable Python DS & ML, in an API compatible & lightning fast way.

Python Stateful Stream Processing OSS Framework Project

Python Stream Processing

sematic: An Open-Source ML Pipeline Development Toolkit Project

An open-source ML pipeline development platform

cleanvision: Find Issues in Image Datasets Project

Automatically find issues in image datasets and practice data-centric computer vision.

ML-Recipes: Collection of Machine Learning Recipes Project

A collection of stand-alone Python machine learning recipes

pipeless-ai: Open-Source Computer Vision Framework Project Started in 2023

An open-source computer vision framework to build and deploy apps in minutes