Cover of Designing Machine Learning Systems by Chip Huyen, featuring abstract geometric shapes representing data flow and system architecture

Pages

Published

2022

AI Learning ✨ New

Designing Machine Learning Systems

An Iterative Process for Production-Ready Machine Learning Applications

Learn to design, build, and maintain ML systems that actually work in production — from data pipelines to model monitoring.

C Chip Huyen

Most ML courses stop at model accuracy. This book starts where they end. Chip Huyen walks you through every layer of a production ML system — data engineering, feature stores, training pipelines, deployment strategies, and monitoring — giving you the mental models and practical tools to build systems that hold up under real-world conditions. At 388 pages, it covers the full lifecycle without padding, making it the clearest practitioner's guide to ML system design available today.

Buy on Amazon →

About this book

Training a model is the easy part. Keeping it accurate, reliable, and cost-effective in production is where most ML projects fail. Chip Huyen wrote this book because the gap between a Jupyter notebook and a live system serving millions of requests is enormous, and almost no resource addressed it head-on.

This book gives you a complete mental model of what a production ML system actually looks like. You will learn how data flows through an organization, how features are computed and stored at scale, how training pipelines are structured to support fast iteration, and how deployment decisions affect latency, cost, and reliability. Each chapter builds on the last, so by the end you have a coherent picture of the entire lifecycle rather than a collection of isolated techniques.

Huyen is direct about tradeoffs. Batch inference versus online inference. Feature stores versus on-the-fly computation. Shadow deployment versus canary releases. You will understand not just how to implement each approach but when to choose it and what you are giving up. That kind of reasoning is what separates engineers who ship ML systems from engineers who demo them.

The book also confronts the operational reality that most practitioners face: data drift, model decay, feedback loops, and the organizational friction of keeping a system accurate over time. A chapter dedicated to monitoring and observability shows you what to measure, what to alert on, and how to diagnose degradation before users notice it.

Data engineering foundations: collection, labeling, versioning, and validation
Feature engineering at scale, including feature stores and real-time pipelines
Model development practices that support reproducibility and fast iteration
Deployment patterns including batch, online, streaming, and edge inference
Infrastructure choices: serving frameworks, containers, and orchestration basics
Monitoring strategies for data drift, concept drift, and system health
The business and organizational context that shapes every technical decision

Whether you are the first ML engineer at a startup or moving from research into a platform role at a larger company, this book gives you the vocabulary, the frameworks, and the practical judgment to design systems that survive contact with production.

🎯 What you'll learn

Map the full lifecycle of a production ML system from data ingestion to model retirement
Design data pipelines that handle real-world messiness: missing labels, distribution shift, and schema drift
Choose between feature computation strategies based on latency, freshness, and cost tradeoffs
Structure training workflows to support reproducibility, fast iteration, and safe rollback
Select deployment patterns — batch, online, streaming, edge — based on your product's actual requirements
Build monitoring systems that detect data drift and model decay before they damage user experience
Reason through the organizational and business constraints that shape every technical ML decision

👤 Who is this book for?

ML engineers who can train models but struggle to get them into reliable production systems
Software engineers transitioning into machine learning roles who want a systems-level foundation
Data scientists ready to move beyond notebooks and take ownership of the full pipeline
Applied researchers joining industry teams where deployment and maintenance are part of the job
Engineering managers overseeing ML teams who need a shared vocabulary with their practitioners

01

Overview of Machine Learning Systems

Establishes what production ML systems are and why they differ fundamentally from one-off model training. You will learn how the components fit together and what makes ML systems uniquely difficult to build and maintain.
02

Introduction to Machine Learning Systems Design

Frames the design process around business objectives, requirements, and constraints. You will practice translating a vague product goal into concrete system requirements before writing a single line of code.
03

Data Engineering Fundamentals

Covers the data layer: sources, formats, storage engines, and data flow patterns. You will understand how data moves through an organization and where the common failure points are.
04

Training Data

Addresses labeling, sampling strategies, class imbalance, and data augmentation. You will learn how the quality and composition of training data shapes every downstream modeling decision.
05

Feature Engineering

Explains how to create, transform, and store features for both batch and real-time use. You will evaluate when a feature store is worth the investment and how to avoid common feature leakage mistakes.
06

Model Development and Offline Evaluation

Covers model selection, experiment tracking, hyperparameter tuning, and evaluation metrics that reflect business goals. You will build workflows that make experiments reproducible and comparable.
07

Model Deployment and Prediction Service

Walks through batch, online, streaming, and edge deployment patterns and the infrastructure each requires. You will match deployment strategy to product latency and cost constraints.
08

Data Distribution Shifts and Monitoring

Defines data drift, concept drift, and feedback loops, then shows you how to detect and respond to each. You will design a monitoring setup that catches model degradation before it reaches users.
09

Continual Learning and Test in Production

Explains how to update models safely using shadow deployment, canary releases, and A/B testing. You will learn when continual retraining is worth the infrastructure cost and when it is not.
10

Infrastructure and Tooling for MLOps

Surveys the tooling landscape — orchestration, serving frameworks, feature platforms, and experiment trackers — and gives you criteria for evaluating and selecting them for your team's context.

Frequently asked questions

Do I need a strong ML background to read this book?

You should be comfortable with basic ML concepts — supervised learning, model evaluation, training loops — at roughly the level of a university ML course or equivalent self-study. The book does not teach modeling fundamentals; it focuses on systems design around them.

Is this book heavy on code or is it more conceptual?

It is primarily conceptual and design-oriented, with code samples used to illustrate specific points rather than as the main vehicle of instruction. If you want line-by-line implementation tutorials, this is not that book — it is focused on decision-making and architecture.

Does the content still apply now that it was published in 2022?

The core concepts — data pipelines, deployment patterns, drift monitoring, and system design tradeoffs — are stable and remain directly applicable. Specific tool names in the MLOps ecosystem evolve quickly, so treat those sections as a framework for evaluation rather than a current vendor guide.

Is this book relevant if I work at a small company without a dedicated ML platform team?

Yes. Huyen explicitly addresses resource-constrained environments and explains which practices scale down to small teams. Many readers apply the frameworks working solo or on a team of two or three engineers.

Does the book come with code files or a companion repository?

The book includes code snippets throughout the text. Check the publisher's page at O'Reilly for any associated resources or errata the author has released since publication.

Get this book

Buy on Amazon →

Specs

Publisher: O'Reilly Media, Inc.
Published: May 2022
Pages: 388
Language: English

About the author

Chip Huyen

New

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

A practical, project-driven introduction to machine learning and deep learning with Python

by Aurélien Géron

AI Learning

2022 View →

New

Probabilistic Machine Learning

A rigorous foundation in Bayesian reasoning, probabilistic models, and modern machine learning methods

by Kevin P. Murphy

AI Learning

2022 View →

Cover of Artificial Intelligence: A Modern Approach by Russell and Norvig, showing abstract symbolic representation of intelligent systems

New

Artificial Intelligence: A Modern Approach, Global Edition

The definitive textbook on intelligent systems, from foundational search and logic to modern machine learning and probabilistic reasoning

by Peter Norvig, Stuart Russell

AI Learning

2021 View →

New

AI and Machine Learning for Coders

A Programmer's Guide to Building AI and Machine Learning Models with TensorFlow

by Laurence Moroney

AI Learning

2020 View →

Designing Machine Learning Systems

About this book

🎯 What you'll learn

👤 Who is this book for?

Table of contents

Overview of Machine Learning Systems

Introduction to Machine Learning Systems Design

Data Engineering Fundamentals

Training Data

Feature Engineering

Model Development and Offline Evaluation

Model Deployment and Prediction Service

Data Distribution Shifts and Monitoring

Continual Learning and Test in Production

Infrastructure and Tooling for MLOps

Frequently asked questions

You might also like

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

Probabilistic Machine Learning

Artificial Intelligence: A Modern Approach, Global Edition

AI and Machine Learning for Coders

Designing Machine Learning Systems

About this book

🎯 What you'll learn

👤 Who is this book for?

Table of contents

Overview of Machine Learning Systems

Introduction to Machine Learning Systems Design

Data Engineering Fundamentals

Training Data

Feature Engineering

Model Development and Offline Evaluation

Model Deployment and Prediction Service

Data Distribution Shifts and Monitoring

Continual Learning and Test in Production

Infrastructure and Tooling for MLOps

Frequently asked questions

You might also like

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow

Probabilistic Machine Learning

Artificial Intelligence: A Modern Approach, Global Edition

AI and Machine Learning for Coders

Stay ahead of the curve