New
Storytelling with Data
A Practical Guide to Communicating Effectively with Data Visualizations and Charts
Pages
398
Published
2019
Building every algorithm and model from scratch in Python to genuinely understand how data science works
Learn data science by writing the code yourself — no black-box libraries, just Python, math, and clear thinking.
Data Science from Scratch strips away the abstraction. Instead of calling sklearn and hoping for the best, you build linear regression, neural networks, clustering algorithms, and more from raw Python. Joel Grus walks you through the statistics, probability, and linear algebra you actually need, then shows you how each fundamental technique works internally. By the end, you understand not just how to apply the tools but why they work — and what to do when they don't.
Most data science tutorials hand you a library and tell you to call fit(). That works until something breaks, a result looks wrong, or an interviewer asks you to explain what gradient descent is actually doing. This book takes a different approach.
Joel Grus builds every core technique from scratch using nothing but standard Python. No scikit-learn, no TensorFlow, no statsmodels. You start with the Python fundamentals you need, work through the statistics and probability that underpin every model, and then implement — line by line — the algorithms that power real data science work: linear regression, logistic regression, decision trees, neural networks, k-means clustering, natural language processing, and more.
The goal is not to avoid libraries forever. It is to understand what those libraries are doing so you can use them with confidence, debug them when they fail, and explain your choices to colleagues and stakeholders who push back. Building things from scratch is the fastest route to genuine intuition.
Along the way, the book covers the practical glue that textbooks often skip: working with data files, scraping the web, using APIs, cleaning messy data, and visualising results. Each chapter is self-contained enough to read in a focused session, yet the book builds a coherent mental model across its full arc.
The second edition updates every example to Python 3 and type hints, and adds new chapters on deep learning foundations and going to production. Whether you are coming from a non-technical background or you are a software engineer making the move into data roles, this is the book that replaces mystery with understanding.
Sets out the philosophy of the book — building from scratch to gain genuine understanding — and walks through the Python environment setup you need to follow along.
Covers the specific Python features used throughout the book: list comprehensions, generators, default arguments, type hints, and the standard library tools that replace heavy dependencies.
Introduces matplotlib for exploratory visualization and shows you how to choose the right chart type to surface patterns and outliers in a dataset.
Builds vectors and matrices as plain Python lists and implements the operations — dot products, matrix multiplication, transpose — that appear repeatedly in every ML algorithm.
Derives mean, variance, covariance, correlation, and common probability distributions from first principles, with code examples that make the math concrete.
Implements gradient descent in Python from a single update step up to stochastic and minibatch variants, giving you the optimization backbone every subsequent model depends on.
Fits a line and then a decision boundary to data by minimising a loss function with the gradient descent you built in the previous chapter, and evaluates each model honestly.
Constructs a neural network one neuron at a time, then adds hidden layers and backpropagation so you see exactly how a net learns rather than relying on a framework to do it invisibly.
Codes ID3 decision tree splitting, bootstrap aggregation for random forests, and k-means clustering, comparing their assumptions and failure modes along the way.
Applies the techniques from earlier chapters to text data and collaborative filtering, building a bag-of-words classifier and a simple item-based recommender in plain Python.
High-school algebra and basic probability help, but the book re-derives every concept it uses in code. If you can read a for-loop, you can follow the math.
The second edition (2019) uses Python 3 throughout, including type hints. Python 2 is not covered at all.
No — the book deliberately avoids them to build intuition. After finishing, you will be better equipped to learn those libraries because you will understand what they are abstracting.
It depends on your background. If you trained on tutorials and library calls rather than fundamentals, the from-scratch implementations will fill real gaps. If you have a strong ML theory background, you may find the pace slow.
Code examples appear throughout each chapter. Check O'Reilly's catalogue page for any officially hosted companion repository linked to this edition.
It is far more practical and readable, but less mathematically rigorous. It is a starting point that prepares you to engage with more formal texts, not a replacement for them.
New
A Practical Guide to Communicating Effectively with Data Visualizations and Charts
New
Techniques for Thinking Analytically and Solving Real Data Problems
New
A practical guide to the complete data engineering lifecycle, from ingestion to serving
by Joe Reis, Matt Housley
New
A hands-on guide to scalable data analytics using Python and PySpark