Data Science from Scratch

Building every algorithm and model from scratch in Python to genuinely understand how data science works

Learn data science by writing the code yourself — no black-box libraries, just Python, math, and clear thinking.

Data Science from Scratch strips away the abstraction. Instead of calling sklearn and hoping for the best, you build linear regression, neural networks, clustering algorithms, and more from raw Python. Joel Grus walks you through the statistics, probability, and linear algebra you actually need, then shows you how each fundamental technique works internally. By the end, you understand not just how to apply the tools but why they work — and what to do when they don't.

Buy on Amazon →

About this book

Most data science tutorials hand you a library and tell you to call fit(). That works until something breaks, a result looks wrong, or an interviewer asks you to explain what gradient descent is actually doing. This book takes a different approach.

Joel Grus builds every core technique from scratch using nothing but standard Python. No scikit-learn, no TensorFlow, no statsmodels. You start with the Python fundamentals you need, work through the statistics and probability that underpin every model, and then implement — line by line — the algorithms that power real data science work: linear regression, logistic regression, decision trees, neural networks, k-means clustering, natural language processing, and more.

The goal is not to avoid libraries forever. It is to understand what those libraries are doing so you can use them with confidence, debug them when they fail, and explain your choices to colleagues and stakeholders who push back. Building things from scratch is the fastest route to genuine intuition.

Along the way, the book covers the practical glue that textbooks often skip: working with data files, scraping the web, using APIs, cleaning messy data, and visualising results. Each chapter is self-contained enough to read in a focused session, yet the book builds a coherent mental model across its full arc.

Probability and statistics refreshed with concrete, code-driven examples
Linear algebra explained through the operations you actually use in ML
Gradient descent implemented step by step, not imported from a package
Neural networks built from a single neuron up to a working multi-layer net
Recommender systems, NLP basics, and network analysis, all in plain Python

The second edition updates every example to Python 3 and type hints, and adds new chapters on deep learning foundations and going to production. Whether you are coming from a non-technical background or you are a software engineer making the move into data roles, this is the book that replaces mystery with understanding.

🎯 What you'll learn

Implement linear algebra operations — vectors, matrices, dot products — in pure Python without NumPy
Build a working gradient descent optimizer from scratch and apply it to real regression problems
Write a neural network forward and backward pass by hand to understand exactly how training works
Apply Naive Bayes, decision trees, and k-nearest neighbors by coding each algorithm yourself
Clean, parse, and explore real datasets using only Python's standard library and minimal dependencies
Use k-means clustering and principal component analysis to find structure in unlabeled data
Scrape web pages and consume REST APIs to gather your own data for analysis
Explain model behavior and trade-offs confidently because you understand what the code is actually doing

👤 Who is this book for?

Software engineers who can write Python but want to understand the math behind machine learning models they use at work
Analysts and domain experts who are comfortable with data but have never written a machine learning algorithm from scratch
Career changers entering data science who want intuition, not just the ability to call library functions
Computer science students who want a practical complement to a formal ML course
Self-taught programmers preparing for data science interviews that require algorithmic understanding

01

Introduction

Sets out the philosophy of the book — building from scratch to gain genuine understanding — and walks through the Python environment setup you need to follow along.
02

A Crash Course in Python

Covers the specific Python features used throughout the book: list comprehensions, generators, default arguments, type hints, and the standard library tools that replace heavy dependencies.
03

Visualizing Data

Introduces matplotlib for exploratory visualization and shows you how to choose the right chart type to surface patterns and outliers in a dataset.
04

Linear Algebra

Builds vectors and matrices as plain Python lists and implements the operations — dot products, matrix multiplication, transpose — that appear repeatedly in every ML algorithm.
05

Statistics and Probability

Derives mean, variance, covariance, correlation, and common probability distributions from first principles, with code examples that make the math concrete.
06

Gradient Descent

Implements gradient descent in Python from a single update step up to stochastic and minibatch variants, giving you the optimization backbone every subsequent model depends on.
07

Linear Regression and Logistic Regression

Fits a line and then a decision boundary to data by minimising a loss function with the gradient descent you built in the previous chapter, and evaluates each model honestly.
08

Neural Networks

Constructs a neural network one neuron at a time, then adds hidden layers and backpropagation so you see exactly how a net learns rather than relying on a framework to do it invisibly.
09

Decision Trees, Random Forests, and Clustering

Codes ID3 decision tree splitting, bootstrap aggregation for random forests, and k-means clustering, comparing their assumptions and failure modes along the way.
10

Natural Language Processing and Recommender Systems

Applies the techniques from earlier chapters to text data and collaborative filtering, building a bag-of-words classifier and a simple item-based recommender in plain Python.

Frequently asked questions

Do I need a math background before reading this book?

High-school algebra and basic probability help, but the book re-derives every concept it uses in code. If you can read a for-loop, you can follow the math.

Which Python version does the book use?

The second edition (2019) uses Python 3 throughout, including type hints. Python 2 is not covered at all.

Should I already know NumPy or pandas?

No — the book deliberately avoids them to build intuition. After finishing, you will be better equipped to learn those libraries because you will understand what they are abstracting.

Is this book suitable if I already work in data science professionally?

It depends on your background. If you trained on tutorials and library calls rather than fundamentals, the from-scratch implementations will fill real gaps. If you have a strong ML theory background, you may find the pace slow.

Does the book include exercises or companion code?

Code examples appear throughout each chapter. Check O'Reilly's catalogue page for any officially hosted companion repository linked to this edition.

How does this compare to a formal machine learning textbook?

It is far more practical and readable, but less mathematically rigorous. It is a starting point that prepares you to engage with more formal texts, not a replacement for them.