Book cover of Designing Data-Intensive Applications by Martin Kleppmann, featuring abstract layered data flow imagery on a dark background

Pages

Published

2017

Programming ✨ New

Designing Data-Intensive Applications

The big ideas behind reliable, scalable, and maintainable systems

Understand the internals of databases, distributed systems, and data pipelines well enough to make the right architecture decisions for your application.

M Martin Kleppmann

Modern applications are built on layers of databases, caches, queues, and stream processors — but most engineers use these tools without understanding what happens when they fail, scale, or disagree with each other. This book works through the fundamental problems of data systems: how data is stored, retrieved, encoded, replicated, partitioned, and processed. It gives you the mental models to reason clearly about trade-offs, so you can choose the right tool and trust the systems you build.

Buy on Amazon →

About this book

Most backend engineers can wire together a Postgres database, a Redis cache, and a message queue. Far fewer can explain what happens to your data when a network partition splits your cluster, or why two databases can return different answers to the same query asked at the same moment. That gap between using data systems and understanding them is where production incidents are born.

Martin Kleppmann's book closes that gap. It works from first principles: how storage engines actually write data to disk, why indexes are designed the way they are, what replication lag means for the consistency guarantees your application can offer, and how distributed transactions got complicated enough that most systems quietly gave up on them. Each concept is grounded in real systems — Postgres, Cassandra, Kafka, HBase, Zookeeper, Flink, and many others appear not as brand names but as concrete examples of specific design decisions.

The book is organized in three parts. The first covers the foundations of data systems on a single node: storage, retrieval, encoding, and the evolution of data formats over time. The second tackles the hard problems of distributing data across multiple machines: replication strategies, partitioning schemes, and the consistency models that arise from each combination. The third part looks at derived data — how batch processing, stream processing, and the lambda and kappa architectures connect systems together into a coherent data pipeline.

Throughout, Kleppmann is honest about what the field does not yet have good answers to. He names the trade-offs that no tool can make for you, and he gives you the vocabulary to discuss them with your team. By the time you finish, you will be able to read a distributed systems paper, evaluate a vendor's consistency claims, and design a data architecture with eyes open to its failure modes.

662 pages of rigorously sourced content, with nearly 200 references to academic papers and production postmortems
Covers relational databases, document stores, column-family stores, graph databases, message brokers, and stream processors in a single coherent framework
Explains CAP theorem, linearizability, causal consistency, and eventual consistency with concrete examples rather than abstract proofs
Addresses schema evolution, backward and forward compatibility, and the practical realities of running rolling upgrades in production

🎯 What you'll learn

Explain how B-tree and LSM-tree storage engines differ and which workloads favor each
Reason about replication lag and choose the consistency model that matches your application's requirements
Identify the partition strategy — range, hash, or composite — that fits your access patterns and avoids hot spots
Distinguish linearizability from serializability and know when each guarantee matters in practice
Evaluate distributed transaction protocols, including two-phase commit and its failure modes, without relying on marketing documentation
Design a fault-tolerant stream processing pipeline that handles late data, duplicate messages, and exactly-once semantics
Assess the consistency and durability claims in a database vendor's documentation against what the system can actually guarantee

👤 Who is this book for?

Backend engineers who build applications on top of databases and want to understand what those databases are actually doing
Platform or infrastructure engineers designing multi-service architectures and needing a principled framework for choosing data stores
Senior engineers preparing for system design interviews at companies where distributed systems knowledge is tested seriously
Data engineers building pipelines and wanting to understand the consistency and ordering guarantees of the systems they connect
Engineering managers who want to speak precisely about reliability and scalability trade-offs when reviewing architecture proposals

01

Reliable, Scalable, and Maintainable Applications

Introduces the three core properties that data-intensive applications must satisfy and defines what each term actually means in practice. You build a shared vocabulary for evaluating every design decision that follows.
02

Data Models and Query Languages

Surveys relational, document, graph, and column models, tracing how each shapes the queries you can express and the trade-offs you accept. You learn why the choice of data model is also a choice about what questions you can ask efficiently.
03

Storage and Retrieval

Opens the black box of storage engines, contrasting B-tree indexes with log-structured merge-trees and explaining how each handles reads, writes, and compaction. You finish with a clear picture of why OLTP and analytics workloads demand different storage designs.
04

Encoding and Evolution

Covers how data is serialized to bytes, why schema evolution is harder than it looks, and how formats like Protocol Buffers, Avro, and JSON handle backward and forward compatibility. You learn how to change a data format without breaking running services.
05

Replication

Works through single-leader, multi-leader, and leaderless replication, tracing the consistency anomalies each approach introduces. You gain a concrete understanding of replication lag and the guarantees — and non-guarantees — each strategy provides.
06

Partitioning

Explains how to split a large dataset across multiple nodes using range and hash partitioning, and how to route queries to the right partition. You see how partitioning interacts with replication and secondary indexes to produce subtle correctness risks.
07

Transactions

Defines isolation levels from read committed to serializable and maps them to the anomalies each prevents or permits. You learn why serializable isolation is both the goal and the performance problem, and what techniques — two-phase locking, SSI — address it.
08

The Trouble with Distributed Systems

Catalogs what can go wrong when components communicate over a network: packet loss, clock skew, partial failures, and Byzantine faults. You develop the mental model needed to reason about correctness when you cannot assume reliable communication.
09

Consistency and Consensus

Builds from linearizability and causal consistency up to the consensus problem, explaining what Zookeeper and similar systems actually provide. You learn why total order broadcast, atomic commit, and leader election are all facets of the same underlying problem.
10

Batch Processing, Stream Processing, and the Future of Data Systems

Traces the MapReduce model through modern stream processing frameworks, covering fault tolerance, windowing, and exactly-once semantics. You finish with a framework for composing multiple data systems into a unified architecture with clear consistency boundaries.

Frequently asked questions

Do I need a computer science degree to follow this book?

No formal CS background is required. You should be comfortable writing code and have worked with at least one database before. The book explains theoretical concepts from scratch using concrete examples, not proofs.

Is the content still accurate given it was published in 2017?

The core concepts — replication, partitioning, consensus, stream processing — are foundational and have not changed. Specific version numbers and product details may be dated, but the reasoning frameworks apply directly to current systems.

Does this book cover any specific programming language?

No. The book is language-agnostic and focuses on system concepts rather than code. Examples reference real databases and frameworks by name, but you do not need to know any particular language to benefit from it.

Is this a book for beginners or experienced engineers?

It is best suited to engineers who already build software with databases and want to understand what those systems are doing internally. True beginners may find it slow going without some hands-on experience first.

Does the book include exercises or companion code?

The book is primarily a conceptual text with no programming exercises or official code repository. The value is in the mental models and trade-off analysis, which you apply to your own systems.

Get this book

Buy on Amazon →

Specs

Publisher: O'Reilly Media, Inc.
Published: Mar 2017
Pages: 662
Language: English

About the author

Martin Kleppmann

New

Code

The Hidden Language of Computer Hardware and Software

by Charles Petzold

Programming

2022 View →

New

Introduction to Algorithms, fourth edition

The definitive reference on algorithms and data structures for students and practicing engineers

by Charles E. Leiserson, Clifford Stein, Ronald L. Rivest, Thomas H. Cormen

Programming

2022 View →

New

A Philosophy of Software Design

Timeless principles for managing complexity and writing software that lasts

by John K. Ousterhout

Programming

2021 View →

New

Clean Architecture

A Craftsman's Guide to Software Structure and Design

by Robert C. Martin

Programming

2018 View →

Designing Data-Intensive Applications

About this book

🎯 What you'll learn

👤 Who is this book for?

Table of contents

Reliable, Scalable, and Maintainable Applications

Data Models and Query Languages

Storage and Retrieval

Encoding and Evolution

Replication

Partitioning

Transactions

The Trouble with Distributed Systems

Consistency and Consensus

Batch Processing, Stream Processing, and the Future of Data Systems

Frequently asked questions

You might also like

Code

Introduction to Algorithms, fourth edition

A Philosophy of Software Design

Clean Architecture

Designing Data-Intensive Applications

About this book

🎯 What you'll learn

👤 Who is this book for?

Table of contents

Reliable, Scalable, and Maintainable Applications

Data Models and Query Languages

Storage and Retrieval

Encoding and Evolution

Replication

Partitioning

Transactions

The Trouble with Distributed Systems

Consistency and Consensus

Batch Processing, Stream Processing, and the Future of Data Systems

Frequently asked questions

You might also like

Code

Introduction to Algorithms, fourth edition

A Philosophy of Software Design

Clean Architecture

Stay ahead of the curve