Book cover of Designing Data-Intensive Applications by Martin Kleppmann, featuring abstract layered data flow imagery on a dark background

Pages

662

Published

2017

Programming ✨ New

Designing Data-Intensive Applications

The big ideas behind reliable, scalable, and maintainable systems

Understand the internals of databases, distributed systems, and data pipelines well enough to make the right architecture decisions for your application.

Modern applications are built on layers of databases, caches, queues, and stream processors β€” but most engineers use these tools without understanding what happens when they fail, scale, or disagree with each other. This book works through the fundamental problems of data systems: how data is stored, retrieved, encoded, replicated, partitioned, and processed. It gives you the mental models to reason clearly about trade-offs, so you can choose the right tool and trust the systems you build.

About this book

Most backend engineers can wire together a Postgres database, a Redis cache, and a message queue. Far fewer can explain what happens to your data when a network partition splits your cluster, or why two databases can return different answers to the same query asked at the same moment. That gap between using data systems and understanding them is where production incidents are born.

Martin Kleppmann's book closes that gap. It works from first principles: how storage engines actually write data to disk, why indexes are designed the way they are, what replication lag means for the consistency guarantees your application can offer, and how distributed transactions got complicated enough that most systems quietly gave up on them. Each concept is grounded in real systems β€” Postgres, Cassandra, Kafka, HBase, Zookeeper, Flink, and many others appear not as brand names but as concrete examples of specific design decisions.

The book is organized in three parts. The first covers the foundations of data systems on a single node: storage, retrieval, encoding, and the evolution of data formats over time. The second tackles the hard problems of distributing data across multiple machines: replication strategies, partitioning schemes, and the consistency models that arise from each combination. The third part looks at derived data β€” how batch processing, stream processing, and the lambda and kappa architectures connect systems together into a coherent data pipeline.

Throughout, Kleppmann is honest about what the field does not yet have good answers to. He names the trade-offs that no tool can make for you, and he gives you the vocabulary to discuss them with your team. By the time you finish, you will be able to read a distributed systems paper, evaluate a vendor's consistency claims, and design a data architecture with eyes open to its failure modes.

  • 662 pages of rigorously sourced content, with nearly 200 references to academic papers and production postmortems
  • Covers relational databases, document stores, column-family stores, graph databases, message brokers, and stream processors in a single coherent framework
  • Explains CAP theorem, linearizability, causal consistency, and eventual consistency with concrete examples rather than abstract proofs
  • Addresses schema evolution, backward and forward compatibility, and the practical realities of running rolling upgrades in production

🎯 What you'll learn

  • Explain how B-tree and LSM-tree storage engines differ and which workloads favor each
  • Reason about replication lag and choose the consistency model that matches your application's requirements
  • Identify the partition strategy β€” range, hash, or composite β€” that fits your access patterns and avoids hot spots
  • Distinguish linearizability from serializability and know when each guarantee matters in practice
  • Evaluate distributed transaction protocols, including two-phase commit and its failure modes, without relying on marketing documentation
  • Design a fault-tolerant stream processing pipeline that handles late data, duplicate messages, and exactly-once semantics
  • Assess the consistency and durability claims in a database vendor's documentation against what the system can actually guarantee

πŸ‘€ Who is this book for?

  • Backend engineers who build applications on top of databases and want to understand what those databases are actually doing
  • Platform or infrastructure engineers designing multi-service architectures and needing a principled framework for choosing data stores
  • Senior engineers preparing for system design interviews at companies where distributed systems knowledge is tested seriously
  • Data engineers building pipelines and wanting to understand the consistency and ordering guarantees of the systems they connect
  • Engineering managers who want to speak precisely about reliability and scalability trade-offs when reviewing architecture proposals

Table of contents

  1. 01

    Reliable, Scalable, and Maintainable Applications

    Introduces the three core properties that data-intensive applications must satisfy and defines what each term actually means in practice. You build a shared vocabulary for evaluating every design decision that follows.

  2. 02

    Data Models and Query Languages

    Surveys relational, document, graph, and column models, tracing how each shapes the queries you can express and the trade-offs you accept. You learn why the choice of data model is also a choice about what questions you can ask efficiently.

  3. 03

    Storage and Retrieval

    Opens the black box of storage engines, contrasting B-tree indexes with log-structured merge-trees and explaining how each handles reads, writes, and compaction. You finish with a clear picture of why OLTP and analytics workloads demand different storage designs.

  4. 04

    Encoding and Evolution

    Covers how data is serialized to bytes, why schema evolution is harder than it looks, and how formats like Protocol Buffers, Avro, and JSON handle backward and forward compatibility. You learn how to change a data format without breaking running services.

  5. 05

    Replication

    Works through single-leader, multi-leader, and leaderless replication, tracing the consistency anomalies each approach introduces. You gain a concrete understanding of replication lag and the guarantees β€” and non-guarantees β€” each strategy provides.

  6. 06

    Partitioning

    Explains how to split a large dataset across multiple nodes using range and hash partitioning, and how to route queries to the right partition. You see how partitioning interacts with replication and secondary indexes to produce subtle correctness risks.

  7. 07

    Transactions

    Defines isolation levels from read committed to serializable and maps them to the anomalies each prevents or permits. You learn why serializable isolation is both the goal and the performance problem, and what techniques β€” two-phase locking, SSI β€” address it.

  8. 08

    The Trouble with Distributed Systems

    Catalogs what can go wrong when components communicate over a network: packet loss, clock skew, partial failures, and Byzantine faults. You develop the mental model needed to reason about correctness when you cannot assume reliable communication.

  9. 09

    Consistency and Consensus

    Builds from linearizability and causal consistency up to the consensus problem, explaining what Zookeeper and similar systems actually provide. You learn why total order broadcast, atomic commit, and leader election are all facets of the same underlying problem.

  10. 10

    Batch Processing, Stream Processing, and the Future of Data Systems

    Traces the MapReduce model through modern stream processing frameworks, covering fault tolerance, windowing, and exactly-once semantics. You finish with a framework for composing multiple data systems into a unified architecture with clear consistency boundaries.

Frequently asked questions

Do I need a computer science degree to follow this book?

No formal CS background is required. You should be comfortable writing code and have worked with at least one database before. The book explains theoretical concepts from scratch using concrete examples, not proofs.

Is the content still accurate given it was published in 2017?

The core concepts β€” replication, partitioning, consensus, stream processing β€” are foundational and have not changed. Specific version numbers and product details may be dated, but the reasoning frameworks apply directly to current systems.

Does this book cover any specific programming language?

No. The book is language-agnostic and focuses on system concepts rather than code. Examples reference real databases and frameworks by name, but you do not need to know any particular language to benefit from it.

Is this a book for beginners or experienced engineers?

It is best suited to engineers who already build software with databases and want to understand what those systems are doing internally. True beginners may find it slow going without some hands-on experience first.

Does the book include exercises or companion code?

The book is primarily a conceptual text with no programming exercises or official code repository. The value is in the mental models and trade-off analysis, which you apply to your own systems.

You might also like

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.