Cover of Fundamentals of Data Engineering by Joe Reis and Matt Housley, showing an abstract representation of data flow and pipeline stages

Pages

Published

2022

Data Analytics ✨ New

Fundamentals of Data Engineering

A practical guide to the complete data engineering lifecycle, from ingestion to serving

Build a solid mental model of every stage in the data engineering lifecycle so you can make better architectural decisions and deliver data that teams actually trust.

J Joe Reis M Matt Housley

Data engineering sits at the center of every modern analytics operation, yet most practitioners learn it piecemeal. This book by Joe Reis and Matt Housley gives you a coherent framework for the entire lifecycle: source systems, ingestion, transformation, storage, and serving. Whether you're choosing between batch and streaming, evaluating a new tool, or explaining trade-offs to stakeholders, you'll have the vocabulary and the mental models to make sound decisions confidently.

Buy on Amazon →

About this book

Data engineering is one of the fastest-growing roles in tech, and one of the least formally defined. Most practitioners learn the job by firefighting — stitching together pipelines, inheriting legacy systems, and making architectural calls without a clear framework to reason from. This book changes that.

Joe Reis and Matt Housley spent years working in the field before writing the reference they wish had existed when they started. The result is a structured, tool-agnostic treatment of the data engineering lifecycle — the sequence of stages every data team navigates to move raw data from source systems into reliable, queryable form for downstream consumers.

The book opens by defining the data engineering lifecycle precisely: generation, ingestion, transformation, serving, and storage, plus the undercurrents that run beneath all of them — security, data management, DataOps, orchestration, and software engineering. This framing lets you evaluate any tool or architecture choice against a stable set of criteria rather than chasing vendor narratives.

From there, Reis and Housley work through each stage in depth. Source systems, APIs, databases, streaming platforms — you'll learn how to assess them and what questions to ask. You'll understand why batch and streaming pipelines exist and how to decide between them. You'll see how storage abstractions like data lakes, data warehouses, and lakehouses relate to each other and where each fits. And you'll examine how data reaches analysts, data scientists, and machine learning systems — because a pipeline that nobody trusts or uses has failed no matter how cleverly it was built.

A defining feature of the book is its emphasis on trade-offs over prescriptions. Rather than advocating for a particular stack, the authors give you a way to think. You'll come away able to articulate why one architecture suits a given organization better than another — and defend that view to engineers, product managers, and executives alike.

The full data engineering lifecycle, defined precisely and applied consistently throughout
Source systems: databases, APIs, event streams, files, and how to work with each
Batch versus streaming: when each makes sense and how to reason about latency and cost
Storage tiers: data lakes, warehouses, and lakehouses compared on practical criteria
Transformation patterns, orchestration, and DataOps practices that keep pipelines healthy
Serving data to analysts, data scientists, and ML systems reliably and at scale

If you've been in data engineering for a year or two and still feel like you're guessing at big decisions, this book gives you the map. If you're a data analyst or scientist who wants to understand the infrastructure beneath your work, it gives you the language. And if you're moving into the field from software engineering, it gives you the context you can't get from tutorials alone.

🎯 What you'll learn

Define the data engineering lifecycle and use it as a consistent lens for evaluating tools, systems, and architectural decisions
Assess source systems — databases, event streams, APIs, and flat files — against practical criteria for ingestion reliability
Choose between batch and streaming architectures based on latency requirements, cost, and organizational maturity
Compare storage abstractions — data lakes, warehouses, and lakehouses — and select the right fit for a given use case
Design transformation pipelines that stay maintainable as data volumes and team size grow
Apply DataOps and orchestration practices that catch problems before they reach downstream consumers
Serve data to analysts, data scientists, and ML systems in ways that build trust and enable self-service
Articulate architectural trade-offs clearly to both technical peers and non-technical stakeholders

👤 Who is this book for?

Data engineers with one to three years of experience who want a coherent framework to replace intuition built from trial and error
Data analysts and analytics engineers who need to understand the pipeline infrastructure upstream of their work
Software engineers transitioning into data roles who already know how to code but lack the architectural context specific to data systems
Data architects and tech leads evaluating tool choices and needing a stable vocabulary for comparing options across the lifecycle
Engineering managers who oversee data teams and want to reason more clearly about trade-offs their teams present

01

The Data Engineering Lifecycle

Introduces the central framework of the book: the data engineering lifecycle and its five stages, plus the undercurrents that apply throughout. You'll learn to use this model as a stable reference point for every decision in later chapters.
02

The Data Engineering Landscape

Surveys the current ecosystem of tools, roles, and organizational contexts data engineers work within. You'll develop a way to read the market critically rather than reacting to hype.
03

Designing Good Data Architecture

Covers the principles behind sound data architecture — scalability, flexibility, and simplicity — and how to apply them before committing to a specific stack.
04

Choosing Technologies Across the Data Engineering Lifecycle

Gives you a practical decision framework for evaluating technologies at each lifecycle stage, including how to weigh build versus buy and open-source versus managed services.
05

Source Systems

Examines the origin points of data: relational databases, NoSQL systems, APIs, event streams, and files. You'll learn what to look for when assessing source reliability and schema stability.
06

Storage

Compares the major storage abstractions — raw object storage, data lakes, data warehouses, and lakehouses — and explains how to match each to specific workload and access patterns.
07

Ingestion

Works through batch ingestion, streaming ingestion, and the architectural patterns that support each, including how to handle schema changes and failures gracefully.
08

Transformation

Covers transformation patterns from simple SQL-based models to complex multi-stage pipelines, with attention to maintainability, testing, and orchestration.
09

Serving Data for Analytics, ML, and Reverse ETL

Explores how data reaches its consumers — BI tools, data scientists, machine learning systems, and operational applications — and what it takes to make those handoffs reliable.
10

Security, Privacy, and the Future of Data Engineering

Addresses the undercurrents of security and data governance that apply at every stage, and looks at where the field is heading so you can position your skills accordingly.

Frequently asked questions

Do I need a specific programming language background to get value from this book?

No specific language is required. The book is deliberately tool-agnostic and conceptual in orientation. Code examples appear where they aid understanding, but the core value is the frameworks and mental models, not syntax.

Is this book for beginners or experienced practitioners?

It suits practitioners with at least some hands-on exposure to data work — a year or more as a data engineer, analyst, or software engineer touching data systems. Complete beginners may find the conceptual density challenging without prior context.

Does the book cover specific tools like Spark, dbt, Airflow, or Snowflake?

Yes, these and many others are discussed as examples, but the book does not teach you how to operate any single tool. The goal is to help you evaluate and position tools relative to each other, not to serve as a product manual.

Is the content still relevant given how fast the data tooling landscape moves?

The lifecycle framework and architectural principles the book teaches are stable — they apply regardless of which specific tools are current. Some vendor-specific details will date, but the reasoning methodology holds.

Does the book include hands-on exercises or a companion dataset?

The book is primarily conceptual and explanatory rather than exercise-driven. It does not ship with a companion dataset or lab environment, though the authors reference real-world scenarios throughout.

Get this book

Buy on Amazon →

Specs

Publisher: O'Reilly Media, Inc.
Published: Jun 2022
Pages: 446
Language: English

About the authors

Joe Reis

Matt Housley

New

Storytelling with Data

A Practical Guide to Communicating Effectively with Data Visualizations and Charts

by Cole Nussbaumer Knaflic

Data Analytics

2025 View →

New

Data Science: The Hard Parts

Techniques for Thinking Analytically and Solving Real Data Problems

by Daniel Vaughan

Data Analytics

2023 View →

New

Data Analysis with Python and PySpark

A hands-on guide to scalable data analytics using Python and PySpark

by Jonathan Rioux

Data Analytics

2022 View →

New

The Art of Statistics

How to Learn from Data

by David Spiegelhalter

Data Analytics

2019 View →

Fundamentals of Data Engineering

About this book

🎯 What you'll learn

👤 Who is this book for?

Table of contents

The Data Engineering Lifecycle

The Data Engineering Landscape

Designing Good Data Architecture

Choosing Technologies Across the Data Engineering Lifecycle

Source Systems

Storage

Ingestion

Transformation

Serving Data for Analytics, ML, and Reverse ETL

Security, Privacy, and the Future of Data Engineering

Frequently asked questions