New
Storytelling with Data
A Practical Guide to Communicating Effectively with Data Visualizations and Charts
Pages
446
Published
2022
A practical guide to the complete data engineering lifecycle, from ingestion to serving
Build a solid mental model of every stage in the data engineering lifecycle so you can make better architectural decisions and deliver data that teams actually trust.
Data engineering sits at the center of every modern analytics operation, yet most practitioners learn it piecemeal. This book by Joe Reis and Matt Housley gives you a coherent framework for the entire lifecycle: source systems, ingestion, transformation, storage, and serving. Whether you're choosing between batch and streaming, evaluating a new tool, or explaining trade-offs to stakeholders, you'll have the vocabulary and the mental models to make sound decisions confidently.
Data engineering is one of the fastest-growing roles in tech, and one of the least formally defined. Most practitioners learn the job by firefighting — stitching together pipelines, inheriting legacy systems, and making architectural calls without a clear framework to reason from. This book changes that.
Joe Reis and Matt Housley spent years working in the field before writing the reference they wish had existed when they started. The result is a structured, tool-agnostic treatment of the data engineering lifecycle — the sequence of stages every data team navigates to move raw data from source systems into reliable, queryable form for downstream consumers.
The book opens by defining the data engineering lifecycle precisely: generation, ingestion, transformation, serving, and storage, plus the undercurrents that run beneath all of them — security, data management, DataOps, orchestration, and software engineering. This framing lets you evaluate any tool or architecture choice against a stable set of criteria rather than chasing vendor narratives.
From there, Reis and Housley work through each stage in depth. Source systems, APIs, databases, streaming platforms — you'll learn how to assess them and what questions to ask. You'll understand why batch and streaming pipelines exist and how to decide between them. You'll see how storage abstractions like data lakes, data warehouses, and lakehouses relate to each other and where each fits. And you'll examine how data reaches analysts, data scientists, and machine learning systems — because a pipeline that nobody trusts or uses has failed no matter how cleverly it was built.
A defining feature of the book is its emphasis on trade-offs over prescriptions. Rather than advocating for a particular stack, the authors give you a way to think. You'll come away able to articulate why one architecture suits a given organization better than another — and defend that view to engineers, product managers, and executives alike.
If you've been in data engineering for a year or two and still feel like you're guessing at big decisions, this book gives you the map. If you're a data analyst or scientist who wants to understand the infrastructure beneath your work, it gives you the language. And if you're moving into the field from software engineering, it gives you the context you can't get from tutorials alone.
Introduces the central framework of the book: the data engineering lifecycle and its five stages, plus the undercurrents that apply throughout. You'll learn to use this model as a stable reference point for every decision in later chapters.
Surveys the current ecosystem of tools, roles, and organizational contexts data engineers work within. You'll develop a way to read the market critically rather than reacting to hype.
Covers the principles behind sound data architecture — scalability, flexibility, and simplicity — and how to apply them before committing to a specific stack.
Gives you a practical decision framework for evaluating technologies at each lifecycle stage, including how to weigh build versus buy and open-source versus managed services.
Examines the origin points of data: relational databases, NoSQL systems, APIs, event streams, and files. You'll learn what to look for when assessing source reliability and schema stability.
Compares the major storage abstractions — raw object storage, data lakes, data warehouses, and lakehouses — and explains how to match each to specific workload and access patterns.
Works through batch ingestion, streaming ingestion, and the architectural patterns that support each, including how to handle schema changes and failures gracefully.
Covers transformation patterns from simple SQL-based models to complex multi-stage pipelines, with attention to maintainability, testing, and orchestration.
Explores how data reaches its consumers — BI tools, data scientists, machine learning systems, and operational applications — and what it takes to make those handoffs reliable.
Addresses the undercurrents of security and data governance that apply at every stage, and looks at where the field is heading so you can position your skills accordingly.
No specific language is required. The book is deliberately tool-agnostic and conceptual in orientation. Code examples appear where they aid understanding, but the core value is the frameworks and mental models, not syntax.
It suits practitioners with at least some hands-on exposure to data work — a year or more as a data engineer, analyst, or software engineer touching data systems. Complete beginners may find the conceptual density challenging without prior context.
Yes, these and many others are discussed as examples, but the book does not teach you how to operate any single tool. The goal is to help you evaluate and position tools relative to each other, not to serve as a product manual.
The lifecycle framework and architectural principles the book teaches are stable — they apply regardless of which specific tools are current. Some vendor-specific details will date, but the reasoning methodology holds.
The book is primarily conceptual and explanatory rather than exercise-driven. It does not ship with a companion dataset or lab environment, though the authors reference real-world scenarios throughout.
New
A Practical Guide to Communicating Effectively with Data Visualizations and Charts
New
Techniques for Thinking Analytically and Solving Real Data Problems
New
A hands-on guide to scalable data analytics using Python and PySpark
New