New
Storytelling with Data
A Practical Guide to Communicating Effectively with Data Visualizations and Charts
Pages
746
Published
2022
A hands-on guide to scalable data analytics using Python and PySpark
Learn to process and analyze large datasets with PySpark by building real pipelines you can run in production.
Data Analysis with Python and PySpark teaches you to work with large-scale data using Apache Spark's Python API. Starting from Python fundamentals, you'll move through data ingestion, transformation, aggregation, and machine learning pipelines using PySpark's DataFrame API. Each chapter builds on a realistic dataset, so by the time you finish, you have the skills and the code to tackle analytics problems that would crush a single-machine workflow.
Most data analysis tutorials top out at a few million rows. Real-world datasets don't. When your CSV file won't fit in memory and your pandas script times out overnight, you need a different tool β and PySpark is that tool.
Data Analysis with Python and PySpark walks you through the entire analytics workflow at scale: reading raw data from files and databases, cleaning and reshaping it with the DataFrame API, joining and aggregating across billions of rows, and finally feeding the results into machine learning models with MLlib. Jonathan Rioux builds every concept around a concrete dataset, so you're writing code from the first chapter, not absorbing abstract theory before the work begins.
The book assumes you know Python. It does not assume you know Spark, distributed computing, or anything about JVM tuning. You'll learn how Spark's execution model actually works β lazy evaluation, the DAG scheduler, shuffle operations β well enough to understand why your job is slow and how to fix it, without needing to become a cluster administrator.
Along the way you'll cover:
Every technique is demonstrated on real datasets pulled from public sources, so you can see exactly what the inputs look like and verify your own results. The code is written for Python 3 and Spark 3, the versions you'll encounter on any modern cloud platform or on-premises cluster.
If you're a Python developer or data analyst who needs to work at a scale that a single machine can't handle, this book gives you a direct, practical path to production-ready Spark code.
You install PySpark, launch a SparkSession, and run your first DataFrame operations on a small dataset to establish the development environment you'll use throughout the book.
You learn how Spark represents data in DataFrames and schemas, and practice reading CSV and JSON files while inspecting column types and null distributions.
You move from the interactive shell to submitting self-contained Spark jobs, and learn how Spark distributes work across partitions and executors.
You apply the core set of built-in column functions β string, numeric, date, and conditional β to clean and reshape a realistic broadcast dataset.
You compute grouped summaries and multi-dimensional aggregations using GroupedData and pivot operations, then examine the execution plan to understand what Spark is actually doing.
You join multiple DataFrames using inner, left, and broadcast joins, and learn which join strategies avoid expensive shuffle operations on large tables.
You apply window functions to compute running totals, rankings, and lag-lead comparisons within ordered partitions, then chain these with user-defined functions for custom logic.
You work with Parquet, Delta, and columnar formats using DataFrameReader and DataFrameWriter, controlling partitioning and compression to optimize downstream query performance.
You build classification and regression models using MLlib's Pipeline API, applying feature engineering transformers and evaluating model performance with cross-validation.
You write pytest-based unit tests for individual transformations and full pipelines, then apply code organization patterns that keep large PySpark projects maintainable.
No. The book assumes you know Python but starts Spark from scratch. You'll pick up the distributed computing concepts you need as each one becomes relevant to the code you're writing.
The book is written for Python 3 and Apache Spark 3, which are the current versions on all major cloud platforms. Published in April 2022, the APIs it covers remain stable and widely used.
Yes. The first chapter walks you through running PySpark locally on your own machine. A cluster is useful for the largest examples but is not required to complete any chapter.
Both roles are covered. The majority of the book focuses on data transformation and pipeline construction, but two chapters address MLlib machine learning pipelines end to end.
Yes. The final chapter covers writing pytest-based unit tests for PySpark transformations and organizing pipeline code so it stays maintainable as it grows.
New
A Practical Guide to Communicating Effectively with Data Visualizations and Charts
New
Techniques for Thinking Analytically and Solving Real Data Problems
New
A practical guide to the complete data engineering lifecycle, from ingestion to serving
by Joe Reis, Matt Housley
New