Cover of Data Analysis with Python and PySpark by Jonathan Rioux, featuring abstract representations of distributed data flow and transformation

Pages

Published

2022

Data Analytics ✨ New

Data Analysis with Python and PySpark

A hands-on guide to scalable data analytics using Python and PySpark

Learn to process and analyze large datasets with PySpark by building real pipelines you can run in production.

Data Analysis with Python and PySpark teaches you to work with large-scale data using Apache Spark's Python API. Starting from Python fundamentals, you'll move through data ingestion, transformation, aggregation, and machine learning pipelines using PySpark's DataFrame API. Each chapter builds on a realistic dataset, so by the time you finish, you have the skills and the code to tackle analytics problems that would crush a single-machine workflow.

Buy on Amazon →

About this book

Most data analysis tutorials top out at a few million rows. Real-world datasets don't. When your CSV file won't fit in memory and your pandas script times out overnight, you need a different tool — and PySpark is that tool.

Data Analysis with Python and PySpark walks you through the entire analytics workflow at scale: reading raw data from files and databases, cleaning and reshaping it with the DataFrame API, joining and aggregating across billions of rows, and finally feeding the results into machine learning models with MLlib. Jonathan Rioux builds every concept around a concrete dataset, so you're writing code from the first chapter, not absorbing abstract theory before the work begins.

The book assumes you know Python. It does not assume you know Spark, distributed computing, or anything about JVM tuning. You'll learn how Spark's execution model actually works — lazy evaluation, the DAG scheduler, shuffle operations — well enough to understand why your job is slow and how to fix it, without needing to become a cluster administrator.

Along the way you'll cover:

Reading and writing CSV, JSON, Parquet, and Delta formats with the DataFrameReader and DataFrameWriter APIs
Transforming data with column expressions, user-defined functions, and the full suite of built-in functions
Aggregating and grouping data using GroupedData, window functions, and pivot operations
Joining large DataFrames efficiently and avoiding the shuffle traps that kill performance
Building and evaluating classification, regression, and recommendation models with MLlib Pipelines
Testing PySpark code with pytest so your pipelines stay correct as they evolve

Every technique is demonstrated on real datasets pulled from public sources, so you can see exactly what the inputs look like and verify your own results. The code is written for Python 3 and Spark 3, the versions you'll encounter on any modern cloud platform or on-premises cluster.

If you're a Python developer or data analyst who needs to work at a scale that a single machine can't handle, this book gives you a direct, practical path to production-ready Spark code.

🎯 What you'll learn

Set up a local PySpark environment and run your first distributed job without a cloud account
Read, inspect, and clean messy real-world datasets using the DataFrame and Column APIs
Transform data at scale using built-in functions, user-defined functions, and window operations
Join and aggregate large DataFrames while avoiding costly shuffle operations
Build end-to-end machine learning pipelines with MLlib, from feature engineering to model evaluation
Write unit tests for PySpark transformations so your code stays reliable as requirements change
Read and write Parquet and other columnar formats to cut storage costs and speed up queries
Interpret Spark execution plans to diagnose and fix slow jobs

👤 Who is this book for?

Python developers who need to process datasets too large for pandas or a single machine
Data analysts moving from SQL-based tools to a programmatic, scalable pipeline workflow
Data engineers who want a thorough grounding in PySpark's DataFrame API before building production pipelines
Machine learning practitioners who need to prepare and transform large feature sets before modeling
Students or self-taught analysts who know Python basics and want to learn distributed data processing from first principles

01

Getting Started with PySpark

You install PySpark, launch a SparkSession, and run your first DataFrame operations on a small dataset to establish the development environment you'll use throughout the book.
02

The DataFrame and Its Structure

You learn how Spark represents data in DataFrames and schemas, and practice reading CSV and JSON files while inspecting column types and null distributions.
03

Submitting and Scaling Spark Applications

You move from the interactive shell to submitting self-contained Spark jobs, and learn how Spark distributes work across partitions and executors.
04

Transforming Data with Column Expressions

You apply the core set of built-in column functions — string, numeric, date, and conditional — to clean and reshape a realistic broadcast dataset.
05

Aggregating and Grouping Data

You compute grouped summaries and multi-dimensional aggregations using GroupedData and pivot operations, then examine the execution plan to understand what Spark is actually doing.
06

Joining and Combining DataFrames

You join multiple DataFrames using inner, left, and broadcast joins, and learn which join strategies avoid expensive shuffle operations on large tables.
07

Window Functions and Advanced Transformations

You apply window functions to compute running totals, rankings, and lag-lead comparisons within ordered partitions, then chain these with user-defined functions for custom logic.
08

Reading and Writing Data at Scale

You work with Parquet, Delta, and columnar formats using DataFrameReader and DataFrameWriter, controlling partitioning and compression to optimize downstream query performance.
09

Machine Learning Pipelines with MLlib

You build classification and regression models using MLlib's Pipeline API, applying feature engineering transformers and evaluating model performance with cross-validation.
10

Testing and Maintaining PySpark Code

You write pytest-based unit tests for individual transformations and full pipelines, then apply code organization patterns that keep large PySpark projects maintainable.

Frequently asked questions

Do I need prior Spark or distributed computing experience?

No. The book assumes you know Python but starts Spark from scratch. You'll pick up the distributed computing concepts you need as each one becomes relevant to the code you're writing.

What version of Python and Spark does the book use?

The book is written for Python 3 and Apache Spark 3, which are the current versions on all major cloud platforms. Published in April 2022, the APIs it covers remain stable and widely used.

Can I follow along without a cloud cluster?

Yes. The first chapter walks you through running PySpark locally on your own machine. A cluster is useful for the largest examples but is not required to complete any chapter.

Is this book suitable for machine learning engineers, or is it mainly about data engineering?

Both roles are covered. The majority of the book focuses on data transformation and pipeline construction, but two chapters address MLlib machine learning pipelines end to end.

Does the book cover DataFrame testing and code quality?

Yes. The final chapter covers writing pytest-based unit tests for PySpark transformations and organizing pipeline code so it stays maintainable as it grows.

Get this book

Buy on Amazon →

Specs

Publisher: Simon and Schuster
Published: Apr 2022
Pages: 746
Language: English

About the author

Jonathan Rioux

New

Storytelling with Data

A Practical Guide to Communicating Effectively with Data Visualizations and Charts

by Cole Nussbaumer Knaflic

Data Analytics

2025 View →

New

Data Science: The Hard Parts

Techniques for Thinking Analytically and Solving Real Data Problems

by Daniel Vaughan

Data Analytics

2023 View →

New

Fundamentals of Data Engineering

A practical guide to the complete data engineering lifecycle, from ingestion to serving

by Joe Reis, Matt Housley

Data Analytics

2022 View →

New

The Art of Statistics

How to Learn from Data

by David Spiegelhalter

Data Analytics

2019 View →

Data Analysis with Python and PySpark

About this book

🎯 What you'll learn

👤 Who is this book for?

Table of contents

Getting Started with PySpark

The DataFrame and Its Structure

Submitting and Scaling Spark Applications

Transforming Data with Column Expressions

Aggregating and Grouping Data

Joining and Combining DataFrames

Window Functions and Advanced Transformations

Reading and Writing Data at Scale

Machine Learning Pipelines with MLlib

Testing and Maintaining PySpark Code

Frequently asked questions