Cover of Data Analysis with Python and PySpark by Jonathan Rioux, featuring abstract representations of distributed data flow and transformation

Pages

746

Published

2022

Data Analytics ✨ New

Data Analysis with Python and PySpark

A hands-on guide to scalable data analytics using Python and PySpark

Learn to process and analyze large datasets with PySpark by building real pipelines you can run in production.

Data Analysis with Python and PySpark teaches you to work with large-scale data using Apache Spark's Python API. Starting from Python fundamentals, you'll move through data ingestion, transformation, aggregation, and machine learning pipelines using PySpark's DataFrame API. Each chapter builds on a realistic dataset, so by the time you finish, you have the skills and the code to tackle analytics problems that would crush a single-machine workflow.

About this book

Most data analysis tutorials top out at a few million rows. Real-world datasets don't. When your CSV file won't fit in memory and your pandas script times out overnight, you need a different tool β€” and PySpark is that tool.

Data Analysis with Python and PySpark walks you through the entire analytics workflow at scale: reading raw data from files and databases, cleaning and reshaping it with the DataFrame API, joining and aggregating across billions of rows, and finally feeding the results into machine learning models with MLlib. Jonathan Rioux builds every concept around a concrete dataset, so you're writing code from the first chapter, not absorbing abstract theory before the work begins.

The book assumes you know Python. It does not assume you know Spark, distributed computing, or anything about JVM tuning. You'll learn how Spark's execution model actually works β€” lazy evaluation, the DAG scheduler, shuffle operations β€” well enough to understand why your job is slow and how to fix it, without needing to become a cluster administrator.

Along the way you'll cover:

  • Reading and writing CSV, JSON, Parquet, and Delta formats with the DataFrameReader and DataFrameWriter APIs
  • Transforming data with column expressions, user-defined functions, and the full suite of built-in functions
  • Aggregating and grouping data using GroupedData, window functions, and pivot operations
  • Joining large DataFrames efficiently and avoiding the shuffle traps that kill performance
  • Building and evaluating classification, regression, and recommendation models with MLlib Pipelines
  • Testing PySpark code with pytest so your pipelines stay correct as they evolve

Every technique is demonstrated on real datasets pulled from public sources, so you can see exactly what the inputs look like and verify your own results. The code is written for Python 3 and Spark 3, the versions you'll encounter on any modern cloud platform or on-premises cluster.

If you're a Python developer or data analyst who needs to work at a scale that a single machine can't handle, this book gives you a direct, practical path to production-ready Spark code.

🎯 What you'll learn

  • Set up a local PySpark environment and run your first distributed job without a cloud account
  • Read, inspect, and clean messy real-world datasets using the DataFrame and Column APIs
  • Transform data at scale using built-in functions, user-defined functions, and window operations
  • Join and aggregate large DataFrames while avoiding costly shuffle operations
  • Build end-to-end machine learning pipelines with MLlib, from feature engineering to model evaluation
  • Write unit tests for PySpark transformations so your code stays reliable as requirements change
  • Read and write Parquet and other columnar formats to cut storage costs and speed up queries
  • Interpret Spark execution plans to diagnose and fix slow jobs

πŸ‘€ Who is this book for?

  • Python developers who need to process datasets too large for pandas or a single machine
  • Data analysts moving from SQL-based tools to a programmatic, scalable pipeline workflow
  • Data engineers who want a thorough grounding in PySpark's DataFrame API before building production pipelines
  • Machine learning practitioners who need to prepare and transform large feature sets before modeling
  • Students or self-taught analysts who know Python basics and want to learn distributed data processing from first principles

Table of contents

  1. 01

    Getting Started with PySpark

    You install PySpark, launch a SparkSession, and run your first DataFrame operations on a small dataset to establish the development environment you'll use throughout the book.

  2. 02

    The DataFrame and Its Structure

    You learn how Spark represents data in DataFrames and schemas, and practice reading CSV and JSON files while inspecting column types and null distributions.

  3. 03

    Submitting and Scaling Spark Applications

    You move from the interactive shell to submitting self-contained Spark jobs, and learn how Spark distributes work across partitions and executors.

  4. 04

    Transforming Data with Column Expressions

    You apply the core set of built-in column functions β€” string, numeric, date, and conditional β€” to clean and reshape a realistic broadcast dataset.

  5. 05

    Aggregating and Grouping Data

    You compute grouped summaries and multi-dimensional aggregations using GroupedData and pivot operations, then examine the execution plan to understand what Spark is actually doing.

  6. 06

    Joining and Combining DataFrames

    You join multiple DataFrames using inner, left, and broadcast joins, and learn which join strategies avoid expensive shuffle operations on large tables.

  7. 07

    Window Functions and Advanced Transformations

    You apply window functions to compute running totals, rankings, and lag-lead comparisons within ordered partitions, then chain these with user-defined functions for custom logic.

  8. 08

    Reading and Writing Data at Scale

    You work with Parquet, Delta, and columnar formats using DataFrameReader and DataFrameWriter, controlling partitioning and compression to optimize downstream query performance.

  9. 09

    Machine Learning Pipelines with MLlib

    You build classification and regression models using MLlib's Pipeline API, applying feature engineering transformers and evaluating model performance with cross-validation.

  10. 10

    Testing and Maintaining PySpark Code

    You write pytest-based unit tests for individual transformations and full pipelines, then apply code organization patterns that keep large PySpark projects maintainable.

Frequently asked questions

Do I need prior Spark or distributed computing experience?

No. The book assumes you know Python but starts Spark from scratch. You'll pick up the distributed computing concepts you need as each one becomes relevant to the code you're writing.

What version of Python and Spark does the book use?

The book is written for Python 3 and Apache Spark 3, which are the current versions on all major cloud platforms. Published in April 2022, the APIs it covers remain stable and widely used.

Can I follow along without a cloud cluster?

Yes. The first chapter walks you through running PySpark locally on your own machine. A cluster is useful for the largest examples but is not required to complete any chapter.

Is this book suitable for machine learning engineers, or is it mainly about data engineering?

Both roles are covered. The majority of the book focuses on data transformation and pipeline construction, but two chapters address MLlib machine learning pipelines end to end.

Does the book cover DataFrame testing and code quality?

Yes. The final chapter covers writing pytest-based unit tests for PySpark transformations and organizing pipeline code so it stays maintainable as it grows.

You might also like

πŸ“¬ Weekly Newsletter

Stay ahead of the curve

Get the best programming tutorials, data analytics tips, and tool reviews delivered to your inbox every week.

No spam. Unsubscribe anytime.