New
Storytelling with Data
A Practical Guide to Communicating Effectively with Data Visualizations and Charts
Pages
317
Published
2017
Statistical Methods and Concepts Every Data Scientist Needs to Know
Build a working command of the statistical methods that actually show up in data science, from sampling and distributions to regression and machine learning.
Most statistics textbooks were written for mathematicians, not practitioners. This book flips that. Peter Bruce and Andrew Bruce cut through academic formality to focus on the concepts and techniques that matter in real data science work. Covering probability, sampling, statistical experiments, regression, and classification, it gives you the mental models to reason clearly about data — and the vocabulary to communicate what you find.
Statistics is the foundation of data science, but most introductions to the field bury the signal in academic notation and proof-heavy explanations. This book takes a different approach. It focuses on the concepts that actually appear in practice — the ones you need to understand before you can reason correctly about data, build models, or interpret results.
Peter Bruce and Andrew Bruce have worked in applied statistics and data mining for decades. The result is a book written from the practitioner's perspective: what you need to know, why it matters, and how to use it. The examples are grounded in real data. The explanations skip the unnecessary formalism without dumbing anything down.
You will start with exploratory data analysis — summary statistics, distributions, and visualizations that tell you what your data actually contains before you start modeling. From there the book covers the mechanics of sampling and bias, the logic of statistical experiments and A/B tests, and the probability theory you need without the graduate-course overhead. The second half moves into regression, classification, and the statistical ideas that sit underneath common machine learning methods.
Each topic is organized around a clear definition, key terms, and worked examples in both R and Python. The structure lets you read cover to cover or jump directly to the concept you need right now.
If you are a developer or analyst who works with data and knows you need a stronger statistical foundation, this is where to start. It is rigorous enough to be useful and practical enough to read in a weekend.
You learn how to summarize and describe datasets using measures of location and variability, and how to spot patterns, outliers, and distribution shapes before any modeling begins.
You work through the principles of random sampling, selection bias, and the central limit theorem, building the foundation for every inferential technique that follows.
You learn the mechanics of A/B testing, hypothesis testing, p-values, and confidence intervals, with clear explanations of what each of these actually tells you — and what it does not.
You build linear and multiple regression models, learn to interpret coefficients correctly, and check the residual diagnostics that distinguish a trustworthy model from a misleading one.
You apply logistic regression, linear discriminant analysis, Naive Bayes, and decision trees to classification problems, and learn how to evaluate classifier performance beyond raw accuracy.
You explore tree-based ensemble methods, bagging, and random forests, connecting modern machine learning tools back to the statistical principles that explain when and why they work.
You apply clustering methods including K-means and hierarchical clustering, and use principal components analysis to reduce dimensionality in high-dimensional datasets.
No. The book assumes you are comfortable with basic algebra and have some experience working with data, but it does not require prior coursework in statistics or probability.
Examples are provided in both R and Python. You do not need to be fluent in both — following along in one language is enough to get full value from the worked examples.
The core statistical concepts covered — sampling, regression, hypothesis testing, classification — are not framework-dependent and remain fully valid. Specific library syntax may have evolved, but the reasoning and methods translate directly to current practice.
It leans firmly practical. Formal proofs are avoided in favor of clear definitions, worked examples, and intuition-building explanations aimed at practitioners rather than statisticians.
It is not the right starting point if you want rigorous mathematical statistics with proofs and derivations. For that, a graduate-level textbook would be more appropriate.
New
A Practical Guide to Communicating Effectively with Data Visualizations and Charts
New
Techniques for Thinking Analytically and Solving Real Data Problems
New
A practical guide to the complete data engineering lifecycle, from ingestion to serving
by Joe Reis, Matt Housley
New
A hands-on guide to scalable data analytics using Python and PySpark