Review of "Biostatistics with Python" by Darko Medin
A book with many Python coding examples but no mathematical notation
Recently, Packt Publishing asked me to review a book called “Biostatistics with Python” by Darko Medin1. He is a biostatistician with a Master of Science degree from the Faculty of Science and Mathematics at the University of Montenegro. He published this book in November, 2024, and it is available in paperback and electronically.
Here is Packt’s own description of this book:
The chapters show you how to clean and describe your data effectively, setting a solid foundation for accurate analysis and proficiency in biostatistical inference to help you draw meaningful conclusions from your data through hypothesis testing and effect size analysis. The book walks you through predictive modeling to harness the power of Python to create robust predictive analytics that can drive your research and professional projects forward. You'll explore clinical biostatistics, learn how to design studies, conduct survival analysis, and synthesize evidence from multiple studies with meta-analysis – skills that are crucial for making informed decisions based on comprehensive data reviews. The concluding chapters will enhance your ability to analyze biological variables, enabling you to perform detailed and accurate data analysis for biological research.
Generally, the book accomplishes these basic goals. It teaches the basic concepts clearly, contains many examples in Python, and provides a good introduction to the relevant concepts about medical research.
What I liked about the book
It is a good introduction for someone who wants to start analyzing data right away and diving into the basic concepts of biostatistics. If you don’t have a strong background in mathematics, statistics, or data analysis, that’s OK. You will find many useful lessons in this book, and you will be able to apply them quickly. For example, I did not know much about how to design clinical studies, and I learned the basic process while reading Chapter 10.
It provides many coding examples in Python. Most jobs in biostatistics use SAS or R, but Python is the most popular programming language in data science. This book fills a significant gap in today’s discourse about biostatistics and Python. (A quick Google search shows only one other book about biostatistics and Python, and it is a self-published book available only in paperback.) I did not know how to run Cox proportional hazards regression in Python, so I followed the example in Chapter 11.
The explanations are generally clear and straightforward, keeping the complexity to an understandable level for an absolute beginner. This is not easy to do, because biostatistics can be a difficult subject - especially for someone without a technical background. Darko often mentioned an advanced topic, but provided a reference for further exploration. An example is the restricted maximum likelihood (REML) method, which is something that I studied briefly during my Master of Science degree in Statistics at the University of Toronto. He wisely kept the description of REML to a minimum on Page 238, and he directed the reader to another resource to learn more detail.
What I did not like about the book
1) The index is very sparse. Key biostatistics terms like “exposure” and “R-squared” are missing; this made it difficult to find relevant discussions for a concept of interest.
2) This book does not have any mathematical equations or expressions, which are necessary to provide a basic understanding of biostatistics. When teaching a technical topic, the instructor must choose a level of rigour and detail that fits the audience, and this is subjective to some extent. I am sympathetic to why Darko might have chosen to reduce the level of mathematical detail in his book, because math can be intimidating and distracting to a newcomer. However, it is bad to completely avoid math and mathematical notation altogether. Concepts like R-squared, hazard ratios, and Cox regression are to difficult to describe and understand without math. Even simpler examples would benefit from some mathematical notation.
Page 241 mentions the concept of odds ratios, but does not show that it is simply a fraction. In the context of comparing a treatment group versus a control group, the odds ratio is
Page 101 introduces linear regression, but it does not state the model equation. This deprives the reader of seeing the mathematical structure of linear regression and what the individual components are. By stating the model equation explicitly, the reader can see how the response variable, intercept, slope, covariate, and error term relate to each other. Even if I remove the normal distribution underlying the error term and the subscripts of the observations, the following equation would still be informative.
The intended audience of the book is “science professionals, researchers, biomedical professionals, and aspiring biostatisticians who want to integrate biostatistics into their work or research”. All of them would be able to understand the above concepts in mathematical notation.
3) Packt Publishing provided Python code for this book in Jupyter Notebooks on GitHub, but some of them do not work. Others require the user to modify them, such as by inserting a dataset. I prefer code that executes successfully as given without any modification; this is especially valuable for beginners who are not familiar with Python or computer programming.
One malfunctioning script was the code for Cox regression in Chapter 11. Even after installing all the relevant packages, that block of code does not work - I tried to execute it in Google Colab, but it encountered an error. Thankfully, the code on Pages 226-227 does implement Cox regression successfully. Thus, I encourage you to rely on the code within the book, not the GitHub repository.
Conclusion
“Biostatistics with Python” is a decent book for a beginner in biostatistics. You will learn the basic concepts of biostatistics and implement common techniques with Python programming. It is clear and straightforward, and I encourage the reader to run the code within the book (but not from the Jupyter notebooks in the associated GitHub repository).
This book does not have any mathematics, so I encourage you to read a more foundational book about biostatistics that does have mathematical equations and formulas. In a future article, I will recommend a book and link back to this conclusion.
Darko and I are acquainted on LinkedIn. We have exchanged several friendly messages with each other, and we have engaged positively with each other’s LinkedIn posts. However, Darko and I do not have a close relationship - professionally or socially. I have not told Darko that I am publishing this review, and nobody has pressured me to write this review with any bias. I am writing this review entirely based on my own volition - with no outside influence or conflict of interest. The only benefit that I received was a paperback copy of this book.