Seaborn's built-in datasets in Python
A set of tools for illustrating programming problems in a public and reproducible way
Suppose that you are analyzing a confidential dataset, and you have a bug in your code. You want to ask a question about this to a non-secure large-language model or on a public discussion forum like Stack Overflow. How can you ask the question while maintaining confidentiality?
When communicating with statisticians about programming problems and data structures, it is helpful to illustrate the issues with example datasets. In Python, I often use the datasets in the Seaborn package (which is primarily used for data visualization).
For example, the Iris dataset is convenient for many illustrative purposes. It is
small,
has no missing values,
contains both categorical and continuous data.
Here is some code to load the Iris dataset and display its first 5 rows.
import pandas as pd
import numpy as np
import seaborn as sb
iris = sb.load_dataset('iris')
display(iris.head())
Here is what the output looks like in a Jupyter Notebook.
A common operation in data analysis is aggregating over a categorical variable. We can use the groupby() function in Pandas to illustrate this. Let’s calculate the average sepal width by species.
iris.groupby('species')['sepal_width'].mean()
Here is what the result looks like.
For security and confidentiality, you may not be able to show your actual dataset to a co-worker, a large-language model, or a discussion forum like Stack Overflow. If you want to ask a question about a Python function, you can use an example dataset like Iris to do so instead. It will be clear and reproducible, and it will allow you to maintain the privacy of your actual dataset.
If you are reading my posts for the first time: I'm Eric Cai, a statistician based in Toronto, Canada. I write about statistics, communication, and career development for professionals in data & analytics. Subscribe to get my articles delivered to your inbox at 9:30 AM Eastern time on Monday to Friday.