# Import modules
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
58 Anscombe’s quartet
In 1973, English statistician Francis John Anscombe emphasized the crucial role of graphical analysis in statistics through a seminal paper where he argued that visual inpsection of data through graphs not only reveals unique insights but also the main characteristics of a dataset. To illustrate his point, Anscombe created a synthetic dataset, known as Anscombe’s quartet, consisting of four distinct pairs of x-y variables. Each of the four datasets in the quartet has the same mean, variance, correlation, and regression line:
- Number of observations: 11
- Mean of the x variable: 9.0
- Mean of the y variable: 7.5
- Equation of linear regression model: y=3+0.5x
- Coefficient of determination (r^2): 0.667
- Coefficient of correlation (r): 0.817
Question: Despite their identical statistical metrics, can these four datasets considered similar?
# Load Ascombe's dataset
= pd.read_csv('../datasets/anscombe_quartet.csv')
df
# Display entire dataset
11) df.head(
obs | x1 | y1 | x2 | y2 | x3 | y3 | x4 | y4 | |
---|---|---|---|---|---|---|---|---|---|
0 | 1 | 10 | 8.04 | 10 | 9.14 | 10 | 7.46 | 8 | 6.58 |
1 | 2 | 8 | 6.95 | 8 | 8.14 | 8 | 6.77 | 8 | 5.76 |
2 | 3 | 13 | 7.58 | 13 | 8.74 | 13 | 12.74 | 8 | 7.71 |
3 | 4 | 9 | 8.81 | 9 | 8.77 | 9 | 7.11 | 8 | 8.84 |
4 | 5 | 11 | 8.33 | 11 | 9.26 | 11 | 7.81 | 8 | 8.47 |
5 | 6 | 14 | 9.96 | 14 | 8.10 | 14 | 8.84 | 8 | 7.04 |
6 | 7 | 6 | 7.24 | 6 | 6.13 | 6 | 6.08 | 8 | 5.25 |
7 | 8 | 4 | 4.26 | 4 | 3.10 | 4 | 5.39 | 19 | 12.50 |
8 | 9 | 12 | 10.84 | 12 | 9.13 | 12 | 8.15 | 8 | 5.56 |
9 | 10 | 7 | 4.82 | 7 | 7.26 | 7 | 6.42 | 8 | 7.91 |
10 | 11 | 5 | 5.68 | 5 | 4.74 | 5 | 5.73 | 8 | 6.89 |
# Define a linear model
= lambda x,intercept,slope: intercept + slope*x lm
# Fit linear model to each dataset
= stats.linregress(df.x1, df.y1)
slope_1, intercept_1, r_val_1, p_val_1, std_err_1 = stats.linregress(df.x2, df.y2)
slope_2, intercept_2, r_val_2, p_val_2, std_err_2 = stats.linregress(df.x3, df.y3)
slope_3, intercept_3, r_val_3, p_val_3, std_err_3 = stats.linregress(df.x4, df.y4) slope_4, intercept_4, r_val_4, p_val_4, std_err_4
# Plot points and fitted line
=(8,8))
plt.figure(figsize
2,2,1)
plt.subplot('Group 1')
plt.title('x1'], df['y1'], facecolor='w', edgecolor='k')
plt.scatter(df['x1'], lm(df['x1'],intercept_1, slope_1), color='tomato')
plt.plot(df['x')
plt.xlabel('y')
plt.ylabel(0,20])
plt.xlim([0,15])
plt.ylim([2,14, f"Mean={df['y1'].mean():.1f}")
plt.text(2,13, f"Variance={df['y1'].var():.1f}")
plt.text(2,12, f"r={r_val_1:.2f}")
plt.text(2,11, f"y={intercept_1:.1f} + {slope_1:.1f}x")
plt.text(
2,2,2)
plt.subplot('Group 2')
plt.title('x2'], df['y2'], facecolor='w', edgecolor='k')
plt.scatter(df['x2'], lm(df['x2'],intercept_2, slope_2), color='tomato')
plt.plot(df['x')
plt.xlabel('y')
plt.ylabel(0,20])
plt.xlim([0,15])
plt.ylim([2,14, f"Mean={df['y2'].mean():.1f}")
plt.text(2,13, f"Variance={df['y2'].var():.1f}")
plt.text(2,12, f"r={r_val_2:.2f}")
plt.text(2,11, f"y={intercept_2:.1f} + {slope_2:.1f}x")
plt.text(
2,2,3)
plt.subplot('Group 3')
plt.title('x3'], df['y3'], facecolor='w', edgecolor='k')
plt.scatter(df['x3'], lm(df['x3'],intercept_3, slope_3), color='tomato')
plt.plot(df['x')
plt.xlabel('y')
plt.ylabel(0,20])
plt.xlim([0,15])
plt.ylim([2,14, f"Mean={df['y3'].mean():.1f}")
plt.text(2,13, f"Variance={df['y3'].var():.1f}")
plt.text(2,12, f"r={r_val_3:.2f}")
plt.text(2,11, f"y={intercept_3:.1f} + {slope_3:.1f}x")
plt.text(
2,2,4)
plt.subplot('Group 4')
plt.title('x4'], df['y4'], facecolor='w', edgecolor='k')
plt.scatter(df['x4'], lm(df['x4'],intercept_4, slope_4), color='tomato')
plt.plot(df['x')
plt.xlabel('y')
plt.ylabel(0,20])
plt.xlim([0,15])
plt.ylim([2,14, f"Mean={df['y4'].mean():.1f}")
plt.text(2,13, f"Variance={df['y4'].var():.1f}")
plt.text(2,12, f"r={r_val_4:.2f}")
plt.text(2,11, f"y={intercept_4:.1f} + {slope_4:.1f}x")
plt.text(
=0.3, hspace=0.4)
plt.subplots_adjust(wspace plt.show()
Practice
Compute the root mean square error, mean absolute error, and mean bias error for each dataset. Can any of these error metrics provide a better description of the goodness of fit for each dataset?
Instead of creating the figure subplots one by one, can you write a for loop that will iterate over each dataset, fit a linear model, and then populate its correpsonding subplot?
References
Anscombe, F.J., 1973. Graphs in statistical analysis. The American Statistician, 27(1), pp.17-21.