Anscombe's Quartet Revisited

To follow along at this point in the lesson, you can use the following code:

Python
import pandas as pd

# Anscombe's Quartet
x  = [10, 8, 13, 9, 11, 14, 6, 4, 12, 7, 5]
y1 = [8.04, 6.95, 7.58, 8.81, 8.33, 9.96, 7.24, 4.26, 10.84, 4.82, 5.68]
y2 = [9.14, 8.14, 8.74, 8.77, 9.26, 8.10, 6.13, 3.10, 9.13, 7.26, 4.74]
y3 = [7.46, 6.77, 12.74, 7.11, 7.81, 8.84, 6.08, 5.39, 8.15, 6.42, 5.73]
x4 = [8, 8, 8, 8, 8, 8, 8, 19, 8, 8, 8]
y4 = [6.58, 5.76, 7.71, 8.84, 8.47, 7.04, 5.25, 12.50, 5.56, 7.91, 6.89]

I   = pd.DataFrame([x, y1], index=["x", "y1"]).T
II  = pd.DataFrame([x, y2], index=["x", "y2"]).T
III = pd.DataFrame([x, y3], index=["x", "y3"]).T
IV  = pd.DataFrame([x4, y4], index=["x4", "y4"]).T

00:00 Before we dig into this layer by layer, I want to give a quick throwback to this Anscombe’s quartet that you learned about in the first lesson of this course.

00:10 Now, here’s some data that makes up these four different types of plots that all have the same statistical values but very different plots. I want to show you how quickly you can plot these using plotnine.

00:24 Now you can get this data off of the description of this course lesson if you want to run it as well, but you can also just watch. So, I have to import pandas explicitly because it comes as a dependency with plotnine but it’s not automatically imported, of course.

00:39 And now you can see I have these datasets and if you .describe() them, you would see what we saw before.

00:50 I could say…

00:55 You could compare these values and see that they’re very similar—the statistical values—if not the exact same. But now, if you take a different approach and you actually go ahead and visualize these datasets—using plotnine, in this case—you can very quickly see a difference.

01:13 So I need to import from plotnine, the ggplot, the aesthetic, and the geometrical object. With ggplot, with this first one, I can add the data layer, so to say.

01:26 And this is the syntax that you can use. You can say ggplot(), pass in the data. So here, I’m passing in the pandas DataFrame as the data layer.

01:37 Then, you’re adding the aesthetics layer, where you define the mappings. From x is going to map to x here, and y is going to map to y1, in this case.

01:49 So, you want to plot this first dataset.

01:54 And now, if I execute this,

01:57 you can see the plot popping up here. And it looks a certain way, okay. One plot alone doesn’t tell you much yet, but now if you make the second one…

02:07 In the same way, I’m just going to say ggplot(), but pass in the second dataset. I’m going to say + aes() (plus aesthetics), where I’m going to map x to "x" and y to "y2", in this case.

02:24 And finally, you need to define the geometric objects, and this is just going to be a point plot. So if I run this, you right away see that this data said has a completely different distribution of the values actually.

02:38 So something that was basically impossible to see by just the statistical descriptions, you can very easily distinguish by a quick plot that doesn’t take more than one import line and then three lines of code for each plot.

02:54 So, you can play around with this a bit more. Also, you can plot the other ones. You can plot number III and number IV and compare them, and if you want, research a little how you can change the colors and size of these dots.

03:08 So, see you in the next lesson, where you’re going to start looking at the data layer in a bit more detail.

Become a Member to join the conversation.