Introspect Groupby Objects via Iteration

Idiomatic pandas: Tricks & Features You May Not Know Joe Tatusko 05:35

In this lesson you’ll have a closer look at pandas groupby functionality. As the resulting pandas groupby objects can be a bit opaque, because they are lazily instantiated and do not have any meaningful representation on their own, you’ll learn how to access their data as well as how to gain more insights into it.

00:00 This video is going to cover a trick to help understand groupby objects in Pandas a little bit better. If you’ve ever called .groupby() on a DataFrame, you may have noticed that the resulting object is somewhat hard to comprehend.

00:13 A groupby object is literally just a grouped set of data, so it doesn’t really have any representation on its own. Let’s take a look in the terminal so I can show you what I’m talking about.

00:25 I’m just going to use the same data set from the Pandas settings video, so I’ll copy that right now. Then open up a terminal, start the interpreter, import pandas as pd, and then paste those lines in.

00:44 To set up some meaningful grouped data, just go and for abalone, make a new column called 'ring_quartile', like that, and using the quartile cut from Pandas, so just pd.qcut(),

01:06 pass in abalone['rings'],

01:11 you want to cut this into 4 quarters, and set the labels just equal to a range(1,5) so that you’ll get 1, 2, 3, 4.

01:24 Close that off. And then make a new DataFrame, or actually groupby object, called grouped. And set this equal to abalone.groupby(), and group by that new 'ring_quartile',

01:41 just like that. So with a DataFrame or Series, you could just call it in the interpreter and it would return a printout of the DataFrame or Series. But if you do that with a groupby object, you just get this, a generic.DataFrameGroupBy object.

01:57 So that doesn’t really help us. To get help, you can call the help() function on grouped and then do a dunder, so .__iter__, and you’ll see that groupby objects are actually iterable.

02:14 So if you were to iterate through this, you would yield a sequence of the name of the group and then the subsetted object, which is actually the DataFrame that represents the data from that group.

02:27 So we can try this out. I’m just going to hit Q to get out of there. And go ahead and set up a loop to try this out! So, for idx, frame in grouped:

02:40 you’re going to print a string-formatted f'Ring Quartile: {}', pass in that idx.

02:53 Then just print some lines to separate everything out.

03:03 And finally, print out that DataFrame, so frame, and just print out the 3 largest items

03:12 based on the 'weight'. And for an ending character, just do two newlines to keep everything separated. Okay, now I can exit this loop. And look at this!

03:28 You should see each quartile set up like this with your data, and then the top three by weight of each quartile.

03:41 So, that’s pretty cool. If you don’t want to iterate through everything, groupby objects also have a getter. So if you do .groups.keys()—

03:54 and it helps if you’re typing .keys() to actually put an s at the end—you can see that these are the keys that you assigned when you created the groupby object. So, let’s say you just wanted to grab the second group.

04:07 You could say grouped.get_group(), and just the second one, and just print out the head of this. And the resulting DataFrame that prints out is everything from the second quartile.

04:19 groupby objects will also let you call aggregate functions, so if you did something like grouped and you wanted to check for the 'height' and the 'weight', you can call .agg() and then pass in a list.

04:35 So if you want to look at the 'mean' and the 'median', you can type those in. And when you run this, you can see that each group had these aggregate functions called on them.

04:48 And these can be any functions you want, and I didn’t have to define mean() or median(), because this used the function names from the NumPy library.

04:55 The big takeaway from this is to realize that these groupby objects act as iterators, so if you call a method or a function on them, each of the groups are passed one by one as an argument to whatever that function is.

05:09 You may have heard of the term split, apply, and combine, which is where you break up the data into different groups, apply some calculation, and then combine everything back together.

05:19 And this is a good example of where that term could come from. So if you’re ever working with groupby objects and something isn’t behaving the way you think it should, try iterating through them like a loop, and it might uncover what’s going on. Thanks for watching.

Amitesh Sinha on Sept. 1, 2019

Very helpful set of videos

bglynch17 on Nov. 3, 2020

Would be useful if you gave the code snippet from the start of the video in the Description

url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'
cols = ['sex', 'length', 'diam', 'height', 'weight', 'rings']
abalone = pd.read_csv(url, usecols=[0,1,2,3,4,8], names=cols)

Become a Member to join the conversation.