Mapping Trick for Membership Binning

Idiomatic pandas: Tricks & Features You May Not Know Joe Tatusko 04:10

This lesson reveils you a mapping trick for membership binning. Let’s have a look at a simple situation were this can be useful.

Assume you have a Series and a corresponding “mapping table” where each value belongs to a multi-member group, or to no groups at all:

Python
      
        
      
    
>>> countries = pd.Series([
...     'United States',
...     'Canada',
...     'Mexico',
...     'Belgium',
...     'United Kingdom',
...     'Thailand'
... ])
...
>>> groups = {
...     'North America': ('United States', 'Canada', 'Mexico', 'Greenland'),
...     'Europe': ('France', 'Germany', 'United Kingdom', 'Belgium')
... }

In other words, you need to map countries to the following result:

Python
      
        
      
    
  North America
  North America
  North America
         Europe
         Europe
          other
dtype: object

00:00 Now it’s time to learn a little trick for membership binning categorical data. Let’s say you have a Series like this, a number of countries, that you need to map to a mapping table, like this, where each of the countries either belongs to an item in here or doesn’t belong anywhere. You basically need a function that’s similar to Pandas’ cut() method but bins values based on categories instead of numbers.

00:26 We’re going to build on the Series.map() method that you used in the categorical data video to do this. So before we get started, let me just copy all this,

00:38 and open up the terminal.

00:44 Start the Python interpreter, import pandas as pd, and then paste in those two dictionaries. Before you get to writing the function, from typing import Any so that you can put some type control into the parameters of the function.

01:01 Now define a new function called membership_map(),

01:06 which is going to take in a couple parameters. So, s will be a Series, groups will be a dict.

01:19 fillvalue, for something that’s not found in the mapping table, can be anything, and set the default value to -1. And this function is going to return a Series.

01:34 Like before, make a dictionary called groups, and we’re going to use a pretty large dictionary comprehension here, {x: k for k, v in groups.items() for x in v}.

01:58 And then just return s.map(groups), and then fill in any values that are not found with the fillvalue, like that. So before you use this, try to think about what’s going on in this dictionary comprehension. So let me scroll up.

02:19 groups is going to be that mapping dictionary, like this right here. When you call groups.items(), you’re going to return the key and then the value. So in this case, the key would be 'Europe' and then the value would be this entire tuple here.

02:36 Pandas’ .map() method is not going to go inside this tuple, so you needed to break it out further, and that’s where this second loop for x in v comes into play.

02:45 So if this is v, now x represents each of these countries. And that’s how you get the final dictionary, so for each country it will then return the k continent.

02:58 If that’s still confusing, feel free to write this out as two nested for loops to try to understand what’s going on. Okay. Time to see if this works.

03:08 So call membership_map(),

03:12 and now you’re going to pass in s, which is countries in this case. groups will be groups, or the mapping dictionary.

03:21 And then the fillvalue for anything not found in that mapping dictionary, you can just say 'other'. And there you go. The first three countries were from North America, the next two were from Europe, and 'Thailand' wasn’t in either of those, so it returned 'other'. This was a small dataset, so it wasn’t very noticeable, but by using this dictionary comprehension and then mapping those values, this would be a lot faster than if you were to run through this with nested for loops. This is a nice use of a dictionary for mapping, which, while it’s helpful with Pandas, is useful in a lot of other situations in Python. So, that’s it!

03:58 You’ve learned how to write a pretty useful function that you can use to map datasets to categorical bins in a very quick and concise way. Thanks for watching.

raulfz on May 6, 2021

Thank you for your Tutorial. The accessor methods and groupby iteration are really great tricks.

However, about this mapping trick I don’t really see any benefit over this:

grp = {ctry : cont for cont, ctries in groups.items() for ctry in ctries}
grp.update({ctry : 'Other' for ctry in countries if ctry not in grp.keys()})
countries.replace(grp)

Best,

arcarlos00 on Sept. 2, 2021

Thanks for the tuto. I understand that it fills some entries with the string ‘Other’, but why do we need to do

from typing import Any

at all?

Bartosz Zaczyński RP Team on Sept. 2, 2021

@arcarlos00 Strictly speaking, you don’t need them. These are called annotations or type hints, which Python completely ignores. Some tools and libraries can leverage those type hints, for example, to improve auto-completion and type checking in your code editor. However, in this case, they seem to only serve as documentation for the reader.

Become a Member to join the conversation.