Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

This lesson is for members only. Join us and get access to thousands of tutorials and a community of expert Pythonistas.

Unlock This Lesson

Mark Your DataFrames With Keys

Martin Breuss

Combining Data in pandas With concat() and merge() Martin Breuss 03:34

Transcript
Discussion

00:00 In this lesson, you look at the optional keys argument to pd.concat() that can help you to deal with the situation where you have multiple things named the same as a result from the concatenation, whether that goes by the column axis or by the row axis.

00:18 Now, before I apply this, let’s take another look at the function signature,

00:24 and you can see in here, there’s the keys argument that you’ll use. And there’s even an explanational docstring that relates to the keys argument, where it says it can also add a layer of hierarchical indexing on the concatenation axis, which may be useful if the labels are the same or overlapping on the passed axis number.

00:45 Let’s look at that practically by just redoing this exact concatenation, but instead of not passing the keys argument, you will add it here, and you’ll say keys=.

01:00 The first one that you’re passing in is the fruits DataFrame, so let’s just call it fruits. And the second one is the veggies DataFrame.

01:08 So you’re just giving the names of the DataFrames. And now when you execute this call, you get a better visual understanding of what do the different columns relate to.

01:19 So as you can see, it correctly shows you that this part belongs to the fruits DataFrame and this part—well, actually not this part down here, but this part—is the veggies DataFrame.

01:33 However, still because the third row is missing in the veggies DataFrame, this is essentially sorted under the columns of the veggies DataFrame.

01:43 Now, if you turn this around and use the same call, but instead of using axis="columns", you use the default of rows, but I’ll put it in explicitly. You could also just skip this whole argument.

01:56 Then you get the same concatenation that you got at the beginning. And again, you’ll see that you have done here—this relates to your veggies DataFrame, now there’s no NaN part of this one—but you can see that it’s part of up there because fruits has one less column.

02:12 This is the fruits DataFrame, and then here are the NaN values that it needs to fill to create a full table down here. So this is how you can use the keys argument to pd.concat() to give a better idea of where did the data come from and kind of get rid of this ambiguity of having multiple 0 indices for example, or multiple columns of the same name.

02:39 In this lesson, you’ve used the optional keys argument to pd.concat() to mark your DataFrames and avoid collisions that come from duplicate label values after concatenation.

02:50 And you did that using pd.concat() and then passing in the keys parameter with an iterable, and often you would just put in the names of the DataFrames that you were actually concatenating.

03:01 And this constructs a multi-index DataFrame that you can then also use to access specific parts of the DataFrame despite duplicate labels.

03:12 In the next lesson, you will learn how you can actually access specific data items in such a multi-index DataFrame, which means that you will take a quick break from the different arguments to pd.concat(), stick with the keys one, and then just figure out what are the advantages of actually using this and which errors would you run into if you don’t.

Become a Member to join the conversation.