Specify the Join Columns Explicitly

Combining Data in pandas With concat() and merge() Martin Breuss 06:07

00:00 In previous lessons, you’ve heard me talk about how pandas automatically figures out the intersection between two DataFrames to pick which columns to use to perform the join operation on. And in our case, between the veggies and fruits DataFrame, those were the name and image columns.

00:17 You can also define specifically which column you want to use to join the DataFrames on. And you learn about how to do this in the current lesson.

00:29 Over here in the Jupyter notebook, I’m first going to reset the fruits DataFrame to drop the taste column that I added for an example in the previous lesson, because currently fruits looks different than it did originally.

00:43 And I’m just going to go back to the previous version by saying `fruits.drop(“taste”) on the columns axis.

00:56 And we will need to reassign this to the fruits DataFrame so that now we’re back to the start with the familiar fruits DataFrame.

01:07 Now what you did previously—you used pd.merge(fruits, veggies) with the default of allowing pandas to figure out what is the intersection between those two DataFrames, and those are the columns name and image, and then pandas performed the merge.

01:25 And in this case, an inner merge, dropping all the data that didn’t have a match. Now I can define a different specific column that I want to match on. So to say pd.merge(fruits, veggies) and then say on, for example, just the "name" column.

01:45 Then pandas is going to compare the veggies and the fruits DataFrame and figure out where is a match in the name column, and then keep that around.

01:55 You might already think, oh, well it’s still going to be just tomato and you’re right with this, but since you’re not matching on the image column, something different is going to happen.

02:04 pandas is not comparing whether they’re the same. And so it needs to do something with the second image data. And what it does here by default is it keeps both of those around. The only match and combination happened on name, so here you do not have a duplicate value, and the name only appears once.

02:22 But both in the X column, the left column, as well as in the right column, you had an image column, and they both contain some data on that matched row. pandas wants to keep both of those around, so now in your result, you get two images of the tomato with this label added automatically added to the column names that allows you to distinguish where did they come from originally?

02:48 So if you take one step back from here and try to figure out what actually happened up in pd.merge(), you’ll see that the on parameter, like I mentioned earlier, matched on the name as well as the image column.

03:04 This is how you get the result where neither of those are duplicated because pandas looked at both of those columns and found matching data in both of them, so there was no need to duplicate it.

03:16 Now you might already know what happens when you do the same thing, but define "image". Give it a try.

03:25 And as you can see, now it keeps the information of the name column of both the left as well as the right DataFrame. So you have duplicate values for the name of the tomato.

03:38 Now you might wonder what happens if you try to merge on a different column, let’s say the color column. Merge, and you know, you can always give it a try and see what happens,

03:50 on="color". Pause the video and think about what you expect to happen. Is it going to be an error or an empty DataFrame or something different?

04:02 And pandas gives you a KeyError that tells you that color, the name of the column that you passed, does not exist in both of the DataFrames.

04:12 If you go back into the function signature of pd.merge(), you will see what that is the case. There’s the docstring:

04:23 on column or index level names to join on these, must be found in both DataFrames. So as you can see, color isn’t found in both of the DataFrames, which is why pandas throws a KeyError if you try to pass a column name that doesn’t exist in both of them.

05:17 And you can also define a single or a composite list of column names and perform the merge on a single column. If you do that and there’s duplicate column names, then the data gets duplicated as well, and pandas puts some indicator to figure out where did the data come from initially.

05:38 And you also saw that if you use a column name that is not present in both DataFames, then you will run into an error when you try to perform the merge.

05:51 Now, if you think about the function signature here and the docstring, it also mentioned that you can join on indices. So in next lesson, you will take a look at how you can join two DataFrames using pd.merge() on an index column.

Become a Member to join the conversation.