Cleaning Your Data

Explore Your Dataset With pandas Douglas Starnes 04:28

For more examples of what you can do with data cleanup, check out Pythonic Data Cleaning With pandas and NumPy.

00:00 You can tell a lot from the data returned by the .info() method. Take another look. For each column, the table lists the number of non-null values.

00:11 Most of the columns have 126,314 non-null values. And the DataFrame has 126,314 rows, thus those columns have no null or missing values. However, the notes column has only 5,424 non-null values, which says there are a lot of missing values in the notes column.

00:35 This is a problem because missing data results in what is essentially guessing for some data tasks. Thus, removing or resolving missing data is important.

00:46 One way to manage missing data is to completely remove it. You can simply omit any row that has a column with missing values. The nba DataFrame only has missing values for the notes column, but the .dropna() method will search all columns in the DataFrame for nulls.

01:48 And now you can see that clean_nba only has 23 columns.

01:54 Also, the notes column has been removed. If the notes column did have relevant information, you could keep it and replace just the missing values with a default value.

02:06 The .fillna() method on a Series will take the value to insert.

02:12 Also, you used the inplace argument to modify the calling DataFrame instead of returning the modified DataFrame. Look at .info() again.

02:24 There are now 126,314 non-null values in the notes column. And if you .describe() the 'notes' column, the value occurring the most is 'no notes at all'.

02:36 This makes sense as more than 90% of the values were missing before being replaced. Run .describe() on nba again. Look at the minimum value of the pts column.

02:49 Apparently, there was at least one game in which a team scored no points. Did they even try? Let’s take a deeper look. If you look at the notes column, it says that this game was forfeited. Now since there’s only one game with 0 points it might not have much of an impact, but you can now proceed to either remove or handle this row.

03:10 Several times, you’ve seen the game_result column that records a win or loss for that game. But does every win match up with the rest of the data in the row?

03:21 If a game result is a win, then the points scored for that row should be more than the opposition points scored. And for a loss, the opposition points should be greater.

03:31 You can test these conditions using a query.

03:35 The first condition checks that the number of points scored by a team is greater than the number of opposition points. The second condition checks that the game was not a win. When joined with the AND operator (&), the query should yield no results. And it does, but there’s an easier way to see. Simply get the .empty attribute.

03:57 A query returns a DataFrame, and a DataFrame with no rows is empty. To check the opposite, reverse the first comparison to check for games in which the opponents had more points and make sure the game was not a loss. Again, this should be empty.

04:14 You have only seen part of what you can do with data cleanup. For more examples, check out this post on Real Python. In the next lesson, you’ll learn how to combine data from multiple sources.

Become a Member to join the conversation.