Doing Rolling-Window Analysis

The pandas DataFrame: Working With Data Efficiently Cesar Aguilar 04:31

Note: Though you can’t see it onscreen, calling temp.rolling(window=3, center=True).mean() at this point in the video will actually return two NaN values, with the second one being the final value. The final value won’t have another value following it, and so there’s not enough data to compute the mean for that time.

00:00 Another type of operation that you may want to do with time-series data is called rolling-window analysis. Now, one of the reasons why you may want to do something like this is when you’ve got data that varies greatly in very small time intervals and you want a way to smooth out the data.

00:17 So, a common application of this is when you’ve got, say, stock prices. The data, as you know, varies quite a bit, even in very small time intervals, and if you want to get sort of a smoothed-out version of the data you may want to do what’s called rolling-window analysis.

00:34 So, the function for this or the method on the DataFrame is called .rolling(), and what we’ll do is we need to specify the width of the window that we want to perform the aggregate function on. And in this case, what we’re going to do is the aggregate function is going to be the mean.

00:52 Let me set the keyword argument window to 3, and then the aggregate function that we’re going to use is called .mean().

01:00 Or again, you could use .min() or .max() just depending on what makes more sense for your application. So let me run that and then let me explain what we get.

01:10 Let’s focus on the 2 hour value that we get of 7.3. So, we specified a window of 3. What happens here is for the value at 2:00, the 2:00 value—the value of 7.3—is computed by getting the values of the temperature at the two previous times, and so the 2:00 value is the right end point of the window.

01:35 We take those first three values, compute the mean, and we get 7.3. If you want to see this explicitly, let me just comment this out, and let me call the .head() and say just the first 3.

01:50 If we average out the temperatures at 12:00 in the morning, 1:00 in the morning, and 2:00 in the morning—average these out—we get the 7.3

02:00 that we had over here. And then to calculate the value at 3:00 we take from the original time series data the time temperature at 1:00 in the morning, 2:00 in the morning and 3:00 in the morning, compute the mean, and in that case, we get 6.7.

02:17 So, the value that’s computed at any given time is the right end point of, in this case, a window of size 3. Now, maybe why it now makes sense that we’re going to get NaN values for the temperature window value at 12:00 in the morning and at 1:00 in the morning—because at 12:00 in the morning, there aren’t two values before 12:00 in the morning in the data and so there is no computation to be done.

02:41 And then likewise at 1:00 in the morning, if that’s the right end point of the window, we’ve only got one before it, and so we don’t have enough data points. Now, an alternative value that you can pass in for a keyword argument that’s called center—the default value is False—is to pass in a value of True.

03:02 What this will do is instead of taking the data point where we’re going to compute a value as the right end point of the window, it’s going to be the center of the window. And so in this case, if we run this code,