NumPy Histograms

Python Histogram Plotting: NumPy, Matplotlib, Pandas & Seaborn Joe Tatusko 04:23

So far, you’ve been working with what could best be called frequency tables. But mathematically, a histogram is a mapping of bins (intervals) to frequencies. More technically, it can be used to approximate the probability density function (PDF) of the underlying variable.

A true histogram first bins the range of values and then counts the number of values that fall into each bin. This is what NumPy’s histogram() does, and it’s the basis for other functions you’ll see here later in Python libraries such as Matplotlib and Pandas.

Consider a sample of floats drawn from the Laplace distribution. This distribution has fatter tails than a normal distribution and has two descriptive parameters (location and scale):

Python
      
        
      
    
>>> import numpy as np
>>> # `numpy.random` uses its own PRNG.
>>> np.random.seed(444)
>>> np.set_printoptions(precision=3)

>>> d = np.random.laplace(loc=15, scale=3, size=500)
>>> d[:5]
array([18.406, 18.087, 16.004, 16.221,  7.358])

In this case, you’re working with a continuous distribution, and it wouldn’t be very helpful to tally each float independently, down to the umpteenth decimal place. Instead, you can bin or bucket the data and count the observations that fall into each bin. The histogram is the resulting count of values within each bin.

00:00 To make actual histograms instead of frequency tables, you’ll need to define bins to collect your data points. Bins are simply upper and lower limits that count how many points fall into their range. To make this a bit clearer, we’re going to generate a set of random data using NumPy.

00:17 Go ahead and import numpy as np. And to make sure you get the same values as I do, because NumPy has its own random number generator, you can set a seed here, which this time do 444.

00:35 And to make sure things are easily viewable, you can also set a print option to set the precision to something like 3 decimal points. Okay.

00:48 Now go ahead and make a list using NumPy’s random, and you’re going to make a Laplace distribution. And set the loc (location) equal to 15, the scale to 3, and generate 500 data points. And to see what this looks like, you can go ahead and print.