Using the ColumnDataSource Object

Interactive Data Visualization in Python With Bokeh Christopher Bailey 09:53

This video covers Bokeh’s ColumnDataSource object. The ColumnDataSource is foundational in passing the data to the glyphs you are using to visualize. Its primary functionality is to map names to the columns of your data, making it easier for you to reference data elements when building your visualization.

For information about integrating data sources, check out the Bokeh user guide’s post on the ColumnDataSource and other source objects available.

Bokeh provides a helpful list of CSS color names categorized by their general hue. Also, htmlcolorcodes.com is a great site for finding CSS, hex, and RGB color codes.

File: read_nba_data.py

Python
      
    
import pandas as pd 

# Read the csv files
player_stats = pd.read_csv('data/2017-18_playerBoxScore.csv',
                           parse_dates=['gmDate'])
team_stats = pd.read_csv('data/2017-18_teamBoxScore.csv',
                          parse_dates=['gmDate'])
standings = pd.read_csv('data/2017-18_standings.csv',
                         parse_dates=['stDate'])

# Create west_top_2
west_top_2 = (standings[(standings['teamAbbr'] == 'HOU') | 
              (standings['teamAbbr'] == 'GS')]
              .loc[:, ['stDate', 'teamAbbr', 'gameWon']]
              .sort_values(['teamAbbr', 'stDate']))

File: WestConfTop2.py

Python
      
    
# Bokeh libraries
from bokeh.io import output_file
from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, CDSView, GroupFilter

# Import the data
from read_nba_data import west_top_2

# Output to static HTML file
output_file('west_top_2_standings_race.html',
            title='Western Conference Top 2 Teams Wins Race')

# Isolate the data for the Rockets and Warriors
rockets_data = west_top_2[west_top_2['teamAbbr'] == 'HOU']
warriors_data = west_top_2[west_top_2['teamAbbr'] == 'GS']

# Create a ColumnDataSource object for each team
rockets_cds = ColumnDataSource(rockets_data)
warriors_cds = ColumnDataSource(warriors_data)

# Create and configure the figure
fig = figure(x_axis_type='datetime',
             plot_height=300, plot_width=600,
             title='Western Conference Top 2 Teams Wins Race, 2017-18',
             x_axis_label='Date', y_axis_label='Wins',
             toolbar_location=None)

# Render the race as step lines
fig.step('stDate', 'gameWon', 
         color='#CE1141', legend='Rockets', 
         source=rockets_cds)
fig.step('stDate', 'gameWon', 
         color='#006BB6', legend='Warriors', 
         source=warriors_cds)

# Move the legend to the upper left corner
fig.legend.location = 'top_left'

# Show the plot
show(fig)

00:00 Now it’s time to practice using the ColumnDataSource object. All the previous examples have employed Python lists and NumPy arrays to represent your data, and Bokeh is well equipped to handle these data types.

00:12 However, when it comes to data in Python, you’re most likely to come across Python dictionaries and pandas DataFrames. Especially for reading data from a file or an external data source, Bokeh is really well equipped to work with these more complex data structures and even has built-in functionality to handle them. Namely, the ColumnDataSource.

00:31 You may be asking yourself “Why you use a ColumnDataSource when Bokeh can interface with other data types directly?” Well, for one, whether you reference a list, an array, a dictionary, or a DataFrame directly, Bokeh’s going to turn it into a ColumnDataSource behind the scenes anyway. And more importantly, the ColumnDataSource makes it much easier to implement Bokeh’s interactive features.

00:50 The ColumnDataSource is the foundation for the data that you want to pass to the glyphs that you want to visualize. The primary functionality is to map names to the columns of your data.

01:00 It makes it easier for Bokeh to do the same when you’re building your visualization. A ColumnDataSource can interpret three types of data: A Python dictionary—the keys are names associated with the respective value sequences.

01:14 It could be a dictionary like this, where you have "growth" and "months" as the keys, and then you have lists or arrays as the values.

01:23 pandas DataFrame—columns of the DataFrame become reference names for the ColumnDataSource. This would be like the example you just did of reading in a CSV file using pandas. In the upcoming example, you’ll do just that—put the DataFrame into the ColumnDataSource. And pandas’ groupby—the columns of the ColumnDataSource reference the columns as seen by calling groupby.describe. In the case of groupby, you choose a column that you’re going to group things by, in this case "name". Then you apply an aggregate function, in this case .sum().

01:57 And also, you can choose what columns you’d like to keep, in this case "points". So once that’s placed into a ColumnDataSource, you can then access the various players’ names and the sum of their points.

02:10 You’ll get to try it out in an upcoming example. To use the ColumnDataSource for this next example, you’re going to start by doing a little bit of filtering. Back in your editor, reopen read_nba_data, and you’re going to create a new DataFrame called west_top_2.

02:26 You’re going to create a filter statement. You’re going to take standings, the DataFrame, and narrow it by sending up a conditional—standings where the team abbreviation is equal to Houston or where the team abbreviation matches Golden State.

02:49 And here, you’re filtering the columns you want to keep, which is the 'stDate' (standing date), the 'teamAbbr' (team abbreviation), and then 'gameWon'.

02:59 .sort_values() by team abbreviation, and then the standing date. Go ahead and save. What does that object look like? Back in the terminal inside your virtual environment type python.

03:20 So here, inside the REPL, I’m importing. And to look at that object, we can look at the top of it by calling .head() as a method. Here’s the first five. So again, you’re keeping just these three columns by using this command here, and then you’ve filtered out of all the standings to keep Houston or Golden State. Sorting values, so it’s coming up with G, Golden State, before H, Houston.

03:48 And if you type west_top_2.tail(), you can see the Houston values at the end. Great!

03:58 Okay, so with this saved away, it’s time for you to create a new visualization.

04:06 Call it WestConfTop2.py.

04:14 Start this off from Bokeh importing the libraries. from bokeh.io import output_file again, and from bokeh.plotting import figure, show.

04:28 Here’s the new piece. from bokeh.models import Column—not ColumnarDataSource, but ColumnDataSource. Make sure you get ColumnDataSource. Great. Here, import the data that you created in that module. Again, the reason for using the module is just to save some effort.

04:52 And you can import all of it. Great! You’re going to output this to a static HTML file.

05:08 So output will be to an HTML file named 'west_top_2_standings_race.html', and the title will be 'Western Conference Top 2 Teams Win Race'.

05:20 To isolate the data between the Rockets and the Warriors, you’re going to do a little filter here. Create rockets_data. rockets_data will be equal to west_top_2 that was imported in, and then filtering west_top_2 with the 'teamAbbr' (team abbreviation) equal to 'HOU' (Houston).

05:47 And for the Warriors, it’s the same statement, but the filter will be to 'GS', for Golden State. Now that you’ve isolated the data for the Rockets and the Warriors, it’s time to create the column data sources.

06:02 So, to create one for each team, we need the first one, rockets_cds, and then call the method ColumnDataSource() with rockets_data as your argument going in, and then for the warriors_cds, do the same thing.

06:22 Enter in the warriors_data. Okay, now that your column data sources are set, it’s time to create the figure.

06:31 So, for the figure, create an instance of figure() and assign it to fig. Create an x_axis_type='datetime',

06:41 and then set a plot_height=300, the plot_width=600. The title is going to be similar. The title that you did a moment ago, 'Western Conference Top 2 Teams Win Race', you can paste it in there. Just add that it’s from the 2017-18 season and set up an x_axis_label='Date' and a y_axis_label='Wins'.

07:08 And to complete your fig, you’ll have the toolbar_location=None so it does not show. Okay. How are you going to see the data on your new figure? Well, from fig, you’re going to choose a new type of glyph called a step line, so fig.step(). For the parameters you’re going to set inside here, you’re going to set 'stDate', which is a column from the ColumnDataSource.

07:33 That’ll be your x, all the dates. And the y will be 'gameWon', which you might remember is a 0 or 1 value. So for each win, the stair step will move up one step. Next up, instead of picking a color from the named colors, you can also enter in a hexadecimal value for color.

07:52 And this color is the color of the Rockets’ jersey. Also create a legend marker for it, so legend will be 'Rockets'. And last but not least here, you need to set a source.

08:03 Where’s this data coming from? It’s coming from the rockets_cds ColumnDataSource. It looks kind of in reverse, but the source being rockets_cds is going to pull out the x and the y columns from that ColumnDataSource, and then you’re setting a color and a legend value. In fact, this format is going to be the same.

08:21 You’re overlaying two pieces of data for the two teams. So in this case, you can copy and paste it. You need to change it to 'Warriors' for the legend, and then same thing with the data source. You need it to be using the warriors_cds data source. And you need to get the jersey color here too, so '#006BB6'. Great!

08:41 Okay. A real quick thing with the legend—you need to pick where you want it to show. So that’s fig.legend.location and the text string is the 'top_left'. Great.

08:53 And this should feel familiar. Show that plot by calling show() with the argument fig inside of it. Okay. This all looks good. So with that all set, make sure to save your script. Down here in the terminal,

09:09 you can type in the name of your script, python3 WestConfTop2.py. And here it is. You can see it’s created that static HTML page.

09:25 Here’s the title, Western Conference Top 2 Teams Win Race. You have your Wins and your Date, Rockets in red and blue for the Warriors.

09:32 And you can see the glyphs for each step line. Each of the Warriors wins moving them up with a little bit of a lead, and the Rockets coming back and overtaking them at the end of the season. In the next video, you’ll take the ColumnDataSource object a little further by talking about GroupFilter and CDSView.

Pygator on Aug. 18, 2019

I get the following error when trying to build the dataframe:

FileNotFoundError Traceback (most recent call last) <ipython-input-1-90fbea3810d4> in <module> 3 # Read the csv files 4 player_stats = pd.read_csv(‘data/2017-18_playerBoxScore.csv’, ----> 5 parse_dates=[‘gmDate’]) 6 team_stats = pd.read_csv(‘data/2017-18_teamBoxScore.csv’, 7 parse_dates=[‘gmDate’])

~/Bokeh/venv/lib/python3.7/site-packages/pandas/io/parsers.py in parser_f(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision) 683 ) 684 –> 685 return _read(filepath_or_buffer, kwds) 686 687 parser_f.name = name

~/Bokeh/venv/lib/python3.7/site-packages/pandas/io/parsers.py in _read(filepath_or_buffer, kwds) 455 456 # Create the parser. –> 457 parser = TextFileReader(fp_or_buf, **kwds) 458 459 if chunksize or iterator:

~/Bokeh/venv/lib/python3.7/site-packages/pandas/io/parsers.py in init(self, f, engine, **kwds) 893 self.options[“has_index_names”] = kwds[“has_index_names”] 894 –> 895 self._make_engine(self.engine) 896 897 def close(self):

~/Bokeh/venv/lib/python3.7/site-packages/pandas/io/parsers.py in _make_engine(self, engine) 1133 def _make_engine(self, engine=”c”): 1134 if engine == “c”: -> 1135 self._engine = CParserWrapper(self.f, **self.options) 1136 else: 1137 if engine == “python”:

~/Bokeh/venv/lib/python3.7/site-packages/pandas/io/parsers.py in init(self, src, kwds) 1904 kwds[“usecols”] = self.usecols 1905 -> 1906 self._reader = parsers.TextReader(src, kwds) 1907 self.unnamed_cols = self._reader.unnamed_cols 1908

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader.cinit()

pandas/_libs/parsers.pyx in pandas._libs.parsers.TextReader._setup_parser_source()

FileNotFoundError: [Errno 2] File b’data/2017-18_playerBoxScore.csv’ does not exist: b’data/2017-18_playerBoxScore.csv’

Pygator on Aug. 21, 2019

I downloaded the data from the link dan gave and moved the three csvs into the data/ subfolder . Datetimes aren’t working for me, i don’t understand how to format them at all.

Chris Bailey RP Team on Aug. 21, 2019

Hi Pygator, Are you getting the same error as you posted above? I just tried the link that Dan provided for the data, and it downloaded the CSVs with incorrect names. Each file name should start with 2017-18 and what I got was 017-18. That would cause the error above of file not found. I will work with Dan on how to fix the issue, but you can rename the files, and add a “2” to the front of each. If the issue is past the first error you got, and is more specific to datetimes, can you send me more info, as to where it is failing. Thanks for taking the time to comment, and I hope I can help you solve this.

mcng4570 on March 26, 2020

You use fig for the figure but at the end you use west_fig twice. Also I am getting the warning: BokehDeprecationWarning: ‘legend’ keyword is deprecated, use explicit ‘legend_label’, ‘legend_field’, or ‘legend_group’ keywords instead

Nice work through so far! Thanks

andresgtn on March 30, 2020

theres a typo on the file WestConfTop2.py

you declare fig but then call west_fig

Chris Bailey RP Team on March 30, 2020

Hi @andresgtn, Thanks for spotting that typo. The video is correct, using only fig. I have updated the text below the video to just use fig.

ellefore on June 1, 2020

I am working within spyder.

The rockets_data and warriors_data are populated:

rockets_data.head() Out[13]: stDate teamAbbr gameWon 21 2017-10-17 HOU 1.0 81 2017-10-18 HOU 2.0 141 2017-10-19 HOU 2.0 201 2017-10-20 HOU 2.0 261 2017-10-21 HOU 3.0

warriors_data.head() Out[14]: stDate teamAbbr gameWon 19 2017-10-17 GS 0.0 79 2017-10-18 GS 0.0 139 2017-10-19 GS 0.0 199 2017-10-20 GS 1.0 259 2017-10-21 GS 1.0

When I run the script, I get the chart with the legend (line colors are present in the legend) but the actual step lines are not in the chart.

Is there a way to print out rockets_cds and warrior_cds to see what they look like? I assume the .step call is what is not working, have there been changes to the inputs to the .step call? This is the code as I have it:

fig.step(‘stDate’, ‘gameWon’, color=’#CE1141’, legend_label=’Rockets’, source=rockets_cds) fig.step(‘stDate’, ‘gameWon’, color=’#006BB6’, legend_label=’Warriors’, source=warriors_cds)

ellefore on June 1, 2020

The code is separated into 2 lines, copy and paste crammed them together in the previous post.

fig.step(‘stDate’, ‘gameWon’, color=’#CE1141’, legend_label=’Rockets’, source=rockets_cds)

fig.step(‘stDate’, ‘gameWon’, color=’#006BB6’, legend_label=’Warriors’, source=warriors_cds)

patientwriter on April 16, 2022

If you came here because of an error like this:

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().

Go here for an answer and explanation: stackoverflow.com/questions/36921951/truth-value-of-a-series-is-ambiguous-use-a-empty-a-bool-a-item-a-any-o Even some of the unaccepted answers are helpful.

andrewodrain on May 17, 2023

Hello Christopher,

I believe the filter statement you have displayed in the video will eventually lead to the chart being incorrect. The problem seems to lie in the ‘west_top_2’ dataframe. There is not a ‘gameWon’ column. This is what I have:

west_top_2 = df.loc[(df['teamAbbr'] == 'HOU') | 
                    (df['teamAbbr'] == 'GS'), 
                    ['stDate', 'teamAbbr','gameWon']]

Become a Member to join the conversation.