7. Bar and Categorical Data Plots

from bokeh.io import show, output_notebook
from bokeh.plotting import figure
# nbi:hide_out
# Make all outputs INLINE for default (why a hell this is needed !!!)
output_notebook()
Loading BokehJS ...

Basic Bar Charts

Bar charts are a common and important type of plot. Bokeh makes it simple to create all sorts of stacked or nested bar charts, and to deal with categorical data in general. The example below shows a simple bar chart created using the vbar method for drawing vertical bars. (There is a corresponding hbar for horizontal bars.) We also set a few plot properties to make the chart look nicer, see chapter Styles and Themes for information about visual properties.

# Here is a list of categorical values (or factors)
fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']

# Set the x_range to the list of categories above
p = figure(
    plot_height=400,
    plot_width=800,    
    x_range=fruits,
    title="Fruit Counts"
)

# Categorical values can also be used as coordinates
p.vbar(
    x=fruits,
    top=[5, 3, 4, 2, 4, 6],
    width=0.9
)

# Set some properties to make the plot look better
p.xgrid.grid_line_color = None
p.y_range.start = 0

# show the results
show(p)

When we want to create a plot with a categorical range, we pass the ordered list of categorical values to figure, e.g. x_range=['a', 'b', 'c']. In the plot above, we passed the list of fruits as x_range, and we can see those refelected as the x-axis.

The vbar glyph method takes an x location for the center of the bar, a top and bottom (which defaults to 0), and a width. When we are using a categorical range as we are here, each category implicitly has width of 1, so setting width=0.9 as we have done here makes the bars shrink away from each other. (Another option would be to add some padding to the range.)

Since vbar is a glyph method, we can use it with a ColumnDataSource just as we would with any other glyph. In the example below, we put the data (including color data) in a ColumnDataSource and use that to drive our plot. We also add a legend, see chapter Adding Annotations.ipynb for more information about legends and other annotations.

from bokeh.models import ColumnDataSource
from bokeh.palettes import Spectral6

fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
counts = [5, 3, 4, 2, 4, 6]

source = ColumnDataSource(
    data=dict(
        fruits=fruits,
        counts=counts,
        color=Spectral6
    )
)

p = figure(
    plot_height=400,
    plot_width=800, 
    x_range=fruits,
    y_range=(0, 9),
    title="Fruit Counts"
)
p.vbar(
    x='fruits',
    top='counts',
    width=0.9,
    color='color',
    legend_field="fruits",
    source=source
)

p.xgrid.grid_line_color = None
p.legend.orientation = "horizontal"
p.legend.location = "top_center"

# show the results
show(p)

Stacked Bars

It's often desirable to stack bars together. Bokeh makes this straightforward using the vbar_stack and hbar_stack methods. When passing data to one of these methods, the data source should have a series for each "row" in the stack. You will provide an ordered list of column names to stack together from the data source.

In the example below, we see simulated data for fruit exports (positive values) and imports (negative values) stacked using two calls to hbar_stack. The values in the columns for each year are ordered according to the fruits, i.e. this is not a "tidy" data format.

from bokeh.palettes import GnBu3, OrRd3

years = ['2015', '2016', '2017']
exports = {'fruits' : fruits,
           '2015'   : [2, 1, 4, 3, 2, 4],
           '2016'   : [5, 3, 4, 2, 4, 6],
           '2017'   : [3, 2, 4, 4, 5, 3]}
imports = {'fruits' : fruits,
           '2015'   : [-1, 0, -1, -3, -2, -1],
           '2016'   : [-2, -1, -3, -1, -2, -2],
           '2017'   : [-1, -2, -1, 0, -2, -2]}

p = figure(
    plot_height=400,
    plot_width=800,    
    y_range=fruits,
    x_range=(-16, 16),
    title="Fruit import/export, by year"
)

p.hbar_stack(
    years,
    y='fruits',
    height=0.9,
    color=GnBu3,     
    legend_label=["%s exports" % x for x in years],
    source=ColumnDataSource(exports)
)

p.hbar_stack(
    years,
    y='fruits',
    height=0.9,
    color=OrRd3,
    legend_label=["%s imports" % x for x in years],
    source=ColumnDataSource(imports)
)

p.y_range.range_padding = 0.1
p.ygrid.grid_line_color = None
p.legend.location = "center_left"

# show the results
show(p)

Notice we also added some padding around the categorical range (e.g. at both ends of the axis) by specifying

p.y_range.range_padding = 0.1

Grouped Bar Charts

Sometimes we want to group bars together, instead of stacking them. Bokeh can handle up to three levels of nested (hierarchical) categories, and will automatically group output according to the outermost level. To specify nested categorical coordinates, the columns of the data source should contain tuples, for example:

x = [ ("Apples", "2015"), ("Apples", "2016"), ("Apples", "2017"), ("Pears", "2015), ... ]

Values in other columns correspond to each item in x, exactly as in other cases. When plotting with these kinds of nested coordinates, we must tell Bokeh the contents and order the axis range, by explicitly passing a FactorRange to figure. In the example below, this is seen as

p = figure(x_range=FactorRange(*x), ....)
from bokeh.models import FactorRange

fruits = ['Apples', 'Pears', 'Nectarines', 'Plums', 'Grapes', 'Strawberries']
years  = ['2015', '2016', '2017']
data   = {'fruits' : fruits,
        '2015'   : [2, 1, 4, 3, 2, 4],
        '2016'   : [5, 3, 3, 2, 4, 6],
        '2017'   : [3, 2, 4, 4, 5, 3]}

# this creates [ ("Apples", "2015"), ("Apples", "2016"), ("Apples", "2017"), ("Pears", "2015), ... ]
x = [ (fruit, year) for fruit in fruits for year in years ]

# like an hstack
counts = sum(
    zip(data['2015'],
        data['2016'],
        data['2017']),
        ()
)

source = ColumnDataSource(
    data=dict(
        x=x,
        counts=counts
    )
)

p = figure(
    plot_height=400,
    plot_width=800,    
    x_range=FactorRange(*x),
    title="Fruit Counts by Year"
)

p.vbar(
    x='x',
    top='counts',
    width=0.9,
    source=source
)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None

# show the results
show(p)

Another way we can set the color of the bars is to use a transform. We first saw some transforms in previous chapter Data Sources and Transformations. Here we use a new one factor_cmap that accepts a the name of a column to use for colormapping, as well as the palette and factors that define the color mapping.

Additionally we can configure it to map just the sub-factors if desired. For instance in this case we don't want shade each (fruit, year) pair differently. Instead, we want to only shade based on the year. So we pass start=1 and end=2 to specify the slice range of each factor to use when colormapping. Then we pass the result as the fill_color value:

fill_color=factor_cmap('x', palette=['firebrick', 'olive', 'navy'], factors=years, start=1, end=2))

to have the colors be applied automatically based on the underlying data.

from bokeh.transform import factor_cmap

p = figure(
    plot_height=400,
    plot_width=800,    
    x_range=FactorRange(*x),
    title="Fruit Counts by Year"
)

p.vbar(
    x='x',
    top='counts',
    width=0.9,
    line_color="white",
    # use the palette to colormap based on the the x[1:2] values
    fill_color=factor_cmap(
        'x',
        palette=['firebrick', 'olive', 'navy'],
        factors=years,
        start=1,
        end=2
    ),
    source=source
)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xaxis.major_label_orientation = 1
p.xgrid.grid_line_color = None

# show the results
show(p)

It is also possible to achieve grouped bar plots using another technique called "visual dodge". That would be useful e.g. if you only wanted to have the axis labeled by fruit type, and not include the years on the axis. This tutorial does not cover that technique but you can find information in the User's Guide.

Mixing Categorical Levels

If you have created a range with nested categories as above, it is possible to plot glyphs using only the "outer" categories, if desired. The plot below shows monthly values grouped by quarter as bars. The data for these are in the familiar format:

factors = [("Q1", "jan"), ("Q1", "feb"), ("Q1", "mar"), ....]

The plot also overlays a line representing average quarterly values, and this is accomplished by using only the "quarter" part of each nexted category:

p.line(x=["Q1", "Q2", "Q3", "Q4"], y=....)
factors = [("Q1", "jan"), ("Q1", "feb"), ("Q1", "mar"),
           ("Q2", "apr"), ("Q2", "may"), ("Q2", "jun"),
           ("Q3", "jul"), ("Q3", "aug"), ("Q3", "sep"),
           ("Q4", "oct"), ("Q4", "nov"), ("Q4", "dec")]

p = figure(
    plot_height=400,
    plot_width=800,    
    x_range=FactorRange(*factors)
)

x = [ 10, 12, 16, 9, 10, 8, 12, 13, 14, 14, 12, 16 ]
p.vbar(
    x=factors,
    top=x,
    width=0.9,
    alpha=0.5
)

qs, aves = ["Q1", "Q2", "Q3", "Q4"], [12, 9, 13, 14]
p.line(
    x=qs,
    y=aves,
    color="red",
    line_width=3
)
p.circle(
    x=qs,
    y=aves,
    line_color="red",
    fill_color="white",
    size=10
)

p.y_range.start = 0
p.x_range.range_padding = 0.1
p.xgrid.grid_line_color = None

# show the results
show(p)

Using Pandas GroupBy

We may want to make charts based on the results of "group by" operations. Bokeh can utilize Pandas GroupBy objects directly to make this simpler. Let's take a look at how Bokeh deals with GroupBy objects by examining the "cars" data set.

from bokeh.sampledata.autompg import autompg_clean as df

df.cyl = df.cyl.astype(str)
df.head()
mpg cyl displ hp weight accel yr origin name mfr
0 18.0 8 307.0 130 3504 12.0 70 North America chevrolet chevelle malibu chevrolet
1 15.0 8 350.0 165 3693 11.5 70 North America buick skylark 320 buick
2 18.0 8 318.0 150 3436 11.0 70 North America plymouth satellite plymouth
3 16.0 8 304.0 150 3433 12.0 70 North America amc rebel sst amc
4 17.0 8 302.0 140 3449 10.5 70 North America ford torino ford

Suppose we would like to display some values grouped according to "cyl". If we create df.groupby(('cyl')) then call group.describe() we can see that Pandas automatically computes various statistics for each group.

group = df.groupby(('cyl'))
group.describe()
mpg displ ... accel yr
count mean std min 25% 50% 75% max count mean ... 75% max count mean std min 25% 50% 75% max
cyl
3 4.0 20.550000 2.564501 18.0 18.75 20.25 22.05 23.7 4.0 72.500000 ... 13.5 13.5 4.0 75.500000 3.696846 72.0 72.75 75.0 77.75 80.0
4 199.0 29.283920 5.670546 18.0 25.00 28.40 32.95 46.6 199.0 109.670854 ... 18.0 24.8 199.0 77.030151 3.737484 70.0 74.00 77.0 80.00 82.0
5 3.0 27.366667 8.228204 20.3 22.85 25.40 30.90 36.4 3.0 145.000000 ... 20.0 20.1 3.0 79.000000 1.000000 78.0 78.50 79.0 79.50 80.0
6 83.0 19.973494 3.828809 15.0 18.00 19.00 21.00 38.0 83.0 218.361446 ... 17.6 21.0 83.0 75.951807 3.264381 70.0 74.00 76.0 78.00 82.0
8 103.0 14.963107 2.836284 9.0 13.00 14.00 16.00 26.6 103.0 345.009709 ... 14.0 22.2 103.0 73.902913 3.021214 70.0 72.00 73.0 76.00 81.0

5 rows × 48 columns

Bokeh allows us to create a ColumnDataSource directly from Pandas GroupBy objects, and when this happens, the data source is automatically filled with the summary values from group.desribe(). Observe the column names below, which correspond to the output above.

source = ColumnDataSource(group)
",".join(source.column_names)
'cyl,mpg_count,mpg_mean,mpg_std,mpg_min,mpg_25%,mpg_50%,mpg_75%,mpg_max,displ_count,displ_mean,displ_std,displ_min,displ_25%,displ_50%,displ_75%,displ_max,hp_count,hp_mean,hp_std,hp_min,hp_25%,hp_50%,hp_75%,hp_max,weight_count,weight_mean,weight_std,weight_min,weight_25%,weight_50%,weight_75%,weight_max,accel_count,accel_mean,accel_std,accel_min,accel_25%,accel_50%,accel_75%,accel_max,yr_count,yr_mean,yr_std,yr_min,yr_25%,yr_50%,yr_75%,yr_max'

Knowing these column names, we can immediately create bar charts based on Pandas GroupBy objects. The example below plots the aveage MPG per cylinder, i.e. columns "mpg_mean" vs "cyl"

from bokeh.palettes import Spectral5

cyl_cmap = factor_cmap(
    'cyl',
    palette=Spectral5,
    factors=sorted(df.cyl.unique())
)

# create a new plot using figure
p = figure(
    plot_height=400,
    plot_width=800,    
    x_range=group
)
p.vbar(
    x='cyl',
    top='mpg_mean',
    width=1,
    line_color="white", 
    fill_color=cyl_cmap,
    source=source
)

p.xgrid.grid_line_color = None
p.xaxis.axis_label = "number of cylinders"
p.yaxis.axis_label = "Mean MPG"
p.y_range.start = 0

# show the results
show(p)

Categorical Scatterplots

So far we have seen Categorical data used together with various bar glyphs. But Bokeh can use categorical coordinates for most any glyphs. Let's create a scatter plot with categorical coordinates on one axis. The commits data set simply has a series datetimes of GitHub commit. Additional columns to express the day and hour of day for each commit have already been added.

from bokeh.sampledata.commits import data

data.head()
day time
datetime
2017-04-22 15:11:58-05:00 Sat 15:11:58
2017-04-21 14:20:57-05:00 Fri 14:20:57
2017-04-20 14:35:08-05:00 Thu 14:35:08
2017-04-20 10:34:29-05:00 Thu 10:34:29
2017-04-20 09:17:23-05:00 Thu 09:17:23

To create our scatter plot, we pass the list of categories as the range just as before

p = figure(y_range=DAYS, ...)

Then we can plot circles for each commit, with "time" driving the x-coordinate, and "day" driving the y-coordinate.

p.circle(x='time', y='day', ...)

To make the values more distinguishable, we can also add a jitter transform to the y-coordinate, which is shown in the complete example below.

from bokeh.transform import jitter

DAYS = ['Sun', 'Sat', 'Fri', 'Thu', 'Wed', 'Tue', 'Mon']
source = ColumnDataSource(data)

p = figure(
    plot_height=400,
    plot_width=800,
    y_range=DAYS,
    x_axis_type='datetime', 
    title="Commits by Time of Day (US/Central) 2012—2016"
)

p.circle(
    x='time',
    y=jitter('day', width=0.6, range=p.y_range),
    alpha=0.3,
    source=source
)

p.xaxis[0].formatter.days = ['%Hh']
p.x_range.range_padding = 0
p.ygrid.grid_line_color = None

# show the results
show(p)