Altair vs. Bokeh (part 3): Scatterplots


Simple Bar Chart

Overview

This is my third entry in a series comparing two interactive data visualization libraries for python, Bokeh and Altair. In part 1, I went through a basic overview of some of the differences in terms of syntactic style and defaults. In part 2, I went slightly deeper into some of the interactive features and more fundamental limitations. In this part, I want to deepen the comparison further, by exploring the basic capabilities each package offers with respect to a basic chart type, namely scatterplots.

Scatterplots are extremely common, and will be familiar to most people. Typically, they are used to visualize two quantitative variables (on the x and y axes), with the ability to encode more data attributes via additional mark properties, such as size, shape, and colour. Although this is not the only thing they are useful for, we typically think of scatterplots as being especially well suited for showing relationships between the two quantitative variables, and, to a lesser extent, for showing clusters.

Naturally, both Bokeh and Altair make it very easy to create basic scatterplots. The differences grow more dramatic, however, once you start delving into interactivity, which of course opens up a great many more possibilities.

As a brief preview, both Altair and Bokeh are capable of creating basic interactive scatterplots. However, going through a number of scenarios, it quickly becomes apparent that a lot of relatively standard designs one might want to implement are much easier to create in Altair than Bokeh. Ultimately, you may still run into frustrations with Altair when you want to go beyond some basic limitations. However, the more time I spend with it, the more impressed I am by how well Altair's designers have anticipated various use cases, and that ends up paying huge benefits in terms of what Altair makes it easy to do.

Basic scatterplots

For a basic comparison, let's use the cars dataset to make a simple scatterplot, and use colour to encode an additional nominal (i.e., categorical) variable (origin).

For the Altair vision, we can do this all with a set of chained functions, as we might expect. As before, all we need to do is specify the variable and data type for each channel we wish to use (which here will be x, y, and colour). Moreover, chaining the interactive() function automatically enables panning and pinch zooming, and we can easily create tooltips for individual data points by specifying that as an additional channel. Already, this is much better than a static scatterplot (such as we might create with something like matplotlib), since we can zoom in or out, and easily identify individual data points using tooltips, without having to label every single point (which would likely be impossible to read).

For this version, I have added both car name and production year as information to be encoded as tooltips, and these are nicely formatted automatically, partly thanks to the year() function, which extracts the year from a date. Because Altair establishes a data type for each encoding, when we assign origin to the colour channel, and tell Altair that it is nominal (N), Altair automatically chooses an appropriate colour palette and creates a legend with the various regions represented by origin. As we saw before in Part 2, Altair also adds x and y axis labels automatically, which we can make slightly nicer by telling it to use a label (title) in which we have removed the underscores.

Here is the basic Altair code and corresponding (interactive) plot, for which you can pan, zoom, and hover:

# import the basic packages we need
import altair as alt
import vega_datasets as vd

# get the dataset url
cars_url = vd.data.cars.url

# create the chart, using points, with x, y, and colour channels
chart = alt.Chart(cars_url).mark_point().encode(
    # set both Horsepower and Miles per Gallon to be quantitative (Q) variables
    x='Horsepower:Q',
    # override the default axis label for y
    y=alt.Y('Miles_per_Gallon:Q', axis=alt.Axis(title='Miles per gallon')), 
    # set Origin to be a nominal (N) variable
    color='Origin:N',
     # also add tooltips with name and year, adjusting the label for year
    tooltip=['Name:N', alt.Tooltip('year(Year)', title='Year')] 
).interactive()     # Make the chart interactive
chart



Creating the same chart in Bokeh is similar, but with a few minor differences. The biggest difference is that you need to manually tell Bokeh what colour scheme to use. There are a few different ways to do this, but here I have used the factor_cmap() function, which maps from the various values in the origin variable (which need to be separately extracted) to a particular colourmap. Here, I have used Bokeh's default colour scheme, Category10.

Other minor default differences include the need to manually add a legend, the lack of axis labels by default, and filled as opposed to open circles. In addition, a minor frustration is that pinch zooming is not enabled by default in Bokeh; rather the user needs to manually select the zoom tool from the Bokeh toolbar beside the plot, which is something that I think is strictly worse than Altair. Bokeh also defaults to placing the legend inside the plot area, whereas Altair places it beside the plot, though obviously both of these can be changed.

Importantly, note that Bokeh requires that we download the dataset and provide it as a dataframe when creating the plot, rather than just providing a URL for the dataset, as can be done with Altair. (I belive there are ways to set up something similar for Bokeh, but it requires some additional infrastructure). The effect of this is that the entire dataset will be embedded in the HTML code when we save the chart, rather than having it retrieved when the page is loaded, which is what our Altair chart is doing. This doesn't make much difference for such a simple chart, but it lead to a more pronounced difference when dealing with larger datasets, as I will return to at the end.

Here's the vanilla Bokeh version:

from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.transform import factor_cmap
import pandas as pd

# get the dataset as a dataframe, and convert it to a column source
df = vd.data.cars()
source = ColumnDataSource(df)

# get the set of values of the Origin variable
origins = sorted(df['Origin'].unique())

# create the chart
p = figure(width=350, height=350)

# add the points, setting the x, y, colour, and legend
p.scatter(
    'Horsepower',
    'Miles_per_Gallon',
    # map the values in Origin to the first three colours of Category10
    color=factor_cmap('Origin', 'Category10_3', origins),
    legend_field="Origin",
    # specify the data source to use
    source=source
)
show(p)

To make the Bokeh plot a little more like the Altair one requires a few interventions. Things like adding axis labels and a legend title are pretty easy. Using a different palette requires changing the colour map. Creating tooltips is similar to Altair, except that in this case, we need to separately extract the year from the dataframe, and add it as a new column. I think there should be a way to turn on pinch zoom by default, but I was not able to figure this out, so that still requires clicking the appropriate toolbar icon.

Here's a Bokeh version that is a bit more like the Altair defaults:

from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource
from bokeh.transform import CategoricalColorMapper

# get the data and extract the years
df = vd.data.cars()
df['intYear'] = [date.year for date in df['Year']]
source = ColumnDataSource(df)

# create our desired colour palette
palette = ['#4c78a8', '#f58518', '#e45756']
color_map = CategoricalColorMapper(factors=sorted(df['Origin'].unique()), palette=palette)

# create the figure and specify the tooltips
p = figure(
    width=350,
    height=350,
    tooltips=tooltips=[("Name", "@Name"), ("Year", "@intYear")]
)

# plot the points with some adjusted properties
p.scatter('Horsepower',
         'Miles_per_Gallon',
         size=5.5,
         line_color={'field': 'Origin', 'transform': color_map},
         fill_alpha=0,
         line_width=2.,
         line_alpha=0.75,
         legend_field="Origin",
         source=source
        )

# add the desired labels
p.xaxis.axis_label = 'Horsepower'
p.yaxis.axis_label = 'Miles per gallon'
p.legend.title = 'Origin'
show(p)

And here is the modified Bokeh plot, which you'll notice handles the tooltips slightly differently:


Linking and Brushing

Going beyond the basics is where things get more complicated, especially when we want to start expanding the options for interactivity.

For example, a simple feature that is nice to have for paired scatterplots is to create two linked plots with brushing enabled. That is, if we select points in one plot, it should automatically select the same points in the other. This kind of thing is perhaps most relevant to creating an entire scatterplot matrix (SPLOM), but here we'll just use a pair of plots to keep things simple.

For the most basic set of linked plots, both Altair and Bokeh are pretty convenient. For Altair, we can define a selection_interval (which supports selecting multiple points) and set the resolution to global, meaning that it will operate across both plots. We then add this to the chart as a parameter, and set a condition for colour, which will selectively colour points based on whether or not they are currently selected. Learning the semantics of these different types of selections requires a bit of investment, but they are fairly straightforward once one knows how to use them.

For Bokeh, by contrast, linking and brushing in some sense seems to happen automatically. If we simply create a grid layout with two subplots, and turn on the box_select tool for both, Bokeh automatically provides this kind of linked brushing behavior. However, because it's less explicit, it's somewhat less clear how this is working, or how one can control it (or even turn it off). The actual affordances are also somewhat different. In the Altair version, for example, you can drag around a selected region once you have created it, which is kind of nice. The same functionality does not seem to exist in Bokeh.

Here's the Altair version, which you can try out by selecting a region on either plot:

cars_url = vd.data.cars.url

# create the selection interval
brush = alt.selection_interval(resolve='global')

# create a base chart, in which we omit the x encoding
base = alt.Chart(cars_url).mark_point().encode(
    # only add the y-axis variable for now, since we are making a base chart
    y=alt.Y('Miles_per_Gallon:Q', axis=alt.Axis(title='Miles per Gallon')),
    # set a condition to determine colour based on the selection
    color=alt.condition(brush, 'Origin:N', alt.ColorValue('gray')),
).add_params(
    # add our selection interval as a param
    brush
)

# plot two versions of the base chart with different x encodings
base.encode(x=alt.X('Horsepower:Q')) | base.encode(x='Acceleration:Q')



and here's the Bokeh version:

from bokeh.layouts import gridplot
from bokeh.models import ColumnDataSource
from bokeh.plotting import figure, show
from bokeh.sampledata.penguins import data
from bokeh.transform import factor_cmap

# create the data as before
df = vd.data.cars()
source = ColumnDataSource(df)
origins = sorted(df['Origin'].unique())

# specify a shared set of tools, including box_select
TOOLS="box_select,wheel_zoom,reset"

# make each of the left and right plots
left = figure(width=350, height=350, title=None, tools=TOOLS)
left.scatter("Horsepower", "Miles_per_Gallon", source=source,
            color=factor_cmap('Origin', 'Category10_3', origins), alpha=0.5, size=8)
left.xaxis.axis_label = 'Horsepower'
left.yaxis.axis_label = 'Weight (lbs)'

right = figure(width=350, height=350, title=None, tools=TOOLS)
right.scatter("Acceleration", "Miles_per_Gallon", source=source,
            color=factor_cmap('Origin', 'Category10_3', origins), alpha=0.5, size=8,
            legend_field="Origin")
right.xaxis.axis_label = 'Horsepower'
right.yaxis.axis_label = 'Miles per Gallon'

# show them in a grid arragnement
show(gridplot([[left, right]]))



Both of these are perfectly functional, and either seems like a fine choice. Where the differences actually show up, however, is when we start trying to go beyond the basics. In particular, Altair makes it trivial to create this same sort of linked behavior across different types of plots. For example, Altair provides a nice demo of creating a bar plot next to a scatterplot, where the bar plot dynamically shows the counts of selected item types. Because this is based on the exact same semantics of the selection interval, it's trivial to add the same selector as a filter (which filters the data to that which is selected), rather than a condition. I suspect it's possible to do this in Bokeh, but I have no idea how. For Altair, here's the code and plot:

# create the selection interval, as before
brush = alt.selection_interval(resolve='global')

# create the scatterplot, including the selection
points = alt.Chart(cars_url).mark_point().encode(
    x=alt.X('Horsepower:Q'),
    y=alt.Y('Miles_per_Gallon:Q', axis=alt.Axis(title='Miles per Gallon')),
    color=alt.condition(brush, 'Origin:N', alt.ColorValue('gray')),
).add_params(
    brush
)

# create a barplot, also including the selection
bars = alt.Chart(cars_url).mark_bar().encode(
    x=alt.X('Origin:N', scale=alt.Scale(domain=['Europe', 'Japan', 'USA'])),
    y=alt.Y('count()', scale=alt.Scale(domain=(0, 260))),
    color='Origin:N'
).add_params(
    brush
).transform_filter(
    # include a transform to filter based on selection
    brush
)

# combine the two side by side
points | bars



Again, it's worth emphasizing, as in my previous post, that all of this just works on a static html page, totally seamlessly, and without the need for a dedicated server.

Finally, this is sort of just a convenience, but it's kind of nice that Altair makes is super easy and compact to create a full scatterplot matrix, by using the repeat() function, whereas the Bokeh version of a SPLOM is a bit more gnarly.

Regression

Beyond linking and brushing, one of the most common uses we see for scatterplots is for visualizing the relationship between two variables, such as the output of a linear regression. The R package, ggplot2, for example, not only makes such plots simple and beautiful, but does so in a way that has a subtle yet distinctive look, so you can easily guess how such a plot was made (which has both pros and cons).

Unfortunately, neither Altair nor Bokeh seems to have much support for this kind of integration with statistical modeling, where you can automatically add the line of best fit, along with a confidence band, with just a couple of lines of code. As far as I know, the only way to do it would be to compute the regression in something like statsmodels or sklearn, extract the confidence band, and then plot that as an area layered on top of the scatterplot.

That being said, Altair does provide some basic support for curve fitting with transform_regression(), which includes things like quadratic and exponential models. It's just a shame that this does not seem to support visualizing uncertainty out of the box. Here's a simple example, which fits a separate curve for each group of points:

cars_url = vd.data.cars.url

# create the basic scatterplot
chart = alt.Chart(cars_url).mark_point(color='grey').encode(
    x='Horsepower:Q',
    y=alt.Y('Miles_per_Gallon:Q', axis=alt.Axis(title='Weight in lbs')),
    color='Origin:N',
    opacity=alt.value(0.3)
)

# Add a quadratic fit for each group
line = chart.transform_regression(
    on='Horsepower',
    regression='Miles_per_Gallon',
    method='quad',
    groupby=['Origin']    
).mark_line(strokeWidth=3).encode(
    opacity=alt.value(1)
)

(chart + line).interactive()

Larger datasets

One issue with Altair that people tend to run up against pretty quickly, is that the maximum number of data points it will accept is set to be pretty low by default (only 5,000). This is not a hard limit, but intended to be a kind of guardrail, to remind users that things could get very slow when you're dealing with larger datasets. In other words, if you're going avove 5,000 data points, you should probably think carefully about what you're doing, and whether you actually need to plot all of those points individually, or whether it would be better to do some sort of aggregation.

Fortunately, Altair can still handle moderately large datasets quite easily. Here, I'll use the flights dataset with 20,000 rows to create an interactive example that has tooltips, and is linked to a widget that allows subset selection. (Note that this is just for demonstration purposes, and it not intended to necessarily be a sensible design).

The Altair code here is a bit tricker, in terms of how I'm linking a widget to the chart, but hopefully it should be fairly easy to understand. As you can see, the result has a little bit of lag compared to a similar plot with less data, (especially when zooming in or out), but overall it is quite responsive.

# disable the limitation on maximum number of data points
alt.data_transformers.disable_max_rows()

# get the flights dataset with 20,000 rows
flights = vd.data.flights_20k.url

# Create a dropdown box, with three months as options
input_dropdown = alt.binding_select(options=[0, 1, 2], labels=['Jan', 'Feb', 'Mar'], name='month: ')

# Create a selector, to link the dropdown to the chart
selector = alt.selection_point(
    # Create a name for this selector
    name="MonthSelector",
    # control which field this will operate on
    fields=['month'],
    # bind this to the dropdown box
    bind=input_dropdown,
    # set a default value
    value=[{'month': 0}]
)

# Create the basic scatterplot, with opacity and colour linked to the selector
chart = alt.Chart(flights).mark_point(filled=True).encode(
    x='distance:Q',
    y='delay:Q',
    tooltip=['date:T', 'origin:N', 'destination:N'],
    # set conditions which extract the month of the data and match to the selector
    color=alt.condition('month(datum.date) == MonthSelector.month', alt.value('steelblue'), alt.value('grey')),
    opacity=alt.condition('month(datum.date) == MonthSelector.month', alt.value(0.6), alt.value(0.2)),    
).properties(
    width=300,
    height=300
).add_params(
    # add the selector as a parameter
    selector
).interactive()

chart




It is of course possible to make a similar plot on Bokeh, but doing so requires writing a custom callback in javascript. (As I discussed in Part 2, this is both Bokeh's biggest advantage, and its biggest liability). In addition, it seems to me that the Bokeh version is, if anything, slower to update when you zoom in or pan across the plot area.

Moreover, as discussed above, without additional infrastructure, Bokeh requires that we embed the entire dataframe in the HTML output. As such, this creates a massive HTML output file (1.3Mb, compared to Altair's 2kb), since it has to encode all the data in an extremely redundant and inefficient format. (Unfortunately this also makes the source for this blog post a bit horrible). There may well be a better way to implement this (both for speed and size), but here the best that I could come up with in Bokeh:


from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, HoverTool
from bokeh.models import CustomJS, Dropdown
from bokeh.layouts import column

# Load the data as a dataframe and create a new month column
df2 = vd.data.flights_20k()
month_mapping = {1: 'Jan', 2: 'Feb', 3:'Mar'}
df2["month"] = [month_mapping[int(i)] for i in pd.to_datetime(df2["date"]).dt.strftime("%m")]

# set default values, and convert to a column data source
df2['color'] = 'grey'
df2['alpha'] = 0.2
source = ColumnDataSource(df2)

# Make the figure and axis labels
#p = figure(width=400, height=400, tools="pan,wheel_zoom,box_zoom,reset,hover", active_scroll="wheel_zoom")
p = figure(
    width=400,
    height=400,
    tools='pan,box_select,wheel_zoom,box_zoom,reset,hover',
    tooltips=[("Date", "@date"), ("Origin", "@origin"), ("Destination", "@destination")]
)
p.xaxis.axis_label = 'Distance'
p.yaxis.axis_label = 'Delay'

# add a scatter renderer with a size, color, and alpha
p.scatter('distance',
         'delay',
         alpha='alpha',
         color='color',
         source=source)

# create the dropdown menu
menu = [("Jan", "Jan"), ("Feb", "Feb"), ("Mar", "Mar")]
dropdown = Dropdown(label="Month", menu=menu)

# create the callback to link to the dropdown box
callback = CustomJS(args=dict(source=source, dropdown=dropdown), code="""
    const months = source.data.month
    const colors = []
    const alphas = []
    for (var i = 0; i < months.length; i++) {
        if (months[i] == this.item) {
            colors.push('blue')
            alphas.push(0.3)
        }
        else {
            colors.push('grey')
            alphas.push(0.2)
        }
    }
    source.data.color = colors
    source.data.alpha = alphas
    source.change.emit();
    """)

# link the dropdown to the callback
dropdown.js_on_event("menu_item_click", callback)

# show the plot and dropdown in a column
show(column(p, dropdown))

And here is the resulting Bokeh plot:


Of course, for handling truly large datasets where you want to plot all of the data, neither Altair or Bokeh is likely to be a suitable choice. There are some specialized packages, like Datashader, which are specifically designed to deal with very large datasets, but that's beyond the scope of this post.

Wrap Up and Limitations

As a final point of comparison, I will note that the one use case where Altair finally failed me was for contour plots. This is one case where Bokeh has a distinct advantage, as it has a dedicated contour plot function, which calls contourpy under the hood.

By contrast, it's possible to create contour plots in Altair, but it's definitely a bit of a pain. First, you have to compute the contours using something like contourpy or even matplotlib. Then you have to convert those contours into lines or areas to be drawn by Altair. This isn't too bad, except that if you have two distinct contours to be drawn at the same level (such as for data that has a bimodal distribution), there doesn't seem to be a simple way to tell Altair to treat the two parts as separate lines (without connecting them), other than putting them on separate layers. Sadly, the problem is that it can sometimes be relatively inconvenient in Altair to simply add a line to a plot, compared to alternatives.

Ultimately, the more you try to push beyond previously unimagined use cases for Altair, the more likely you are to run into frustrations. However, the designers of Altair have done such an exceptional job of anticipating a wide range of potential uses, and integrating these seamlessly into the chart semantics, that overall Altair feels to me like a much better choice than Bokeh for most work in exploratory data visualization, as well as for many communicative visualizations. Bokeh still seems to command a larger audience (based on github stars and watchers), but the more time I spend with them, the more convinced I am that Altair is overall a nicer choice than Bokeh for most purposes.