Altair vs. Bokeh (part 2) - Granular Material

Simple Bar Chart

In the first part of this series, I used a basic bar plot to illustrate the differences between Altair and Bokeh in terms of defaults, chart configuration, and syntactic style. In this post, I'll get into some of the more substantive differences, including more complicated chart types, combining plots, basic interactivity, and how to deploy the output online. Note that this is not intended to be a systematic comparison, but rather more of a preliminary exploration of the options.

To illustrate these, I'll work here with histograms and density plots. These are ideal plots to play around with, as both are extremely common and useful (especially in exploratory data analysis, where they can be used to visualize the distribution of data), and yet both involve choices that go beyond basic aesthetics, like color, symbol, or plot area. In particular, it might matter a lot how we set parameters like the number of bins in a histogram or the bandwidth of the density estimator. Finally, both chart types can benefit from interactivity in important ways.

It doesn't really matter what data we use for this, so I'll use the IMDB Rating data provided by vega datasets, which has around 3000 rows. Note that we'll provide data to Altair as a URL, rather than a dataframe, to avoid embedding the full dataset in the javascript specification when we output it as HTML. (As an aside, there seems to be an issue such that the density transform we'll use below only works properly for data in json format, not csv.)

As we would expect from the previous post, creating a histogram is quite straightforward in Altair. As before, we will use mark_bar() as the mark type. However, there is one trick that is helpful here, which is that rather than giving altair a variable name for the y-encoding, we can give it the name of a special function, called count(), which will count the number of data elements which fall into each bin, as defined by whatever we put for the x-encoding (which in this case will be IMDB_Rating). Altair will also choose the number of bins semi-intelligently, with the option to set a maximum number.

In this case, since we are dealing with float valued ratings that go up to 10, we'll make things a bit more readable by setting the maximum number of bins to 10. (Note that this will actually end up giving us 9 bins, since the minimum rating in the data is greater than 1 and the maximum is less than 10). As a nice feature, Altair also makes it extremely easy to add tooltips, by just listing the variables to include in the tooltip, which we can use to display the exact count values for each bin. Also note that we are going to specify the data type for each variable (here, everything is Q for quantitative, as opposed to nominal or ordinal), as Altair may not be able to infer it. Here is the code snippet for our first figure:

import altair as alt
from vega_datasets import data

alt.Chart(
    data.movies.url
).mark_bar(
    color='grey'
).encode(
    x=alt.X('IMDB_Rating:Q', bin=alt.Bin(maxbins=10)),
    y='count()',
    tooltip=['count()']
)

And here is the resulting plot (with functional tooltips):

If we want to show a density plot instead, things get a little bit more complicated, but Atlair has definitely planned for this particular use case. Here, we'll take advantage of one of the many transform functions provided by Altair, in this case transform_density(). This function will take the values in the dataframe, and compute density values using a kernel density estimator, effectively transforming one variable into another. Here, we'll ask it to produce the outputs in the form of counts (to better match the histogram), rather than as a normalized probability distribution. Having applied this transform, the output of it is then available to be given as the variable to be encoded on the y-channel, just like any other. To actually plot the data, we'll use a mark_area() to fill in the area below the line. Once again, we end up with very compact code (and output html), including tooltips.

import altair as alt
from vega_datasets import data

alt.Chart(
    data.movies.url
).transform_density(
    'IMDB_Rating',
    as_=['IMDB_Rating', 'density'],
    counts=True,
).mark_area().encode(
    x="IMDB_Rating:Q",
    y='density:Q',
    tooltip=['IMDB_Rating:Q', 'density:Q']
)

And here is the corresponding figure:

As with the histogram, this looks quite good, out of the box. As noted, the transform_density() function is applying a kernel density estimator. If we wanted to, we could manually set the bandwidth for this. Since we haven't specified a value, Altair determines one automatically.

Comparing the two charts, it's obvious that both have their advantages. The histogram gives more precise information about the counts in each bin. However, a small change in the number of bins might give us quite a different figure, depending on exactly where the data lies. By contrast, the density plot will change more gradually as we change the bandwidth. On the other hand, with an inappropriate bandwidth, the density estimator will smooth over sharp changes in the density of the data.

One way we might think of combining these plots (to get the best of both) is to allow the user to view both at the same time. Putting them side by side in Altair is especially easy. All we need to do is to use the pipe symbol, i.e., hist | density.

However, for easier comparison, it would probably be better to overlay them on top of each other. To make this more interesting, let's add a slider to control the opacity of each element, to allow the user to transition smoothly from one to the other. To do so, we'll make use of Altair's interactive widgets, which allow for interactively controlling certain aspects of plots.

Like most things in Altair, this is quite straightforward, as long as we're working within the scope of things that have been planned for by Altair's creators. The first step is to create a slider. Referring to the docs, we find that we can do this using the binding_range() function, which takes min, max, and step values. Next, we need to bind this slider to a variable. To do this, we'll use a parameter (param), which basically maps the value of the slider to something that can be used in charts. Here are the two lines required to set this up:

slider = alt.binding_range(min=0, max=1.0, step=0.01)
opacity_var = alt.param(name='Opacity', bind=slider, value=0.5)

Finally, we can set the opacity parameter of each plot. We'll set one of these directly to opacity_var, and the other to 1 - opacity_var, such that they will trade off against each other. To make sure Altair recognizes this expression as something it can work with, we'll wrap it in alt.value(). Finally, we'll overlay the two plots on top of each other by using the + symbol, (rather than | symbol), and attach the parameter to the plot, using the add_param() function. We'll also turn off the background grid, drop the tooltips, and change the density plot from mark_area() to mark_line(), to make this a bit easier to read.

Here is the full code:

import altair as alt
from vega_datasets import data

# create the slider and associated parameter
slider = alt.binding_range(min=0, max=1.0, step=0.01)
opacity_var = alt.param(name='Opacity', bind=slider, value=0.5)

# create the histogram
hist = alt.Chart(
    data.movies.url
).mark_bar(
    color='grey'
).encode(
    x=alt.X('IMDB_Rating:Q', bin=alt.Bin(maxbins=10)),
    y='count()',
    opacity=opacity_var,
)

# create the density plot
kde = alt.Chart(
    data.movies.url
).transform_density(
    'IMDB_Rating',
    counts=True,
    as_=['IMDB_Rating', 'density'],                        
).mark_line().encode(
    x="IMDB_Rating:Q",
    y='density:Q',
    opacity=alt.value(1-opacity_var),    
)

# overlay the two, and link it to the opacity parameter
(hist + kde).configure_axis(
    grid=False
).add_params(
    opacity_var
)

And here is the resulting interactive plot, which allows the user to easily slide back and forth between the histogram and the density plot:

To make the same thing in Bokeh, we will take a somewhat different approach. First of all, there are the basic differences we looked at last time, like how we create the plot area and add elements using separate plotting commands (more like matplotlib). In this case, however, Bokeh doesn't have a dedicated plotting function for either histograms or density plots. Instead, we'll have to compute these ourselves, and then plot them using the basic plotting commands quad and line, respectively.

The fact that something like computing a density smoother is not built into Bokeh has both pros and cons. On the one hand, being forced to do it ourselves means we will likely be more intentional about what we are doing. On the other hand, if we just want a basic density plot for exploration, this may seem like extra work.

The other thing that is different is how Bokeh supports the interactivity. As in Altair, Bokeh provides a series of widgets for things like sliders, which can be linked to elements of plots. However, the way this is implemented is somewhat different, such that I couldn't find an easy way to directly support the kind of trading off of opacity between the histogram and density plot as I used above. Using the value of the slider directly is easy, but unfortunately Bokeh doesn't seem to accept expressions involving parameters, like one minus the value of the slider.

In order to get this to work, I used what is arguably the one killer feature of Bokeh, which I don't think Altair can match. Specifically, it is possible to write custom callbacks in Javascript, and link them to widgets. I'll dig into this a bit more below, but for the moment, we can just write a very simple callback to allow for the trading off of opacity. For the histogram, we'll use the js_link() function of the slider, which directly links the value of the slider to some glyph attribute (in this case the fill opacity, or alpha, value of the histogram). For the density plot, by contrast, we'll use js_on_change() to write a custom function which sets the value of the line opacity (alpha) to one minus the value of the slider.

Putting it all together, here is what I came up with. To compute the histogram, I manually set the number of bins, and used numpy's histogram function. To compute the density, I used gaussian_kde() in scipy. In both cases, I have been more explicit about the relevant parameters, compared to just using the defaults in Altair.

from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, CustomJS, Slider
from bokeh.layouts import column, row
from scipy import stats
import numpy as np

# load the data
source = data.movies()
temp = sorted(source['IMDB_Rating'].dropna().values)

# make the plot canvas
plot = figure(width=300, height=300, toolbar_location=None)

# compute the histogram
bins = np.arange(np.floor(min(temp)), np.ceil(max(temp))+1, 1)
hist, edges = np.histogram(temp, bins=bins)

# plot the histogram
quad = plot.quad(top=hist, bottom=0, left=edges[:-1], right=edges[1:],
         fill_color="grey", line_color="white", alpha=0.5)

# compute the density
kernel = stats.gaussian_kde(temp, bw_method='scott')
x = np.linspace(min(temp), max(temp), 100)
y = kernel(x) * len(temp)

# plot the density
col_source = ColumnDataSource(data=dict(x=x, y=y))
line = plot.line('x', 'y', source=col_source, alpha=0.5, width=2)

# make the interactive slider
sl = Slider(start=0.0, end=1.0, step=0.01, value=0.5, title='Opacity')

# link it to the histogram
sl.js_link('value', quad.glyph, 'fill_alpha')

# link it to the density line
sl.js_on_change('value',
    CustomJS(args=dict(other=line),
             code="other.glyph.line_alpha = 1 - this.value"
    )
)

# combine everything together
layout = column(plot, sl)
show(layout)

Overall it produces a very similar result, although note that the bandwidth parameter seems somewhat different between Altair and scipy, so the results are not necessarily identical for the same value of the bandwidth parameters.

As before, the HTML output from Bokeh is quite a bit longer (which you can see if you check the HTML source for this page), but that's not necessarily a big deal.

The fact that both Altair and Bokeh support this kind of interactivity, without requiring a server connection, is one of the main reasons to use them over a static plotting library like matplotlib. One could easily produce a static plot using other options, but the ability to create a simple interactive chart that can be trivially embedded in a static webpage, is huge!

Unfortunately, this basic interactivity only extends so far. In particular, we've seen how we can link a slider to opacity. There is similarly support for things like choosing which variable to display on which axis (using a drop-down menu), or other aspects of the plot where hooks have been explicitly provided. It does not, however, extend to all aspects of the plot.

For example, it would be extremely convenient if we could allow the user to interactively set the bandwidth of the kernel density estimator, or control the number of bins in the histogram. Both of these would be helpful, because histograms and density plots can look dramatically different depending on how they are configured. Allowing the user to scrub across a range of values allows them to get a better sense of what the data really looks like.

Unfortunately, if we try to link the value of the slider to the bandwidth (either in Altair, or using js_link() in Bokeh), it will not work. This makes sense, because in both cases, these involve computations that are currently being run when we execute these python scripts, not things being computed by javascript in the browser.

However, as you might have guessed, with Bokeh we can make use of the same trick as above. As mentioned, Bokeh supports custom Javascript callbacks. This means that, if we are willing to do it, we can create custom scripts that CAN be executed directly in the browser. This is a huge advantage of Bokeh, and theoretically allows for much more complex interactivity options on static sites than can be made with Altair. On the other hand, doing so requires coding a javascript script in the middle of a python script, which is not exactly convenient or easy to debug. Moreover, if we're already writing our own javascript, at some point it makes more sense to just switch to something like D3, which can be written fully in javascript and incorporates much better support for a variety of features.

Nevertheless, as a proof of concept, here is a simple Bokeh script that includes a custom javascript callback that allows the user to interactively set the number of bins in a histogram. (I'll leave it to someone else to write the javascript for computing the KDE!). This is basically the same approach as above, but with many more lines of javascript code embedded within the python script.

from bokeh.plotting import figure, show
from bokeh.models import ColumnDataSource, CustomJS, Slider, Range1d
from bokeh.models import ColumnDataSource, Grid, LinearAxis, Plot, Quad
from bokeh.layouts import column, row

source = data.movies()
temp = sorted(source['IMDB_Rating'].dropna().values)
temp_cs = ColumnDataSource(data=dict(temp=temp))

plot = figure(width=300, height=300, toolbar_location=None)

# compute the basic histogram
bins = np.arange(np.floor(min(temp)), np.ceil(max(temp))+1, 1)
hist, edges = np.histogram(temp, bins=bins)

# put the relevant parameters into a column data source
col_source = ColumnDataSource(data=dict(top=hist, bottom=[0]*len(hist), left=edges[:-1], right=edges[1:]))

# plot the initial data
plot.quad(top='top', bottom='bottom', left='left', right='right', source=col_source,
         fill_color="grey", line_color="white", alpha=0.5)

# recompute the histogram paramters as the slider changes
callback = CustomJS(args=dict(source=col_source, temp_cs=temp_cs), code="""
    const n_bins = cb_obj.value    
    const vals =  temp_cs.data.temp
    const vals_min = Math.floor(Math.min(...vals));
    const vals_max = Math.ceil(Math.max(...vals));    
    const bin_width = (vals_max - vals_min) / n_bins;
    var left = Array(n_bins);
    for (var i = 0; i < left.length; i += 1) {
        left[i] = i * bin_width + vals_min;
    }    
    const right = left.map(n => n + bin_width);
    const bottom = Array(left.length).fill(0);
    var top = Array(left.length).fill(0);
    for (var i = 0; i < vals.length; i += 1) {
        var bin_num = Math.floor((vals[i] - vals_min) / bin_width);
        top[bin_num] += 1;
    }
    source.data = {top, bottom, left, right}
    """
    )

# combine everything together
slider = Slider(start=1, end=50, value=np.ceil(max(temp))-np.floor(min(temp))+1, step=1, title="Number of bins")
slider.js_on_change('value', callback)

# maintain the same x-range regardless
plot.x_range = Range1d(min(temp), max(temp)+1)

layout = column(plot, slider)
show(layout)

And the result works more or less as expected:

If we do want to more easily incorporate custom interactions into either Altair or Bokeh, without writing any javascript, there is one final option. In fact, it is quite easy to do so, as long as the plots are being run by the corresponding Altair or Bokeh server. This means that we can easily run them locally (e.g., for exploratory data analyses), but unfortunately we can't simply insert them into static web pages.

To deploy such interactive visualizations online, (assuming you're not running your own server), your best bet is to use a package like Streamlit or Panel, and host the visualization someplace like Streamlit Cloud or HuggingFace. For example, here is a plot I made using Bokeh and Panel that I have uploaded to HuggingFace. It may take a minute to load up (click Restart this Space, if necessary), but it does work quite seamlessly, and allows the user to separately control both the kernel density estimator bandwidth and the number of bins in the histogram. (If you click on Files, you can see the script used to produce this in app.py, and you will see that it is quite a simple and straightforward python script, where we wrap the plot generation in a function, and then link that to sliders using Panel.)

Wrapping up, both Altair and Bokeh give us extensive options for both interactivity. If we're only using these locally, or we don't care where they are hosted online, then we can easily code up a wide range of interactive plots by combining either Altair or Bokeh with a package like Panel. For deployment on static web pages, however, we'll either need to limit ourselves to the forms of interactivity that are explicitly supported, or we can use Bokeh and write custom javascript callbacks. Overall Bokeh seems to require more external libraries to create some basic chart types (which has both advantages and disadvantages), and may not be as intuitive or compact as Altair, but this ability to support custom callbacks is ultimately a big advantage for deploying interactive visualizations on static pages.