A microbiologist with a data science problem

Making trees in the Jupyter notebook (part 2 - Bokeh)

Wed 12 August 2015

This post will work with the tree and information we produced in the last post. This time we'll use a Python plotting library called Bokeh to visualize the tree.

IToL is a great and flexible tool but on the flexibility front it can't compete with charting libraries like Matplotlib, ggvis, ggplot2, or Bokeh. To demonstrate the power of such flexibility let's make a tree with the new-ish Bokeh. There are several features I want to include in this visualization:

  1. I want the tree to be unrooted. I didn't include an explicit outgroup in the tree from the previous post so I think it's probably just best to display the tree in radial (as opposed to circular) format.
  2. I want the tips to show information when you mouse over them. I think interactive graphics are a great way to include collaborators in the data exploration. Also, interactive graphics can dramatically cut down on communication overhead. Luckily Bokeh makes adding hover interaction really easy as we'll see.
  3. I want to add interaction that takes you to the protein webpage on GenBank when you click a tip node in the tree.
  4. Last, the tree figure should be "zoomable" so that users can drill down into the figure to investigate the details.

Ok, so (as always) we will first bring in all the modules we need. To get Bokeh plots to render inline in the Jupyter notebook you have to call the output_notebook function.

from bokeh.io import output_notebook, show
from bokeh.models.glyphs import Circle, Segment
from bokeh.models import ColumnDataSource, Range1d, DataRange1d, Plot
from cogent.draw.dendrogram import UnrootedDendrogram
from cogent import LoadTree
import pandas as pd
from bokeh.models import HoverTool, TapTool, BoxZoomTool, ResetTool, OpenURL
output_notebook()
BokehJS successfully loaded.

Next we'll load our tree. I'm using pycogent for this post but there are many great tree objects and parsers out there for both Python and R. Pycogent was the only module that had code for plotting an unrooted tree layout that I could find. I didn't do an exhaustiv search and I wouldn't be surprise if there are other implementations. Biopython appears to use graphviz for displaying trees in the radial format and the network layout makes from some interesting looking trees. I prefer the unrooted tree layout algorithm in PyCogent. (FYI, I expect the tree layout code will be moved to scikit-bio at some point if it hasn't already.)

The coords method below returns a 2D array where every row represents a node.

The rows/nodes information in the array are in order the node name, node ID, x position , y position, and a list of child nodes.

tree = LoadTree("data/GH5.tree")
dendrogram = UnrootedDendrogram(tree)
coords = dendrogram.coords(500,500)

Now I'll define a couple functions to help us work with the coords array. The first just checks to see if the child nodes list is empty (an empty list denotes the node is a terminal node). The second function expands each row. That is to say it makes a row for each child of each parent node. This effectively converts our coords array from an array of nodes to an array of node connections. This is useful because when we plot the tree we are really drawing connections or edges as opposed to nodes. The expand functions skips terminal nodes (aka tips) because they don't connect to any downstream nodes (hence "terminal").

def is_tip(coord):
    if not coord[-1]:
        return True
    else:
        return False
    
def expand(coords):
    data = []
    for coord in coords:
        if is_tip(coord):
            continue
        else:
            child_ids = coord[-1]
            for child in child_ids:
                l = coord[1:4]
                l.append(child)
                data.append(l)
    return data

We will also want an dictionary of the x, y coordinates for each node keyed by node ID, a list/set of the tip node IDs, and a dictionary of tip node accessions keyed by corresponding node IDs. If you remember from our last post, the tree sequences are identified by their accession numbers and therefore each tip in the tree has an associated accession number.

node_locs = {r[1]: (r[2], r[3]) for r in coords}
leaves = set([r[1] for r in coords if is_tip(r)])
acc2id = {r[1] : r[0] for r in coords if is_tip(r)}

Let's also bring in the color definitions file from the last post. We uploaded this file to IToL last time to get the tree coloring we wanted. We'll use it for the same purpose here.

tip_info = pd.read_csv("data/color_defs.tsv", sep="\t", header=None, names=["acc","type","color","ec"])

As I wrote above, we plot edges not nodes. So, I want a dataframe where each row represents an edge and has all the information I need for plotting that edge. This information includes the beginning and ending coordinates of the edge, the accession number for terminal edges, and a column denoting whether the edge leads to a tip or not.

df = pd.DataFrame(expand(coords), columns = ["parent","x0","y0","child"])
df.set_index("parent", inplace=True)
df["x1"] = [node_locs[c][0] for c in df["child"].values]
df["y1"] = [node_locs[c][1] for c in df["child"].values]
df["acc"] = [acc2id[c] if c in acc2id else None for c in df["child"].values]
df["is_tip"] = [True if c in leaves else False for c in df["child"]]

Now I'll merge the color definition file info to our edge dataframe. For convenience I also want a dataframe of just terminal edges (df_tips below).

df_all = pd.merge(df, tip_info, on=["acc"], how="left")
df_tips = df_all[df_all["is_tip"]]

This is all the we need to make an unrooted tree figure in any charting library!

To make the tree in Bokeh I'm going to use the chart interface. First I'll set up a dictionary of plot options.

plot_options = dict(plot_width=800, plot_height=800, toolbar_location=None, 
                    outline_line_color=None, title = "Characterized GH5 Tree")

Now I'll make all the glyphs I want. We'll use circle glyphs for terminal nodes and segment glyphs for edges. I am going to make the backbone tree in black and overlay segment glyphs for just the tip edges; the tip edges will be colored by their respective EC activities.

radius = dict(value=1, units="data")

leaf_glyph = Circle(x="x1", y="y1", 
                    radius=radius, 
                    fill_color="#707070", 
                    name="circles", 
                    fill_alpha=0.5) 

tree_glyph = Segment(x0="x0", y0="y0", 
                     x1="x1", y1="y1", 
                     line_color="#151515",
                     line_alpha=0.75)

tip_glyph = Segment(x0="x0", y0="y0", 
                     x1="x1", y1="y1", 
                     line_color="color",
                     line_width=1.65)

When we add these glyphs to our plot we need to upload a corresponding ColumnDataSource. This function just converts or Pandas dataframes to Bokeh data sources.

def df2ds(df):
    return ColumnDataSource(ColumnDataSource.from_df(df))

Below I make the plot and add the glyphs.

ydr = DataRange1d(range_padding=0.05)  
xdr = DataRange1d(range_padding=0.05)  

plot = Plot(x_range=xdr, y_range=ydr, **plot_options)
plot.add_glyph(df2ds(df_all), tree_glyph)
plot.add_glyph(df2ds(df_tips), tip_glyph)
plot.add_glyph(df2ds(df_tips), leaf_glyph)
Out[23]:
<bokeh.models.renderers.GlyphRenderer at 0x7f528b5c5a10>

...and we render the plot inline with the show command. Voila! Pretty cool but no interaction yet...

show(plot)

Let's make the same plot with the hover and zoom action. We start as before but this time I'm going to put a toolbar on the plot that will let us hit a reset button to get back the original figure (It's pretty essential to add the reset function in addition to box zoom).

plot_options = dict(plot_width=800, plot_height=800, toolbar_location="left", 
                    outline_line_color=None, title = "Characterized GH5 Tree")

ydr = DataRange1d(range_padding=0.05)  
xdr = DataRange1d(range_padding=0.05)  

plot = Plot(x_range=xdr, y_range=ydr, **plot_options)
plot.add_glyph(df2ds(df_all), tree_glyph)
plot.add_glyph(df2ds(df_tips), tip_glyph)
leaf = plot.add_glyph(df2ds(df_tips), leaf_glyph)

Let's make the hover tool tip. It's just defined in html. I want to display the accession and EC information for each tip (add and "@" symbol to denote columns in the data source that you want to display in the tooltip).

tooltip = """
    <div>
        <span style="font-size: 17px; font-weight: bold;">accession: </span>
        <span style="font-size: 15px; color: #151515;">@acc</span>
    </div>
    <div>
        <span style="font-size: 17px; font-weight: bold;">EC: </span>
        <span style="font-size: 15px; color: #151515;">@ec</span>
    </div>
"""

Now we make our hover tool and add it to the plot. Be cause we assigned the renderer for our "nodes" (i.e. leaf) we can specify that we want this hover tool to be specific to that glyph in the plot.

hover = HoverTool(renderers = [leaf], tooltips=tooltip)
plot.add_tools(hover, BoxZoomTool(), ResetTool())

Now hover over the points to see the accession and EC information for each tip. Also note the toolbar on the left. You can box zoom into the tree with your mouse to explore the topology at better resolution. Try it out! The reset button will get you back to the original orientation.

show(plot)

That's everything I wanted to do except for the click action. As I stated above, it would be great if clicking on tip nodes brought you to the GenBank webpage for the protein that the each node represents. We can get this interaction with the tap tool. The following is a bit of a hack but the url takes the accession info and returns a url to a CGI script that searches the GenBank protein database for that accession value. In the end, clicking a tip takes you to a GenBank page. Give it a shot!

taptool = TapTool(action=OpenURL(url="http://www.ncbi.nlm.nih.gov/protein/?term=@acc%5Bacc%5D"))
plot_options = dict(plot_width=800, plot_height=800, toolbar_location="left", 
                    outline_line_color=None, title = "Characterized GH5 Tree")

ydr = DataRange1d(range_padding=0.05)  
xdr = DataRange1d(range_padding=0.05)  

plot = Plot(x_range=xdr, y_range=ydr, **plot_options)
plot.add_glyph(df2ds(df_all), tree_glyph)
plot.add_glyph(df2ds(df_tips), tip_glyph)
leaf = plot.add_glyph(df2ds(df_tips), leaf_glyph)

hover = HoverTool(renderers =[leaf],tooltips=tooltip)
plot.add_tools(hover, BoxZoomTool(), ResetTool(), taptool)

show(plot)

This would be a great way to show some results to collaborators without having to clobber it all up with a bunch of labels. And the syntax is really concise. Bokeh is an exciting tool that allows you to make clean, rich visualizations quickly and in my experience it works seamlessly with the Jupyter notebook. Exciting stuff!!!