phylogenetics

Table of Contents

Home

Theory

Phylogeny

phylogeny.png

Newick grammar

See here for a formal specification of the Newick grammar by Gary Olsen (1990). The important part of this is

Conventions:
   Items in { } may appear zero or more times.
   Items in [ ] are optional, they may appear once or not at all.
   All other punctuation marks (colon, semicolon, parentheses, comma and
         single quote) are required parts of the format.


              tree ==> descendant_list [ root_label ] [ : branch_length ] ;

   descendant_list ==> ( subtree { , subtree } )

           subtree ==> descendant_list [internal_node_label] [: branch_length]
                   ==> leaf_label [: branch_length]

            root_label ==> label
   internal_node_label ==> label
            leaf_label ==> label

                 label ==> unquoted_label
                       ==> quoted_label

        unquoted_label ==> string_of_printing_characters
          quoted_label ==> ' string_of_printing_characters '

         branch_length ==> signed_number
                       ==> unsigned_number

Software

biopython

biopython-logo.png

Biopython is a Python library that aims to provide a comprehensive set of modules for computational biology and bioinformatics.

  • the Bio Phylo BaseTree module provides foundational classes for working with phylogenies.
  • the Clade class in particular provides a recursively defined tree.
    • instances of Clade have branch_length (the length of the branch to their parent) and clades (a list of their children).
    • the Clade class has a method is_terminal which is a predicate for if the current clade is a leaf node.
    • iterating over a Clade will iterate over its direct children.
    • the __len__ of a Clade is its number of children.

Example: Reading Newick from file

from Bio import Phylo

newick_tree = "((A:0.1, B:0.2):0.3, (C:0.4, D:0.5):0.6);"

filename = "foobar.newick"
with open(filename, "w+") as file:
    file.write(newick_tree)

tree = Phylo.read(filename, "newick")
clade = tree.root

assert clade[0][0].branch_length == 0.1
assert clade[0][0].name == "A"

Example: Reading Nexus from file

from Bio import Phylo

nexus_tree = """
#NEXUS

BEGIN TAXA;
    TAXLABELS A B C D;
END;

BEGIN TREES;
    TREE tree1 = [&R] ((A:0.1,B:0.2):0.3,(C:0.4,D:0.5):0.6);
END;
"""

filename = "foobar.nexus"
with open(filename, "w+") as file:
    file.write(nexus_tree)

tree = Phylo.read(filename, "nexus")
clade = tree.root

assert clade[0][0].branch_length == 0.1
assert clade[0][0].name == "A"

Example: Reading Newick from string

from io import StringIO
from Bio import Phylo

newick_tree = StringIO("((A:0.1, B:0.2):0.3, (C:0.4, D:0.5):0.6);")

tree = Phylo.read(newick_tree, "newick")
clade = tree.root

assert clade[0][0].branch_length == 0.1
assert clade[0][0].name == "A"

Example: Plotting a tree

biopython-phylo-draw.png

Figure 1: Plot generated by the Phylo.draw method.

from io import StringIO
from Bio import Phylo

newick_tree = StringIO("((A:0.1, B:0.2):0.3, (C:0.4, D:0.5):0.6);")

tree = Phylo.read(newick_tree, "newick")
Phylo.draw(tree)

Example: Plotting the lineages through time (LTT)

biopython-ltt-phylo.png

Figure 2: Facetted figure of a phylogeny and its LTT.

The computation of the LTT information from the tree is handled by the ltt_data function which returns two lists: the times of the LTT changes and the corresponding values. The sharex=True argument to plt.subplots is used to ensure the facets share a common x-axis.

from Bio import Phylo
from io import StringIO
import matplotlib.pyplot as plt
import numpy as np


def ltt_data(tree):
    node_times = []
    def traverse(clade, current_time):
        bl = clade.branch_length if clade.branch_length else 0
        if clade.clades:
            node_times.append(('branch', current_time))
        for child in clade.clades:
            if child.is_terminal():
                node_times.append(('tip', current_time + child.branch_length))
            else:
                traverse(child, current_time + child.branch_length)
    traverse(tree.root, 0)
    times = []
    num_lines = []
    curr_ltt = 1
    for (node_type, time) in node_times:
        times.append(time)
        if node_type == 'branch':
            curr_ltt += 1
            num_lines.append(curr_ltt)
        else:
            curr_ltt -= 1
            num_lines.append(curr_ltt)
    return times, num_lines


newick_tree = StringIO("((A:0.1, B:0.2):0.3, (C:0.4, D:0.5):0.6);")
tree = Phylo.read(newick_tree, "newick")

fig, (ax1, ax2) = plt.subplots(2, 1, sharex=True, gridspec_kw={'height_ratios': [3, 2]})

Phylo.draw(tree, do_show = False, axes = ax1)
ax1.yaxis.set_ticks([])
ax1.yaxis.set_ticklabels([])
ax1.set_ylabel('')
ax1.set_xlabel('')

ltt_x, ltt_y = ltt_data(tree)
ax2.plot(ltt_x, ltt_y, 'bo')
ax2.step(ltt_x, ltt_y, 'b-', where='post', label='Sine Wave')
ax2.set_ylabel('Lineage count', color='b')
ax2.set_xlabel('Time (branch length)')
ax2.tick_params(axis='y', labelcolor='b')

plt.subplots_adjust(right=0.85)
fig.savefig("demo-plotting-ltt.png", dpi=300)

Example: Counting the number of leaves

from io import StringIO
from Bio import Phylo

newick_tree = StringIO("((A:0.1, B:0.2):0.3, (C:0.4, D:0.5):0.6);")

tree = Phylo.read(newick_tree, "newick")
print(f"There are {len(tree.get_terminals())} leaves")
There are 4 leaves

Example: Normalising a tree to unit branch lengths

from io import StringIO
from Bio import Phylo
import copy

newick_tree = StringIO("((A:0.1, B:0.2):0.3, (C:0.4, D:0.5):0.6);")

tree = Phylo.read(newick_tree, "newick")
assert max(tree.depths().values()) == 1.1

def normalise_branch_lengths(tree):
    """
    Normalise a tree to unit branch lengths.
    """
    new_tree = copy.deepcopy(tree)
    branch_lens = [c.branch_length for c in new_tree.find_clades()
                   if c.branch_length is not None]
    mean_branch_len = sum(branch_lens) / len(branch_lens)
    scale_factor = 1 / mean_branch_len
    for c in new_tree.find_clades():
        if c.branch_length is not None:
            c.branch_length *= scale_factor
    return mean_branch_len, new_tree

mean_branch_len, new_tree = normalise_branch_lengths(tree)

print(f"The mean branch length was {format(mean_branch_len, '.3f')}")
print(f"The old tree height is {max(tree.depths().values())}")
print(f"The new tree height is {format(max(new_tree.depths().values()), '.3f')}")
assert max(tree.depths().values()) == 1.1
assert abs(max(new_tree.depths().values()) - (1.1 * scale_factor)) < 1e-6

The output of this script is

The mean branch length was 0.350
The old tree height is 1.1
The new tree height is 3.143

Example: Querying node depths

This script demonstrates how to query the depth of a node in a tree given a reference to the corresponding clade. The documentation for the depths method is here.

from io import StringIO
from Bio import Phylo
import copy

newick_tree = StringIO("((A:0.1, B:0.2):0.3, (C:0.4, D:0.5):0.6);")

tree = Phylo.read(newick_tree, "newick")
depth_dict = tree.depths()

# The root node should have a depth of 0.0
assert depth_dict[tree.clade] == 0.0
# The depth of the left internal node should be 0.3
assert depth_dict[tree.clade.clades[0]] == 0.3
# The depth of leaf A should be 0.4
assert depth_dict[tree.clade.clades[0].clades[0]] == 0.4
# The depth of leaf B should be 0.5
assert depth_dict[tree.clade.clades[0].clades[1]] == 0.5
# The depth of the right internal node should be 0.6
assert depth_dict[tree.clade.clades[1]] == 0.6
# The depth of leaf C should be 1.0
assert depth_dict[tree.clade.clades[1].clades[0]] == 1.0
# The depth of leaf D should be 1.1
assert depth_dict[tree.clade.clades[1].clades[1]] == 1.1

Example: Quickly reading a tree from disk

TLDR: pickle is faster than Newick. Also, don't worry about cPickle, that's what Python 3.x is using anyway.

The following code writes a big tree to one of two types of files: a plain text Newick file or a pickle. It then reads those trees back multiple times to get an estimate of how long it takes. Across a range of tests we see that the pickled version is about twice as fast, but this seems to diminish as we get to larger trees.

from io import StringIO
from Bio import Phylo
import pickle
import timeit

def _tree_string(n):
    if n == 0:
        return "(A:0.1, A:0.1):0.1"
    else:
        tmp = _tree_string(n-1)
        return f"({tmp}, {tmp}):0.1"

def tree_string(n):
    return _tree_string(n) + ";"

newick_tree = tree_string(9)

filename_newick = "foobar.newick"
with open(filename_newick, "w+") as file:
    file.write(newick_tree)

filename_pickle = "foobar.pickle"
newick_tree_obj = Phylo.read(StringIO(newick_tree), "newick")
with open(filename_pickle, "wb+") as file:
    pickle.dump(newick_tree_obj, file)


def test_a():
    tree = Phylo.read(filename_newick, "newick")
    return len(tree.get_terminals())

def test_b():
    with open(filename_pickle, "rb") as file:
        tree = pickle.load(file)
        return len(tree.get_terminals())

assert test_a() == test_b()

num_reps = 250
time_a = timeit.timeit(lambda: test_a(), number=num_reps)
time_b = timeit.timeit(lambda: test_b(), number=num_reps)
print(f"The ratio of time needed is {(time_a/time_b):.2f} for Bio/pickle with {test_a()} nodes with {num_reps} replicates.")

Gives us the following examples for different tree sizes.

The ratio of time needed is 1.98 for Bio/pickle with 256 nodes with 250 replicates.
The ratio of time needed is 2.00 for Bio/pickle with 512 nodes with 250 replicates.
The ratio of time needed is 1.80 for Bio/pickle with 1024 nodes with 250 replicates.
The ratio of time needed is 1.69 for Bio/pickle with 2048 nodes with 250 replicates.

TempEst

Example 1

The following examples demonstrates how to use ape::dist.dna and ape::nj to get a simple tree that can be read into TempEst.

alignment_dna <- read.dna("alignment.fasta", format = "fasta")
dist_matrix <- dist.dna(alignment_dna, model = "K80", pairwise.deletion = TRUE)
nj_phylo <- nj(dist_matrix)
write.nexus(nj_phylo, file = "nj-tree.nexus")

Example 2

We will use the tree shown in Figure 1 an example

tempest-demo-tree.png

When we simulate some sequences on this and then create a neighbour-joining tree based on these sequences we can put this into TempEst to check if there is any signal in the data. The screen shot in Figure 2 shows the results. Note that even with ideal data the result is not perfect.

tempest-results.png

Here is the code that simulated the data set.

library(ape)
library(phangorn)
set.seed(1)
#' Read in a simple tree to use as an example so we do not
#' need to mess around trying to simulate one.
demo_tree <- read.tree(text = "(((((A:2.0,B:1.0):2.0,C:1.0):2.0,D:1.0):2.0,E:1.0):2.0,F:1.0);")
png(filename="tempest-demo-tree.png")
plot(demo_tree)
dev.off()
num_tips <- length(demo_tree$tip.label)
tip_times <- head(
  node.depth.edgelength(demo_tree),
  num_tips
)
#' Relabel the tips to preserve the sampling date in a
#' suitable format for subsequent use.
demo_tree$tip.label <- sprintf("%s_%f", demo_tree$tip.label, tip_times)
demo_seqs <- demo_tree |> simSeq(l = 1e4, rate = 1e-3) |> as.character()
#' Compute the distances manually because the functions
#' provided seem to be a bit buggy.
demo_distances <- matrix(0, num_tips, num_tips)
rownames(demo_distances) <- demo_tree$tip.label
colnames(demo_distances) <- demo_tree$tip.label
for (ix in 1:(num_tips - 1)) {
  for (jx in ix:num_tips) {
    jac <- sum(demo_seqs[ix,] != demo_seqs[jx,])
    demo_distances[ix,jx] <- jac
    demo_distances[jx,ix] <- jac
  }
}
#' Estimate the tree using the distances and write this to a
#' nexus file so it can be read into TempEst.
demo_nj <- nj(demo_distances)
write.nexus(demo_nj, file = "tempest-demo-data.nexus")

Rentrez in R (downloading sequence data)

By default, rentrez returns fetched data as an XML object from the XML package, rather than as an xml_node from the xml2 package, which is much easier to work with, so I'll add a helper here to move between them.

#' Fetch Entrez data as XML object
#'
#' @param ... Arguments to pass to =rentrez::entrez_fetch=
#'
#' @return An =xml_node= object
#'
#' @examples
#' entrez_obj <- fetch_as_xml(db = "nucleotide", id = accession_numbers)
#'
fetch_as_xml <- function(...) {
  temp_file <- tempfile()
  on.exit(file.remove(temp_file))

  entrez_obj <- rentrez::entrez_fetch(
    ...,
    rettype = "xml",
    parsed = TRUE
  )
  XML::saveXML(entrez_obj, file = temp_file)
  return(xml2::read_xml(temp_file))
}

APE in R

FULL DOCUMENTATION

Use rlineage to simulate the complete tree. To get the sub-tree consisting of just those that made it to the present use drop.fossils. If you want to filter for particular nodes, there is keep.tip and drop.tip.

There is a plot.phylo method and ltt.plot which can be used to get quick visualisations of a phylogeny, but for publication quality figures you probably want to use Ggtree.

The phylo class

The phylo class is pretty much the defacto standard for representing phylogenies in R. A phylogeny with \(n\) tips is represented as a phylo which has the following attributes:

  • Nnode is the number of internal nodes in the tree,
  • tip.label is a string indicating the name of each tip,
  • edge is a matrix with two columns where a row, \([i,j]\), indicates an edge from node \(i\) to node \(j\),
  • edge.length is an array where element is the length of an edge in the corresponding row of the edge matrix.

Note that the indexing of the nodes of the tree is \(1\) to \(n\) for the tips and the internal nodes get allocated the integers \(n+1,n+2,\dots\). You do not want to reconstruct the tree from this information yourself though so ape offers convenience functions such as node.depth.edgelength which gives you the depths from the TMRCA down to each node.

Simulation

I wrote a little tool that uses ape to simulate from the birth-death-sampling-occurrence process. It is available in this gist.

To simulate from the process with time varying rates, here is a little example. Note that this uses forwards time starting from the origin as time zero. Figure X shows the output of this simulation.

library(ape)
library(purrr)
set.seed(1)

birth_rate <- function(ts) {
  ifelse(ts < 0.5, 5.0, 1.0)
}

death_rate <- function(ts) {
  rep(2.0, length(ts))
}

max_ltt <- function(x) {
  max(ltt.plot.coords(x)[,2])
}

ps <- purrr::map(1:100, \(x) rlineage(birth_rate, death_rate, Tmax = 1))

y_upper_lim <- purrr::map(ps, max_ltt) |> as_vector() |> max()
png('ape-rlineage-demo.png',
    width = 1.618 * 300,
    height = 300,
    units = "px")
ltt.plot(ps[[1]], ylim = c(1, y_upper_lim))
purrr::walk(tail(ps, -1), ltt.lines)
abline(v=-0.5, col='red')
dev.off()

ape-rlineage-demo.png

Phangorn in R

The Trees vignette with the phangorn package gives a very clear explanation of how to estimate a maximum likelihood tree to describe a time stamped MSA. Here is a summary of the steps involved with estimating a tree for H3N2 sequences that are bundled with the package.

First we load in the time stamped sequences that are bundled with the package and construct a named vector with the tip dates. Note that older versions of the package do not come with this dataset so you may need to update to a more recent version, otherwise you can get it from a recent copy of the source code on GitHub.

library(ape)
library(phangorn)

fdir <- system.file("extdata/trees", package = "phangorn")
tmp <- read.csv(file.path(fdir, "H3N2_NA_20.csv"))
H3N2 <- read.phyDat(file.path(fdir, "H3N2_NA_20.fasta"), format = "fasta")
dates <- setNames(tmp$numdate_given, tmp$name)

Here we use the black box function to estimate the tree with a \(\text{GTR} + \Gamma(4)\) model and dated tips. There are far more details in the the vignette about how to do this without just calling a black box, but this is a quick way to get something to start subsequent analyses with.

fit_td <- pml_bb(
  H3N2,
  model = "GTR+G(4)",
  method = "tipdated",
  tip.dates = dates,
  rearrangment = "NNI",
  control = pml.control(trace = 0)
)

Then we can make a simple plot showing the estimated tree.

tree_td <- fit_td$tree
root_time <- max(dates) - max(node.depth.edgelength(tree_td))

png("phangorn-demo.png",
  width = 1.618 * 300,
  height = 300,
  units = "px"
)
plot(tree_td, show.tip.label = FALSE)
title(main = "H3N2", adj = 1)
title(sub = "Timed tree estimated using phangorn", line = -13, adj = 1)
axisPhylo(root.time = root_time, backward = FALSE)
dev.off()

phangorn-demo.png

Ggtree and friends

To parse the various tree formats there is the treeio package which has functions such as read.beast and read.nhx. This will produce a value of the treedata class which is defined in the tidytree package. The treedata is an S4 class, so it has slots. So to get the phylo object storing the topology you would use mytreedata@phylo and to get associated data there is a data slot.

To actually visualise these objects there is the ggtree package which provides geometries in the style of Ggplot2. Note that to install these packages you will likely want to use Bioconductor.

BiocManager::install("ggtree")

Author: Alexander E. Zarebski

Created: 2024-11-01 Fri 10:54

Validate