AANN 06/06/2024

Speed reading
Details

Speed reading

In this example we will use the timeit library to see how long it takes to get data that has been stored in various ways. Since I'm working with phylogenies, a natural way to store them is in plaintext Newick format. Unfortunately, reading trees from disk is painfully slow. Here are my attempts to try and speed this up.

Let's consider three methods:

reading a tree in from a Newick file with Bio.Phylo,
reading a pickled version of the tree (either individually or in a batch) in with pickle
and finally reading a pickle file out of a HDF5 file and unpickling that (either individually or in a batch).

A stock standard pickle file is the way to go! Loading a pickled tree out of a bytestring stored in HDF5 (which I thought was going to be faster) is not any faster than regular pickle file (although it does have the convenience of only needing a single file). You can perform this experiment yourself using script, but here are the results anyway.

$ python main.py
The example tree has 1024 terminals
We average times over 300 evaluations
Reading with pickle (individually) is 1.96 times faster than reading newick
Reading with pickle (batched) is 4.27 times faster than reading newick
Reading from HDF5 (individually) is 1.78 times faster than reading newick
Reading from HDF5 (batched) is 3.66 times faster than reading newick

Details

Here are the details of the script. We start by loading in packages (shocker!) and specifying NUM_REPS which is the number of times to repeat the experiment to get a meaningful average of the time needed.

from io import StringIO
from Bio import Phylo
import h5py
import numpy as np
import pickle
import timeit

NUM_REPS = 300

Next we, set up a way to recursively generate a large tree with the _tree_string function which produces a large tree in Newick format. The result is stored in three ways: written to a Newick file as plain text, parsed into a tree and pickled, and parsed and picked and stored in an HDF5 file.

def _tree_string(n):
    if n == 0:
        return "(A:0.1, A:0.1):0.1"
    else:
        tmp = _tree_string(n-1)
        return f"({tmp}, {tmp}):0.1"

def tree_string(n):
    return _tree_string(n) + ";"

newick_tree = tree_string(9)

filename_newick = "foobar.newick"
with open(filename_newick, "w+") as file:
    file.write(newick_tree)

filename_pickle = "foobar.pickle"
newick_tree_obj = Phylo.read(StringIO(newick_tree), "newick")
with open(filename_pickle, "wb+") as file:
    pickle.dump(newick_tree_obj, file)

filename_hdf5 = "foobar.hdf5"
with h5py.File(filename_hdf5, "w") as file:
    pickle_blob = np.void(pickle.dumps(newick_tree_obj))
    for i in range(NUM_REPS):
        file.create_dataset(f"tree_{i}", data=pickle_blob)

The three functions here each read the tree from their respective sources. The functions with a 1 in the function name read the file once per invocation and those with a 2 read it 20 times. This gives us a way to understand how much of the time is spent actually doing the function call and how much goes to the actual un-pickling of the tree.

def test_a():
    tree = Phylo.read(filename_newick, "newick")
    return len(tree.get_terminals())

def test_b1():
    with open(filename_pickle, "rb") as file:
        tree = pickle.load(file)
        return len(tree.get_terminals())

def test_b2():
    for i in range(int(NUM_REPS/20)):
        with open(filename_pickle, "rb") as file:
            tree = pickle.load(file)
    return len(tree.get_terminals())

hdf5_conn = h5py.File(filename_hdf5, "r")
def test_c1():
    tree = pickle.loads(hdf5_conn[f"tree_0"][()])
    return len(tree.get_terminals())

def test_c2():
    for i in range(int(NUM_REPS/20)):
        tree = pickle.loads(hdf5_conn[f"tree_{i}"][()])
    return len(tree.get_terminals())

assert test_a() == test_b1()
assert test_a() == test_b2()
assert test_a() == test_c1()
assert test_a() == test_c2()

Finally to actually carry out the timing and report the results we use timeit and print the results.

time_a = timeit.timeit(lambda: test_a(), number=NUM_REPS)
time_b1 = timeit.timeit(lambda: test_b1(), number=NUM_REPS)
time_b2 = timeit.timeit(lambda: test_b2(), number=20)
time_c1 = timeit.timeit(lambda: test_c1(), number=NUM_REPS)
time_c2 = timeit.timeit(lambda: test_c2(), number=20)
hdf5_conn.close()

print(f"The example tree has {test_a()} terminals")
print(f"We average times over {NUM_REPS} evaluations")
print(f"Reading with pickle (individually) is {(time_a/time_b1):.2f} times faster than reading newick")
print(f"Reading with pickle (batched) is {(time_a/time_b2):.2f} times faster than reading newick")
print(f"Reading from HDF5 (individually) is {(time_a/time_c1):.2f} times faster than reading newick")
print(f"Reading from HDF5 (batched) is {(time_a/time_c2):.2f} times faster than reading newick")

AANN 06/06/2024

Table of Contents

Speed reading

Details