AANN 06/06/2024
Table of Contents
Speed reading
In this example we will use the timeit
library to see how long it
takes to get data that has been stored in various ways. Since I'm
working with phylogenies, a natural way to store them is in plaintext
Newick format. Unfortunately, reading trees from disk is painfully
slow. Here are my attempts to try and speed this up.
Let's consider three methods:
- reading a tree in from a Newick file with
Bio.Phylo
, - reading a pickled version of the tree (either individually or in a
batch) in with
pickle
- and finally reading a pickle file out of a HDF5 file and unpickling that (either individually or in a batch).
A stock standard pickle file is the way to go! Loading a pickled tree out of a bytestring stored in HDF5 (which I thought was going to be faster) is not any faster than regular pickle file (although it does have the convenience of only needing a single file). You can perform this experiment yourself using script, but here are the results anyway.
$ python main.py The example tree has 1024 terminals We average times over 300 evaluations Reading with pickle (individually) is 1.96 times faster than reading newick Reading with pickle (batched) is 4.27 times faster than reading newick Reading from HDF5 (individually) is 1.78 times faster than reading newick Reading from HDF5 (batched) is 3.66 times faster than reading newick
Details
Here are the details of the script. We start by loading in packages
(shocker!) and specifying NUM_REPS
which is the number of times to
repeat the experiment to get a meaningful average of the time needed.
from io import StringIO from Bio import Phylo import h5py import numpy as np import pickle import timeit NUM_REPS = 300
Next we, set up a way to recursively generate a large tree with the
_tree_string
function which produces a large tree in Newick format.
The result is stored in three ways: written to a Newick file as plain
text, parsed into a tree and pickled, and parsed and picked and stored
in an HDF5 file.
def _tree_string(n): if n == 0: return "(A:0.1, A:0.1):0.1" else: tmp = _tree_string(n-1) return f"({tmp}, {tmp}):0.1" def tree_string(n): return _tree_string(n) + ";" newick_tree = tree_string(9) filename_newick = "foobar.newick" with open(filename_newick, "w+") as file: file.write(newick_tree) filename_pickle = "foobar.pickle" newick_tree_obj = Phylo.read(StringIO(newick_tree), "newick") with open(filename_pickle, "wb+") as file: pickle.dump(newick_tree_obj, file) filename_hdf5 = "foobar.hdf5" with h5py.File(filename_hdf5, "w") as file: pickle_blob = np.void(pickle.dumps(newick_tree_obj)) for i in range(NUM_REPS): file.create_dataset(f"tree_{i}", data=pickle_blob)
The three functions here each read the tree from their respective
sources. The functions with a 1
in the function name read the file
once per invocation and those with a 2
read it 20 times. This gives
us a way to understand how much of the time is spent actually doing
the function call and how much goes to the actual un-pickling of the
tree.
def test_a(): tree = Phylo.read(filename_newick, "newick") return len(tree.get_terminals()) def test_b1(): with open(filename_pickle, "rb") as file: tree = pickle.load(file) return len(tree.get_terminals()) def test_b2(): for i in range(int(NUM_REPS/20)): with open(filename_pickle, "rb") as file: tree = pickle.load(file) return len(tree.get_terminals()) hdf5_conn = h5py.File(filename_hdf5, "r") def test_c1(): tree = pickle.loads(hdf5_conn[f"tree_0"][()]) return len(tree.get_terminals()) def test_c2(): for i in range(int(NUM_REPS/20)): tree = pickle.loads(hdf5_conn[f"tree_{i}"][()]) return len(tree.get_terminals()) assert test_a() == test_b1() assert test_a() == test_b2() assert test_a() == test_c1() assert test_a() == test_c2()
Finally to actually carry out the timing and report the results we use
timeit
and print the results.
time_a = timeit.timeit(lambda: test_a(), number=NUM_REPS) time_b1 = timeit.timeit(lambda: test_b1(), number=NUM_REPS) time_b2 = timeit.timeit(lambda: test_b2(), number=20) time_c1 = timeit.timeit(lambda: test_c1(), number=NUM_REPS) time_c2 = timeit.timeit(lambda: test_c2(), number=20) hdf5_conn.close() print(f"The example tree has {test_a()} terminals") print(f"We average times over {NUM_REPS} evaluations") print(f"Reading with pickle (individually) is {(time_a/time_b1):.2f} times faster than reading newick") print(f"Reading with pickle (batched) is {(time_a/time_b2):.2f} times faster than reading newick") print(f"Reading from HDF5 (individually) is {(time_a/time_c1):.2f} times faster than reading newick") print(f"Reading from HDF5 (batched) is {(time_a/time_c2):.2f} times faster than reading newick")