For my Ph.D. thesis I studied parallel algorithms for sequence comparison, specifically, for computing sequence alignments like this:
In biology, variants of such alignments are used for comparing DNA or protein sequences.
The following theoretical aspects of parallel string alignment are discussed:
Also, I developed an efficient method for computing local string alignments (“The Seaweed Code”), which is available at code.google.com/p/seaweeds (I will upload an updated version soon)
My Ph.D. supervisor was Dr. Alexander Tiskin.
Here is the full thesis in PDF format.
Krusche, P. & Tiskin, A., 2010. New algorithms for efficient parallel string comparison. In the 22nd ACM symposium. New York, New York, USA: ACM Press, pp. 209–216. Available at: http://portal.acm.org/citation.cfm?doid=1810479.1810521.
Krusche, P. & Tiskin, A., 2010. Parallel longest increasing subsequences in scalable time and memory. Parallel Processing and Applied Mathematics, 6067, pp.176–185. Available at: http://link.springer.com/chapter/10.1007%2F978-3-642-14390-8_19.
Krusche, P. & Tiskin, A., 2010. Computing alignment plots efficiently. In B. Chapman et al., eds. Parallel Computing: From Multicores and GPU’s to Petascale. pp. 158–165. Available at: http://www.booksonline.iospress.nl/Content/View.aspx?piid=16553.
Krusche, P. & Tiskin, A., 2009. String comparison by transposition networks. Texts in Algorithmics, Vol. 11 of London Algorithmics 2008: Theory and Practice. Available at: http://www.collegepublications.co.uk/algorithmics/?00011 and http://arxiv.org/abs/0903.3579.
Slides for the talk at WPCCS’08.
Krusche, P. & Tiskin, A., 2007. Efficient parallel string comparison. In Proceedings of ParCo 2007, vol. 38 of NIC Series. John von Neumann Institute for Computing, pp. 193–200. Available at: http://www.fz-juelich.de/nic-series/volume38/volume38.html.
Slides for the talk at ParCo 2007 in Jülich, Germany, 4th of September 2007
Slides for the talk at the AFM seminar at Warwick, 23rd of April 2007
String comparison is a fundamental component in many of today’s applications of computing, including bioinformatics, signal processing, databases, internet search and many others. A specific type of practical string comparison problem is computing the longest common subsequence (LCS) of two input strings. The LCS problem is equivalent to computing edit distances and strongly connected to computing string alignments, which are of great importance for applications in computational biology. In this thesis, we show new approaches to using parallel computation in string comparison, which allow us to understand algorithms for computing the LCS on the word-RAM, the PRAM and the BSP models in a unified fashion. This approach is based on recent results on algorithms for semi-local string comparison. Semi-local string comparison is a straightforward extension of the standard LCS problem, in which we ask to compute the LCS for one string and all substrings of another string. We use a revised approach to analyzing BSP algorithms which includes the analysis of communication and I/O cost, as well as of the local memory requirements. We define asymptotic scalability in memory and communication. Scalable communication captures the input/output cost of partitioning a problem into subproblems, and achieving scalable memory allows to obtain small subproblems which can fit into processor caches. In this thesis, we show how to achieve scalable memory and communication for computing longest common subsequences, as well as for computing longest increasing subsequences. Furthermore, we present new algorithms for semi-local string comparison that are parameterized by the similarity of the input strings, and we discuss aspects of their practical implementation. Alongside with theoretical results, this thesis also comprises an algorithm engineering project, which has the goal of allowing practical applications to make use of the techniques of semi-local string comparison. For a particular problem of local string comparison in evolutionary biology, our methods have improved upon the fastest existing methods. We have obtained a speedup by a factor of more than 14 over the fastest existing implementation, running on a single processor, and achieved the potential to scale to hundreds of processors. This has greatly increased the feasible size of the input sequences that can be compared using our method, potentially allowing loss-free local comparison of entire genomes.