Reports

Speeding up dynamic programming with applications to molecular biology

Galil, Zvi; Giancarlo, Raffaele

Consider the problem of computing E[j] = mit:! {D[k] + w(k, j)}, j = 1, ... , n, O~k~]-l where w is a given weight function, D[D] is given and for every k = 1, ... , n, D[k] is easily computable from E[k]. This problem appears as a subproblem in dynamic programming solutions to various problems. Obviously, it can be solved in time O( n2 ), and for a general weight function no better algorithm is possible. We consider two dual cases that arise in applications: In the concave case, the weight function satisfies the quadrangle inequality: w(k,j) + w(l,j') ~ w(l,j) +w(k,j'), for all k ~ 1 ~ j ~ j'. In the convex case, the weight function satisfies the inverse quadrangle inequality. In both cases we show how to use the assumed property of w to derive an O( n log n) algorithm. Even better, linear-time algorithms are obtained if w satisfies the following additional closest zero property: for every two integers 1 and k, 1 < k, and real number a, the smallest zero of f(x) = w(l,x) - w(k,x) - a which is larger than 1 can be found in constant time. Surprisingly, the two algorithms are also dual in the following sense: Both work in stages. In the j-th stage they compute Elj]. They maintain a set of candidates which satisfies the property that Elj] depends only on D[k] + w(k, j) for k's in the set. Moreover, each algorithm discards candidates from the set, and discarded candidates never rejoin the set. To be able to maintain such a set of candidates efficiently one uses the following "dual" data structures: a queue in the concave case and a stack in the convex case. The two algorithms speed up several dynamic programming routines that solve as a subproblem the problem above. The speed-up is from O(n3 ) to O(n2Iogn) or O(n2 ). Applications include algorithms for comparing DNA sequences, algorithms for determining the secondary structure of RNA, and algorithms used in speech recognition and geology. One typical problem is the following: Given the cost of substituting any pair of symbols and a convex cost function g for gaps (where g(r) is the cost of a gap of size r), compute the modified edit distance between the two given sequences.

Files

More About This Work

Academic Units
Computer Science
Publisher
Department of Computer Science, Columbia University
Series
Columbia University Computer Science Technical Reports, CUCS-300-87
Published Here
December 2, 2011