Biosequence Algorithms 2005 Home Exam Questions

BSA 2005: Kotikokeen kysymykset

Return solutions to four assignments by Wednesday, March 23 at noon (See general instructions)
Palauta ratkaisut neljään tehtävään keskiviikkoon 23.3. klo 12:00 mennessä (Ks. yleisiä ohjeita)
  1. (Gusfield, Ex. 1.11) Let T be a text string of length m and let S be a multiset of n characters. The problem is to find all substrings in T of length n that are formed by the characters of S. For example, let S = {a, a, b, c} and T = abahgcabah. Then caba is of substring of T formed from the characters of S. Give a solution to this problem that runs in O(m) time. The method should also be able to state, for each position i, the length of the longest substring in T starting at i that can be formed from S. Explain the correctness and the complexity of the method.

  2. Implement the Boyer-Moore method with both the basic and the extended version of the bad character rule in some programming language and environment which allow you to measure execution times. (Good suffix shifts are not needed. When an occurrence has been found, shift the pattern to the right simply by one.) Experiment with your implementation. Report your observations on how the basic vs. extended bad character rule effects the search time. (You could use as your test material some bio-sequences available in public archives; a few somewhat randomly selected ones can be found in directory data. You can make your target sequences long enough to get some observable times by concatenating multiple copies together.)

  3. (Gusfield, Ex. 6.3) The relationship between the suffix tree for a string S and for the reverse string Sr is not obvious. However, there is a significant relationship between the two trees. Find it, state it, and prove it. (Hint: Suffix links help.)

  4. (Gusfield, Ex. 7.3) We can define the suffix tree in terms of an Aho-Corasick (AC) automaton, which is a compact representation of the set of given patterns. For a single string S we can think of the n suffixes of S as a set of patterns. Then one can build a suffix tree for S by first constructing the AC tree for those n patterns, and then compressing, into a single edge, any maximal path through nodes with only a single child. If we take this approach, what is the relationship between the failure links used in the AC automaton and the suffix links used in Ukkonen's algorithm? Why aren't suffix trees built this way?

  5. Study and explain the main ideas and the complexity results of Hirschberg's algorithm for computing optimal alignments in linear space. (See Gusfield, Sect. 12.1, or "A linear space algorithm for computing maximal common subsequences" by D. S. Hirschberg, Communications of the ACM, Volume 18 , Issue 6 (June 1975), 341 - 343. (available from IP addresses of our university) Notice that the original article presents an algorithm for computing longest common subsequences, which is a special case of computing optimal alignments.

  6. Discuss the uses of multiple string alignment in analyzing and representing protein families. (See for example Sect. 14 in Gusfield.)