Biosequence Algorithms 2005 Home Exam Questions
BSA 2005: Kotikokeen kysymykset
Return solutions to four assignments by Wednesday, March 23 at noon
(See general instructions)
Palauta ratkaisut neljään
tehtävään keskiviikkoon 23.3. klo 12:00 mennessä
(Ks. yleisiä ohjeita)
-
(Gusfield, Ex. 1.11) Let T be a text string of length m
and let S be a multiset of n
characters. The problem is to find all substrings in
T of length n that are
formed
by the characters of S. For example, let S = {a, a, b, c} and
T = abahgcabah.
Then caba is of substring of
T formed from the characters of S.
Give a solution to this problem that runs in O(m) time. The method should
also be able to state, for each position i, the length of the longest
substring in T
starting at i that can be formed from S.
Explain the correctness and the complexity of the method.
-
Implement the Boyer-Moore method with both the basic and the
extended version of the bad character rule in some programming
language and
environment which allow you to measure execution times.
(Good suffix shifts are not needed. When an occurrence has been found, shift
the pattern to the right simply by one.)
Experiment with your implementation. Report your observations on how the
basic vs. extended bad character rule effects the search time.
(You could use as your test material some bio-sequences available in public
archives; a few
somewhat randomly selected ones can be found in directory data.
You can make your target sequences long enough to get some observable times
by concatenating multiple copies together.)
-
(Gusfield, Ex. 6.3)
The relationship between the suffix tree for a string S and
for the reverse string Sr is not obvious.
However, there is a significant relationship between the two trees.
Find it, state it, and prove it. (Hint: Suffix links help.)
-
(Gusfield, Ex. 7.3) We can define the suffix tree in terms of an
Aho-Corasick
(AC) automaton, which is a compact representation of the set of given
patterns.
For a single string S we can think of the n suffixes of
S
as a set of patterns. Then one can build a suffix tree for S by first
constructing the AC tree for those n patterns, and then compressing,
into a single edge, any maximal path through nodes with only a single child.
If we take this approach, what is the relationship between the failure links
used in the AC automaton and the suffix links used in Ukkonen's algorithm?
Why aren't suffix trees built this way?
-
Study and explain the main ideas and the complexity results of Hirschberg's
algorithm for computing optimal alignments in linear space.
(See Gusfield, Sect. 12.1, or
"A linear space algorithm for computing maximal common subsequences"
by D. S. Hirschberg,
Communications of the ACM,
Volume 18 , Issue 6 (June 1975), 341 - 343.
(available from IP addresses of our university) Notice that
the original article presents an algorithm for computing longest
common subsequences, which is a special case of computing
optimal alignments.
-
Discuss the uses of multiple string alignment in analyzing and representing
protein families.
(See for example Sect. 14 in Gusfield.)