The DNA double helix consists of a long strand of four bases:  adenine (abbreviated A), cytosine (C), guanine (G) and thymine
 (T).  Thus, it can be represented as a long string containing the 
characters A, C, G, and T. The field of bio-informatics involves storing
 and searching of DNA sequence data. What is a good data structure to 
facilitate storage and string match/search operation on this kind of 
data? Write a class that stores the DNA sequence and implement a method 
which takes another DNA sub-sequence and returns the position of this 
sub-sequence in the first DNA sequence.
There are several linear time string matching algorithms. See http://en.wikipedia.org/wiki/String_searching_algorithm for a list of such algorithms.
Often a data structure called Suffix Tree or PAT tree is used in DNA sequencing applications in the field of Bio-informatics. A Suffix tree data structure facilitates string match/search in O(m) complexity, where m is the length of the sub-string. It takes an initial O(n) time required to build the suffix tree.
One concern with suffix tree is the high amount of space/memory needed. But, with a small alphabet ( only 4 characters in the DNA search problem ) , its not too bad. More over, there are some advanced techniques that can be used to further reduce the storage requirements.
A thorough treatment of suffix trees can be found here.
There are several linear time string matching algorithms. See http://en.wikipedia.org/wiki/String_searching_algorithm for a list of such algorithms.
Often a data structure called Suffix Tree or PAT tree is used in DNA sequencing applications in the field of Bio-informatics. A Suffix tree data structure facilitates string match/search in O(m) complexity, where m is the length of the sub-string. It takes an initial O(n) time required to build the suffix tree.
One concern with suffix tree is the high amount of space/memory needed. But, with a small alphabet ( only 4 characters in the DNA search problem ) , its not too bad. More over, there are some advanced techniques that can be used to further reduce the storage requirements.
A thorough treatment of suffix trees can be found here.
 
No comments:
Post a Comment