The DNA double helix consists of a long strand of four bases: adenine (abbreviated A), cytosine (C), guanine (G) and thymine
(T). Thus, it can be represented as a long string containing the
characters A, C, G, and T. The field of bio-informatics involves storing
and searching of DNA sequence data. What is a good data structure to
facilitate storage and string match/search operation on this kind of
data? Write a class that stores the DNA sequence and implement a method
which takes another DNA sub-sequence and returns the position of this
sub-sequence in the first DNA sequence.
There are several linear time string matching algorithms. See http://en.wikipedia.org/wiki/String_searching_algorithm for a list of such algorithms.
Often a data structure called Suffix Tree or PAT tree is used in DNA sequencing applications in the field of Bio-informatics. A Suffix tree data structure facilitates string match/search in O(m) complexity, where m is the length of the sub-string. It takes an initial O(n) time required to build the suffix tree.
One concern with suffix tree is the high amount of space/memory needed. But, with a small alphabet ( only 4 characters in the DNA search problem ) , its not too bad. More over, there are some advanced techniques that can be used to further reduce the storage requirements.
A thorough treatment of suffix trees can be found here.
There are several linear time string matching algorithms. See http://en.wikipedia.org/wiki/String_searching_algorithm for a list of such algorithms.
Often a data structure called Suffix Tree or PAT tree is used in DNA sequencing applications in the field of Bio-informatics. A Suffix tree data structure facilitates string match/search in O(m) complexity, where m is the length of the sub-string. It takes an initial O(n) time required to build the suffix tree.
One concern with suffix tree is the high amount of space/memory needed. But, with a small alphabet ( only 4 characters in the DNA search problem ) , its not too bad. More over, there are some advanced techniques that can be used to further reduce the storage requirements.
A thorough treatment of suffix trees can be found here.
No comments:
Post a Comment