Code Zone: Suffix Tree

Saturday, February 18, 2012

Suffix Tree

Discuss suffix trees and their application

Useful Links:
http://www.allisons.org/ll/AlgDS/Tree/Suffix/
http://en.wikipedia.org/wiki/Suffix_tree

Suffix Tree Applications

Suffix Trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology, and other application areas. Some examples are given below.

String Search
Searching for a substring, pat[1..m], in txt[1..n], can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time).

Longest Repeated Substring
Add a special ``end of string'' character, e.g. `$', to txt[1..n] and build a suffix tree; the longest repeated substring of txt[1..n] is indicated by the deepest fork node in the suffix tree, where depth is measured by the number of characters traversed from the root,i.e., `issi' in the case of `mississippi'. The longest repeated substring can be found in O(n) time using a suffix tree.

Longest Common Substring
The longest common substring of two strings, txt1 and txt2, can be found by building a generalized suffix tree for txt1 and txt2: Each node is marked to indicate if it represents a suffix of txt1 or txt2 or both. The deepest node marked for both txt1 and txt2 represents the longest common substring.
Equivalently, one can build a (basic) suffix tree for the string txt1$txt2#, where `$' is a special terminator for txt1 and `#' is a special terminator for txt2. The longest common substring is indicated by the deepest fork node that has both `...$...' and `...#...' (no $) beneath it.

Note that the `longest common substring problem' is different to the `longest common subsequence problem' which is closely related to the `edit-distance problem': An instance of a subsequence can have gaps where it appears in txt1 and in txt2, but an instance of a substringcannot have gaps.

Palindromes
A palindrome is a string, P, such that P=reverse(P). e.g. `abba'=reverse(`abba'). e.g. `ississi' is the longest palindrome in `mississippi'.
The longest palindrome of txt[1..n] can be found in O(n) time, e.g. by building the suffix tree for txt$reverse(txt)# or by building the generalized suffix tree for txt and reverse(txt).

Code Zone

Labels

Saturday, February 18, 2012

Suffix Tree

No comments:

Post a Comment