Discuss suffix trees and their application
Useful Links:
http://www.allisons.org/ll/AlgDS/Tree/Suffix/
http://en.wikipedia.org/wiki/Suffix_tree
Suffix Trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology, and other application areas. Some examples are given below.
String Search
Searching for a substring, pat[1..m], in txt[1..n], can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time).
Longest Repeated Substring
Add a special ``end of string'' character, e.g. `$', to txt[1..n] and build a suffix tree; the longest repeated substring of txt[1..n] is indicated by the deepest fork node in the suffix tree, where depth is measured by the number of characters traversed from the root,i.e., `issi' in the case of `mississippi'. The longest repeated substring can be found in O(n) time using a suffix tree.
Longest Common Substring
The longest common substring of two strings, txt1 and txt2, can be found by building a generalized suffix tree for txt1 and txt2: Each node is marked to indicate if it represents a suffix of txt1 or txt2 or both. The deepest node marked for both txt1 and txt2 represents the longest common substring.
Equivalently, one can build a (basic) suffix tree for the string txt1$txt2#, where `$' is a special terminator for txt1 and `#' is a special terminator for txt2. The longest common substring is indicated by the deepest fork node that has both `...$...' and `...#...' (no $) beneath it.
Note that the `longest common substring problem' is different to the `longest common subsequence problem' which is closely related to the `edit-distance problem': An instance of a subsequence can have gaps where it appears in txt1 and in txt2, but an instance of a substringcannot have gaps.
Palindromes
A palindrome is a string, P, such that P=reverse(P). e.g. `abba'=reverse(`abba'). e.g. `ississi' is the longest palindrome in `mississippi'.
The longest palindrome of txt[1..n] can be found in O(n) time, e.g. by building the suffix tree for txt$reverse(txt)# or by building the generalized suffix tree for txt and reverse(txt).
Useful Links:
http://www.allisons.org/ll/AlgDS/Tree/Suffix/
http://en.wikipedia.org/wiki/Suffix_tree
- Suffix Array Introduction
- Suffix Array nLogn Algorithm
- kasai’s Algorithm for Construction of LCP array from Suffix Array
- Suffix Tree Introduction
- Ukkonen’s Suffix Tree Construction – Part 1
- Ukkonen’s Suffix Tree Construction – Part 2
- Ukkonen’s Suffix Tree Construction – Part 3
- Ukkonen’s Suffix Tree Construction – Part 4,
- Ukkonen’s Suffix Tree Construction – Part 5
- Ukkonen’s Suffix Tree Construction – Part 6
- Generalized Suffix Tree
- Build Linear Time Suffix Array using Suffix Tree
- Substring Check
- Searching All Patterns
- Longest Repeated Substring,
- Longest Common Substring, Longest Palindromic Substring
Suffix Trees can be used to solve a large number of string problems that occur in text-editing, free-text search, computational biology, and other application areas. Some examples are given below.
String Search
Searching for a substring, pat[1..m], in txt[1..n], can be solved in O(m) time (after the suffix tree for txt has been built in O(n) time).
Longest Repeated Substring
Add a special ``end of string'' character, e.g. `$', to txt[1..n] and build a suffix tree; the longest repeated substring of txt[1..n] is indicated by the deepest fork node in the suffix tree, where depth is measured by the number of characters traversed from the root,i.e., `issi' in the case of `mississippi'. The longest repeated substring can be found in O(n) time using a suffix tree.
Longest Common Substring
The longest common substring of two strings, txt1 and txt2, can be found by building a generalized suffix tree for txt1 and txt2: Each node is marked to indicate if it represents a suffix of txt1 or txt2 or both. The deepest node marked for both txt1 and txt2 represents the longest common substring.
Equivalently, one can build a (basic) suffix tree for the string txt1$txt2#, where `$' is a special terminator for txt1 and `#' is a special terminator for txt2. The longest common substring is indicated by the deepest fork node that has both `...$...' and `...#...' (no $) beneath it.
Note that the `longest common substring problem' is different to the `longest common subsequence problem' which is closely related to the `edit-distance problem': An instance of a subsequence can have gaps where it appears in txt1 and in txt2, but an instance of a substringcannot have gaps.
Palindromes
A palindrome is a string, P, such that P=reverse(P). e.g. `abba'=reverse(`abba'). e.g. `ississi' is the longest palindrome in `mississippi'.
The longest palindrome of txt[1..n] can be found in O(n) time, e.g. by building the suffix tree for txt$reverse(txt)# or by building the generalized suffix tree for txt and reverse(txt).
No comments:
Post a Comment