Name
  • String
  • Substring
  • Search
Edit

The Forward Dawg Matching algorithm computes the longest factor of the pattern ending at each position in the text. This is make possible by the use of the smallest suffix automaton (also called DAWG for Directed Acyclic Word Graph) of the pattern. The smallest suffix automaton of a word w is a Deterministic Finite Automaton S(w) = (Qq0TE). The language accepted by S(w) isL(S(w))={u in Sigma* :  exists v in Sigma* such that w=vu}.

The preprocessing phase of the Forward Dawg Matching algorithm consists in computing the smallest suffix automaton for the pattern x. It is linear in time and space in the length of the pattern.

During the searching phase the Forward Dawg Matching algorithm parses the characters of the text from left to right with the automaton S(x) starting with state q0. For each state q in S(x) the longest path from q0 to p is denoted by length(q). This structure extensively uses the notion of suffix links. For each state p the suffix link of p is denoted by S[p]. For a state p, let Path(p)=(p0,p1, ... ,pell) be the suffix path of p such that p0=p, for 1 leq i leq ellpi=S[pi-1] and pell=q0. For each text character y[j] sequentially, let p be the current state, then the Forward Dawg Matching algorithm takes a transition defined for y[j] for the first state of Path(p) for which such a transition is defined. The current state p is updated with the target state of this transition or with the initial state q0 if no transition exists labelled with y[j] from a state of Path(p).

An occurrence of x is found when length(p)=m.

The Forward Dawg Matching algorithm performs exactly n text character inspections.


Main features:

  • uses the suffix automaton of x;
  • O(n) worst case time complexity;
  • performs exactly n text character inspections.

Example:

Preprocessing phase

Forward Dawg Automaton
Suffix automaton used by Forward Dawg Matching Search algorithm.

Searching phase

First attempt
GCATCGCAGAGAGTATACAGTACG
1230 
Second attempt
GCATCGCAGAGAGTATACAGTACG
 20 
Third attempt
GCATCGCAGAGAGTATACAGTACG
 123457910 
 
GCATCGCAGAGAGTATACAGTACG
 -------1 
Fourth attempt
GCATCGCAGAGAGTATACAGTACG
 60 
Fifth attempt
GCATCGCAGAGAGTATACAGTACG
 62340 
Sixth attempt
GCATCGCAGAGAGTATACAGTACG
 621

The Forward Dawg Matching algorithm performs 24 character comparisons on the example.

C