Gestalt pattern matching

Gestalt pattern matching, also Ratcliff/Obershelp pattern recognition, is a string-matching algorithm for determining the similarity of two strings. It was developed in 1983 by John W. Ratcliff and John A. Obershelp and published in the Dr. Dobb's Journal in July 1988.

Algorithm
The similarity of two strings $$S_1$$ and $$S_2$$ is determined by the formula, calculating twice the number of matching characters $$K_m$$ divided by the total number of characters of both strings. The matching characters are defined as some longest common substring plus recursively the number of matching characters in the non-matching regions on both sides of the longest common substring:



D_{ro} = \frac{2K_m}{|S_1|+|S_2|} $$ where the similarity metric can take a value between zero and one:
 * $$0 \leq D_{ro} \leq 1$$

The value of 1 stands for the complete match of the two strings, whereas the value of 0 means there is no match and not even one common letter.

Sample
The longest common substring is  (light grey) with 5 characters. There is no further substring on the left. The non-matching substrings on the right side are  and. They again have a longest common substring  (dark gray) with length 2. The similarity metric is determined by:

\frac{2K_m}{|S_1|+|S_2|} = \frac{2 \cdot (|\text{WIKIM}|+|\text{IA}|)}{|S_1|+|S_2|} = \frac{2 \cdot (5 + 2)}{9 + 9} = \frac{14}{18} = 0.\overline{7} $$

Properties
The Ratcliff/Obershelp matching characters can be substantially different from each longest common subsequence of the given strings. For example $$S_1 = q \; ccccc \; r \; ddd \; s \; bbbb \; t \; eee \; u$$ and $$S_2 = v \; ddd \; w \; bbbb \; x \; eee \; y \; ccccc \; z$$ have $$ccccc$$ as their only longest common substring, and no common characters right of its occurrence, and likewise left, leading to $$K_m = 5$$. However, the longest common subsequence of $$S_1$$ and $$S_2$$ is $$(ddd) \; (bbbb) \; (eee)$$, with a total length of $$10$$.

Complexity
The execution time of the algorithm is $$O(n^3)$$ in a worst case and $$O(n^2)$$ in an average case. By changing the computing method, the execution time can be improved significantly.

Commutative property
The Python library implementation of the gestalt pattern matching algorithm is not commutative:



D_{ro}(S_1, S_2) \neq D_{ro}(S_2, S_1). $$

For the two strings
 * Sample

S_1 = \text{GESTALT PATTERN MATCHING} $$ and

S_2 = \text{GESTALT PRACTICE} $$ the metric result for
 * $$D_{ro}(S_1, S_2)$$ is $$\frac{24}{40}$$ with the substrings,  ,  ,   and for
 * $$ D_{ro}(S_2, S_1)$$ the metric is $$\frac{26}{40}$$ with the substrings,  ,  ,  ,.

Applications
The Python  library, which was introduced in version 2.1, implements a similar algorithm that predates the Ratcliff-Obershelp algorithm. Due to the unfavourable runtime behaviour of this similarity metric, three methods have been implemented. Two of them return an upper bound in a faster execution time. The fastest variant only compares the length of the two substrings:


 * $$D_{rqr} = \frac{2 \cdot \min(|S1|, |S2|)}{|S1| + |S2|}$$,

The second upper bound calculates twice the sum of all used characters $$S_1$$ which occur in $$S_2$$ divided by the length of both strings but the sequence is ignored.


 * $$D_{qr} = \frac{2 \cdot \big | \{\!\vert S1 \vert\!\} \cap \{\!\vert S2 \vert\!\} \big |}{|S1| + |S2|}$$

Trivially the following applies:
 * $$0 \leq D_{ro} \leq D_{qr} \leq D_{rqr} \leq 1$$ and
 * $$0 \leq K_m \leq | \{\!\vert S1 \vert\!\} \cap \{\!\vert S2 \vert\!\} \big | \leq \min(|S1|, |S2|) \leq \frac {|S1| + |S2|}{2}$$.