User:Van Parunak/sandbox

Grammar-based distances refer to a family of String metrics that take into account the grammars that generate the strings being compared. Most string measures simply count the changes in characters needed to transform one string into another (Edit distance), or even more generally, the number of shared and distinct characters (e.g., Overlap coefficient). The fundamental insight of grammar-based distance is that if two strings differ at a character that is generated by the same grammatical production, the strings should be reckoned closer to one another than if the distinguishing character arises from higher-level productions.

Most work on grammar-based distances is based on inducing the grammar underlying a set of strings by applying Lempel-Ziv compression, which effectively reconstructs a production sequence yielding the strings. This approach is increasingly popular in biological settings . An alternative approach is based on knowing the underlying grammar of the strings in advance.