User:Cacycle/diff





wikEd diff is a free JavaScript visual diff library for inline text comparisons. It is the only available JavaScript diff library that detects and highlights block moves and that works on the word and character level. While wikEd diff has been developed and optimized for comparing Wikipedia source texts, it works great for any type of text, including program code. The library is customizable, has Unicode and multilingual support, is fully commented and documented, and is free (public domain). The script is used by the Wikipedia/MediaWiki in-browser editor wikEd and by the gadget wikEdDiff. You can test the library by checking any of these gadgets in your English Wikipedia preferences. For easy text comparisons by copy-and-pasting you can also use the wikEd diff online tool and demo. The library has also been ported to PHP in the form of a MediaWiki extension.

Features

 * Visual inline style, changes are shown in a single output text
 * Block move detection and highlighting
 * Resolution down to characters level
 * Unicode and multilingual support
 * Stepwise split (paragraphs, sentences, words, characters)
 * Recursive diff
 * Optimized code for resolving unmatched sequences
 * Minimization of length of moved blocks
 * Alignment of ambiguous unmatched sequences to next line break or word border
 * Clipping of unchanged irrelevant parts from the output (optional)
 * Fully customizable
 * Text split optimized for MediaWiki source texts
 * Well commented and documented code

Code
The JavaScript code can be found under User:Cacycle/diff.js (raw code).

Supported browsers
The code works for all browsers, including Internet Explorer 8 under Windows XP.

Usage
To call this library, please use the following code:

To raw code of the library is available under:

https://en.wikipedia.org/w/index.php?title=User:Cacycle/diff.js&action=raw&ctype=text/javascript

To use the library in other JavaScript programs, add the following line to the HTML tag:

Customization
The library is fully customizable. The following settings can be changed on your User:YourUsername/common.js page. Please check the top of the diff.js program code for additional settings.Please see wikEdDiff for wikEdDiff-specific settings.

Colors can be customized by adding CSS code to your User:YourUsername/common.css page. Please check the code for available CSS classes.

Block mark symbols:

Show complete un-clipped diff text:

Enable block move layout with highlighted blocks and marks at their original positions (default):

Reject blocks if they are too short and their words are not unique, prevents fragmentated diffs for very different versions (default):

Reject blocks if shorter than this number of real words (default):

Enable character-refined diff (default):

Enable recursive diff to resolve problematic sequences (default):

Maximum recursion depth (default):

Display blocks in differing colors (rainbow color scheme):

Show debug infos and stats (block and group data object) in browser console:

Show timing results in browser console:

Run unit tests to prove correct working, display results in browser console:

Text type customization
The regular expressions for splitting the texts are optimized for natural language containing wiki markup. While this default also works great for other text types such as source code, it can be customized:

Implementation
The code is an implementation and extension of the following basic algorithm:


 * Paul Heckel: A technique for isolating differences between files
 * Communications of the ACM 21(4):264 (1978)
 * http://doi.acm.org/10.1145/359460.359467 (access restricted)
 * Mirror: http://documents.scribd.com/docs/10ro9oowpo1h81pgh1as.pdf (open access)

This method is word-based and uses unique words as anchor points to identify matching text and moved blocks.

wikEd diff has additional code for resolving unmatched islands that are caused by common tokens at the border of sequences of common tokens or by sequence crossing-overs. From these matching data, the program extracts moved blocks and compiles a new text with added insertion, deletion, and moved blocks, and block marks.

Details

 * 1) Matching tokens (Heckel method):
 * 2) A list of real words and chunks is compiled from the old and the new text for later "unlinking" step
 * 3) Old and new text versions are split into tokens ("symbols", "words"), creating two token lists
 * 4) The number of token occurrences in each version are counted in the symbol tables
 * 5) Tokens unique in both texts serve as anchors or seed crystals and are connected ("linked", "matched") between old and new version
 * 6) Starting from these connected pairs, adjacent identical pairs are connected outwards
 * 7) In addition to the classical Heckel method, the following improvements have been implemented:
 * 8) The process is repeated ("levels") while decreasing the size of unmatched tokens from paragraphs to lines, chunks (i.e. html and link markup), words, and single characters
 * 9) Words are split into characters  only in corresponding unmatched regions ("gaps") of identical token number and strong similarity in all tokens:
 * 10) Identical tokens (i.e. space separators) will be linked, resulting in word-wise character-level linking
 * 11) One token became connected or separated by space or dash (or any token) (axb → a b)
 * 12) Addition or deletion of flanking strings in tokens (xabx → ab)
 * 13) Addition or deletion of internal string in tokens (axb → ab)
 * 14) Same length and at least 50 % identity (abc → abx)
 * 15) Same start or end, same text longer than different text (abcd → abcxx)
 * 16) Same length and at least 50 % identity (abcd → abcxx)
 * 17) Gaps caused by "crossing-overs" are resolved by repeating the procedure at each level with a fresh symbol table for unlinked tokens
 * 18) Corresponding gaps caused by non-unique tokens on both sides of the gap borders are resolved by repeating the procedure at each level for each gap pair recursively
 * 19) Ambiguous gaps with the same front and back tokens are aligned with newlines or word borders if possible ("sliding")
 * 20) Extracting block information:
 * 21) Unchanged text blocks ('same' blocks, sequences of matched tokens) are added to the blocks table in new text order
 * 22) To set blocks as fixed vs. moved, the longest sequences of blocks in increasing old text order are identified (longest increasing subsequence problem for moved blocks):
 * 23) Independent block sections with no block moves outside the sections are identified
 * 24) Continuous blocks in old text order are identified as groups
 * 25) The longest subsequence of groups in sections in old text order is identified using a cached recursive method   and set fixed
 * 26) Blocks outside sections are set fixed
 * 27) Preventing fragmented outputs caused by false positive matchings in texts with many changes:
 * 28) Unchanged blocks are converted into insertion/deletion pairs in token lists ("unlinking"):
 * 29) Unlink whole group if no block is shorter than three real words   and contains no word unique to the whole text
 * 30) Unlink group blocks stepwise from the front and the end until a block contains more than one real word or a word unique to the whole text
 * 31) After unlinking, the whole process of extracting block information from the token lists has to be repeated
 * 32) Unlinking is repeated as long as possible
 * 33) Deletion blocks ('del' blocks, unmatched sequences of text blocks from the old text token list) are added to the blocks table
 * 34) Deletion blocks are sorted-in in between the unchanged blocks at the correct position in new text order:
 * 35) Deletion blocks that are next to a fixed neighbor block in old text order are positioned next to that block
 * 36) Deletion blocks without a fixed neighbor in old text order are positioned next to a neighbor if that block is not the start or end of a group
 * 37) Alternatively, the deletion block is positioned next to the closest fixed neighbor to the left
 * 38) Deletion blocks around unchanged blocks will be arranged in order of their old text positions
 * 39) Insertion blocks ('ins' blocks, unmatched text blocks from the new text token list) are added to the blocks table
 * 40) Insertion block group numbers and fixed status are set:
 * 41) If the insertion is inside a group, that group number and fixed status is used
 * 42) For insertions outside existing groups a new single-block group is created
 * 43) Adding marks at the original position of moved groups ('mark' blocks) to the blocks table:
 * 44) This is similar to positioning deletion blocks
 * 45) Marks of moved goups that are next to a fixed group in old text order are positioned next to that group
 * 46) Alternatively, the mark is positioned next to the closest fixed group to the left
 * 47) Mark blocks and deletion blocks around unchanged blocks will be arranged in order of their old text positions
 * 48) Moved group colors are set in order of the marks in the new text
 * 49) Collect diff fragment list for markup (abstraction layer for customized diffs):
 * 50) Groups are processed in new text order
 * 51) If block moves are not displayed as per configuration , then this text is inserted as a deletion
 * 52) Otherwise, a mark is added with this text as a popup title
 * 53) Clip unchanged sections from unmoved block text:
 * 54) Unmoved block text is clipped at the closest feature that is detected from the borders (lines, heading, paragraph, line break, blank, fixed number of chars)
 * 55) Create html formatted diff code from:

License
The code has been released into the public domain.