Template talk:Strlen quick

This is the discussion/talk-page for: Template:Strlen_quick.

Created
The fast string-length counter, Template:Strlen_quick, was created by long-term user Wikid77 on 30 January 2011, to provide a very fast string-length template, optimized for improved performance with actual Wikipedia data. It is also optimized to use limited wiki-markup resources in the NewPP MediaWiki preprocessor, by using expansion depth of only 5 levels, rather than 9-to-14 levels used by other string-length templates. -Wikid77 10:09, 30 January 2011 (UTC)

Optimizing for actual string lengths
30-Jan-2011: The Template:Strlen_quick was created, as a faster alternative to {str_len}, by optimizing for real string data as used in articles. Using the actual string searches, from existing Wikipedia articles, it is possible to determine the most-likely string lengths, such as 17/18 characters for titles. Then, optimize to match those lengths faster: for example, suppose the top 1,000 articles all used an infobox code of 9 letters, in that case, checking for length 9, first, could avoid checking other lengths. In the case of 353,000 articles using {&#123;Italic_title}}, the string lengths range from 2-99 letters, with the most-common lengths between 16-19 long, and 88% of all titles < 30 long. The distribution of lengths of titles has been as follows:
 * 84% > 10, 12% < 10, 51% in 10-19, 25% in 20-29, 7% in 30-39, 1.7% in 40-49, 0.6% >50.

For lengths 0-9, the increase is dramatic: almost no titles are 1 or 2 characters, a few are 3, some are 4, then more have lengths 5, 6, 7, 8, with 9 as 19x times more common than length 3. In trying to match title-length quickly, then check for the most-common first, as length 9-to-1 in reverse order. Among lengths 10-19, the most common are at 17/18, then fewer when farther away, with 10 being the least-frequent length among those. Above 20, the lengths decrease in frequency, 21-to-29, as the reverse of 9-1, so checking 21, first, is 3x times more likely to match than 29. Among 30-39, the titles are quite rare, with 31 being as rare as length 5, and 39 being 3x times more rare, as occurring only 43-per-10,000 titles. By optimizing for the actual lengths of titles, those lengths can be matched perhaps twice as quickly. A pure binary search would give unfair advantage to rare lengths, so the string-search should be prioritized in favor of the more common lengths.

The markup logic, below, uses prioritized steps (the actual markup handles length over 70): LOGIC to match 1-to-60 lengths in order of most common real data:

Tests of the above code show that it, in fact, processes actual title lengths about 2x times (twice) as fast as the binary-search markup logic which has been used in template {&#123;str_len}}. -Wikid77 10:09, 30 January 2011, revised 01:21, 22 February 2011 (UTC)

Zero length string returns length=1
testing:
 * &rarr;
 * &rarr;
 * &rarr;
 * &rarr;
 * &rarr;
 * &rarr;
 * I think the last three are in error. -DePiep (talk) 07:54, 15 June 2012 (UTC)