Knuth–Plass line-breaking algorithm

The Knuth–Plass algorithm is a line-breaking algorithm designed for use in Donald Knuth's typesetting program TeX. It integrates the problems of text justification and hyphenation into a single algorithm by using a discrete dynamic programming method to minimize a loss function that attempts to quantify the aesthetic qualities desired in the finished output.

The algorithm works by dividing the text into a stream of three kinds of objects: boxes, which are non-resizable chunks of content, glue, which are flexible, resizeable elements, and penalties, which represent places where breaking is undesirable (or, if negative, desirable). The loss function, known as "badness", is defined in terms of the deformation of the glue elements, and any extra penalties incurred through line breaking.

Making hyphenation decisions follows naturally from the algorithm, but the choice of possible hyphenation points within words, and optionally their preference weighting, must be performed first, and that information inserted into the text stream in advance. Knuth and Plass' original algorithm does not include page breaking, but may be modified to interface with a pagination algorithm, such as the algorithm designed by Plass in his PhD thesis.

Typically, the cost function for this technique should be modified so that it does not count the space left on the final line of a paragraph; this modification allows a paragraph to end in the middle of a line without penalty. The same technique can also be extended to take into account other factors such as the number of lines or costs for hyphenating long words.

Computational complexity
A naive brute-force exhaustive search for the minimum badness by trying every possible combination of breakpoints would take an impractical $$O(2^n)$$ time. The classic Knuth-Plass dynamic programming approach to solving the minimization problem is a worst-case $$O(n^2)$$ algorithm but usually runs much faster in close to linear time.

Solving for the Knuth-Plass optimum can be shown to be a special case of the convex least-weight subsequence problem, which can be solved in $$O(n)$$ time. Methods to do this include the SMAWK algorithm.

Simple example of minimum raggedness metric
For the input text

AAA BB CC DDDDD

with line width 6, a greedy algorithm that puts as many words on a line as possible while preserving order before moving to the next line, would produce:

--   Line width: 6 AAA BB   Remaining space: 0 CC       Remaining space: 4 DDDDD    Remaining space: 1

The sum of squared space left over by this method is $$0^2 + 4^2 + 1^2 = 17$$. However, the optimal solution achieves the smaller sum $$3^2 + 1^2 + 1^2 = 11$$:

--   Line width: 6 AAA      Remaining space: 3 BB CC    Remaining space: 1 DDDDD    Remaining space: 1

The difference here is that the first line is broken before  instead of after it, yielding a better right margin and a lower cost 11.