User:Jayden001/sandbox

Summary generation is done by extracting a set of sentences from a webpage that best summarizes its content.

Currently, the length and the number of selected sentences of the generated summary has limits of 320 characters and 3 sentences respectively.

This section describes the summary model and the algorithm for finding the optimal set of sentences subjected to those constraints.

Given sentences $$s_1, s_2, \ldots, s_n$$, we want to maximized the joint information content by solving the constraint optimization problem
 * $$ \max_{i < j < k} I(s_i, s_j, s_k), \qquad l_i + l_j + l_k \leq 320 $$,

where $$l_i$$ is the length of sentence $$i$$.

Rather than iterating over $$O(n^3)$$ of all 3-sentence combination to find the optimal summary. We resort to the approximation of maximizing the lower bound of the objective function.

\begin{align} I(s_i, s_j, s_k) &= I(s_i) + I(s_j | s_i) + I(s_k | s_i, s_j) \\ &\geq I(s_i | s_1, \ldots, s_{i-1}) + I(s_j | s_1, \ldots, s_{j-1}) + I(s_k | s_1, \ldots, s_{k-1}) \\ &= v_i + v_j + v_k. \end{align} $$

We model the information value of each sentence based on the occurrence of salient terms in the sentence.
 * $$ V(s_i | s_1, \ldots, s_{i-1}) = \sum_{t \in \text{salient terms}} \text{score}_t \times P(\text{important}_t \;|\; \text{term signals}),$$

where the term-level signals are based on all previous sentences.

Using information theory analogy, the salient terms are like independent occurrence of rare events, each contributing the expected information content over within the sentence.

Of course, sentence with many salient term hits is not necessarily a good summary sentence (eg. naming terms). The likelihood of being a good summary also depends on other sentence-level signals like position, length, formatting, etc. Therefore, we model the expected marginal information as



\begin{align} v_i &= I(s_i | s_1, \ldots, s_{i-1}) \\ &= V(s_i | s_1, \ldots, s_{i-1}) \times P(\text{good summary} \;|\; \text{sentence signals}).\\ \end{align} $$

Maximizing $$v_i + v_j + v_k$$ subjected to the length and number of sentence constraint can be done using dynamic programming.

The optimization problem becomes
 * $$ \max_{i < j < k} v_i + v_j + v_k, \qquad l_i + l_j + l_k \leq 320 $$,

which is just a generalized knapsack problem with number of item constraint, in addition to the usual capacity constraint.

Final algorithm: