V-optimal histograms

Histograms are most commonly used as visual representations of data. However, Database systems use histograms to summarize data internally and provide size estimates for queries. These histograms are not presented to users or displayed visually, so a wider range of options are available for their construction. Simple or exotic histograms are defined by four parameters, Sort Value, Source Value, Partition Class and Partition Rule. The most basic histogram is the equi-width histogram, where each bucket represents the same range of values. That histogram would be defined as having a Sort Value of Value, a Source Value of Frequency, be in the Serial Partition Class and have a Partition Rule stating that all buckets have the same range.

V-optimal histograms are an example of a more "exotic" histogram. V-optimality is a Partition Rule which states that the bucket boundaries are to be placed as to minimize the cumulative weighted variance of the buckets. Implementation of this rule is a complex problem and construction of these histograms is also a complex process.

Definition
A v-optimal histogram is based on the concept of minimizing a quantity which is called the weighted variance in this context. This is defined as
 * $$W=\sum_{j=1}^Jn_jV_j \, ,$$

where the histogram consists of J bins or buckets, nj is the number of items contained in the jth bin and where Vj is the variance between the values associated with the items in the jth bin.

Examples
The following example will construct a V-optimal histogram having a Sort Value of Value, a Source Value of Frequency, and a Partition Class of Serial. In practice, almost all histograms used in research or commercial products are of the Serial class, meaning that sequential sort values are placed in either the same bucket, or sequential buckets. For example, values 1, 2, 3 and 4 will be in buckets 1 and 2, or buckets 1, 2 and 3, but never in buckets 1 and 3. That will be taken as an assumption in any further discussion.

Take a simple set of data, for example, a list of integers:

1, 3, 4, 7, 2, 8, 3, 6, 3, 6, 8, 2, 1, 6, 3, 5, 3, 4, 7, 2, 6, 7, 2

Compute the value and frequency pairs (1, 2), (2, 4), (3, 5), (4, 2), (5, 1), (6, 4), (7, 3), (8, 2)

Our V-optimal histogram will have two buckets. Since one bucket must end at the data point for 8, we must decide where to put the other bucket boundary. The V-optimality rule states that the cumulative weighted variance of the buckets must be minimized. We will look at two options and compute the cumulative variance of those options.

Option 1: Bucket 1 contains values 1 through 4. Bucket 2 contains values 5 through 8.

Bucket 1: Average frequency 3.25 Weighted variance 2.28

Bucket 2: Average frequency 2.5 Weighted variance 2.19

Sum of Weighted Variance 4.47

Option 2: Bucket 1 contains values 1 through 2. Bucket 2 contains values 3 through 8.

Bucket 1: Average frequency 3 Weighted variance 1.41

Bucket 2: Average frequency 2.83 Weighted variance 3.29

Sum of Weighted Variance 4.70

The first choice is better, so the histogram that would wind up being stored is: Bucket 1: Range (1–4), Average Frequency 3.25 Bucket 2: Range (5–8), Average Frequency 2.5

Advantages of V-optimality vs. equi-width or equi-depth
V-optimal histograms do a better job of estimating the bucket contents. A histogram is an estimation of the base data, and any histogram will have errors. The partition rule used in VOptimal histograms attempts to have the smallest variance possible among the buckets, which provides for a smaller error. Research done by Poosala and Ioannidis 1 has demonstrated that the most accurate estimation of data is done with a VOptimal histogram using value as a sort parameter and frequency as a source parameter.

Disadvantages of V-optimality vs. equi-width or equi-depth
While the V-optimal histogram is more accurate, it does have drawbacks. It is a difficult structure to update. Any changes to the source parameter could potentially result in having to re-build the histogram entirely, rather than updating the existing histogram. An equi-width histogram does not have this problem. Equi-depth histograms will experience this issue to some degree, but because the equi-depth construction is simpler, there is a lower cost to maintain it. The difficulty in updating VOptimal histograms is an outgrowth of the difficulty involved in constructing these histograms.

Computing the V-optimal histogram is computationally expensive to compute compared to other types of histograms.

Construction issues
The above example is a simple one. There are only 7 choices of bucket boundaries. One could compute the cumulative variance for all 7 options easily and choose the absolute best placement. However, as the range of values gets larger and the number of buckets gets larger, the set of possible histograms grows exponentially and it becomes a dauntingly complex problem to find the set of boundaries that provide the absolute minimum variance using the naïve approach. Using dynamic programming, it is possible to compute the V-optimal histogram in $$O(N^2 B)$$ where N is the number of data points and B is the number buckets.

Since finding the optimal histogram is quadratic, it is common to instead approximate the V-optimal histogram. By creating random solutions, using those as a starting point and improving upon them, one can find a solution that is a fair approximation of the "best" solution. One construction method used to get around this problem is the Iterative Improvement algorithm. Another is Simulated Annealing. The two may be combined in Two Phase Optimization, or 2PO. These algorithms are put forth in "Randomized Algorithms..." (cited below) as a method to optimize queries, but the general idea may be applied to construction of V-optimal Histograms.

Iterative improvement
Iterative Improvement (II) is a fairly naïve greedy algorithm. Starting from a random state, iterative steps in many directions are considered. The step that offers the best improvement of cost (in this case Total Variance) is taken. The process is repeated until one settles at the local minimum, where no further improvement is possible. Applied to the construction of V-optimal histograms, the initial random state would be a set of values representing the bucket boundary placements. The iterative improvement steps would involve moving each boundary until it was at its local minimum, then moving to the next boundary and adjusting it accordingly.

Simulated annealing
A basic explanation of Simulated Annealing is that it is a lot like II, only instead of taking the greedy step each time, it will sometimes accept a step that results in an increase in cost. In theory, SA will be less likely to stop at a very local minimum, and more likely to find a more global one. A useful piece of imagery is an M-shaped graph, representing overall cost on the Y axis. If the initial state is on the V-shaped part of the "M", II will settle into the high valley, the local minimum. Because SA will accept uphill moves, it is more likely to climb up the slope of the "V" and wind up at the foot of the "M", the global minimum.

Two phase optimization
Two phase optimization, or 2PO, combines the II and SA methods. II is run until a local minimum is reached, then SA is run on that solution in an attempt to find less obvious improvements.

Variation
The idea behind V-optimal histograms is to minimize the variance inside each bucket. In considering this, a thought occurs that the variance of any set with one member is 0. This is the idea behind "End-Biased" V-optimal Histograms. The value with the highest frequency is always placed in its own bucket. This ensures that the estimate for that value (which is likely to be the most frequently requested estimate, since it is the most frequent value) will always be accurate and also removes the value most likely to cause a high variance from the data set.

Another thought that might occur is that variance would be reduced if one were to sort by frequency, instead of value. This would naturally tend to place like values next to each other. Such a histogram can be constructed by using a Sort Value of Frequency and a Source Value of Frequency. At this point, however, the buckets must carry additional information indicating what data values are present in the bucket. These histograms have been shown to be less accurate, due to the additional layer of estimation required.