ZPEG

ZPEG is a motion video technology that applies a human visual acuity model to a decorrelated transform-domain space, thereby optimally reducing the redundancies in motion video by removing the subjectively imperceptible. This technology is applicable to a wide range of video processing problems such as video optimization, real-time motion video compression, subjective quality monitoring, and format conversion.

The ZPEG company produces modified versions of x264, x265, AV1, and FFmpeg under the name ZPEG Engine (see ).

Decorrelated Transform Space
Pixel distributions are well-modeled as stochastic process, and a transformation to their ideal decorrelated representation is accomplished by the Karhunen–Loève transform (KLT) defined by the Karhunen–Loève theorem. The discrete cosine transform (DCT) is often used as a computationally efficient transform that closely approximates the Karhunen–Loève transform for video data due to the strong correlation in pixel space typical of video frames. As the correlation in the temporal direction is just as high as that of the spatial directions, a three-dimensional DCT may be used to decorrelate motion video.

Human Visual Model
A Human Visual Model may be formulated based on the contrast sensitivity of the visual perception system. A time-varying Contrast Sensitivity model may be specified, and is applicable to the three-dimensional discrete cosine transform (DCT). A three-dimensional Contrast Sensitivity model is used to generate quantizers for each of the three-dimensional basis vectors, resulting in a near-optimal visually lossless removal of imperceptible motion video artifacts.

Perceptual strength in visiBels
The perceptual strength of the Human Visual Model quantizer generation process is calibrated in visiBels (vB), a logarithmic scale roughly corresponding to perceptibility as measured in screen heights. As the eye moves further from the screen, it becomes less able to perceive details in the image. The ZPEG model also includes a temporal component, and thus is not fully described by viewing distance. In terms of viewing distance, the visiBel strength increases by six as the screen distance halves. The standard viewing distance for Standard Definition television (about 7 screen heights) is defined as 0vB. The normal viewing distance for high-definition video (HD video), about 4 screen heights, would be defined as about −6 vB (3.5 screen heights).

Video optimization
The ZPEG pre-processor optimizes motion video sequences for compression by existing motion estimation-based video compressors, such as Advanced Video Coding (AVC) (H.264) and High Efficiency Video Coding (HEVC) (H.265). The human visual acuity model is converted into quantizers for direct application to a three-dimensional transformed block of the motion video sequence, followed by an inverse quantization (signal processing) step by the same quantizers. The motion video sequence returned from this process is then used as input to the existing compressor.

Compression boost strength
The application of Human Visual System-generated quantizers to a block-based Discrete Cosine Transform results in increased compressibility of a motion video stream by removing imperceptible content from the stream. The result is a curated stream that has removed detailed spatial and temporal details that the compressor would otherwise be required to reproduce. The stream also produces better matches for motion estimation algorithms. The quantizers are generated to be imperceptible at a specified viewing distance, specified in visiBels. Typical pre-processing viewing conditions in common use are:
 * Standard Definition (SD) video is processed at −6 vB.
 * High-definition video (HD) is processed at −12 vB.
 * Ultra-High Definition video (UHD, 4K) is processed at −12 vB.
 * Immersive Ultra-High Definition video (Virtual Reality) is processed at −18 vB.

Average compression savings for 6 Mbps HD video using the x.264 codec when processed at −12 vB is 21.88%. Average compression savings for 16 Mbps Netflix 4K test suite video using the x.264 codec processed at −12 vB is 29.81%. The same Netflix test suite when compressed for immersive viewing (−18 vB) generates a 25.72% savings. These results are reproducible through use of a publicly-accessible test bed.

Deblocking
While the effects of ZPEG pre-processing are imperceptible to the average viewer at the specified viewing distance, edge effects introduced by block-based transform processing still affect the performance advantage of the video optimization process. While existing deblocking filters may be applied to improve this performance, optimal results are obtained through use of a multi-plane deblocking algorithm. Each plane is offset by one-half the block size in each of four directions, such that the offset of the plane is one of (0,0), (0,4), (4, 0), and (4,4) in the case of 8x8 blocks and four planes. Pixels values are then chosen according to their distance to the block edge, with interior pixel values being preferred to boundary pixel values. The resulting deblocked video generates substantially better optimization over a wide range of pre-processing strengths.

Real-time video compression
Conventional motion compression solutions are based on motion estimation technology. While some transform-domain video codec technologies exist, ZPEG is based on the three-dimensional Discrete Cosine Transform (DCT), where the three dimensions are pixel within line, line within frame, and temporal sequence of frames. The extraction of redundant visual data is performed by the computationally-efficient process of quantization of the transform-domain representation of the video, rather than the far more computationally expensive process of searching for object matches between blocks. Quantizer values are derived by applying a Human Visual Model to the basis set of DCT coefficients at a pre-determined perceptual processing strength. All perceptually redundant information is thereby removed from the transform domain representation of the video. Compression is then performed by an entropy removal process.

Quantization
Once the viewing conditions has been chosen under which the compressed content is to be viewed, a Human Visual Model generates quantizers for application to the three-dimensional Discrete Cosine Transform (DCT). These quantizers are tuned to remove all imperceptible content from the motion video stream, greatly reducing the entropy of the representation. The viewing conditions expressed in visiBels and the correlation of pixels before transformation are generated for reference by the entropy encoding.

Context-driven entropy coding
While quantized DCT coefficients have traditionally be modeled as Laplace distributions, more recent work has suggested the Cauchy distribution better models the quantized coefficient distributions. The ZPEG entropy encoder encodes quantized three-dimensional DCT values according to a distribution that is completely characterized by the quantization matrix and the pixel correlations. This side-band information carried in the compressed stream enables the decoder to synchronize its internal state to the encoder.

Subband decomposition
Each DCT band is separately entropy coded to all other bands. These coefficients are transmitted in band-wise order, starting with the DC component, followed by the successive bands in order of low resolution to high, similar to wavelet packet decomposition. Following this convention assures that the receiver will always receive the maximum possible resolution for any bandpass pipe, enabling a no-buffering transmission protocol.

Subjective quality metrics
The gold measure of perceived quality difference between a reference video and its degraded representation is defined in ITU-R recommendation BT-500. The double-stimulus continuous quality-scale (DSCQS) method rates the perceived difference between the reference and distorted videos to create an overall difference score derived from individual scores ranging from −3 to 3:
 * -3: impaired video is much worse.
 * -2: impaired video is worse.
 * -1: impaired video is slightly worse.
 * 0: Videos are the same.
 * 1: impaired video is slightly better.
 * 2: impaired video is better.
 * 3: impaired video is much better.

In an analogy to the single-stimulus continuous quality-scale (SSCQS) normalized metric Mean Opinion Score (MOS), the overall DSCQS score is normalized to the range (−100, 100) and is termed the Differential Mean Opinion Score (DMOS), a measure of subjective video quality. An ideal objective measure will correlate strongly to the DMOS score when applied to a reference/impaired video pair. A survey of existing techniques and their overall merits may be found on the Netflix blog. ZPEG extends the list of available techniques by providing a subjective quality metric generated by comparing the Mean Squared Error metric of the difference between the reference and impaired videos after pre-processing at various perceptual strengths (in visiBels). The effective viewing distance at which the impairment difference is no longer perceivable is reported as the impairment metric.

Format conversion
Statistically ideal format conversion is done by interpolation of video content in Discrete Cosine Transform space. The conversion process, particularly in the case of up-sampling, must consider the ringing artifacts that occur when abrupt continuities take place in a sequence of pixels being re-sampled. The resulting algorithm can down-sample or up-sample video formats by changing the frame dimensions, pixel aspect ratio, and frame rate.