Steered Mixture-of-Experts
For Image Coding

"A Universal Image Coding Approach Using Sparse Steered Mixture-of-Experts Regression"

Verhack R., Sikora T., Lange L., Van Wallendael, G and Lambert, P.

Our challenge is the design of a "universal" bit-efficient image compression approach. The prime goal is to allow reconstruction of images with high quality. In addition, we attempt to design the coder and decoder "universal", such that MPEG-7-like low-and mid-level descriptors are an integral part of the coded representation.


To this end, we introduce a sparse Mixture-of-Experts regression approach for coding images in the pixel domain. The underlying stochastic process of the pixel amplitudes are modelled as a 3-dimensional and multi-modal Mixture-of-Gaussians with K modes. This closed form continuous analytical model is estimated using the Expectation-Maximization algorithm and describes segments of pixels by local 3-D Gaussian steering kernels with global support. As such, each component in the mixture of experts steers along the direction of highest correlation. The conditional density then serves as the regression function.

Experiments show that a considerable compression gain is achievable compared to JPEG for low bitrates for a large class of images, while forming attractive low-level descriptors for the image, such as the local segmentation boundaries, direction of intensity flow and the distribution of these parameters over the image.

Small example

Let's have a closer look at what our modeling actually does, by following an example. The next figure shows a 32-by-32 pixel crop from the well known Lena image. The original was JPEG coded and SMoE (with 10 components) coded with the same bitrate (45 Byte). The JPEG header was left out, to ensure a fair comparison.

The original
JPEG
SMoE

Notice how well the dominant edges are reconstructed by SMoE, compared to JPEG. OK, so how do we get to this reconstruction?

Mixture Model
Top view
Softmax

We represent the image as its joint probability function (pdf) of the 3-D space (2-D pixel location, 1-D grayvalue). It allows you to query the following question: "Give me the most probably luminance for a pixel at location X". In our case, we assume that the model is a Mixture-of-Gaussian, meaning that it is a network of single Gaussian distributions or components. In the figure, we have a network of 10 components.

The second image depicts the same model, projected onto the pixel domain. You can see that it identifies the main regions. Nevertheless, it only becomes very interesting in the right image above. Due to the use of the softmax function at reconstruction time. For every pixel the influence of each component is assessed while the total sum off all influences onto one pixel sums to one. Consequently, every pixel is assured to be covered, i.e. the reconstruction is assured to be global.

Every component describes a gradient. This can clearly be seen at the right side in the image, where the whole bottom-right side is approximated by just one large component. Our reconstruction is thereby a very careful arrangement of gradients. This is why we refer to the system as a Mixture-of-Experts. Each component acts as an expert for its softmaxed region. Our approach thus requires the transmission of those 10 components. Each element is described by its 3D center and its 3x3 covariance matrix.

Segmentation
Intensity Flow
Edges

Most current standards are blind, which means that they operate on the lowest possible computer vision level of pixels. The pixels are not grouped by any underlaying semantics. Consider the three images above. Our model has a mid-level understanding of the image, as it identifies segments in the image. Because of the structure of our model, we are able to infer a various amount of extra visual information about the image, e.g. the intensity flow (also: local orientation), segmentation, edge orientation, and edge strengths. These are the basic building blocks for many computer vision approaches, such as image retrieval or comparison. They are freely available at decoder, right in the pixel code.

For a more in-depth, mathematical discussion on the framework, we refer to the paper [1].

Examples

JPEG vs SMoE

Let's start off with a comparison of the well-known Peppers image at the same bpp (bits-per-pixel).

0.14 bpp
JPEG
SMoE
0.45 bpp

It is clear that for low bitrates SMoE reconstructs the dominant edges very well, while JPEG suffers from heavy block artefacts. For high bitrates, SMoE suffers from the fact that it does not have enough components to model the noise-like detail completely. What we mean with this is that the model would need one component per noise-speckle to have an exact reconstruction. This is infeasible as the cost per component is relatively high. Nonetheless, this can be mitigated by extending our approach with some form of texture coding. This will be the logical next step in our framework.

Let's have a look at the descriptors for this Peppers example for the lowest bitrate.

Segmentation
Edges
Intensity flow

Lena

SMoE: 0.15 bpp, 26.93 dB PSNR, 0.78 SSIM
JPEG: 0.14 bpp, 24.83 dB PSNR, 0.67 SSIM

Cameraman

SMoE: 0.25 bpp, 28.8 dB PSNR, 0.866 SSIM
JPEG: 0.24 bpp, 31.3 dB PSNR, 0.87 SSIM

Computationally efficient reconstruction (*)

The normal reconstruction consists of calculating a weighted sum over all components per pixel. Some observations:

  1. As every pixel is calculated independent from other, this allows for massive parallelization, e.g. on GPUs.
  2. In theory, every component has global support (every component has an influence onto every pixel), but in practice only nearby components have observable influence. Consequently, we can limit the range of the sum to cover only nearby components, based on the pixel location and component centers.

Based on these observations, we developed a local reconstruction, in which we process the reconstruction in blocks of pixels. We define a relevance window around that block. Only components in this relevance window are considered during the reconstruction of that block. From experiments we have seen that this local reconstruction method boosts the performance immensily for large images.

(*) This part is not tackled in the paper [1].

Video and higher dimensional visual data

An extension to video has been made, more information can be found in [2], or on http://www.nue.tu-berlin.de/research/smoefvc.

Note, that this technique is readily extendable to higher dimensional visual data. The key idea advantage of SMoE, is that it treats inter-dimensional correlation, as just correlation. It does not differ between the spatial correlation between pixels and the temporal. Consequently, it is flexible to scale in dimensionality.

Code

Code will be published publicly online as soon as the paper is formally accepted. For now, please contact me.

References

[1] Verhack, R., Sikora, T., Lange, L., Van Wallendael, G., and Lambert, P. (2016). "A Universal Image Coding Approach using Sparse Steered Mixture-of-Experts Regression". Under submission for IEEE International Conference on Image Processing, ICIP 2016.

[2] Lange , L., Verhack, R., and Sikora, T. (2016) "Video Representation and Coding using a Sparse Steered Mixture-of-Experts Network". Under submission for IEEE International Conference on Image Processing, ICIP 2016.