System for Coding Video Conference Sequences Based on Vector Quantization

The goal of our project was to build system for compression of video conferencing sequences by the method of vector quantization.

Abstract
The goal of our project was to build system for compression of video conferencing sequences by the method of vector quantization. This approach combines the ability to receive high compression ratio with low distortion with relatively simple implementation.
In the project we have built general-purpose vector quantizer that can be used as stand alone system for different quantization tasks. In the second stage we build system specially designed to give good performance in compression of videoconference sequences. We have used the ideas of localized multi-level quantization that were proposed in the article”Localized Compression of Video Sequences”, IEEE International Conference on Image Processing, by M. Porat. Then we studied the behavior and performance of the system (compression rate, distortion rate, time, levels that were actually used) on different sequences and with different distortion measures Using the received results we designed 3 motion-based subsystems that increased system performance in the mean of compression rate while providing significant setup. We tested the system on different videoconference sequences to find optimal parameters setup. We received with “Miss America” sequence compression rate of over 200 with good PSNR values (over 37db) and visual quality.The system is implemented as MFC based object-oriented C++ application and provides simple GUI and can serve as useful API for other applications dealing with compression, vector quantization, image processing and computer vision tasks.

System description
The system is built from 3 main components:

1.General purpose Vector Quantizer.
2.The system for localized compression of video sequence based on the ideas presented in the article “”Localized Compression of Video Sequences””, IEEE International Conference on Image Processing, by M. Porat.

3.Sub-systems that improve performance via using motion analyses for 3 main tasks:

3.1.Definition of codebook sizes for each superblock.

3.2.Finding blocks where no motion occurred and therefore we don’t have to transmit them again, but can use a block from receiver’s memory.

3.3.Definition of the relations between the codebooks that are caused by motion of the pixels that the codebook originally belonged to.

1.General purpose Vector Quantizer
The vector quantizer produces 2^N-level reproduction alphabet according to the algorithms for optimal vector quantizer design that are described in part A of this work: we use “splitting” algorithm for requiring initial alphabet with 2^I {I = 1:N} levels and then run the algorithm for calculation of optimal alphabet with average distortion function:
1

as described for unknown input distribution. Then received alphabet is splitted again to obtain next level initial alphabet.
The procedure is performed till 2^N level optimal reproduction alphabet is received. We used k-mean method as described in ISODATA algorithm to produce optimal partition for each given alphabet:

·Initial alphabet is used as “bins” to place all training sequence vectors.

·Each training sequence vector placed into bin where minimal distortion is received between the vector and bin’s centroid.

·The centroids of the bins that are received when all training sequence vectors placed into bins are the new optimal alphabet for the received optimal partition.

This final alphabet is used as codebook for coding data outside the training sequence (assigning to each input vector its reproduction codeword according to “nearest neighbor” method).

While working with pixel vectors in video processing we used splitting constant Epsilon = 12, and square error distortion measure
SE : 2.

2. The system for localized compression of video sequence
We implemented system using the ideas presented in the article “”Localized Compression of Video Sequences””, IEEE International Conference on Image Processing, by M. Porat.

We used frame packets of 4 frames; superblocks therefore were 32x32x4 pixels size. We also used 4 levels of codeword vectors sizes 8x8x4, 4x4x4, 2x4x4, and 1x1x4. The system receives its input as non-compressed BMP files that compressed in transmitter object by the method described in the article. Then the compressed data is decompressed in receiver object and received frames returned in PPM format. All this process is controlled by routines that produce statistical data about system work and enable to make experiments and evaluate changes in the system. We used following PSNR definition for numerical

Evaluation of decompressed images quality:

Assume we are given a source image f(i,j) that contains N by N pixels and a reconstructed image F(i,j) where F is reconstructed by decoding the encoded version of f(i,j). First you compute the mean squared error (MSE) of the reconstructed image as follows:
3

The summation is over all pixels. The root mean squared error (RMSE) is the square root of MSE.

PSNR in decibels (dB) is computed by using 4.

3. Sub-systems that improve performance in videoconference coding applications via using motion analyses
In most typical videoconference sequences we can find common features:

·Static (or very slowly changing) background.

·Information-carrying motion usually caused by persons that participate in the videoconference.

·Fluctuations caused by small camera motion or illumination changes and don’t carry any important information.

·Different rates of motion and importance of information carried by motion for visual quality of images in different areas on the frame (for example: human’s face while speaking has a lot of important motion while his body moves less and the motion of flat surfaces (without edges) is much less important to the visual quality.

We tried to use these features to improve both compression rate and visual and numerical quality andto speedup the system. We designed 3 sub-systems to achieve these goals.

3.1 First system enables to assign codebook size N specific to eachsuperblock

It received by analyzing “amount of motion” inside each superblock through the training sequence frames. “amount of motion” is defined as a sum of

Changes in pixels intensity through the superblock area (32×32) through the training series frames:

Amount of motion = sum over pixels of superblock{sum over frames [ d(x,y,t) ]}

Intensity changes are received from background subtraction.

change(x,y,t) = abs { frame(x,y,t) – background(x,y,t) }

Background here initialized to the first frame and than updated as follows:

background(x,y,t) = alpha* background(x,y,t) + (1-alpha)*frame(x,y,t)

alpha = 0.33
If the “amount of motion” for the superblock is bigger, it receives codebooks with more levels. Currently we enable codebookswith N = 8, 16, 32, 64, 128, 256, 512 and 1024.Typically superblocks that belong to background receive N = 8 – 32, to the human face: N = 512 – 1024 to the human body : N = 64 – 256.

5

Codebook sizes legend:White = 512, 1024
Blue = 256Yellow = 128
Red = 64Green = 32
Black = 16Gray = 8

3.2 Finding blocks where no motion occurred and therefore
we don’t have to transmit them again, but can use a block from receiver’s memory.

We have added to receiver “memory” that remembers the images received as a result of decompression of last transmitted frames packet.

On the transmitter side we have system that is able to detect superblocks that no significant motion occurred inside them since the last frame packet. When such a block found then instead of transmitting its index in codebook we transmit only an order to receiver to get this block from memory. The procedure works on biggest blocks available in the system

(8x8x4).

The decision is made with the help of background subtraction with following formulas:
background = last frame of previous frames packet

for each block {

counter = 0

for each pixel in block

{ for each frametin frames packet {

if ( frame(x,y) – background(x,y) > treshold1 ) { counter= counter + 1 } } }

if ( counter < treshold2 ) { There is no significant motion in the block}}

3.3 Definition of the relations between the codebooks that are caused by motion of the pixels that the codebook originally belonged to
In a case that suitable code word was found in the same not local codebook some predefined number of times – the local codebook “”remembers”” (Both in receiver and transmitter) the index of the codebook where the suitable codewords were found (in addition to regular codebooks’ Updating procedure) so we receive the effect of codebooks that able “”to follow”” in some degree the motion of the object’s part that they belonged to from training sequence. We call the codebook that was found by the algorithm “Good Neighbor” of current superblock.To use this feature special bit added to the codeword of the system that orders the receiver to take codeword from codebook that was found by this algorithm for this superblock, and this instead of transmitting the index of the target codebook. This algorithm is used for 8x8x4, 4x4x4 and 2x2x4 vectors codebooks.

In addition to all these systems we update codebooks after each frames packet transmission. In order to do so, we decompress the transmitted data no only in receiver but also in transmitter. Then pixels that where received in each superblock from decoding at all levels are used to create vectors that will serve as input for this superblock’s codebooks updating procedure. The updating algorithm therefore:

· Receive superblocks pixels from decoding the transmitted data for this superblock.

· Create from these pixels vectors of all levels (8x8x4 -> 1x1x4) to use as input sequence for updating procedure.

· Use current codebook as “bins”.

· For each vector from the input sequence find bin with minimal distortion between the vector and bin’s centroid. Add the vector to this bin. This way receive an optimal partition of the input sequence.

· Find optimal alphabet for this partition: calculate centroids of all received bins. These centroids are the updated codebook. This way bins that contain 0 input sequence vectors will not change suitable codewords, while other codewords will vary slightly to adjust codebook to the changes.

6

Results
1. We have tested the system on different video sequences in order to estimate its performance in different setups.
We checked PSNR and visual quality of the decoded sequences, the relation between the distortion and compression ratio and efficiency of each part of our system and different parameters setups for different conditions in each input sequence.
In our tests we permitted PSNR values of over 35 db.
We have reached compression rate of over 211 with PSNR > 37 db in “Miss America” sequence (frames 29->65) coding:

7 8
Initial (frame 36)                                            Decoded frame
9 10
Initial (frame 57)                                             Decoded frame

Statistic routines provided following data for this sequence
Big Blocks: 2840Medium Blocks: 2877

Small Blocks: 4156Scalar Blocks: 640

NoMotion Blocks: 10427GoodNeighbours Used: 101

Average Rate = 211.7495Average Distortion = 37.4752 db

11

12
As expected in the frame packets were a lot of motion occurred both compression and distortion rates reduced, but they always were above 100 and 35db respectfully.

Acknowledgments
We would like to thank our supervisor Furman Dimitry for his support and guidance throughout this project.Also we would like to thank the Ollendorff Minerva Center which supported this project.