Goluck KONUKO

PhD Student

L2S, CentraleSupélec
3 rue Joliot Curie
91190 Gif-sur-Yvette, France

About Me

Publications

Employment History

MULTI-REFERENCE GENERATIVE FACE VIDEO COMPRESSION WITH CONTRASTIVE LEARNING (MMSP 2024)

Generative face video coding (GFVC) has been demonstrated as a potential approach to low-latency, low bi-
trate video conferencing. GFVC frameworks achieve an extreme gain in coding efficiency with over 70% bitrate savings when compared to conventional codecs at bitrates below 10kbps. In recent MPEG/JVET standardization efforts, all the information required to reconstruct video sequences using GFVC frameworks are adopted as part of the supplemental enhancement information (SEI) in existing compression pipelines. In light of this development, we aim to address a challenge that has been weakly addressed in prior GFVC frameworks, i.e., reconstruction drift as the distance between the reference and target frames increases. This challenge creates the need to update the reference buffer more frequently by transmitting more Intra refresh frames, which are the most expensive element of the GFVC bitstream. To overcome this problem, we propose instead multiple reference animation as a robust approach to minimizing reconstruction drift, especially when used in a bi-directional prediction mode. Further, we propose a contrastive learning formulation for multi-reference animation. We observe that using a contrastive learning framework enhances the representation capabilities of the animation generator. The resulting framework, MRDAC (Multi-Reference Deep Animation Codec) can therefore be used to compress longer sequences with fewer reference frames or achieve a significant gain in reconstruction accuracy at comparable bitrates to previous frameworks. Quantitative and qualitative results show significant coding and reconstruction quality gains compared to previous GFVC methods, and more
accurate animation quality in presence of large pose and facial expression changes

IMPROVED PREDICTIVE CODING FOR ANIMATION-BASED VIDEO COMPRESSION (EUVIP 2024)

This paper addresses the limitations of generative face video compression (GFVC) under conditions of substantial head movement and complex facial deformations. Previous GFVC frameworks focused on perceptual compression and reconstruct videos only with the goal of perceptual quality. As a result, they often have a large disparity relative to conventional codecs when evaluated for pixel fidelity. We propose a robust framework for learned predictive coding process aiming for both perceptual quality and improved performance in terms of pixel fidelity under low bitrate conditions. Our method proposes a dual residual learning strategy. Specifically, it learns the frame residual between the animated frame and the ground truth i.e. spatial residual coding and further exploits redundancies between neighboring frame residuals i.e temporal residual coding. We specially formulate a low bitrate conditional residual coding mechanisms for both spatial and temporal residual coding. In addition, we propose a zero-cost residual alignment mechanism to refine prediction accuracy of frame residuals. Through end-to-end optimization, the proposed framework achieves a balance between perceptual quality, pixel fidelity and compression efficiency. We conduct experimental evaluations on test sequences and conditions proposed under the JVET-AH0114 standard to show significant performance gains relative to HEVC and VVC standards in terms of perceptual metrics. Compared to other GFVC frameworks, our proposed framework achieves state of the art performance on perceptual metrics and pixel fidelity metrics. It is also competitive with HDAC, HEVC and VVC in terms of pixel fidelity at low bitrates.

PREDICTIVE CODING FOR ANIMATION-BASED VIDEO COMPRESSION (ICIP 2024)

We address the problem of efficiently compressing video for conferencing-type applications. We build on recent approaches based on image animation, which can achieve good reconstruction quality at very low bitrate by representing face motions with a compact set of sparse keypoints. However, these methods encode video in a frame-by-frame fashion, i.e., each frame is reconstructed from a reference frame, which limits the reconstruction quality when the bandwidth is larger. Instead, we propose a predictive coding scheme which uses image animation as a predictor, and codes the residual with respect to the actual target frame. The residuals can be in turn coded in a predictive manner, thus removing efficiently temporal dependencies. Our experiments indicate a significant bitrate gain, in excess of 70% compared to the HEVC video standard and over 30% compared to VVC, on a dataset of talking-head videos.

A HYBRID DEEP ANIMATION CODEC FOR LOW-BITRATE VIDEO CONFERENCING (ICIP 2022)

Deep generative models, and particularly facial animation schemes, can be used in video conferencing applications to efficiently compress a video through a sparse set of keypoints, without the need to transmit dense motion vectors. While these schemes bring significant coding gains over conventional video codecs at low bitrates, their performance saturates quickly when the available bandwidth increases. In
this paper, we propose a layered, hybrid coding scheme to overcome this limitation. Specifically, we extend a codec based on facial animation by adding an auxiliary stream consisting of a very low bitrate version of the video, obtained through a conventional video codec (e.g., HEVC). The animated and auxiliary videos are combined through a novel fusion module. Our results show consistent average BD-Rate
gains in excess of -30% on a large dataset of video conferencing sequences, extending the operational range of bitrates of a facial animation cod

ULTRA-LOW BITRATE VIDEO CONFERENCING USING DEEP IMAGE ANIMATION (ICASSP 2021)

In this work we propose a novel deep learning approach for ultra-low bitrate video compression for video conferencing applications. To address the shortcomings of current video compression paradigms when the available bandwidth is extremely limited, we adopt a model-based approach that employs deep neural networks to encode motion information as keypoint displacement and reconstruct the video signal at the decoder side. The overall system is trained in an end-to-end fashion minimizing a reconstruction error on the encoder output. Objective and subjective quality evaluation experiments demonstrate that the proposed approach provides an average bitrate reduction for the same visual quality of more than 80% compared to HEVC.