Enterprise AI Analysis
TreeNet: A Light Weight Model for Low Bitrate Image Compression
This analysis explores TreeNet, a novel low-complexity image compression model leveraging a binary tree-structured encoder-decoder architecture for efficient representation and reconstruction. TreeNet significantly outperforms JPEG AI in low-bitrate scenarios while drastically reducing model complexity.
Executive Impact
TreeNet represents a significant leap in image compression efficiency, particularly critical for enterprise applications requiring high-volume data handling with minimal computational overhead. Its advancements translate directly into substantial cost savings and improved operational performance.
Deep Analysis & Enterprise Applications
Select a topic to dive deeper, then explore the specific findings from the research, rebuilt as interactive, enterprise-focused modules.
Abstract
Reducing computational complexity remains a critical challenge for the widespread adoption of learning-based image compression techniques. In this work, we propose TreeNet, a novel low-complexity image compression model that leverages a binary tree-structured encoder-decoder architecture to achieve efficient representation and reconstruction. We employ attentional feature fusion mechanism to effectively integrate features from multiple branches. We evaluate TreeNet on three widely used benchmark datasets and compare its performance against competing methods including JPEG AI, a recent standard in learning-based image compression. At low bitrates, TreeNet achieves an average improvement of 4.83% in Bjøntegaard delta bitrate over JPEG AI, while reducing model complexity by 87.82%. Furthermore, we conduct extensive ablation studies to investigate the influence of various latent representations within TreeNet, offering deeper insights into the factors contributing to reconstruction.
Introduction to Image Compression
With the proliferation of digital media, lossy image compression becomes vital. Traditional image compression methods such as JPEG2000 [1], BPG [2], and VTM [3] have achieved impressive rate-distortion performance. However, these methods are constrained by their pronounced reliance on complex manually-engineered modules. Recently, end to end optimized learned image compression has gained popularity. The foundational learned compression methods [4], [5] broadly follow the structure of a variational auto-encoder [15]. The analysis transform ga transforms an image x into a latent representation y. A quantizer Q quantizes y into a discrete valued latent representation ŷ followed by lossless entropy coding using an entropy model Eŷ. The synthesis transform gs maps the entropy decoded latent representation ŷe to the image space to produce the reconstructed image &hat;x. The overall process can be summarized as follows:
Enterprise Process Flow: TreeNet Compression
where, θ and φ represent learnable parameters of ga and gs, respectively. As quantization is non-differentiable, uniform noise is added during training to simulate the quantization process [4]. For optimization of such methods the following loss function is used: L = R(ŷ) + λ · D(x,&hat;x), where R(y) represents the bitrate of the discrete latent ŷ, and D (x, &hat;x) indicates the distortion between the original image x and the reconstructed image &hat;x. The Lagrangian multiplier λ controls the trade-off between rate and distortion.
Several works concentrating on different facets of learned image compression, such as network design [9], [10], context modeling [7], [19], [12], [11], [13], [14] etc. have been proposed to enhance rate-distortion performance. Remarkably, some learned image compression methods [11], [13], [14] are outperforming state-of-the-art traditional methods like the intra-coding mode of VVC. Recently, the JPEG standardization committee has introduced JPEG AI [16], a learning-based image coding standard. However, to achieve better rate-distortion performance, these methods often necessitate complex model architectures, which limits their widespread adoption.
II. Model Architecture
In this section, we describe TreeNet in detail. Fig. 2 illustrates the schematic representation of the model architecture. The model consists an analysis transform ga(·), four entropy bottlenecks geb(·), and a synthesis transform gs(·). The analysis transform ga is designed as a perfect binary tree with a height of 3 encompassing 8 leaf nodes, each functioning as a learnable block. In our experiments, the nodes in ga are structured as residual downsampling blocks as depicted in Fig. 2. The input image x is processed by the root node of ga. The preceding nodes form the input to the successive nodes, i.e., the two child nodes of each parent node receive identical input feature maps. This is done so that during training, the two child nodes have the scope of capturing unique and complimentary features from the complete input. We combine the feature maps emerging from the leaf nodes with common parent nodes using attentional feature fusion [28]. This results in four latents y1, y2, y3, and y4 which are then sent to four different entropy bottlenecks for latent specific entropy coding.
The architecture of each entropy bottleneck is same as in [10] consisting a hyper analysis transform ha(·), a context model gc(·), a hyper synthesis transform hs(·), and an entropy parameter estimation block hep(·). However, instead of an autoregressive context model, we utilize the checkerboard context model [12] for faster inference.
The output of four entropy bottlenecks is fed into the synthesis transform gs. The synthesis transform consists of three upsampling layers. Each upsampling layer contains N residual upsample nodes and N-1 feature fusion nodes, where N∈{3,2,1}. In each feature fusion node, the outputs from two residual upsample nodes are combined, as illustrated in Fig. 2. In the end, a single residual upsample node generates the reconstructed image &hat;x utilizing the features received from the preceding upsampling layer. Existing model architectures [6], [7], [10], [11] use 128 or 192 channels in the convolutional layers. However, TreeNet is configured with 32 channels, which reduces the complexity further.
III. Experiments
A. Training Setup
We trained the models on the combined train and test set of COCO 2017 dataset [29] consisting 150,000 images. We randomly cropped patches of size 256 × 256 from the original images for training. The models are trained with a batch size of 16 with random shuffling. For training the models following loss function was utilized, L = R(ŷ, &hat;z) + λ · D(x,&hat;x), where R represents rate, λ1 and λ2 are Lagrangian multipliers, and Ψ indicates factorized prior. We compute the mean square error (MSE) and the multi-scale SSIM (MS-SSIM) between the original and reconstructed images as distortion measures. In our experiments, we empirically found λ1 to be {0.01, 0.005, 0.0025, 0.00125} and λ2 to be {2.4, 1.2, 0.6, 0.3} culminating in a bitrate between 0.1 and 0.4 bpp. For optimization we used Adam [30] optimizer with an initial learning rate of 10-4. Each model corresponding to a λ value is trained for 450 epochs.
B. Testing Setup
For evaluation, we used Kodak [18], CLIC Professional Valid [31], and Tecnick [17] datasets. We benchmarked TreeNet against BPG [2], VTM [3], JPEG AI [16], Factorized Prior [4], Hyper Prior [6], Color Learning [8] and SLIC [9]. For quantitative comparison, we computed quality metrics such as PSNR, MS-SSIM [32], LPIPS [33], and DISTS [34] using PyTorch Image Quality package [35]. Additionally, we compute NLPD [36] using the implementation provided in [37]. We also calculate Bjøntegaard delta bitrate (BD-rate) [38] for comparing different codecs.
IV. Results
A. Quantitative Comparison
For evaluating the performance of TreeNet quantitatively, we compute the rate-distortion (RD) performance on datasets as mentioned in Subsection III-B. The BD-rate computations are reported in Table I and the RD plots are shown in Fig. 3, 4, and 5. Overall, TreeNet outperforms Factorized Prior [4], Color Learning [8], Hyper Prior [6], and BPG [2], while performing competitively compared to VTM [3], JPEG AI [16], and SLIC [9] across all metrics. TreeNet has an average BD-rate gain in PSNR of 4.83% over JPEG AI. Notably, TreeNet has BD-rate savings of 12.50% in NLPD metric over JPEG AI [16] while being significantly less complex.
| Methods | Kodak [18] | CLIC Professional Valid [31] | Tecnick [17] | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSNR | MS-SSIM | 1-NLPD | 1-LPIPS | PSNR | MS-SSIM | 1-NLPD | 1-LPIPS | PSNR | MS-SSIM | 1-NLPD | 1-LPIPS | |
| BPG [2] | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| VTM [3] | -22.62 | -15.01 | -23.52 | -10.93 | -27.98 | -20.20 | -27.46 | -17.39 | -28.93 | -16.27 | -25.50 | -17.65 |
| Color Learning [8] | 49.03 | -7.02 | -6.59 | 31.60 | 41.28 | -7.63 | -3.55 | 22.10 | 45.55 | 1.23 | 7.26 | 33.86 |
| Factorized Prior [4] | 23.88 | -14.59 | -15.97 | 30.22 | 24.01 | -16.77 | -11.04 | 39.42 | 21.13 | -7.33 | -1.63 | 37.71 |
| Hyper Prior [6] | -2.39 | -18.80 | -23.18 | 19.84 | -7.60 | -26.79 | -26.07 | 15.27 | -5.88 | -19.16 | -18.51 | 15.99 |
| SLIC [9] | -13.67 | -30.31 | -35.65 | 9.23 | -8.65 | -34.05 | -31.80 | 12.81 | -18.37 | -33.07 | -33.46 | 7.77 |
| JPEG AI [16] | 7.47 | -37.51 | -14.36 | 2.12 | -15.06 | -38.61 | -33.11 | -24.71 | -8.30 | -29.08 | -17.62 | -17.09 |
| TreeNet (Ours) | -2.95 | -32.15 | -33.30 | -10.18 | -13.34 | -35.05 | -38.17 | -13.72 | -14.08 | -30.73 | -31.31 | -10.52 |
B. Qualitative Comparison
To showcase the efficacy of TreeNet to produce high quality reconstructions, we visually compare the output of various methods to that of TreeNet as shown in Fig. 6. Along with the overall comparison, we focus on specific areas of the image to highlight the differences. Upon closer inspection, we observe TreeNet has higher reconstruction fidelity both in terms of structure and color compared to other codecs. Even though structural fidelity of JPEG AI [16] is better compared to our method, TreeNet outperforms it in terms of color fidelity. We observe that subtle changes in color are well preserved in TreeNet.
C. Complexity Analysis
For quantifying complexity, we compute thousand multiply-accumulate operations per pixel (kMACs/pixel) using torch-info¹ module. Table II depicts a comparison of kMACs/pixel of various learning-based models. The overall complexity for TreeNet encompassing both encoder and decoder is 60.4 kMACs/pixel. Out of this, the decoder accounts for 51.08 kMACs/pixel, while the encoder contributes 9.32 kMACs/pixel. Notably, TreeNet has 87.82 % less complexity compared to JPEG AI (495.99 kMACs/pixel). We further compute module-wise complexities along with the number of parameters for TreeNet and report them in Table III.
| Codec Names | Encoder Complexity [kMACs/pixel] | Decoder Complexity [kMACs/pixel] |
|---|---|---|
| TreeNet (ours) | 9.32 | 51.08 |
| Factorized Prior | 36.84 | 147.25 |
| Hyper Prior | 40.80 | 149.89 |
| JPEG AI | 277.16 | 218.83 |
| Color Learning | 128.89 | 305.58 |
| SLIC | 93.52 | 745.57 |
| Module Names | Complexity [kMACs/pixel] | No. Params. [millions] |
|---|---|---|
| ga | 6.86 | 0.32 |
| gs | 49.89 | 0.86 |
| ha (x4) | 0.37 | 0.19 |
| hs (x4) | 0.85 | 0.68 |
| hep (x4) | 0.8 | 0.03 |
| Context Model (×4) | 0.44 | 0.21 |
D. Ablation Study
1) Latent Interpretation
For showcasing the impact of four input feature maps of gs on the reconstructed image, we conducted eight experiments belonging to two broad categories, namely, selective propagation and accumulative propagation. In selective propagation, we provide a single input feature map out of the four to gs at a time as shown in (5) and inspect the output. In doing so, we determine the influence of each feature map in the pixel space. The first four columns in Fig. 7, depict the output osp when individual input feature maps are provided to gs. From these columns we can infer that y1 and y2 are responsible for reconstruction of low frequency and high frequency contents in the image, whereas y3 and y4 impart color to the reconstruction. Secondly, in accumulative propagation, we gradually accumulate the input to gs one after the other, starting from the feature map y1 as shown in (6).
2) Spatial Rate Distribution
We visualize the average number of bits required for encoding the latent feature maps. The bitmap is computed by averaging likelihoods for latent pixels across channels. Formally, the bitmap generation process can be stated as: My = Ω(...), where My represents the upscaled bitmap used for visualization. Ω(·) is the nearest neighbour operation used for scaling the bitmap to image dimension, C is the number of channels in a latent feature map, ŷi,j indicates the jth channel of ith quantized latent ŷ, and ˆzi,j represents the jth channel of ith quantized hyper-latent ˆz. We visualize the bitmaps alongside the original image in Fig. 8. For latent y1, the bitmap is more spread out compared to that for latent y2 for which the bitmap is concentrated around the high frequency parts of the image. The bitmaps for latents y3 and y4 present the focus on color gradient. The checkerboard patterns that are visible in bitmaps are due to the checkerboard context model.
V. Conclusion
In this paper, we propose a novel learning-based image compression method called TreeNet that leverages a binary tree-structured architecture for complexity reduction. We present a detailed quantitative and qualitative evaluation of our method and compare it with various state-of-the-art methods including JPEG AI. The experiments showcase competitiveness of TreeNet across three test sets in the five evaluation metrics while being significantly less complex. Finally, we elucidate the contribution of latent blocks on reconstruction providing interpretability to our model. In future work, we aim to further reduce the decoder complexity and improve the rate-distortion performance through better context modeling.
References
- A. Skodras, C. Christopoulos, and T. Ebrahimi, "The jpeg 2000 still image compression standard," IEEE Signal processing magazine, vol. 18, no. 5, pp. 36-58, 2001.
- G. J. Sullivan, J.-R. Ohm, W.-J. Han, and T. Wiegand, "Overview of the high efficiency video coding (hevc) standard," IEEE Transactions on circuits and systems for video technology, vol. 22, no. 12, pp. 1649-1668, 2012.
- B. Bross, Y.-K. Wang, Y. Ye, S. Liu, J. Chen, G. J. Sullivan, and J.- R. Ohm, "Overview of the versatile video coding (vvc) standard and its applications," IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 10, pp. 3736-3764, 2021.
- J. Ballé, V. Laparra, and E. P. Simoncelli, "End-to-end optimized image compression," in International Conference on Learning Representations, 2017.
- L. Theis, W. Shi, A. Cunningham, and F. Huszár, "Lossy image compression with compressive autoencoders," in International Conference on Learning Representations, 2017.
- J. Ballé, D. Minnen, S. Singh, S. J. Hwang, and N. Johnston, "Variational image compression with a scale hyperprior," in International Conference on Learning Representations, 2018.
- D. Minnen, J. Ballé, and G. D. Toderici, "Joint autoregressive and hierarchical priors for learned image compression," in Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, Eds., vol. 31. Curran Associates, Inc., 2018.
- S. Prativadibhayankaram, T. Richter, H. Sparenberg, and S. Foessel, "Color learning for image compression," in 2023 IEEE International Conference on Image Processing (ICIP). IEEE, 2023, pp. 2330-2334.
- S. Prativadibhayankaram, M. P. Panda, T. Richter, H. Sparenberg, S. Fößel, and A. Kaup, "Slic: a learned image codec using structure and color," in 2024 Data Compression Conference (DCC). IEEE, 2024, pp. 3-12.
- Z. Cheng, H. Sun, M. Takeuchi, and J. Katto, "Learned image compression with discretized gaussian mixture likelihoods and attention modules," in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 7939-7948.
- D. He, Z. Yang, W. Peng, R. Ma, H. Qin, and Y. Wang, "Elic: Efficient learned image compression with unevenly grouped space- channel contextual adaptive coding," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 5718-5727.
- D. He, Y. Zheng, B. Sun, Y. Wang, and H. Qin, "Checkerboard context model for efficient learned image compression," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 14771-14780.
- W. Jiang, J. Yang, Y. Zhai, P. Ning, F. Gao, and R. Wang, "Mlic: Multi- reference entropy model for learned image compression," in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 7618-7627.
- W. Jiang, J. Yang, Y. Zhai, F. Gao, and R. Wang, "Mlic++: Linear complexity multi-reference entropy modeling for learned image compression," arXiv preprint arXiv:2307.15421, 2023.
- D. P. Kingma, M. Welling et al., "Auto-encoding variational bayes," 2013.
- E. Alshina, J. Ascenso, and T. Ebrahimi, "Jpeg ai: The first international standard for image coding based on an end-to-end learning-based approach," IEEE MultiMedia, vol. 31, no. 4, pp. 60-69, 2024.
- N. Asuni, A. Giachetti et al., "Testimages: a large-scale archive for testing visual devices and basic image processing algorithms." in STAG, 2014, pp. 63-70.
- E. Kodak, "Kodak lossless true color image suite," Tech. Rep, 1993.
- D. Minnen and S. Singh, "Channel-wise autoregressive entropy models for learned image compression," in 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020, pp. 3339-3343.
- F. Mentzer, E. Agustson, and M. Tschannen, "M2t: Masking transformers twice for faster decoding," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 5340-5349.
- Y. Zhu, Y. Yang, and T. Cohen, “Transformer-based transform coding,” in International Conference on Learning Representations, 2022.
- Y. Yang and S. Mandt, "Computationally-efficient neural image compression with shallow decoders," in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 530-540.
- F. Galpin, M. Balcilar, F. Lefebvre, F. Racapé, and P. Hellier, "Entropy coding improvement for low-complexity compressive auto-encoders," in 2023 Data Compression Conference (DCC), 2023, pp. 338-338.
- T. Leguay, T. Ladune, P. Philippe, G. Clare, F. Henry, and O. Déforges, "Low-complexity overfitted neural image codec," in 2023 IEEE 25th International Workshop on Multimedia Signal Processing (MMSP). IEEE, 2023, pp. 1-6.
- H. Kim, M. Bauer, L. Theis, J. R. Schwarz, and E. Dupont, “C3: High-performance and low-complexity neural compression from a single image or video," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9347-9358.
- T. Blard, T. Ladune, P. Philippe, G. Clare, X. Jiang, and O. Déforges, "Overfitted image coding at reduced complexity," in 2024 32nd Eu- ropean Signal Processing Conference (EUSIPCO). IEEE, 2024, pp. 927-931.
- R. Flepp, A. Ignatov, R. Timofte, and L. Van Gool, "Real-world mobile image denoising dataset with efficient baselines," in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 22 368-22377.
- Y. Dai, F. Gieseke, S. Oehmcke, Y. Wu, and K. Barnard, "Attentional feature fusion," in IEEE Winter Conference on Applications of Computer Vision, WACV 2021, 2021.
- T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, "Microsoft coco: Common objects in context," in Computer vision-ECCV 2014: 13th European conference, zurich, Switzerland, September 6-12, 2014, proceedings, part v 13. Springer, 2014, pp. 740-755.
- D. P. Kingma and J. Ba, "Adam: A method for stochastic optimization," arXiv preprint arXiv:1412.6980, 2014.
- G. Toderici, W. Shi, R. Timofte, L. Theis, J. Balle, E. Agustsson, N. Johnston, and F. Mentzer, "Workshop and challenge on learned image compression (clic2020)," in CVPR, 2020.
- Z. Wang, E. P. Simoncelli, and A. C. Bovik, "Multiscale structural similarity for image quality assessment," in The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, vol. 2. Ieee, 2003, pp. 1398-1402.
- R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, "The unreasonable effectiveness of deep features as a perceptual metric," in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 586-595.
- K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, "Image quality assessment: Unifying structure and texture similarity," IEEE transactions on pattern analysis and machine intelligence, vol. 44, no. 5, pp. 2567-2581, 2020.
- S. Kastryulin, J. Zakirov, D. Prokopenko, and D. V. Dylov, "Pytorch image quality: Metrics for image quality assessment," 2022.
- V. Laparra, J. Ballé, A. Berardino, and E. P. Simoncelli, "Perceptual image quality assessment using a normalized laplacian pyramid," Elec- tronic Imaging, vol. 28, pp. 1-6, 2016.
- K. Ding, K. Ma, S. Wang, and E. P. Simoncelli, "Comparison of image quality models for optimization of image processing systems," CoRR, vol. abs/2005.01338, 2020.
- G. Bjontegaard, "Calculation of average psnr differences between rd- curves," ITU SG16 Doc. VCEG-M33, 2001.
Calculate Your Potential ROI
Estimate the impact TreeNet could have on your operational efficiency and cost savings.
Implementation Roadmap
A phased approach to integrate TreeNet into your existing infrastructure for optimal results.
Phase 1: Discovery & Strategy
Detailed assessment of current image processing workflows, infrastructure, and identifying key optimization areas. Define success metrics and a tailored implementation plan.
Phase 2: Pilot Deployment & Testing
Deploy TreeNet in a controlled environment. Conduct rigorous testing with your specific datasets to validate performance, compression ratios, and system compatibility. Gather feedback.
Phase 3: Integration & Optimization
Seamless integration of TreeNet into your production systems. Fine-tune parameters and workflows to maximize efficiency and achieve the desired rate-distortion performance. Training for your teams.
Phase 4: Monitoring & Scaling
Continuous monitoring of TreeNet's performance and system health. Scale the solution across your enterprise, ensuring ongoing support and adaptability to future needs.
Ready to Transform Your Image Compression?
Book a personalized consultation with our AI specialists to explore how TreeNet can integrate into your enterprise and drive significant efficiencies.