TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding

Mahbub, E Sobhani

View/Open

Final Thesis (1.598Mb)

Date

2025-07-02

Author

Mahbub, E Sobhani

Metadata

Show full item record

Abstract

Lossy text compression reduces data size while preserving core meaning, making it wellsuited for tasks like summarization, automated analysis, and digital archives where exact fidelity is less critical. Despite the dominance of transformer-based models in language modeling, the integration of context vectors and lossless entropy coding into Seq2Seq text generation remain underexplored. A key challenge lies in identifying the most informative context vectors from the encoder output and incorporating entropy coding into the transformer framework to enhance storage efficiency while maintaining high-quality outputs, even in the presence of noisy text. Previous studies have primarily focused on near-lossless token generation, often overlooking space efficiency. In this paper, we introduce TextEconomizer, an encoder-decoder framework paired with a transformer neural network. This framework utilizes its latent representation to reduce variable-sized inputs by 50% to 80%, without prior knowledge of dataset dimensions. Our model achieves competitive compression ratios by incorporating entropy coding, while delivering near-perfect text quality, as assessed by BLEU, ROUGE, and semantic similarity scores. Notably, TextEconomizer operates with approximately 153 times fewer parameters than comparable models, achieving a compression ratio of 5.39× without sacrificing semantic quality. Additionally, we evaluate our framework by implementing an LSTM-based autoencoder, commonly used in image compression, and by integrating advanced modules within the transformer architecture as alternatives to conventional techniques. Our autoencoder achieves a stateof- the-art compression ratio of 67× with 196 times fewer parameters, while our modified transformer outperforms the autoencoder with a 263-fold reduction in parameters. The TextEconomizer framework significantly surpasses existing transformer-based models in balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy compression with optimal space utilization. Additionally, we propose another transformerbased method named RejuvenateFormer for text decompression, addressing prior issues by harnessing a new pre-processing technique and a lossless compression method. Our meticulous pre-processing technique, incorporating the Lempel-Ziv-Welch algorithm, achieves compression ratios of 12.57, 13.38, and 11.42 on the BookCorpus, EN-DE, and EN-FR corpora, thus showing state-of-the-art compression ratios compared to other deep learning and traditional approaches. Furthermore, the RejuvenateFormer achieves a BLEU score of 27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its comprehensive efficacy. In contrast, the pre-trained T5-Small exhibits better performance over prior state-of-the-art models.

URI

http://dspace.uiu.ac.bd/handle/52243/3387

Collections

M.Sc Thesis/Project [166]