TextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Coding
Abstract
Lossy text compression reduces data size while preserving core meaning, making it wellsuited
for tasks like summarization, automated analysis, and digital archives where exact
fidelity is less critical. Despite the dominance of transformer-based models in language
modeling, the integration of context vectors and lossless entropy coding into Seq2Seq text
generation remain underexplored. A key challenge lies in identifying the most informative
context vectors from the encoder output and incorporating entropy coding into the transformer
framework to enhance storage efficiency while maintaining high-quality outputs,
even in the presence of noisy text. Previous studies have primarily focused on near-lossless
token generation, often overlooking space efficiency. In this paper, we introduce TextEconomizer,
an encoder-decoder framework paired with a transformer neural network. This
framework utilizes its latent representation to reduce variable-sized inputs by 50% to 80%,
without prior knowledge of dataset dimensions. Our model achieves competitive compression
ratios by incorporating entropy coding, while delivering near-perfect text quality, as
assessed by BLEU, ROUGE, and semantic similarity scores. Notably, TextEconomizer
operates with approximately 153 times fewer parameters than comparable models, achieving
a compression ratio of 5.39× without sacrificing semantic quality. Additionally, we
evaluate our framework by implementing an LSTM-based autoencoder, commonly used
in image compression, and by integrating advanced modules within the transformer architecture
as alternatives to conventional techniques. Our autoencoder achieves a stateof-
the-art compression ratio of 67× with 196 times fewer parameters, while our modified
transformer outperforms the autoencoder with a 263-fold reduction in parameters. The
TextEconomizer framework significantly surpasses existing transformer-based models in
balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy
compression with optimal space utilization. Additionally, we propose another transformerbased
method named RejuvenateFormer for text decompression, addressing prior issues by
harnessing a new pre-processing technique and a lossless compression method. Our meticulous
pre-processing technique, incorporating the Lempel-Ziv-Welch algorithm, achieves
compression ratios of 12.57, 13.38, and 11.42 on the BookCorpus, EN-DE, and EN-FR
corpora, thus showing state-of-the-art compression ratios compared to other deep learning
and traditional approaches. Furthermore, the RejuvenateFormer achieves a BLEU score of
27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its
comprehensive efficacy. In contrast, the pre-trained T5-Small exhibits better performance
over prior state-of-the-art models.
Collections
- M.Sc Thesis/Project [158]
