Show simple item record

dc.contributor.authorMahbub, E Sobhani
dc.date.accessioned2025-12-22T05:10:03Z
dc.date.available2025-12-22T05:10:03Z
dc.date.issued2025-07-02
dc.identifier.urihttp://dspace.uiu.ac.bd/handle/52243/3387
dc.descriptionMSc Thesisen_US
dc.description.abstractLossy text compression reduces data size while preserving core meaning, making it wellsuited for tasks like summarization, automated analysis, and digital archives where exact fidelity is less critical. Despite the dominance of transformer-based models in language modeling, the integration of context vectors and lossless entropy coding into Seq2Seq text generation remain underexplored. A key challenge lies in identifying the most informative context vectors from the encoder output and incorporating entropy coding into the transformer framework to enhance storage efficiency while maintaining high-quality outputs, even in the presence of noisy text. Previous studies have primarily focused on near-lossless token generation, often overlooking space efficiency. In this paper, we introduce TextEconomizer, an encoder-decoder framework paired with a transformer neural network. This framework utilizes its latent representation to reduce variable-sized inputs by 50% to 80%, without prior knowledge of dataset dimensions. Our model achieves competitive compression ratios by incorporating entropy coding, while delivering near-perfect text quality, as assessed by BLEU, ROUGE, and semantic similarity scores. Notably, TextEconomizer operates with approximately 153 times fewer parameters than comparable models, achieving a compression ratio of 5.39× without sacrificing semantic quality. Additionally, we evaluate our framework by implementing an LSTM-based autoencoder, commonly used in image compression, and by integrating advanced modules within the transformer architecture as alternatives to conventional techniques. Our autoencoder achieves a stateof- the-art compression ratio of 67× with 196 times fewer parameters, while our modified transformer outperforms the autoencoder with a 263-fold reduction in parameters. The TextEconomizer framework significantly surpasses existing transformer-based models in balancing memory efficiency and high-fidelity outputs, marking a breakthrough in lossy compression with optimal space utilization. Additionally, we propose another transformerbased method named RejuvenateFormer for text decompression, addressing prior issues by harnessing a new pre-processing technique and a lossless compression method. Our meticulous pre-processing technique, incorporating the Lempel-Ziv-Welch algorithm, achieves compression ratios of 12.57, 13.38, and 11.42 on the BookCorpus, EN-DE, and EN-FR corpora, thus showing state-of-the-art compression ratios compared to other deep learning and traditional approaches. Furthermore, the RejuvenateFormer achieves a BLEU score of 27.31, 25.78, and 50.45 on the EN-DE, EN-FR, and BookCorpus corpora, showcasing its comprehensive efficacy. In contrast, the pre-trained T5-Small exhibits better performance over prior state-of-the-art models.en_US
dc.language.isoen_USen_US
dc.publisherUIUen_US
dc.subjectLossy Text Compressionen_US
dc.subjectDenoising Transformersen_US
dc.subjectEntropy Codingen_US
dc.subjectLLMen_US
dc.titleTextEconomizer: Enhancing Lossy Text Compression with Denoising Transformers and Entropy Codingen_US
dc.typeThesisen_US


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record