Towards Comprehensive Bangla Computing: Corpus and Lexicon with Spell and Grammar Checker
Abstract
In this thesis, we have presented a comprehensive Bangla spell and grammar checker and its building techniques. To make the grammar checker highly accurate and robust, we have built the largest Bangla monolingual corpus comprising over 100 million words. Moreover, we have built the largest Bangla lexicon with over 1 million unique words extracted from the monolingual corpus to enrich the spell checker. Since we have embedded immense data, to increase efficiency and to reduce processing time we have used hashing technique, pre-defined double metaphone and pre-defined counts for language model probability. In addition, our spell and grammar checker improves itself day by day by keeping an individual local log of an user’s previous suggestions and choices for future suggestions which gives customized user experiences.
Bangla is a language spoken of over 300 million total speakers around the world. To write a fruitful Bangla article for diverse publications, it is highly required to have a robust spell and grammar checker. However, few pieces of research have been done on this side of language processing and therefore, in Bangla, there is no such spell and grammar checker which can provide highly appreciable output. Some studies have been done individually on the spell checker or grammar checker. However, checking spell and checking grammar at the same time is very essential for a novel article. Moreover, no researches have been done with such an immense amount of data like we did. In addition, all of these studies have been done only for research purposes without discerning practical applications. That is why all of these studies show several imperfections in performing real-life text processing. In this thesis, we have demonstrated the technique to build our corpus, lexicon, spell and grammar checker with describing the limitations of other studies and solution of those limitations.
Collections
- M.Sc Thesis/Project [149]