BSpell: A CNN-blended BERT Based Bengali Spell Checker
Bengali typing is mostly performed using English keyboard and can be highly erroneous due to the presence of compound and similarly pronounced letters. Spelling correction of a misspelled word requires understanding of word typing pattern as well as the context of the word usage. We propose a specialized BERT model, BSpell targeted towards word for word correction in sentence level. BSpell contains an end-to-end trainable CNN sub-model named SemanticNet along with specialized auxiliary loss. This allows BSpell to specialize in highly inflected Bengali vocabulary in the presence of spelling errors. We further propose hybrid pretraining scheme for BSpell combining word level and character level masking. Utilizing this pretraining scheme, BSpell achieves 91.5 accuracy on real life Bengali spelling correction validation set. Detailed comparison on two Bengali and one Hindi spelling correction dataset shows the superiority of proposed BSpell over existing spell checkers.
READ FULL TEXT