Using Multilingual Topic Models for Improved Alignment in English-Hindi MT

Diptesh Kanojia, Aaditya Joshi, Pushpak Bhattarcharyya, Mark J. Carman

January 2015

PDF Slides

Using Multilingual Topic Models for Improved Alignment in English-Hindi MT

Diptesh Kanojia, Aaditya Joshi, Pushpak Bhattarcharyya, Mark J. Carman

January 2015

PDF Slides

Abstract

Parallel corpora are often injected with bilingual dictionaries for improved Indian language machine translation (MT). In absence of such dictionaries, a coarse dictionary may be required. This paper demonstrates the use of a multilingual topic model for creating coarse dictionaries for English-Hindi MT. We compare our approaches with: (a) a baseline with no additional dictionary injection, and (b) a corpus with a good quality dictionary. Our results show that the existing Cartesian product approach which is used to create the pseudo-parallel data results in a degradation on tourism and health datasets, for English-Hindi MT. Our paper points to the fact that existing Cartesian approach using multilingual topics (devised for European languages) may be detrimental for Indian language MT. On the other hand, we present an alternate ‘sentential’ approach that leads to a slight improvement. However, our sentential approach (using a parallel corpus injected with a coarse dictionary) outperforms a system trained using parallel corpus and a good quality dictionary.

Type

Conference paper

Publication

Twelth International Conference on Natural Language Processing