"Keep Your Dimensions on a Leash": True Cognate Detection using Siamese Deep Neural Networks

Abstract

Automatic Cognate Detection helps NLP tasks of Machine Translation, Information Retrieval, and Phylogenetics. Cognate words are defined as word pairs across languages which exhibit partial or full lexical similarity and mean the same (e.g., hund-hound in German-English). In this paper, we use a Siamese Feed-forward neural network with word-embeddings to detect such word pairs. Our experiments with various embedding dimensions show larger embedding dimensions can only be used for large corpora sizes for this task. On a dataset built using linked Indian Wordnets, our approach beats the baseline approach with a significant margin (up to 71%) with the best F-score of 0.85% on the Hindi-Gujarati language pair.

Publication
In ACM India Joint International Conference on Data Science and Management of Data