Cognate Identification to improve Phylogenetic trees for Indian Languages

Cognate Identification to improve Phylogenetic trees for Indian Languages

Abstract

Cognates are present in multiple variants of the same text across different languages. Computational Phylogenetics uses algorithms and techniques to analyze these variants and infer phylogenetic trees for a hypothesized accurate representation based on the output of the computational algorithm used. In our work, we detect cognates among a few Indian languages namely Hindi, Marathi, Punjabi, and Sanskrit for helping build cognate sets for phylogenetic inference. Cognate detection helps phylogenetic inference by helping isolate diachronic sound changes and thus detect the words of a common origin. A cognate set manually annotated with the help of a lexicographer is generally used to automatically infer phylogenetic trees. Our work creates cognate sets of each language pair and infers phylogenetic trees based on a bayesian framework using the Maximum likelihood method. We also implement our work to an online interface and infer phylogenetic trees based on automatically detected cognate sets. The online interface helps create phylogenetic trees based on the textual data provided as an input. It helps a lexicographer provide manual input of data, edit the data based on their expert opinion and eventually create phylogenetic trees based on various algorithms including our work on automatically creating cognate sets. We go on to discuss the nuances in detection cognates with respect to these Indian languages and also discuss the categorization of Cognate words i.e., ‘Tatasama’ and ‘Tadbhava’ words.

Publication
Proceedings of the ACM India Joint International Conference on Data Science and Management of Data