Utilizing Word Embeddings based Features for Phylogenetic Tree Generation of Sanskrit Texts


Tracing the root of a text i.e., the original version of the text, by inferring phylogenetic trees has been a topic of interest in philological studies. However, existing methods face meaning conflation deficiency due to the usage of lexical similarity based measures which feed the distance matrix to clustering algorithms. In this paper, we utilize word embeddings as features to compute the distances among manuscripts. We conduct this pilot study on using word embeddings to compute inter-manuscript distances and provide an effective distance matrix to infer phylogenetic trees. We conduct experiments on the historical Sanskrit text known as Kāśikāvrtti (KV) and infer phylogenetic trees using this approach. For comparison, we also develop baseline methods using lexical distance-based measures to infer phylogenetic trees for KV. We show that our methodology produces better trees which club closely related manuscripts together compared to the baseline methods.

In 6th International Sanskrit Computational Linguistics Symposium (ISCLS 2019), .