Publications | Kanojia, Diptesh

Utilizing Weak Supervision to Create S3D: A Sarcasm Annotated Dataset

Sarcasm is prevalent in all corners of social media, posing many challenges within Natural Language Processing (NLP), particularly for …

Jordan Painter, Helen Treharne, Diptesh Kanojia

PDF Code Dataset Slides

Findings of the WMT 2022 Shared Task on Quality Estimation

We report the results of the WMT 2022 shared task on Quality Estimation, in which the challenge is to predict the quality of the output …

Chrysoula Zerva, Frédéric Blain, Ricardo Rei, Piyawat Lertvittayakumjorn, José G. C. de Souza, Steffen Eger, Diptesh Kanojia, Duarte Alves, Constantin Orăsan, Marina Fomicheva, André F. T. Martins, Lucia Specia

PDF Dataset Slides Source Document

Findings of the WMT 2022 Shared Task on Automatic Post-Editing

We present the results from the 8th round of the WMT shared task on MT Automatic Post-Editing, which consists in automatically …

Pushpak Bhattacharyya, Rajen Chatterjee, Markus Freitag, Diptesh Kanojia, Matteo Negri, Marco Turchi

PDF Code Dataset Slides

Harnessing Abstractive Summarization for Fact-Checked Claim Detection

Social media platforms have become new battlegrounds for anti-social elements, with misinformation being the weapon of choice. …

Varad Bhatnagar, Diptesh Kanojia, Kameswari Chebrolu

Preprint PDF Code Dataset Slides Video

PLOD: An Abbreviation Detection Dataset for Scientific Documents

The detection and extraction of abbreviations from unstructured texts can help to improve the performance of Natural Language …

Leonardo Zilio, Hadeel Saadany, Prashant Sharma, Diptesh Kanojia, Constantin Orăsan

Preprint PDF Code Dataset Slides Video

HiNER: A Large Hindi Named Entity Recognition Dataset

Named Entity Recognition (NER) is a foundational NLP task that aims to provide class labels like Person, Location, Organisation, Time, …

Rudra Murthy, Pallab Bhattacharjee, Rahul Sharnagat, Jyotsana Khatri, Diptesh Kanojia, Pushpak Bhattacharyya

Preprint PDF Code Dataset Poster Video

SURREY-CTS-NLP at WASSA2022: An Experiment of Discourse and Sentiment Analysis for the Prediction of Empathy, Distress and Emotion

This paper summarises the submissions our team, SURREY-CTS-NLP has made for the WASSA 2022 Shared Task for the prediction of empathy, …

Shenbin Qian, Constantin Orăsan, Diptesh Kanojia, Hadeel Saadany, Félix Do Carmo

PDF

An Ensemble Approach to Acronym Extraction using Transformers

Acronyms are abbreviated units of a phrase constructed by using initial components of the phrase in a text. Automatic extraction of …

Prashant Sharma, Hadeel Saadany, Leonardo Zilio, Diptesh Kanojia, Constantin Orăsan

Preprint PDF Code

Automated Evidence Collection for Fake News Detection

Fake news, misinformation, and unverifiable facts on social media platforms propagate disharmony and affect society, especially when …

Mrinal Rawat, Diptesh Kanojia

Preprint PDF Code Dataset Slides Video

Pushing the Right Buttons: Adversarial Evaluation of Quality Estimation

Current Machine Translation (MT) systems achieve very good results on a growing variety of language pairs and datasets. However, they …

Diptesh Kanojia, Marina Fomicheva, Tharindu Ranasinghe, Frédéric Blain, Constantin Orăsan, Lucia Specia

Preprint PDF Code Dataset Slides Video

'So You Think You’re Funny?': Rating the Humour Quotient in Standup Comedy

Computational Humour (CH) has attracted the interest of Natural Language Processing and Computational Linguistics communities. Creating …

Anirudh Mittal, Pranav Jeevan, Prerak Gandhi, Diptesh Kanojia, Pushpak Bhattacharyya

Preprint PDF Code Dataset Poster Slides Video

FrameNet-assisted Noun Compound Interpretation

Given a noun compound (NC), we address the problem of predicting the appropriate semantic label linking the constituents of the NC. …

Girishkumar Ponkiya, Diptesh Kanojia, Pushpak Bhattacharyya, Girish Palshikar

PDF Dataset

Cognition-aware Cognate Detection

Automatic detection of cognates helps downstream NLP tasks of Machine Translation, Cross-lingual Information Retrieval, Computational …

Diptesh Kanojia, Prashant Sharma, Sayali Ghodekar, Pushpak Bhattacharyya, Gholamreza Haffari, Malhar Kulkarni

Preprint PDF Code Poster Slides Video

Harnessing Cross-lingual Features to Improve Cognate Detection for Low-resource Languages

Cognates are variants of the same lexical form across different languages; for example ‘fonema’ in Spanish and …

Diptesh Kanojia, Raj Dabre, Shubham Dewangan, Pushpak Bhattacharyya, Gholamreza Haffari, Malhar Kulkarni

Preprint PDF Dataset Slides Video

Happy Are Those Who Grade without Seeing: A Multi-Task Learning Approach to Grade Essays Using Gaze Behaviour

The gaze behaviour of a reader is helpful in solving several NLP tasks such as automatic essay grading. However, collecting gaze …

Sandeep Mathias, Rudra Murthy, Diptesh Kanojia, Abhijit Mishra, Pushpak Bhattacharyya

Preprint PDF Dataset Video

Cognitively Aided Zero-Shot Automatic Essay Grading

Automatic essay grading (AEG) is a process in which machines assign a grade to an essay written in response to a topic, called the …

Sandeep Mathias, Rudra Murthy, Diptesh Kanojia, Pushpak Bhattacharyya

Preprint PDF Dataset Slides

A Survey on Using Gaze Behaviour for Natural Language Processing

Gaze behaviour has been used as a way to gather cognitive information for a number of years. In this paper, we discuss the use of gaze …

Sandeep Mathias, Diptesh Kanojia, Abhijit Mishra, Pushpak Bhattacharyya

Preprint PDF Poster Slides Video

Recommendation Chart of Domains for Cross-Domain Sentiment Analysis: Findings of A 20 Domain Study

Cross-domain sentiment analysis (CDSA) helps to address the problem of data scarcity in scenarios where labelled data for a domain …

Akash Sheoran, Diptesh Kanojia, Aditya Joshi, Pushpak Bhattacharyya

Preprint PDF Dataset

Challenge Datasets of Cognate and False Friend Pairs for Indian Languages

Cognates are present in multiple variants of the same text across different languages (e.g., hund in German and hound in English …

Diptesh Kanojia, Pushpak Bhattacharyya, Malhar Kulkarni, Gholamreza Haffari

Preprint PDF Dataset

"A Passage to India": Pre-trained Word Embeddings for Indian Languages

Dense word vectors or ‘word embeddings’ which encode semantic properties of words, have now become integral to NLP tasks …

Kumar Saurav, Kumar Saunack, Diptesh Kanojia, Pushpak Bhattacharyya

Preprint PDF Code Dataset

Strategies of Effective Digitization of Commentaries and Sub-commentaries: Towards the Construction of Textual History

This paper describes additional aspects of a digital tool called the ‘Textual History Tool’. We describe its various salient features …

Diptesh Kanojia, Malhar Kulkarni, Sayali Ghodekar, Eivind Kahrs, Pushpak Bhattacharyya

Preprint PDF Project Slides

Harnessing Deep Cross-lingual Word Embeddings to Infer Accurate Phylogenetic Trees

Establishing language relatedness by inferring phylogenetic trees has been a topic of interest in the area of diachronic linguistics. …

Yashasvi Mantha, Diptesh Kanojia, Pushpak Bhattacharyya, Malhar Kulkarni

PDF Poster

"Keep Your Dimensions on a Leash": True Cognate Detection using Siamese Deep Neural Networks

Automatic Cognate Detection helps NLP tasks of Machine Translation, Information Retrieval, and Phylogenetics. Cognate words are defined …

Diptesh Kanojia, Sravan Munukutla, Sayali Ghodekar, Pushpak Bhattacharyya, Malhar Kulkarni

PDF Code Dataset Poster

Utilizing Word Embeddings based Features for Phylogenetic Tree Generation of Sanskrit Texts

Tracing the root of a text i.e., the original version of the text, by inferring phylogenetic trees has been a topic of interest in …

Diptesh Kanojia, Abhijeet Dubey, Malhar Kulkarni, Pushpak Bhattacharyya, Gholamreza Haffari

PDF Slides

Utilizing Wordnets for Cognate Detection among Indian Languages

Automatic Cognate Detection (ACD) is a challenging task which has been utilized to help NLP applications like Machine Translation, …

Diptesh Kanojia, Kevin Patel, Pushpak Bhattacharyya, Malhar Kulkarni, Gholamreza Haffari

Preprint PDF Slides

An Introduction to the Textual History Tool

This paper describes a digital tool called the Textual History Tool in detail. This tool captures the historical evolution of a text …

Diptesh Kanojia, Malhar Kulkarni, Pushpak Bhattacharyya, Sayali Ghodekar, Irawati Kulkarni, Nilesh Joshi, Eivind Kahrs

PDF Project Slides

Some Strategies to Capture Karaka-Yogyata with Special Reference to apadana

In today’s digital world language technology has gained importance. Several software, have been developed and are available in the …

Swaraja Salaskar, Diptesh Kanojia, Malhar Kulkarni

Preprint PDF Poster

Cognate Identification to improve Phylogenetic trees for Indian Languages

Cognates are present in multiple variants of the same text across different languages. Computational Phylogenetics uses algorithms and …

Diptesh Kanojia, Malhar Kulkarni, Pushpak Bhattacharyya, Gholamreza Haffari

PDF Poster Slides

Synthesizing Audio for Hindi Wordnet

In this paper, we describe our work on the creation of a voice model using a speech synthesis system for the Hindi Language. We use …

Diptesh Kanojia, Preethi Jyothi, Pushpak Bhattacharyya

PDF Poster

Semi-automatic WordNet Linking using Word Embeddings

Wordnets are rich lexico-semantic resources. Linked wordnets are extensions of wordnets, which link similar concepts in wordnets of …

Kevin Patel, Diptesh Kanojia, Pushpak Bhattacharyya

Preprint PDF Code Dataset Slides

pyiwn: A Python-based API to access Indian Language WordNets

Indian language WordNets have their individual web-based browsing interfaces along with a common interface for IndoWordNet. These …

Ritesh Panjwani, Diptesh Kanojia, Pushpak Bhattacharyya

PDF Code Poster

New Vistas to study Bhartṛhari: Cognitive NLP

A sentence is an important notion in the Indian grammatical tradition. The collection of the definitions of a sentence can be found in …

Jayashree Gajjam, Diptesh Kanojia, Malhar Kulkarni

Preprint PDF Slides

Indian Language Wordnets and their Linkages with Princeton WordNet

Wordnets are rich lexico-semantic resources. Linked wordnets are extensions of wordnets, which link similar concepts in wordnets of …

Diptesh Kanojia, Kevin Patel, Pushpak Bhattacharyya

Preprint PDF Code Dataset Poster

Hindi Wordnet for Language Teaching: Experiences and Lessons Learnt

This paper reports the work related to making Hindi Wordnet1 available as a digital resource for language learning and teaching, and …

Hanumant Redkar, Rajita Shukla, Sandhya Singh, Jaya Saraswati, Laxmi Kashyap, Diptesh Kanojia, Preethi Jyothi, Malhar Kulkarni, Pushpak Bhattacharyya

PDF Slides

Eyes are the Windows to the Soul: Predicting the Rating of Text Quality Using Gaze Behaviour

Predicting a reader’s rating of text quality is a challenging task that involves estimating different subjective aspects of the …

Sandeep Mathias, Diptesh Kanojia, Kevin Patel, Samarth Agarwal, Abhijit Mishra, Pushpak Bhattacharyya

Preprint PDF Poster

Scanpath Complexity: Modeling Reading Effort using Gaze Information

Measuring reading effort is useful for practical purposes such as designing learning material and personalizing text comprehension …

Abhijit Mishra, Diptesh Kanojia, Seema Nagar, Kuntal Dey, Pushpak Bhattacharyya

PDF Dataset Slides

Sarcasm Suite: A browser-based engine for sarcasm detection and generation

Sarcasm Suite is a browser-based engine that deploys ﬁve of our past papers in sarcasm detection and generation. The sarcasm detection …

Aditya Joshi, Diptesh Kanojia, Pushpak Bhattacharyya

PDF Project Slides

Is your Statement Purposeless? Predicting Computer Science Graduation Admission Acceptance based on Statement Of Purpose

We present a quantitative, data-driven machine learning approach to mitigate the problem of unpredictability of Computer Science …

Diptesh Kanojia, Nikhil Wani, Pushpak Bhattacharyya

PDF Slides

That’ll do fine!: A coarse lexical resource for English-Hindi MT, using polylingual topic models

Parallel corpora are often injected with bilingual lexical resources for improved Indian language machine translation (MT). In absence …

Diptesh Kanojia, Aaditya Joshi, Pushpak Bhattacharyya, Mark J. Carman

PDF Poster

Sophisticated Lexical Databases - Simplified Usage: Mobile Applications and Browser Plugins For Wordnets

India is a country with 22 officially recognized languages and 17 of these have WordNets, a crucial resource. Web browser based …

Diptesh Kanojia, Raj Dabre, Pushpak Bhattarcharyya

PDF Slides

SlangNet: A WordNet like resource for English Slang

We present a WordNet like structured resource for slang words and neologisms on the internet. The dynamism of language is often an …

Shehzaad Dhuliawala, Diptesh Kanojia, Pushpak Bhattacharyya

PDF Slides

Predicting Readers' Sarcasm Understandability by Modeling Gaze Behavior

Sarcasm understandability or the ability to understand textual sarcasm depends upon readers’ language proficiency, social knowledge, …

Abhijit Mishra, Diptesh Kanojia, Pushpak Bhattacharyya

PDF Dataset Poster

Mapping it differently: A solution to the linking challenges

This paper reports the work of creating bilingual mappings in English for certain synsets of Hindi wordnet, the need for doing this, …

Meghna Singh, Rajita Shukla, Jaya Jha, Laxmi Kashyap, Diptesh Kanojia, Pushpak Bhattacharyya

PDF Slides

Leveraging Cognitive Features for Sentiment Analysis

Sentiments expressed in user-generated short text and sentences are nuanced by subtleties at lexical, syntactic, semantic and pragmatic …

Abhijit Mishra, Diptesh Kanojia, Seema Nagar, Kuntal Dey, Pushpak Bhattacharyya

Preprint PDF Dataset Slides

Harnessing Cognitive Features for Sarcasm Detection

In this paper, we propose a novel mechanism for enriching the feature vector, for the task of sarcasm detection, with cognitive …

Abhijit Mishra, Diptesh Kanojia, Seema Nagar, Kuntal Dey, Pushpak Bhattacharyya

Preprint PDF Dataset Poster

Civique: Using Social Media to detect Urban Emergencies

We present the Civique system for emergency detection in urban areas by monitoring micro blogs like Tweets. The system detects …

Diptesh Kanojia, Vishwajeet Kumar, Krithi Ramamritham

Preprint PDF Poster Slides

A picture is worth a thousand words: Using OpenClipArt library for enriching IndoWordNet

WordNet has proved to be immensely useful for Word Sense Disambiguation, and thence Machine translation, Information Retrieval and …

Diptesh Kanojia, Shehzaad Dhuliawala, Pushpak Bhattarcharyya

PDF Slides

World WordNet database structure: an efficient schema for storing information of WordNets of the world

WordNet is an online lexical resource which expresses unique concepts in a language. English WordNet is the first WordNet which was …

Hanumant Harichandra Redkar, Sudha Baban Bhingardive, Diptesh Kanojia, Pushpak Bhattacharyya

PDF Slides

Using Multilingual Topic Models for Improved Alignment in English-Hindi MT

Parallel corpora are often injected with bilingual dictionaries for improved Indian language machine translation (MT). In absence of …

Diptesh Kanojia, Aaditya Joshi, Pushpak Bhattarcharyya, Mark J. Carman

PDF Slides

TransChat: Cross-Lingual Instant Messaging for Indian Languages

We present TransChat, an open-source, cross platform, Indian language Instant Messaging (IM) application that facilitates cross lingual …

Diptesh Kanojia, Shehzaad Dhuliawala, Naman Gupta, Abhijit Mishra, Pushpak Bhattarcharyya

PDF Poster

PanchBhoota: Hierarchical phrase based machine translation systems for five Indian languages

We present our work on developing fifteen Hierarchical Phrase Based Statistical Machine Translation (HPBSMT) systems for five Indian …

Neha R Prabhugaonkar, Apurva S Nagvenkar, Diptesh Kanojia, Jyoti D. Pawar, Pushpak Bhattacharyya, Manish Shrivastava

PDF Source Document

PaCMan: Parallel Corpus Management Workbench

We present a Parallel Corpora Management tool that aides parallel corpora generation for the task of Machine Translation (MT). It takes …

Diptesh Kanojia, Manish Shrivastava, Raj Dabre, Pushpak Bhattacharyya

PDF Poster

Do not do processing, when you can look up: Towards a Discrimination Net for WSD

The task of Word Sense Disambiguation (WSD) incorporates in its definition the role of ‘context’. We present our work on the …

Diptesh Kanojia, Pushpak Bhattacharyya, Raj Dabre, Siddhartha Gunti, Manish Shrivastava

PDF Slides

More than meets the eye: Study of Human Cognition in Sense Annotation

Word Sense Disambiguation (WSD) approaches have reported good accuracies in recent years. However, these approaches can be classified …

Salil Joshi, Diptesh Kanojia, Pushpak Bhattacharyya

PDF Slides

Discrimination-net for Hindi

Current state-of-the-art Word Sense Disambiguation (WSD) algorithms are mostly supervised and use the P (Sense|Word) statistic for …

Diptesh Kanojia, Arindam Chatterjee, Salil Joshi, Pushpak Bhattacharyya

PDF Video

A Study of the Sense Annotation Process: Man v/s Machine.

Does context help determine sense? This question might seem frivolous, even preposterous to anybody sensible. However, our long time …

Arindam Chatterjee, Salil Joshi, Pushpak Bhattacharyya, Diptesh Kanojia, Akhlesh Kumar Meena

PDF Slides