Hi!I am currently a Research Engineer (on contract) at Google Research India, working with Dr. Partha Talukdar in the NLU Group. Here, I am working on entity-aware translation of educational videos to create subtitles for a host of Indian languages.
Before this, I spent an amazing year at Microsoft Research India, and worked with Dr. Monojit Choudhury and Dr. Kalika Bali on various problems in low-resource language systems and probing NLI models.
I did my undergraduate thesis at Microsoft Research under Navin Goyal and Monojit Choudhury, on semantic parsing applied to the conversion of natural language to regular expressions and SQL. I graduated with a B.E. (Hons.) in Computer Science from BITS Pilani, Goa, India, in 2019.
-- Creating a scalable entity-aware translation+transliteration pipeline to generate subtitles for English-medium college lecture videos (from NPTEL) for various Indian languages.
-- Probed large pretrained lanuage models for NLI reasoning, by designing a taxonomy of reasoning capabilities, annotating an existing NLI dataset based on the capabilities, then evaluating the model performance on this new dataset.
-- Worked on a range of multilingual problems, such as quantitative analysis of language diversity in ACL, efficacy of code-mixing chatbots, measuring quality of crowdsourced speech data, among others.
-- Works published at ACL, CoNLL, CSCW, LREC, EMNLP Workshop, and ICON (see publications below).
-- Worked on semantic parsing and its applications to natural language to code. Particularly focused on natural language to regular expressions (NL2Regex), and to SQL queries.
Implemented and experimented with different neural architectures for NL2Regex, and conducted an error analysis. Subsequently, I also analyzed quality of various semantic parsing datasets
using a range of metrics and indicators.
-- Research carried out also contributed to my undergraduate thesis.
-- Created voice interface application for truck drivers. Used pocketsphinx-android from CMUSphinx to power voice recognition.
Customized to recognize key commands in English, Hindi, and Indian-English.
-- Created capabilities like an in-built support system, location and route support, and verbal data entry.
-- Presented prototype to I-Loads administration (CEO,CFO,CTO) and tech team.
The State and Fate of Linguistic Diversity and Inclusion in the NLP World
, Sebastin Santy*, Amar Budhiraja*, Kalika Bali, Monojit Choudhury
Annual Conference of the Association for Computational Linguistics (ACL) 2020
website | pdf | abstract
Coverage/Media Mentions: NLP with Friends | Quartz | Ruder's Blog | LacunaFund | Underrated ML |
NLP Newsletter | SIGTYP Newsletter
Do Multilingual Users Prefer Chat-bots that Code-mix? Let's Nudge and Find Out!
Anshul Bawa, Pranav Khadpe, , Kalika Bali and Monojit Choudhury
ACM Conference on Computer Supported Cooperative Work and Social Computing (CSCW) 2020
Crowdsourcing Speech Data for Low-Resource Languages from Low-Income Workers
Basil Abraham, Danish Goel, Divya Siddarth, Kalika Bali, Manu Chopra, Monojit Choudhury, , Preethi Jyoti, Sunayana Sitaram and Vivek Seshadri (Alphabetically Ordered)
International Conference on Language Resources and Evaluation (LREC) 2020
Unsung Challenges of Building and Deploying Language Technologies for Low Resource Language Communities
, Christain Barnes, Sebastin Santy, Simran Khanuja, Sanket Shah, Anirudh Srinivasan, Satwik Bhattamishra, Sunayana Sitaram, Monojit Choudhury and Kalika Bali
International Conference on Natural Language Processing (ICON) 2019
pdf | abstract | slides