Understanding human language plays a pivotal role in creating intelligent systems. With that in view, Dr. Banerjee's research spans areas that bring together machine learning (ML) and natural language processing (NLP): biomedical knowledge discovery for better healthcare, and misinformation analysis.

The way we use language varies a lot, depending on the what (content), why (intent), who (speaker/writer and audience), and how (style). Based on these, we can understand natural language communication, and learn from it. Intelligent systems built using a deeper understanding of human language can be employed for immense social good.

  1. The language used in medical research, for instance, is highly specialized as it is meant for technical comprehension by other researchers in that field; but a system capable of understanding it, and extracting useful information from it, can help healthcare practitioners and patients.

  2. Quotidian language use, on the other hand, is dictated by various aspects of individual and collective human behavior. An intelligent system can be refined to provide better help to individuals and/or social groups by interpretation of (and inference from) language.

LAIR, headed by Dr. Banerjee, delves into these aspects of information use and their implications on society, working toward developing elegant streamlined solutions with practical and immensely beneficial social applications.

Current Research Projects

I. Tracking semantic change in medical information

Healthcare-related misinformation is a severe threat to society. This is a major area of current research, carried out with support from the National Science Foundation (NSF). A first step in this direction was to identify what information is worth checking for credibility [Zuo, Karakas, and Banerjee; 2018, Zuo, Karakas, and Banerjee; 2019]. It was subsequently studied if and how medical information undergoes changes across three genres: research literature, traditional news media, and social media.

II. Industry project: Financial information extraction

This is a multi-year project focused on fine-grained document-type classification and information extraction from complex financial documents of various types. This work is being carried out jointly with (and with support from) Broadridge Financial Solutions, Inc.

Past Research Projects

Deception in language

Language is a medium of conveying information, but unfortunately, it is also often used to deceive. We have explored the detection of such language in Online reviews, and discovered that the stylometric aspects of language play an important role in exposing the deceptive intent of writers [Feng, Banerjee, and Choi; 2012a]. We have also carried out experiments on the process of creating such language by investigating the differences in how people type when writing truthful as opposed to deceptive texts, and revealed interesting parallels between typing patterns and speech patterns when people lie [Banerjee et al. 2012]. On a related note, we also investigated stylometric aspects of language to identify the traits of individual writers, and therefore, were able to develop algorithms to identify the authors even in highly formal writing [Feng, Banerjee, and Choi; 2012b].


  • Ritwik Banerjee, Song Feng, Jun S. Kang, and Yejin Choi (Computer Science, Stony Brook University)

Research Products

  • Song Feng, Ritwik Banerjee, and Yejin Choi. 2012. Characterizing Stylistic Elements in Syntactic Structure. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1522 - 1533, Jeju Island, Korea. Association for Computational Linguistics.

  • Song Feng, Ritwik Banerjee, and Yejin Choi. 2012. Syntactic Stylometry for Deception Detection. In Proceedings of the 50th Annual Meeting of the Association for Computation Linguistics (Vol. 2: Short Papers), pp. 171 - 175. Jeju Island, Korea. Association for Computational Linguistics.

  • Ritwik Banerjee, Song Feng, Jun Seok Kang, and Yejin Choi. 2014. Keystroke Patterns as Prosody in Digital Writings: A Case Study with Deceptive Reviews and Essays. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1469 - 1473, Doha, Qatar. Association for Computational Linguistics.

    • The dataset contains truthful and deceptive writings from two domains: business reviews, and essays on two topics of social interest: gun control and gay marriage. The data is available for download as compressed tar.bz2 files:

The uncompressed dataset consists of files with tab-separated values. The key log data is found in the last column, titled ReviewMeta. This field has a list of KeyUp, KeyDown and MouseUp event logs. Note that the first event timestamp is not always zero. The event logs have the following formats:

[timestamp] KeyUp/KeyDown [javascript keycode]

[timestamp] MouseUp [begin-index] [end-index]

Semantic textual similarity in clinical notes

In the modern world of increasingly digitized healthcare infrastructure, clinical notes and other reports are maintained digitally. They are, however, almost always manually created. This has resulted in pervasive copy-paste actions across the board, leading to an immense amount of redundant information in such notes and reports. Using an ensemble of traditional ontology-based methods and state-of-the-art neural networks, a lightweight but highly accurate system was developed to detect clinical texts for semantic duplication and similarity [Salek Faramarzi et al. 2022].


  • Ritwik Banerjee, Noushin Salek Faramarzi, and Akanksha Dara (Computer Science, Stony Brook University)

Research Products

Authorship in multi-author documents

Identifying the authorship of, as well as authorship changes in, multi-author documents [Zuo, Zhao, and Banerjee; 2019], using neural networks instead of explicit stylometric features.


  • Ritwik Banerjee, Chaoyuan Zuo, and Yu Zhao (Computer Science, Stony Brook University)

Research Products

  • Chaoyuan Zuo, Yu Zhao, and Ritwik Banerjee. 2019. Style Change Detection with Feed-forward Neural Networks. In Working Notes of CLEF 2019 – Conference and Labs of the Evaluation Forum, CLEF 2018 – Vol. 2380, Lugano, Switzerland. Central Europe Workshop Proceedings (

Literature-based medical relation inference

Due to the sheer volume of new research taking place, it is nearly impossible for practicing physicians to keep up with the latest findings. This calls for automated knowledge discovery from research literature. Even though the problem is extremely difficult because of the complex and highly-specialized language, an AI-driven solution could lead to automatically keeping all medical databases updated with state-of-the-art knowledge. Such databases can be combined with clinical notes to improve diagnostics and patient care. Arguably, the most important knowledge in clinical practice is the understanding of the relation between a drug and a disease or symptom. This can be broadly understood in terms of whether a drug is beneficial or harmful for a patient.

As part of his doctoral thesis, Dr. Banerjee designed a global inference system (modeled as a linear programming optimization) based on the pharmacodynamic similarities between drugs. It was shown that even if the system has no prior knowledge of newly studied drug classes, it can identify these drugs as being potentially beneficial. For example, for type-2 diabetes patients, our system identified the class "sodium glucose co-transporter 2" as a potentially beneficial drug, in spite of no drug from this category being known a priori to the system. The key novelty in this work was that the similarity between medical entities was computed based on the pharmacological actions of these entities [Banerjee 2015].


  • Ritwik Banerjee, I. V. Ramakrishnan, and Yejin Choi (Computer Science, Stony Brook University)

Research Products

NLP for precision healthcare informatics

A problem in healthcare is that in spite of the recent focus on precision medicine, much of the relevant data is not patient-specific, and thus, corroborating relevant information and discarding the rest remains the manual endeavor of clinicians. This is a rather complex problem, with several aspects to it — laboratory tests, prescription drugs, diet, etc. We developed AI-driven systems that can distill patient-specific information from large amounts of natural language data as well as structured databases. This has led to automatic recommendation of the most relevant laboratory tests for a patient, depending on the precise circumstances [Banerjee et al. 2014], and personalized identification of adverse drug reactions and attribution of patient's symptoms to their drug regimen [Banerjee et al. 2015].


  • Ritwik Banerjee, I. V. Ramakrishnan, Yejin Choi, Gaurav Piyush, and Ameya Naik (Department of Computer Science, Stony Brook University)

  • Mark Henry and Matthew Perciavall (School of Medicine, Stony Brook University)

Research Products

Analysis of network outages

An important problem pertaining to securing trustworthy communication is to maintain the ability to communicate. This ability is susceptible to adversarial attacks, especially in today's Internet-driven society. The technical strategies required to thwart such attacks is a not an area of research for Dr. Banerjee. However, it was observed that a lot of information about such incidents is usually available as human conversations. This led to a collaboration with researchers in computer networks, to identify and analyze the nature of such breakdown in cyber communication [Banerjee et al. 2015].


  • Ritwik Banerjee, Yejin Choi, Abbas Razaghpanah, Luis Chiang, Phillipa Gill (Computer Science, Stony Brook University)

  • Vyas Sekar (Electrical & Computer Engineering, Carnegie Melon University)

Research Products