Research

Temporary page for the Language & AI Research (LAIR) group.

Understanding human language plays a pivotal role in creating intelligent systems. With that in view, Dr. Banerjee's research spans areas that bring together machine learning (ML) and natural language processing (NLP): biomedical knowledge discovery for better healthcare, and misinformation analysis.

The way we use language varies a lot, depending on the what (content), why (intent), who (speaker/writer and audience), and how (style). Based on these, we can understand natural language communication, and learn from it. Intelligent systems built using a deeper understanding of human language can be employed for immense social good.

LAIR, headed by Dr. Banerjee, delves into these aspects of information use and their implications on society, working toward developing elegant streamlined solutions with practical and immensely beneficial social applications.

Current Projects

Healthcare-related misinformation is a severe threat to society. This is a major area of current research, carried out with support from the National Science Foundation (NSF). A first step in this direction was to identify what information is worth checking for credibility [Zuo, Karakas, and Banerjee; 2018, Zuo, Karakas, and Banerjee; 2019]. It was subsequently studied if and how medical information undergoes changes across three genres: research literature, traditional news media, and social media.

II. Industry project: Financial information extraction

This is a multi-year project focused on fine-grained document-type classification and information extraction from complex financial documents of various types. This work is being carried out jointly with (and with support from) Broadridge Financial Solutions, Inc.

Past Projects

Deception in language

Language is a medium of conveying information, but unfortunately, it is also often used to deceive. We have explored the detection of such language in Online reviews, and discovered that the stylometric aspects of language play an important role in exposing the deceptive intent of writers [Feng, Banerjee, and Choi; 2012a]. We have also carried out experiments on the process of creating such language by investigating the differences in how people type when writing truthful as opposed to deceptive texts, and revealed interesting parallels between typing patterns and speech patterns when people lie [Banerjee et al. 2012]. On a related note, we also investigated stylometric aspects of language to identify the traits of individual writers, and therefore, were able to develop algorithms to identify the authors even in highly formal writing [Feng, Banerjee, and Choi; 2012b].

Group

Research Products

The uncompressed dataset consists of files with tab-separated values. The key log data is found in the last column, titled ReviewMeta. This field has a list of KeyUp, KeyDown and MouseUp event logs. Note that the first event timestamp is not always zero. The event logs have the following formats:

[timestamp] KeyUp/KeyDown [javascript keycode]

[timestamp] MouseUp [begin-index] [end-index]

Semantic textual similarity in clinical notes

In the modern world of increasingly digitized healthcare infrastructure, clinical notes and other reports are maintained digitally. They are, however, almost always manually created. This has resulted in pervasive copy-paste actions across the board, leading to an immense amount of redundant information in such notes and reports. Using an ensemble of traditional ontology-based methods and state-of-the-art neural networks, a lightweight but highly accurate system was developed to detect clinical texts for semantic duplication and similarity [Salek Faramarzi et al. 2022].

Group

Research Products

Authorship in multi-author documents

Identifying the authorship of, as well as authorship changes in, multi-author documents [Zuo, Zhao, and Banerjee; 2019], using neural networks instead of explicit stylometric features.

Group

Research Products

Literature-based medical relation inference

Due to the sheer volume of new research taking place, it is nearly impossible for practicing physicians to keep up with the latest findings. This calls for automated knowledge discovery from research literature. Even though the problem is extremely difficult because of the complex and highly-specialized language, an AI-driven solution could lead to automatically keeping all medical databases updated with state-of-the-art knowledge. Such databases can be combined with clinical notes to improve diagnostics and patient care. Arguably, the most important knowledge in clinical practice is the understanding of the relation between a drug and a disease or symptom. This can be broadly understood in terms of whether a drug is beneficial or harmful for a patient.

As part of his doctoral thesis, Dr. Banerjee designed a global inference system (modeled as a linear programming optimization) based on the pharmacodynamic similarities between drugs. It was shown that even if the system has no prior knowledge of newly studied drug classes, it can identify these drugs as being potentially beneficial. For example, for type-2 diabetes patients, our system identified the class "sodium glucose co-transporter 2" as a potentially beneficial drug, in spite of no drug from this category being known a priori to the system. The key novelty in this work was that the similarity between medical entities was computed based on the pharmacological actions of these entities [Banerjee 2015].

Group

Research Products

NLP for precision healthcare informatics

A problem in healthcare is that in spite of the recent focus on precision medicine, much of the relevant data is not patient-specific, and thus, corroborating relevant information and discarding the rest remains the manual endeavor of clinicians. This is a rather complex problem, with several aspects to it — laboratory tests, prescription drugs, diet, etc. We developed AI-driven systems that can distill patient-specific information from large amounts of natural language data as well as structured databases. This has led to automatic recommendation of the most relevant laboratory tests for a patient, depending on the precise circumstances [Banerjee et al. 2014], and personalized identification of adverse drug reactions and attribution of patient's symptoms to their drug regimen [Banerjee et al. 2015].

Group

Research Products

Analysis of network outages

An important problem pertaining to securing trustworthy communication is to maintain the ability to communicate. This ability is susceptible to adversarial attacks, especially in today's Internet-driven society. The technical strategies required to thwart such attacks is a not an area of research for Dr. Banerjee. However, it was observed that a lot of information about such incidents is usually available as human conversations. This led to a collaboration with researchers in computer networks, to identify and analyze the nature of such breakdown in cyber communication [Banerjee et al. 2015].

Group

Research Products