Understanding human language plays a pivotal role in creating intelligent systems. With that in view, Banerjee's research spans multiple areas that bring together machine learning (ML) and natural language processing (NLP): biomedical knowledge discovery for better healthcare, misinformation analysis, and linguistics for security and privacy.

Language use varies a lot depending on the what (content), why (intent), who (speaker/writer and audience), and how (style). Natural language understanding can be improved based on these insights, and intelligent systems built using a deeper understanding of human language can be employed for immense social good. The language used in medical research, for instance, is highly specialized as it is meant for technical comprehension by other researchers in that field; but a system capable of understanding it, and extracting useful information from it, can help healthcare practitioners and patients. Quotidian language use, on the other hand, is dictated by various aspects of individual and collective human behavior. An intelligent system can be refined to provide better help to individuals and/or social groups by interpretation of (and inference from) language.

Prospective students please see the section at the bottom of this page!

Current Research Projects

Privacy Compliance of Live Medical Data

Developers face hurdles understanding complex privacy laws in handling medical data. Our project seeks to simplify legal jargon, aiding app design for data safety. We aim to create tools offering transparency and user control in health apps, closing the knowledge gap for a secure digital health landscape.

(Project page under development)

Information Extraction in Clinical Nephrology

Chronic Kidney Disease (CKD) poses significant health risks, necessitating the identification of new early-stage risk factors. Major depression and post-traumatic stress disorder (PTSD), often linked, may relate to CKD progression. The study uses supervised machine learning and natural language processing based on clinical notes and radiology reports to predict PTSD's connection to CKD.

(Project page under development)

Fallacious Argumentation in Information Disorder

The study targets recognizing subtle misinformation that propagates through manipulative argumentation in the language used in contemporary information ecosystems. It aims to develop computational models detecting such manipulative and fallacious argumentation, expanding beyond explicit falsehoods to address epistemic corruption.

(Project page under development)

Past Research Projects

Tracking Semantic Change in Medical Information

This project created algorithms and datasets to distinguish unchanged medical information from their distorted versions circulating online. We achieve this by identifying semantic differences in the information found in news articles compared to the corresponding information in original scientific publications.

Project Page

Extraction & Classification of Financial Information

A multi-year project focused on fine-grained document-type classification and information extraction from complex financial texts applied to scalable automation of a broad range of tasks.

(Technical details of this research are proprietary)

Semantic Similarity of Clinical Texts

Hospitals amass crucial textual data for healthcare, often in disorganized forms within Electronic Health Record (EHR) systems. Measuring semantic similarity between clinical texts (STS) mitigates the associated problems by streamlining data, reducing redundancy, while preserving valuable information and highlighting new information.

Project Page

Literature-based Medical Knowledge Discovery

This research designed an AI-driven solution, and developed a prototype system, to automate updating medical databases with new findings. It uses pharmacodynamic similarities between drugs to identify potentially beneficial drugs and drug categories for specific diseases, symptoms, or syndromes, despite lacking prior knowledge of such drugs during the model's training.

Project Page

Deception in Language

We think of language as a medium of conveying information, but unfortunately, it is also often used to deceive. This research explored the detection of such deception in online reviews using deep interpretable linguistic properties.

Project Page

Forensic Linguistics

Research into "idiolects" (the unique use of language by individuals) to identify authors, including the authorship of collaborative multi-author documents.

Project Page

Personalized Healthcare

Healthcare faces a challenge in utilizing patient-specific data for personalized medicine. The AI developed in this research extracts relevant patient details from diverse data sources, facilitating tailored lab tests, detecting drug reactions, and linking symptoms to safe medications.

Project Page

Network Outage Analysis

A collaboration with researchers in computer networks to extract network outage information and develop supervised machine learning approaches to categorize them into multiple causal categories.

Project Page

Prospective students

If you are not a student at Stony Brook University, and want to join Banerjee's research team, please understand that

Banerjee may be unable to respond to individual emails or queries regarding research positions. If you are interested in joining his research team as part of your M.S. or Ph.D., please apply to one of the following graduate programs offered by the Department of Computer Science at Stony Brook University:

Students at Stony Brook University

Ph.D. students

If you are a current Ph.D. student in the Department of Computer Science at Stony Brook University, and have a strong background in

feel free to contact Banerjee via email about potential opportunities within his research group.

Ph.D. students are expected to have the ability to formulate a research problem that they want to pursue. Please be prepared to start your journey by presenting a convincing proposal for your research.

Undergraduate students

Research opportunities in the form of CSE 487 or CSE 495/496 exist for exceptional B.S. students. Outstanding performance in coursework relevant to machine learning is a prerequisite (ideally, in CSE 353 and/or CSE 354).

If you are a current undergraduate student of computer science or a closely-related area, feel free to contact Banerjee via email about these options.

At the end of the project/thesis, the goal is that your work should be deemed publishable at a peer-reviewed research conference or journal. If you have made significant contribution, you will be a co-author in the published work (subject to acceptance at a reputable venue).

M.S. students

If you are a student in the MS or 5-yr BS/MS program in the computer science department, and want to work with Banerjee for your advanced graduate project (CSE 523/524) or your graduate thesis (CSE 599), contact him via email with the following in mind:

A successful applicant will typically have good grades, a strong programming background in Python with knowledge of version control, and a good understanding of machine learning fundamentals. Experience with libraries like PyTorch or Keras is a strong plus.

As prerequisites, the graduate machine learning and/or natural language processing courses are strongly recommended. This is an important indicator of your area of interest. If contacted, please be prepared for an initial technical interview on NLP/ML concepts. If your interview is satisfactory, you will be put into a specific project team, with a Ph.D. student leading the project. Banerjee and his team are dedicated to their research, and expectations from the team members include:

At the end of the project/thesis, the work is expected to be of a caliber that is publishable at a peer-reviewed research conference or journal. If you have made significant contribution, you will be a co-author in the published work (subject to acceptance at a reputable venue).