Tracking semantic changes in medical information


Changes in the meaning of information as it passes through cyberspace can mislead those who access the information. This project will develop a new dataset and algorithms to identify and categorize medical information that remains true to the original meaning or undergoes distortion. Instead of imposing an external true/false label on this information, this project looks into a series of changes within the news coverage itself that gradually lead to a deviation from the original medical claims. Identifying important differences between original medical articles and news stories is a challenging, high risk-high reward venture. Broader impacts of this work include benefits to the research community by making novel contributions to understanding temporal changes in natural language information, as well as social benefits in the form of improved informational tools like question-answering. For the medical domain in particular, understanding temporal distortions and deviations from actual medical findings can reduce occurrences of harmful health choices, for instance, by embedding the research outcomes in news, social media, or search engines.

This project will develop a large dataset of medical scientific publications, and record their characteristics as they change over time across news by designing and developing discrete time-series representations of entities and their attributes and relations. This task will provide the basis for designing and implementing machine learning tasks that exploit stylometric features in natural language in conjunction with temporal distributions to identify and categorize such changes. This research will go beyond current approaches limited to true/false classification of individual articles, and hence be able to identify and analyze information change in narratives, including semantic changes and nuances, or selective emphasis of related information. The research entails an unsupervised and a semi-supervised machine learning approach with bootstrapping, and exploring a binary labeling task to distinguish distorted pieces of information from those that are faithful to the scientific finding, and a multi-label categorization to learn the type of semantic change occurring through time.

As a first step in this direction, we focused on identifying what information is worth verifying, and developed a hybrid method comprising heuristics and supervised learning to identify "check-worthy" information [Zuo,Karakas, and Banerjee; 2018] . Our approach achieved the best state-of-the-art detection, as measured by several metrics. An expansion on this work was invited to the CLEF 2019 conference [Zuo, Karakas, and Banerjee; 2019].

Next, we looked into how healthcare information is first presented in research literature, and then in newswires for general readership. We developed a novel dataset of 5,034 news articles paired with the research abstracts of the work being mentioned, and explored how to identify identical or near-identical content expressed in vastly different syntax and vocabulary. For this, we took a two-step approach: (1) select the most relevant candidates from a collection of 222,000 research abstracts, and (2) re-rank this list of most relevant candidates. We compared the classical approach of information retrieval (IR) using BM25 with more recent transformer-based models, and find that cross-genre medical IR is a viable task, but incorporating domain-specific knowledge is crucial for its success [Zuo, Acharya, and Banerjee; 2020].

Through the course of this project, we observed that the complex nature of medical misinformation can be attributed largely to two phenomena. First, (mis)information propagates across multiple distinct genres ... from research literature to newswires to social media, where each genre has its own linguistic properties and pragmatic hurdles to overcome. Second, a large amount of information amounts to paltering, or what is often called "less than lying". We have pursued scientific investigations in both directions.

(Mis)information propagation across genres

In the former, we looked into the phenomenon of linguistic transformations that happen when medical information transitions from specialized research literature into news intended for wider readership. This transition makes the information vulnerable to misinterpretation, misrepresentation, and incorrect attribution, all of which may be difficult to identify without adequate domain knowledge and may exist even in the presence of explicit citations. Moreover, news articles seldom provide a precise correspondence between a specific claim and its origin, making it harder to identify which claims, if any, reflect the original findings. For instance, an article stating “Flagellin shows therapeutic potential with H3N2, known as Aussie Flu.” contains two claims (“Flagellin ... H3N2,” and “H3N2, known as Aussie Flu”) that may be true or false independent of each other, and it is prima facie unclear which claims, if any, are supported by the cited research. We developed a corpus of sentences from medical news along with the sources from peer-reviewed medical research journals these news articles cite. Then, we used this corpus to study what a general reader perceives to be true, and how to verify the scientific source of claims. Unlike existing corpora, this captures the metamorphosis of information across two genres with disparate readership and vastly different vocabularies and presents the first empirical study of health-related fact-checking across them [Zuo et al; 2022a].

We delved further into the cross-genre propagation of misinformation and the perception of truth. For this part of our research, we collaborated with a team led by Dr. Indrakshi Ray at the Colorado State University, Fort Collins. As prior research has often demonstrated, social media posts often leverage the trust readers have in prestigious news agencies and cite news articles as a way of gaining credibility. It is not, however, always the case that the cited article supports the claim being upheld in the social media post. In other words, the post makes it "look" like the claim originates from a credible source, but it really does not! We develop a cross-genre ad hoc information retrieval model to identify whether the information in a Twitter post is, indeed, supported by the news article it cites. This leg of our work rests on a large corpus of 46.86 million Twitter posts about COVID-19, and is divided into two tasks: (i) development of models to detect Tweets containing claim and worth to be fact-checked and (ii) verifying whether the claims made in a Tweet are supported by the newswire article it cites. Unlike previous studies that detect unsubstantiated information by post hoc analysis of the patterns of propagation, our approach is capable of identifying deceptive support before the misinformation begins to spread. Among our chief findings is the observation that among the posts that contain a seemingly factual claim while citing a news article as supporting evidence, at least 1% include a citation intended to deceive the reader [Zuo et al; 2022b].

Less than lying

The latter consists of selective reporting, non-disclosure of conflicts of interests, disease-mongering, etc. These manifold attributes make the automatic detection of medical misinformation a daunting challenge, and has so far only been explored by journalists and healthcare professionals in purely qualitative studies. We delved into a significantly more complex multi-class classification task to test whether medical news articles (most of which are not considered "fake" by any existing fact-checking system) actually satisfy criteria deemed important by medical experts and healthcare journalists (as far as misinformation is concerned). We collected a corpus of 1,119 health news paired with systematic reviews, where each review has six criteria essential to the accuracy of medical news. Our experiments compared classical token-based approaches with the more recent transformer-based models, and found that detecting qualitative lapses is an extremely challenging task with direct ramifications in misinformation. Moreover, it is an important direction to pursue beyond assigning True or False labels to short claims [Zuo, Zhang, and Banerjee; 2021].

Research Group

This project was led by Dr. Ritwik Banerjee at the Department of Computer Science, Stony Brook University.


  • Chaoyuan Zuo, PhD (Computer Science) Faculty at the School of Journalism & Communication, Nankai University (China)

  • Noushin Salek Faramarzi, PhD Candidate

  • Kritik Mathur, MS ↦ Software Engineer @ Amazon

  • Dhruv Kela, MS ↦ Software Engineer @ DigitalOcean

  • Narayan Acharya, MS ↦ Research Engineer @ dmetrics, Inc.

  • Ayla Karakas, BS (Linguistics) ↦ Ph.D., Computational Linguistics @ Yale

  • Qi Zhang, BS (Computer Science) ↦ MS, University of California San Diego


  • Dr. Indrakshi Ray, Professor of Computer Science at Colorado State University.

  • Dr. Hossein Shirazi, Post Doctoral Fellow (Computer Science) at Colorado State University.

  • Fateme Hashemi Chaleshtori, MS (Computer Science) at Colorado State University.


  1. Chaoyuan Zuo, Ayla Karakas, and Ritwik Banerjee. A Hybrid Recognition System for Check-worthy Claims Using Heuristics and Supervised Learning. In Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum, CLEF 2018 – Vol. 2125. CEUR-WS, 2018. [ slides ]

  2. Chaoyuan Zuo, Ayla Karakas, and Ritwik Banerjee. To Check or not to Check: Syntax, Semantics, and Context in the Language of Check-worthy Claims. In Crestani et al. (Eds.) Experimental IR Meets Multilinguiality, Multimodality, and Interaction: Proceedings of the 10th International Conference of the CLEF Association, CLEF 2019 – LNCS Vol. 11696. Springer, 2019. [ Invited Paper to "Best of the Labs" Track ]

  3. Chaoyuan Zuo, Narayan Acharya, Ritwik Banerjee. Querying Across Genres for Medical Claims in News. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2020.

  4. Chaoyuan Zuo, Qi Zhang, Ritwik Banerjee. An Empirical Assessment of the Qualitative Aspects of Misinformation in Health News. In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda (NLP4IF). ACL, 2021.

  5. Ritwik Banerjee and Indrakshi Ray. Diagnosis, Prevention, and Cure for Misinformation. In IEEE International Conference on Cognitive Machine Intelligence (CogMI), pp.156 - 162. IEEE, 2021. [ Vision Paper ]

  6. Sina Mahdipour Saravani, Ritwik Banerjee, and Indrakshi Ray. An Investigation into the Contribution of Locally Aggregated Descriptors to Figurative Language Identification. In Proceedings of the Second Workshop on Insights from Negative Results in NLP. ACL, 2021.

  7. Chaoyuan Zuo, Kritik Mathur, Dhruv Kela, Noushin Salek Faramarzi, and Ritwik Banerjee. Beyond belief: a cross-genre study on perception and validation of health information online. In International Journal of Data Science and Analytics. 13:299 – 314. Springer, 2022.

  8. Chaoyuan Zuo, Ritwik Banerjee, Hossein Shirazi, Fateme Hashemi Chaleshtori, and Indrakshi Ray. Seeing Should Probably not be Believing: The Role of Deceptive Support in COVID-19 Misinformation on Twitter. In Journal of Data and Information Quality. ACM, 2022.