Tracking semantic changes in medical information
Changes in the meaning of information as it passes through cyberspace can mislead those who access the information. This project will develop a new dataset and algorithms to identify and categorize medical information that remains true to the original meaning or undergoes distortion. Instead of imposing an external true/false label on this information, this project looks into a series of changes within the news coverage itself that gradually lead to a deviation from the original medical claims. Identifying important differences between original medical articles and news stories is a challenging, high risk-high reward venture. Broader impacts of this work include benefits to the research community by making novel contributions to understanding temporal changes in natural language information, as well as social benefits in the form of improved informational tools like question-answering. For the medical domain in particular, understanding temporal distortions and deviations from actual medical findings can reduce occurrences of harmful health choices, for instance, by embedding the research outcomes in news, social media, or search engines.
This project will develop a large dataset of medical scientific publications, and record their characteristics as they change over time across news by designing and developing discrete time-series representations of entities and their attributes and relations. This task will provide the basis for designing and implementing machine learning tasks that exploit stylometric features in natural language in conjunction with temporal distributions to identify and categorize such changes. This research will go beyond current approaches limited to true/false classification of individual articles, and hence be able to identify and analyze information change in narratives, including semantic changes and nuances, or selective emphasis of related information. The research entails an unsupervised and a semi-supervised machine learning approach with bootstrapping, and exploring a binary labeling task to distinguish distorted pieces of information from those that are faithful to the scientific finding, and a multi-label categorization to learn the type of semantic change occurring through time.
As a first step in this direction, we focused on identifying what information is worth verifying, and developed a hybrid method comprising heuristics and supervised learning to identify "check-worthy" information [Zuo,Karakas, and Banerjee; 2018] . Our approach achieved the best state-of-the-art detection, as measured by several metrics. An expansion on this work was invited to the CLEF 2019 conference [Zuo, Karakas, and Banerjee; 2019].
(Mis)information propagation across genres
Less than lying
This project was led by Dr. Ritwik Banerjee at the Department of Computer Science, Stony Brook University.
Chaoyuan Zuo, PhD (Computer Science) ↦ Faculty at the School of Journalism & Communication, Nankai University (China)
Noushin Salek Faramarzi, PhD Candidate
Kritik Mathur, MS ↦ Software Engineer @ Amazon
Dhruv Kela, MS ↦ Software Engineer @ DigitalOcean
Narayan Acharya, MS ↦ Research Engineer @ dmetrics, Inc.
Ayla Karakas, BS (Linguistics) ↦ Ph.D., Computational Linguistics @ Yale
Qi Zhang, BS (Computer Science) ↦ MS, University of California San Diego
Dr. Indrakshi Ray, Professor of Computer Science at Colorado State University.
Dr. Hossein Shirazi, Post Doctoral Fellow (Computer Science) at Colorado State University.
Fateme Hashemi Chaleshtori, MS (Computer Science) at Colorado State University.
Chaoyuan Zuo, Ayla Karakas, and Ritwik Banerjee. A Hybrid Recognition System for Check-worthy Claims Using Heuristics and Supervised Learning. In Working Notes of CLEF 2018 – Conference and Labs of the Evaluation Forum, CLEF 2018 – Vol. 2125. CEUR-WS, 2018. [ slides ]
Chaoyuan Zuo, Ayla Karakas, and Ritwik Banerjee. To Check or not to Check: Syntax, Semantics, and Context in the Language of Check-worthy Claims. In Crestani et al. (Eds.) Experimental IR Meets Multilinguiality, Multimodality, and Interaction: Proceedings of the 10th International Conference of the CLEF Association, CLEF 2019 – LNCS Vol. 11696. Springer, 2019. [ Invited Paper to "Best of the Labs" Track ]
Chaoyuan Zuo, Narayan Acharya, Ritwik Banerjee. Querying Across Genres for Medical Claims in News. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2020.
Chaoyuan Zuo, Qi Zhang, Ritwik Banerjee. An Empirical Assessment of the Qualitative Aspects of Misinformation in Health News. In Proceedings of the Fourth Workshop on NLP for Internet Freedom: Censorship, Disinformation, and Propaganda (NLP4IF). ACL, 2021.
Chaoyuan Zuo, Kritik Mathur, Dhruv Kela, Noushin Salek Faramarzi, and Ritwik Banerjee. Beyond belief: a cross-genre study on perception and validation of health information online. In International Journal of Data Science and Analytics. 13:299 – 314. Springer, 2022.