Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study

BackgroundThe increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side effects, use patterns, and opinions on novel psychoactive substances. However, due to the large volume of data, obta...

Full description

Saved in:
Bibliographic Details
Main Authors: Sudeshna Das, Yao Ge, Yuting Guo, Swati Rajwal, JaMor Hairston, Jeanne Powell, Drew Walker, Snigdha Peddireddy, Sahithi Lakamana, Selen Bozkurt, Matthew Reyna, Reza Sameni, Yunyu Xiao, Sangmi Kim, Rasheeta Chandler, Natalie Hernandez, Danielle Mowery, Rachel Wightman, Jennifer Love, Anthony Spadaro, Jeanmarie Perrone, Abeed Sarker
Format: Article
Language:English
Published: JMIR Publications 2025-01-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e66220
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841557097806823424
author Sudeshna Das
Yao Ge
Yuting Guo
Swati Rajwal
JaMor Hairston
Jeanne Powell
Drew Walker
Snigdha Peddireddy
Sahithi Lakamana
Selen Bozkurt
Matthew Reyna
Reza Sameni
Yunyu Xiao
Sangmi Kim
Rasheeta Chandler
Natalie Hernandez
Danielle Mowery
Rachel Wightman
Jennifer Love
Anthony Spadaro
Jeanmarie Perrone
Abeed Sarker
author_facet Sudeshna Das
Yao Ge
Yuting Guo
Swati Rajwal
JaMor Hairston
Jeanne Powell
Drew Walker
Snigdha Peddireddy
Sahithi Lakamana
Selen Bozkurt
Matthew Reyna
Reza Sameni
Yunyu Xiao
Sangmi Kim
Rasheeta Chandler
Natalie Hernandez
Danielle Mowery
Rachel Wightman
Jennifer Love
Anthony Spadaro
Jeanmarie Perrone
Abeed Sarker
author_sort Sudeshna Das
collection DOAJ
description BackgroundThe increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side effects, use patterns, and opinions on novel psychoactive substances. However, due to the large volume of data, obtaining useful insights through natural language processing technologies such as large language models is challenging. ObjectiveThis paper aims to develop a retrieval-augmented generation (RAG) architecture for medical question answering pertaining to clinicians’ queries on emerging issues associated with health-related topics, using user-generated medical information on social media. MethodsWe proposed a two-layer RAG framework for query-focused answer generation and evaluated a proof of concept for the framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. Our modular framework generates individual summaries followed by an aggregated summary to answer medical queries from large amounts of user-generated social media data in an efficient manner. We compared the performance of a quantized large language model (Nous-Hermes-2-7B-DPO), deployable in low-resource settings, with GPT-4. For this proof-of-concept study, we used user-generated data from Reddit to answer clinicians’ questions on the use of xylazine and ketamine. ResultsOur framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO, evaluated for 20 queries with 76 samples. There was no statistically significant difference between GPT-4 and Nous-Hermes-2-7B-DPO for coverage (Mann-Whitney U=733.0; n1=37; n2=39; P=.89 two-tailed), coherence (U=670.0; n1=37; n2=39; P=.49 two-tailed), relevance (U=662.0; n1=37; n2=39; P=.15 two-tailed), length (U=672.0; n1=37; n2=39; P=.55 two-tailed), and hallucination (U=859.0; n1=37; n2=39; P=.01 two-tailed). A statistically significant difference was noted for the Coleman-Liau Index (U=307.5; n1=20; n2=16; P<.001 two-tailed). ConclusionsOur RAG framework can effectively answer medical questions about targeted topics and can be deployed in resource-constrained settings.
format Article
id doaj-art-9d0ab59424e6461c94b86837d9f8a22b
institution Kabale University
issn 1438-8871
language English
publishDate 2025-01-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-9d0ab59424e6461c94b86837d9f8a22b2025-01-06T20:30:38ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-01-0127e6622010.2196/66220Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept StudySudeshna Dashttps://orcid.org/0000-0002-2112-6986Yao Gehttps://orcid.org/0000-0002-3323-7130Yuting Guohttps://orcid.org/0000-0002-8919-0888Swati Rajwalhttps://orcid.org/0000-0002-3826-5069JaMor Hairstonhttps://orcid.org/0000-0001-6069-5869Jeanne Powellhttps://orcid.org/0000-0002-3494-2376Drew Walkerhttps://orcid.org/0000-0002-4216-2396Snigdha Peddireddyhttps://orcid.org/0000-0002-2972-1122Sahithi Lakamanahttps://orcid.org/0000-0003-1304-7484Selen Bozkurthttps://orcid.org/0000-0003-1234-2158Matthew Reynahttps://orcid.org/0000-0003-4688-7965Reza Samenihttps://orcid.org/0000-0003-4913-6825Yunyu Xiaohttps://orcid.org/0000-0002-0479-1781Sangmi Kimhttps://orcid.org/0000-0002-1761-4696Rasheeta Chandlerhttps://orcid.org/0000-0003-2021-6346Natalie Hernandezhttps://orcid.org/0000-0001-8911-6613Danielle Moweryhttps://orcid.org/0000-0003-3802-4457Rachel Wightmanhttps://orcid.org/0000-0001-6141-1776Jennifer Lovehttps://orcid.org/0000-0002-5882-4390Anthony Spadarohttps://orcid.org/0000-0002-0941-4651Jeanmarie Perronehttps://orcid.org/0000-0001-7073-9060Abeed Sarkerhttps://orcid.org/0000-0001-7358-544X BackgroundThe increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side effects, use patterns, and opinions on novel psychoactive substances. However, due to the large volume of data, obtaining useful insights through natural language processing technologies such as large language models is challenging. ObjectiveThis paper aims to develop a retrieval-augmented generation (RAG) architecture for medical question answering pertaining to clinicians’ queries on emerging issues associated with health-related topics, using user-generated medical information on social media. MethodsWe proposed a two-layer RAG framework for query-focused answer generation and evaluated a proof of concept for the framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. Our modular framework generates individual summaries followed by an aggregated summary to answer medical queries from large amounts of user-generated social media data in an efficient manner. We compared the performance of a quantized large language model (Nous-Hermes-2-7B-DPO), deployable in low-resource settings, with GPT-4. For this proof-of-concept study, we used user-generated data from Reddit to answer clinicians’ questions on the use of xylazine and ketamine. ResultsOur framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO, evaluated for 20 queries with 76 samples. There was no statistically significant difference between GPT-4 and Nous-Hermes-2-7B-DPO for coverage (Mann-Whitney U=733.0; n1=37; n2=39; P=.89 two-tailed), coherence (U=670.0; n1=37; n2=39; P=.49 two-tailed), relevance (U=662.0; n1=37; n2=39; P=.15 two-tailed), length (U=672.0; n1=37; n2=39; P=.55 two-tailed), and hallucination (U=859.0; n1=37; n2=39; P=.01 two-tailed). A statistically significant difference was noted for the Coleman-Liau Index (U=307.5; n1=20; n2=16; P<.001 two-tailed). ConclusionsOur RAG framework can effectively answer medical questions about targeted topics and can be deployed in resource-constrained settings.https://www.jmir.org/2025/1/e66220
spellingShingle Sudeshna Das
Yao Ge
Yuting Guo
Swati Rajwal
JaMor Hairston
Jeanne Powell
Drew Walker
Snigdha Peddireddy
Sahithi Lakamana
Selen Bozkurt
Matthew Reyna
Reza Sameni
Yunyu Xiao
Sangmi Kim
Rasheeta Chandler
Natalie Hernandez
Danielle Mowery
Rachel Wightman
Jennifer Love
Anthony Spadaro
Jeanmarie Perrone
Abeed Sarker
Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study
Journal of Medical Internet Research
title Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study
title_full Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study
title_fullStr Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study
title_full_unstemmed Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study
title_short Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study
title_sort two layer retrieval augmented generation framework for low resource medical question answering using reddit data proof of concept study
url https://www.jmir.org/2025/1/e66220
work_keys_str_mv AT sudeshnadas twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT yaoge twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT yutingguo twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT swatirajwal twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT jamorhairston twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT jeannepowell twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT drewwalker twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT snigdhapeddireddy twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT sahithilakamana twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT selenbozkurt twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT matthewreyna twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT rezasameni twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT yunyuxiao twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT sangmikim twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT rasheetachandler twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT nataliehernandez twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT daniellemowery twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT rachelwightman twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT jenniferlove twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT anthonyspadaro twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT jeanmarieperrone twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy
AT abeedsarker twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy