Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study
BackgroundThe increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side effects, use patterns, and opinions on novel psychoactive substances. However, due to the large volume of data, obta...
        Saved in:
      
    
          | Main Authors: | , , , , , , , , , , , , , , , , , , , , , | 
|---|---|
| Format: | Article | 
| Language: | English | 
| Published: | 
            JMIR Publications
    
        2025-01-01
     | 
| Series: | Journal of Medical Internet Research | 
| Online Access: | https://www.jmir.org/2025/1/e66220 | 
| Tags: | 
       Add Tag    
     
      No Tags, Be the first to tag this record!
   
 | 
| _version_ | 1841557097806823424 | 
    
|---|---|
| author | Sudeshna Das Yao Ge Yuting Guo Swati Rajwal JaMor Hairston Jeanne Powell Drew Walker Snigdha Peddireddy Sahithi Lakamana Selen Bozkurt Matthew Reyna Reza Sameni Yunyu Xiao Sangmi Kim Rasheeta Chandler Natalie Hernandez Danielle Mowery Rachel Wightman Jennifer Love Anthony Spadaro Jeanmarie Perrone Abeed Sarker  | 
    
| author_facet | Sudeshna Das Yao Ge Yuting Guo Swati Rajwal JaMor Hairston Jeanne Powell Drew Walker Snigdha Peddireddy Sahithi Lakamana Selen Bozkurt Matthew Reyna Reza Sameni Yunyu Xiao Sangmi Kim Rasheeta Chandler Natalie Hernandez Danielle Mowery Rachel Wightman Jennifer Love Anthony Spadaro Jeanmarie Perrone Abeed Sarker  | 
    
| author_sort | Sudeshna Das | 
    
| collection | DOAJ | 
    
| description | 
          
            BackgroundThe increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side effects, use patterns, and opinions on novel psychoactive substances. However, due to the large volume of data, obtaining useful insights through natural language processing technologies such as large language models is challenging.
            ObjectiveThis paper aims to develop a retrieval-augmented generation (RAG) architecture for medical question answering pertaining to clinicians’ queries on emerging issues associated with health-related topics, using user-generated medical information on social media.
            MethodsWe proposed a two-layer RAG framework for query-focused answer generation and evaluated a proof of concept for the framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. Our modular framework generates individual summaries followed by an aggregated summary to answer medical queries from large amounts of user-generated social media data in an efficient manner. We compared the performance of a quantized large language model (Nous-Hermes-2-7B-DPO), deployable in low-resource settings, with GPT-4. For this proof-of-concept study, we used user-generated data from Reddit to answer clinicians’ questions on the use of xylazine and ketamine.
            ResultsOur framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO, evaluated for 20 queries with 76 samples. There was no statistically significant difference between GPT-4 and Nous-Hermes-2-7B-DPO for coverage (Mann-Whitney U=733.0; n1=37; n2=39; P=.89 two-tailed), coherence (U=670.0; n1=37; n2=39; P=.49 two-tailed), relevance (U=662.0; n1=37; n2=39; P=.15 two-tailed), length (U=672.0; n1=37; n2=39; P=.55 two-tailed), and hallucination (U=859.0; n1=37; n2=39; P=.01 two-tailed). A statistically significant difference was noted for the Coleman-Liau Index (U=307.5; n1=20; n2=16; P<.001 two-tailed).
            ConclusionsOur RAG framework can effectively answer medical questions about targeted topics and can be deployed in resource-constrained settings. | 
    
| format | Article | 
    
| id | doaj-art-9d0ab59424e6461c94b86837d9f8a22b | 
    
| institution | Kabale University | 
    
| issn | 1438-8871 | 
    
| language | English | 
    
| publishDate | 2025-01-01 | 
    
| publisher | JMIR Publications | 
    
| record_format | Article | 
    
| series | Journal of Medical Internet Research | 
    
| spelling | doaj-art-9d0ab59424e6461c94b86837d9f8a22b2025-01-06T20:30:38ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-01-0127e6622010.2196/66220Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept StudySudeshna Dashttps://orcid.org/0000-0002-2112-6986Yao Gehttps://orcid.org/0000-0002-3323-7130Yuting Guohttps://orcid.org/0000-0002-8919-0888Swati Rajwalhttps://orcid.org/0000-0002-3826-5069JaMor Hairstonhttps://orcid.org/0000-0001-6069-5869Jeanne Powellhttps://orcid.org/0000-0002-3494-2376Drew Walkerhttps://orcid.org/0000-0002-4216-2396Snigdha Peddireddyhttps://orcid.org/0000-0002-2972-1122Sahithi Lakamanahttps://orcid.org/0000-0003-1304-7484Selen Bozkurthttps://orcid.org/0000-0003-1234-2158Matthew Reynahttps://orcid.org/0000-0003-4688-7965Reza Samenihttps://orcid.org/0000-0003-4913-6825Yunyu Xiaohttps://orcid.org/0000-0002-0479-1781Sangmi Kimhttps://orcid.org/0000-0002-1761-4696Rasheeta Chandlerhttps://orcid.org/0000-0003-2021-6346Natalie Hernandezhttps://orcid.org/0000-0001-8911-6613Danielle Moweryhttps://orcid.org/0000-0003-3802-4457Rachel Wightmanhttps://orcid.org/0000-0001-6141-1776Jennifer Lovehttps://orcid.org/0000-0002-5882-4390Anthony Spadarohttps://orcid.org/0000-0002-0941-4651Jeanmarie Perronehttps://orcid.org/0000-0001-7073-9060Abeed Sarkerhttps://orcid.org/0000-0001-7358-544X BackgroundThe increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side effects, use patterns, and opinions on novel psychoactive substances. However, due to the large volume of data, obtaining useful insights through natural language processing technologies such as large language models is challenging. ObjectiveThis paper aims to develop a retrieval-augmented generation (RAG) architecture for medical question answering pertaining to clinicians’ queries on emerging issues associated with health-related topics, using user-generated medical information on social media. MethodsWe proposed a two-layer RAG framework for query-focused answer generation and evaluated a proof of concept for the framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. Our modular framework generates individual summaries followed by an aggregated summary to answer medical queries from large amounts of user-generated social media data in an efficient manner. We compared the performance of a quantized large language model (Nous-Hermes-2-7B-DPO), deployable in low-resource settings, with GPT-4. For this proof-of-concept study, we used user-generated data from Reddit to answer clinicians’ questions on the use of xylazine and ketamine. ResultsOur framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO, evaluated for 20 queries with 76 samples. There was no statistically significant difference between GPT-4 and Nous-Hermes-2-7B-DPO for coverage (Mann-Whitney U=733.0; n1=37; n2=39; P=.89 two-tailed), coherence (U=670.0; n1=37; n2=39; P=.49 two-tailed), relevance (U=662.0; n1=37; n2=39; P=.15 two-tailed), length (U=672.0; n1=37; n2=39; P=.55 two-tailed), and hallucination (U=859.0; n1=37; n2=39; P=.01 two-tailed). A statistically significant difference was noted for the Coleman-Liau Index (U=307.5; n1=20; n2=16; P<.001 two-tailed). ConclusionsOur RAG framework can effectively answer medical questions about targeted topics and can be deployed in resource-constrained settings.https://www.jmir.org/2025/1/e66220 | 
    
| spellingShingle | Sudeshna Das Yao Ge Yuting Guo Swati Rajwal JaMor Hairston Jeanne Powell Drew Walker Snigdha Peddireddy Sahithi Lakamana Selen Bozkurt Matthew Reyna Reza Sameni Yunyu Xiao Sangmi Kim Rasheeta Chandler Natalie Hernandez Danielle Mowery Rachel Wightman Jennifer Love Anthony Spadaro Jeanmarie Perrone Abeed Sarker Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study Journal of Medical Internet Research  | 
    
| title | Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study | 
    
| title_full | Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study | 
    
| title_fullStr | Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study | 
    
| title_full_unstemmed | Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study | 
    
| title_short | Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study | 
    
| title_sort | two layer retrieval augmented generation framework for low resource medical question answering using reddit data proof of concept study | 
    
| url | https://www.jmir.org/2025/1/e66220 | 
    
| work_keys_str_mv | AT sudeshnadas twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT yaoge twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT yutingguo twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT swatirajwal twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT jamorhairston twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT jeannepowell twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT drewwalker twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT snigdhapeddireddy twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT sahithilakamana twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT selenbozkurt twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT matthewreyna twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT rezasameni twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT yunyuxiao twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT sangmikim twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT rasheetachandler twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT nataliehernandez twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT daniellemowery twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT rachelwightman twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT jenniferlove twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT anthonyspadaro twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT jeanmarieperrone twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy AT abeedsarker twolayerretrievalaugmentedgenerationframeworkforlowresourcemedicalquestionansweringusingredditdataproofofconceptstudy  |