47
views
0
recommends
+1 Recommend
1 collections
    1
    shares

      Submit your manuscript to the new open access journal Drug Repurposing. Open for research articles, reviews, discussions, case studies, negative results across the whole spectrum of drug repurposing.

      No article processing charges.

      scite_
       
      • Record: found
      • Abstract: found
      • Article: found
      Is Open Access

      DrugRepoChatter: A Drug Repurposing Expert Chatbot Curated by the REPO4EU Consortium

      Published
      research-article
      Bookmark

            Abstract

            The amount of scientific literature available is overwhelming, especially in fast-evolving fields like drug repurposing. Researchers face a significant challenge: staying up-to-date is nearly impossible due to the sheer volume of publications, databases, and tools available. This situation creates an urgent need for more efficient ways to access and integrate information from these articles. To tackle this problem, we developed “DrugRepoChatter,” a tool designed to help researchers navigate the flood of recent literature on data sources, methods, and tools relevant for mechanism-based drug repurposing. DrugRepoChatter uses a vector database containing 285 open-access articles carefully selected by experts from the REPO4EU project to cover the latest developments in drug repurposing. The chatbot works by finding semantic similarities between a user’s question and the information stored in the database, effectively answering questions by drawing directly from the content of these articles. DrugRepoChatter is available as a web tool at https://apps.cosy.bio/drugrepochatter/ This tool makes the review process much more efficient and ensures that researchers can quickly find relevant information. By facilitating faster access to the scientific literature and enabling researchers to easily find the tools and information they need, DrugRepoChatter accelerates knowledge discovery. This tool not only streamlines the process of reviewing the literature but also helps integrate scattered information into a cohesive, accessible format.

            Main article text

            INTRODUCTION

            The challenge of keeping pace with the exponential growth in the scientific literature is a significant barrier to effectively translating research evidence into practice. This growing gap between knowledge and application often leads to overlooking valuable evidence, emphasizing the need for efficient strategies to navigate and integrate this vast amount of information. 1 Addressing the replication crisis and improving scientific research practices have become crucial in this context. Strategies such as pre-registration of research, peer reviews throughout the research process, and promoting open science practices are vital for maintaining scientific integrity and trust. 2

            Moreover, the recent global pandemic has underscored the challenges in quality control and the need for reliable information amidst a surge in biomedical publications, exacerbating the difficulty of keeping up-to-date with the latest scientific findings. 3 The pollution of scientific literature, or “infollution,” necessitates a critical evaluation of current peer-review systems and strategies to enhance the reproducibility of published data, aiming to curtail the dissemination of misleading or low-quality research. 4 Furthermore, the pressure to publish for career advancement can lead to a prioritization of quantity over quality, undermining the validity of scientific research. As an incentive, this publication strategy is detrimental to conducting high-quality scientific research and, thus, also prevents advances in health services. 5 Strategies for handling knowledge gaps and conducting systematic reviews are crucial for navigating these challenges and fostering a culture prioritizing scientific integrity and relevance. 6 Expert reviews can offer researchers a starting point when entering a new domain. However, such reviews are typically outdated quickly in fast-moving research fields. To remain succinct and accessible, expert reviews give only a high-level view and often focus on specific questions and topics. Interested researchers, thus, have to engage in systematic and time-consuming literature reviews, in which it is challenging to recognize high-quality research.

            Large language models (LLMs) are revolutionizing how scientists access and utilize the vast repository of scientific knowledge. LLMs can dramatically improve their response accuracy by integrating tools like retrieval-augmented generation (RAG) from the scientific literature. This technique enables the dynamic incorporation of information from external databases, such as PubMed or Google Scholar, ensuring that the provided information is factual and up-to-date. As outlined by Lewis et al. 7 in their seminal work on RAG for knowledge-intensive Natural Language Processing (NLP) tasks, this approach combines the extensive database of algorithms with an intuitive, natural chatting manner, offering a seamless and efficient way for scientists to retrieve relevant scientific literature. By building a RAG tool for semantic similarity-based literature retrieval, LLMs can provide contextual user assistance, effectively bridging the gap between complex scientific queries and the vast expanse of existing scientific research. This integration not only empowers scientists with the fastest access to information in a conversational manner but also leverages factual information to answer questions based on scientific articles, thereby facilitating a more informed and evidence-based scientific inquiry. Importantly, the effectiveness of RAG hinges on the quality and relevance of the curated literature collection used as its knowledge base, emphasizing the need for expert-curated databases to ensure accurate and reliable information retrieval.

            We previously performed a community-driven literature search and evaluation of drug repurposing tools, reviews, and databases as part of the REPO4EU consortium (https://repo4.eu/). 8 Our article curation method, detailed in the Methods section, ensured a rigorous selection of high-quality, relevant, and publicly available publications in the drug repurposing field. This carefully curated collection serves as the basis for a new tool we created to help researchers overcome the difficulties they encounter when sifting through the enormous and disorganized amount of drug repurposing data.

            RESULTS

            To address the pressing need for efficient access to high-quality drug repurposing information, we developed “DrugRepoChatter,” an LLM-powered chatbot designed to serve as a fundamental resource for academics, scientists, and drug repurposing enthusiasts. DrugRepoChatter aims to combine the fragmented landscape of drug repurposing literature into a cohesive and accessible format. This tool utilizes our curated collection of scientific literature as its knowledge base. DrugRepoChatter employs RAG technology to dynamically incorporate information from this curated list, enabling immediate, accurate, and contextually relevant interactions. By simplifying access to valuable scientific information, DrugRepoChatter aims to accelerate drug repurposing research and facilitate more informed decision-making in the field.

            DrugRepoChatter is an advancement toward democratizing access to scientific knowledge in drug repurposing, making a wealth of expert-curated open-access literature accessible to a wide audience. DrugRepoChatter is the direct result of the REPO4EU consortium’s efforts that aim to foster a more involved and knowledgeable community on drug repurposing, hence expediting the discovery and use of repurposed drugs.

            Figure 1 illustrates the process and architecture of DrugRepoChatter, showing how the chatbot processes and retrieves information from the curated articles. DrugRepoChatter uses an LLM model to embed text splits from expert-selected open-access articles into a vector space for similarity searches. This system enables the chatbot to match user queries with relevant literature effectively, providing clear and accurate responses based on the embedded scientific content.

            Figure 1.

            A schematic representation of DrugRepoChatter. Initially, 285 drug repurposing articles have been carefully selected by experts. The articles, in PDF format, have been loaded, preprocessed, and split into chunks. Each text split has been embedded into a vector space using an LLM model and stored. These data are then used for similarity searches, where the embedding of the user query is compared to the embedding of text splits. Then, the most relevant splits are selected by the retriever. Next, the relevant splits are used by the LLM model to generate answers. Users can further personalize the database by adding their own papers. LLM, large language model.

            DrugRepoChatter is based on a database containing 285 expert-curated open-access articles selected for their relevance and quality in drug repurposing. This curation process was critical in ensuring that the chatbot could provide accurate, informed responses to user queries, significantly enhancing the efficiency of literature review processes for researchers. Following the open science paradigm, we ensured unrestricted access to this information by focusing solely on open-access publications.

            DrugRepoChatter’s design and technology enable it to match user queries with relevant articles from the database effectively. Users receive answers informed by the embedded literature, presented clearly and accessibly, streamlining the process of obtaining scientific information. The tool can help integrate scattered information into a cohesive, user-friendly format, eliminating the need for researchers to manually navigate multiple databases or sift through extensive volumes of literature.

            The acceleration of discovery in drug repurposing is another notable outcome of DrugRepoChatter’s implementation. By providing quick access to the scientific literature and equipping researchers with the necessary tools and information, DrugRepoChatter supports identifying novel drug repurposing opportunities and increases the pace of discovery in this field.

            Feedback from initial users indicates a strong appreciation for the chatbot’s intuitive design and the credibility of its responses, underscoring the value of direct source citation. Overall, DrugRepoChatter exemplifies the transformative potential of leveraging advanced technologies to support scientific inquiry. Its development and successful implementation demonstrate a significant advancement in how researchers access and interact with scientific literature, fostering a more efficient and informed exploration of new therapeutic opportunities.

            DISCUSSION

            The development and implementation of DrugRepoChatter represent a significant advancement in how researchers access and interact with scientific literature. By leveraging advanced technologies such as vector databases, semantic similarity searches, and RAG, our chatbot provides accurate, informed responses to user queries, significantly enhancing the efficiency of the literature review processes for researchers.

            The careful curation of scientific articles was critical in ensuring that DrugRepoChatter could provide reliable, informed answers. By focusing exclusively on open-access publications, we promote an open science approach and ensure unrestricted access to this information. This curated database is the foundation of DrugRepoChatter’s ability to match user queries with relevant articles effectively, providing users with clear, accessible answers that streamline the process of accessing scientific information.

            DrugRepoChatter eliminates the need for researchers to manually navigate multiple databases or sift through extensive volumes of literature. Its design and technology enable it to accelerate education in drug repurposing by providing quick access to the relevant scientific literature and equipping researchers with the necessary tools and information. This positive reception highlights the tool’s potential to foster more efficient and informed exploration of new therapeutic opportunities by researchers. The interaction data collected from these users will be instrumental in further refining DrugRepoChatter, ensuring it continues to serve the research community’s needs effectively.

            CONCLUSION AND FUTURE RESEARCH

            In conclusion, DrugRepoChatter exemplifies the transformative potential of leveraging advanced technologies to support scientific inquiry. Its development and successful implementation demonstrate that researchers can access and interact with the scientific literature more efficiently and informedly, ultimately fostering a more productive exploration of new therapeutic opportunities in drug repurposing.

            DrugRepoChatter facilitates a more streamlined review process, enabling researchers to stay abreast of the latest developments in their field. The chatbot’s positive reception by the research community highlights its potential to significantly impact how scientific literature is accessed and utilized in drug repurposing research.

            Future research will focus on expanding and updating the chatbot’s database through a systematic and efficient process. We have already created targeted PubMed queries for drug repurposing reviews, databases, and tools as part of the REPO4EU community effort. We intend to use these customized queries to create a periodic querying system that consistently pulls new publications from PubMed. Basic filtering criteria, like choosing only articles with valid PubMed IDs, will be applied during this process. We will incorporate the new articles’ titles and abstracts into our knowledge base to produce time-stamped versions. This approach allows us to continually expand our knowledge base without the computational demands and potential user experience issues that could arise from embedding and chatting with full-text articles, even if limited to open-access publications. By using this approach, DrugRepoChatter will be able to remain up to date with the most recent research while still being user-friendly and efficient. The process of updating articles will be streamlined because they will not need to be re-embedded into the database. Users will always have access to the most recent information without overloading the system or degrading response times thanks to this approach, which strikes a compromise between comprehensiveness and practicality.

            Additionally, further development will aim to enhance the chatbot’s natural language processing capabilities in order to improve the relevance and accuracy of its responses. As new open-source LLMs are released, we plan to potentially replace the currently running model with newer, more advanced versions. This ongoing update of the underlying LLM could significantly benefit future users by providing improved performance, expanded capabilities, and potentially better alignment with the specific needs of drug repurposing research.

            As artificial intelligence (AI) and machine learning continue to advance, there is significant potential for tools like DrugRepoChatter to become integral components of the research ecosystem, facilitating faster and more informed scientific inquiry. Continued collaboration between AI developers and the scientific community will be crucial in realizing this potential and ensuring that such tools are tailored to meet the evolving needs of researchers.

            METHODS

            The REPO4EU consortium aims to develop a platform that sets global standards for drug repurposing, both in clinical and information management aspects. Partners conducted a literature review and gathered scientific papers on drug repurposing to create a knowledge base for this platform.

            Article Curation Process

            The initial phase involved compiling an exhaustive list of potential articles from databases such as PubMed 9 and Google Scholar. 10 This list was refined through a meticulous review process that assessed each article’s relevance, quality, and contribution to drug repurposing. Our article curation strategy began with identifying relevant and novel research papers, reviews, and databases in the drug repurposing field. Following this first step, we conducted a thorough evaluation to assess the quality and relevance of the literature in light of current research trends and clinical utility. In particular, we applied the following inclusion criteria to filter the publications for our study:

            1. Tools and methods have to be usable in a programmatic fashion by including the terms package, Application Programming Interface (API), or code. Regarding tools designed for users who are not usually expected to have programming knowledge (e.g., clinicians), a web app or Graphical User Interface (GUI) should be provided.

            2. Databases and data resources should contain the term database.

            3. We consider publications to be accepted by the community if they have more than 30 PubMed citations. Publications with fewer citations are only considered if published within the last 5 years.

            4. All publications should be open-access.

            This approach guaranteed that only publications that met our high scientific validity requirements and were publicly available made it into our refined selection. Special attention was given to ensuring that only open-access articles were included to facilitate unrestricted access. The selected articles were cataloged based on their relevance to different categories within drug repurposing, such as databases, reviews, and tools. This organization facilitated easier access and retrieval of information relevant to specific queries or topics. For each article, we obtained its corresponding PDF version.

            Embedding Articles in the DrugRepoChatter Vector Database

            To extract text from the curated articles’ PDF files, we utilized the PyPDFLoader module from the LangChain-community Python package. 11 We configured PyPDFLoader to extract text from each page of the PDF as a separate document while storing the page numbers in the metadata. This involved extracting the text while discarding references to focus on the core content. Additional article information, like title, abstract, and DOI, was included as metadata.

            The text extracted from the articles was then embedded using the nomic-embed-text model. 12 These embeddings were saved in a Facebook AI Similarity Search (FAISS) vector store, 13 a library developed by Facebook for efficient similarity searches. FAISS uses clustering methods, such as k-means, to organize the feature space, allowing for the quick identification of the cluster nearest to a queried vector. This facilitates efficient and rapid text retrieval based on the similarity and relevance of user queries. This functionality is essential for DrugRepoChatter’s ability to quickly source and cite pertinent articles in response to user inquiries.

            Chatbot Implementation

            We built DrugRepoChatter using LangChain 11 as our foundation, a powerful library for constructing LLM applications with RAG capabilities. This technique allows the LLM to dynamically tap into external knowledge sources, like our expert-curated database, to deliver accurate and up-to-date information in its responses. 7,14,15

            To make DrugRepoChatter user-friendly, we incorporated the Llama3:8b model. 16 This system allows for fine-tuning of search results through several key parameters: score_threshold, k, and fetch-k. These parameters can be adjusted via the web interface, giving researchers control over the retrieval process. The score threshold, defaulting to 0.9, can be adjusted from 0.0 to 1.0 with a step size of 0.1, allowing users to set the minimum relevance score for retrieved documents. The k parameter, which determines the number of results returned, defaults to 3 and can be modified between 1 and 50. Similarly, the fetch-k parameter, controlling the initial retrieval scope before filtering, defaults to 20 and is also adjustable from 1 to 50. These customization options ensure that the chatbot’s answers are tailored to the user’s specific query needs. While the model’s temperature is set to 0.0 by default in the ollama_connector script to minimize hallucinations, this is not directly controlled by users in the GUI. LangChain plays a crucial role in seamlessly bridging the gap between the chatbot interface and the Llama3 models, ultimately empowering DrugRepoChatter to deliver precise and insightful responses based on the curated literature.

            DrugRepoChatter retrieves the most relevant articles’ embeddings to generate informed and accurate responses when posed with queries. This system ensures that the chatbot’s answers are only grounded in the embedded scientific literature. To provide responses that are concise and easy to understand, DrugRepoChatter was designed to adapt its answers based on the retrieved articles. It can also admit when it does not have enough information, ensuring that users are provided with informative and trustworthy responses.

            We used Streamlit 17 to implement our chatbot, a faster way to build and share data apps. Streamlit turns data scripts into shareable web apps. This allows us to create a user-friendly interface for DrugRepoChatter, making it accessible and easy-to-use for researchers and other users.

            CONFLICTS OF INTEREST

            The authors declare no conflict of interest.

            DATA AND CODE AVAILABILITY

            SUPPLEMENTARY

            The dataset supporting the findings of this study is publicly available and can be accessed through Zenodo. This dataset includes the curated articles utilized in the REPO4EU’s D2.1 review, providing a comprehensive resource for researchers in the field of drug repurposing. Interested readers can access and cite the dataset using the following reference:

            Delgado-Chaves FM. Articles included in REPO4EU’s D2.1 review [Data set]. Zenodo. 2024. Available from: https://doi.org/10.5281/zenodo.13767731

            REFERENCES

            1. Wallace J. The science of searching – how to find the evidence quickly and efficiently. Ir Med J. 2012. Vol. 105(4):121–122

            2. Eben C, Bőthe B, Brevers D, et al.. The landscape of open science in behavioral addiction research: current practices and future directions. J Behav Addict. 2023. Vol. 12(4):862–870. [Cross Ref]

            3. Persson PB, Persson AB. Scientific due diligence [in times of need for reliable information]. Acta Physiol. 2022. Vol. 234(1):e13740. [Cross Ref]

            4. Shrout PE, Rodgers JL. Psychology, science, and knowledge construction: broadening perspectives from the replication crisis. Annu Rev Psychol. 2018. Vol. 69:487–510. [Cross Ref]

            5. Dinis-Oliveira RJ, Magalhães T. The inherent drawbacks of the pressure to publish in health sciences: good or bad science. F1000Res. 2015. Vol. 4:419[Cross Ref]

            6. Ioannidis JPA, Berkwits M, Flanagin A, Bloom T. Peer review and scientific publication at a crossroads: call for research for the 10th international congress on peer review and scientific publication. JAMA. 2023. Vol. 330(13):1232–1235. [Cross Ref]

            7. Lewis P, Perez E, Piktus A, et al.. Retrieval-augmented generation for knowledge-intensive NLP tasks. Adv Neural Inf Process Syst. 2020. Vol. 33:9459–9474

            8. List M, Manz Q, Bernett J, et al.. D2.1 Whitepaper on the platform knowledge base and data standards for in silico drug repurposing. 2024. [Cross Ref]

            9. Canese K, Weis S. PubMed: The Bibliographic Database. The NCBI Handbook. March. 2013. ftp://ftp.sanru.ac.ir/system%20of%20virtual%20training%20file/Faculty%20of%20Crop%20Sciences/Plant%20Protection/MohammadAli%20Tajick/Fungal%20Systematics%20Doctori/The%20NCBI%20Handbook/2%20Bookshelf_NBK21094.pdf

            10. Vine R. Google scholar. J Med Libr Assoc. 2006. Vol. 94(1):97–99

            11. Chase H. LangChain, 10 2022. n.d. https://python.langchain.com/docs/introduction/

            12. Nussbaum Z, Morris JX, Duderstadt B, Mulyar A. Nomic embed: training a reproducible long context text embedder. arXiv [cs.CL]. arXiv. 2024. http://arxiv.org/abs/2402.01613

            13. Jégou H, Douze M, Johnson J, Hosseini L, Deng C. Faiss: similarity search and clustering of dense vectors library. Astrophysics Source Code Library. 2022. https://ui.adsabs.harvard.edu/abs/2022ascl.soft10024J

            14. Huang J, Tan M. The role of ChatGPT in scientific communication: writing better scientific review articles. Am J Cancer Res. 2023. Vol. 13(4):1148–1154

            15. Pollesello P, Papp Z. Integrating an artificial intelligence chatbot in scientific communication: dos and don’ts. Eur Sci Ed. 2023. Vol. 49:e112023. [Cross Ref]

            16. Huang W, Ma X, Qin H, et al.. How good are low-bit quantized LLaMA3 models? An empirical study. arXiv [cs.LG]. arXiv. 2024. [Cross Ref]

            17. Khorasani M, Abdou M, Fernández JH. Web Application Development with Streamlit. Apress. n.d. https://streamlit.io/Accessed June 12, 2024

            Author and article information

            Journal
            dr
            Drug Repurposing
            ScienceOpen (Berlin )
            2941-2528
            09 October 2024
            : 1
            : 2
            : e20240014
            Affiliations
            [1 ] Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany ( https://ror.org/00g30e956)
            [2 ] Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany ( https://ror.org/02kkvpp62)
            [3 ] Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA, USA ( https://ror.org/04b6nzv94)
            [4 ] Ludwig Boltzmann Institute for Network Medicine at the University of Vienna, Vienna, Austria ( https://ror.org/03prydq77)
            [5 ] Max Perutz Labs, Vienna Biocenter Campus (VBC), Vienna, Austria ( https://ror.org/04khwmr87)
            [6 ] Department of Structural and Computational Biology, University of Vienna, Center for Molecular Biology, Vienna, Austria ( https://ror.org/03prydq77)
            [7 ] Faculty of Mathematics, University of Vienna, Vienna, Austria ( https://ror.org/03prydq77)
            [8 ] CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Vienna, Austria ( https://ror.org/02z2dfb58)
            [9 ] GeneSurge GmbH, Munich, Germany;
            [10 ] Discovery and Data Science (DDS) Unit, STALICLA SL, Barcelona, Spain;
            [11 ] Department of Pharmacology and Personalised Medicine, FHML, MeHNS, Maastricht University, Maastricht, The Netherlands ( https://ror.org/02jz4aj89)
            [12 ] Munich Data Science Institute, Technical University of Munich, Garching, Germany;
            Author notes
            *Correspondence to: Fernando Miguel Delgado-Chaves, Institute for Computational Systems Biology, University of Hamburg, Albert-Einstein-Ring 8-10, 22761 Hamburg, Germany. Tel: +49-40-42838-7687. E-mail: fernando.miguel.delgado-chaves@ 123456uni-hamburg.de
            Author information
            https://orcid.org/0000-0002-6171-1215
            Article
            10.58647/DRUGREPO.24.2.0014
            d15f3dc9-26c4-443b-b90d-11692130a096
            2024 The Author(s).

            This work has been published open access under Creative Commons Attribution License (CC BY) 4.0, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at www.scienceopen.com.

            History
            : 25 June 2024
            : 20 September 2024
            Page count
            Figures: 1, References: 17, Pages: 6
            Funding
            Funded by: European Union’s Horizon Europe research and innovation programme
            Award ID: 101057619
            Funded by: Swiss State Secretariat for Education, Research and Innovation (SERI)
            Award ID: 22.00115
            As part of the REPO4EU project, this work was supported by the European Union’s Horizon Europe research and innovation programme under grant agreement No. 101057619. Views and opinions expressed are, however, those of the author(s) only and do not necessarily reflect those of the European Union or the European Health and Digital Executive Agency (HADEA). Neither the European Union nor the granting authority can be held responsible for them. This work was also partly supported by the Swiss State Secretariat for Education, Research and Innovation (SERI) under contract No. 22.00115.
            Categories

            semantic similarity,large language models,drug repurposing,chatbot,retrieval-augmented generation

            Comments

            Comment on this article