<b>DrugRepoChatter: a drug repurposing expert chatbot curated by the REPO4EU consortium</b>

scite shows how a scientific paper has been cited by providing the context of the citation, a classification describing whether it supports, mentions, or contrasts the cited claim, and a label indicating in which section the citation was made.

Header Revision notes Abstract Content Author and article information Comments

Version and Review History

Published Version

Preprint version 3

Preprint Version

Record: found
Abstract: found
Article: found

Is Open Access

DrugRepoChatter: a drug repurposing expert chatbot curated by the REPO4EU consortium

Preprint

research-article

This is not the latest version for this article. If you want to read the latest version, click here.

Author(s): Fernando Miguel Delgado-Chaves ¹ ^, , Lisa Spindler ² , Farzaneh Firoozbakht ¹ , Andreas Maier ¹ , Michael Hartung ¹ , Olga Tsoy ¹ , Quirin Manz ² , Johannes Kersting ² , Judith Bernett ² , Rui-Sheng Wang ³ , Joseph Loscalzo ³ , Julia Guthrie ⁴ ^, ⁵ ^, ⁶ , Jörg Menche ⁴ ^, ⁵ ^, ⁶ ^, ⁷ ^, ⁸ , Robert P. Loewe ⁹ , Francesco Sirci ¹⁰ , Montserrat Puiggros ¹⁰ , Emre Guney ¹⁰ , Zina Piper ¹¹ , Harald H.H.W. Schmidt ¹¹ , Markus List ² , Jan Baumbach ¹

Publication date (Electronic preprint): 16 September 2024

Journal: DrugRxiv

Publisher: REPO4EU

Keywords: Drug repurposing, large language models, semantic similarity, chatbot, retrieval augmented generation (RAG), literature review

Revision notes

Dear Editor and Esteemed Reviewers,

We would like to express our sincere gratitude for the thoughtful and constructive feedback provided on our manuscript titled “DrugRepoChatter: a drug repurposing expert chatbot curated by the REPO4EU consortium.” We deeply appreciate the time and effort each reviewer has invested in evaluating our work. The insightful comments and suggestions have been instrumental in enhancing the quality and clarity of our study.

In response to the feedback, we have carefully addressed each point raised by Reviewer I and Reviewer II. Below, we provide a detailed response to each comment, outlining the specific revisions and improvements made to the manuscript. We believe these changes have significantly strengthened our paper and have better positioned our chatbot, DrugRepoChatter, as a valuable tool in the field of drug repurposing research.

Thank you once again for your invaluable contributions to our work. We look forward to your continued guidance and hope that the revised manuscript meets your expectations.

Sincerely,

Dr. Fernando M. Delgado Chaves

Postdoctoral Researcher

Institute for Computational Systems Biology

Universität Hamburg

Albert-Einstein-Ring 8-10, (3. Floor)

22761 Hamburg

Germany

https://www.cosy.bio/

Reviewer I

The paper describes the development of a chatbot designed to help researchers in the drug repurposing field. The chatbot has a database composed of 285 open-access articles curated by experts to provide accurate responses to user queries. The chatbot uses LLM and RAG techniques to improve the literature review processes. The main strengths include the use of LLM and RAG as a cutting-edge approach to navigate a large amount of literature, the use of experts for the curation process from the REPO4EU consortium, and the use of open access papers to promote open science. As a weakness, it must be emphasized that the amount of literature used only covers 285 articles, which has limited generability to the results that the chatbot can provide. Apart from that, as the number of relevant publications grows, maintaining the quality and relevance of the curated database will become increasingly challenging. The manuscript does not detail strategies for scaling the curation process effectively.

We appreciate the reviewer's comment regarding the limited size of our current database and the challenges of scaling the curation process. We acknowledge that the current database of 285 articles does, indeed, constrain the generalizability of the results our chatbot can provide. However, we have been actively working on addressing this limitation.

In recent efforts to update the knowledgebase of the REPO4EU project, we have developed a semi-automated approach to expand our literature database. This involves using new PubMed queries to create databases of relevant literature for the categories: drug-repurposing reviews, bioinformatics tools for drug repurposing, and drug-repurposing related databases (REPO4EU D2.2). Our plan is to use this expanded database with DrugRepoChatter, allowing users to chat with the titles and abstracts of a much larger set of publications. This approach will significantly increase the breadth of information available through our tool.

It is important to note that while we can embed and interact with the titles and abstracts of these additional publications, downloading and embedding the full text of articles remains limited to open-access publications. This restriction is due to legal and ethical considerations, as we cannot provide access to publications behind a paywall.

We believe this semi-automated approach strikes a balance between expanding our knowledge base and maintaining the quality and relevance of the information provided. It allows us to scale up the amount of literature our chatbot can draw upon while still adhering to open-access principles. This development represents a significant step forward in addressing the limitation highlighted by the reviewer, and we plan to implement this feature in the near future to enhance the capabilities and usefulness of DrugRepoChatter within the larger scope of REPO4EU.

Reviewer II

Level of importance: DrugRepoChatter appears as a very promising approach to tackle the overwhelming process of gathering and extracting relevant information from the literature. The approach is innovative: I am looking forward to see how this will develop further, and whether such approach will be picked up in different fields than drug repurposing. In that sense, it could serve a larger community.

Level of validity: The authors describe quite well the different problems at hands, and the approach is logical and well described.

Level of completeness: The article is well referenced.

Regarding the methods section: it may lack a bit of details if one wishes to re-implement the tool, see next comments.

In the introduction, you specify “Importantly, the quality of an RAG hinges on the quality of the training material, creating a need for expert-curated literature collections as input.” However, the method section does not mention anything about whether training was performed and, if so, how. It would be interesting to develop this part of the methods.

We appreciate the reviewer’s pointing out this inconsistency. We apologize for the confusion caused by our statement about training in the introduction.

In fact, our method does not involve training. Instead, we use a zero-shot model for question answering in combination with a vector database, implementing retrieval-augmented generation (RAG). This approach allows us to leverage pre-trained language models without the need for additional training or fine-tuning on our specific dataset.

To clarify this point in the manuscript, we have revised the mentioned statement in the introduction and provided a more accurate description. We emphasized that our system relies on the quality of the curated literature for populating the vector database rather than for training purposes.

In the web tool, options such as score threshold, k, and fetch_k can be used to fine tune the search. Those are not specified in the method section. It would be good to harmonize the information between the paper and the tool’s web interface.

We appreciate the reviewer’s insightful comment regarding the omission of specific parameters such as score_threshold, k, and fetch_k from the methods section. These parameters are, indeed, relevant to the fine-tuning of search results.

Thus, we have added a detailed description of these parameters to the methods section. The score_threshold parameter defines a minimum relevance score for retrieved documents, ensuring that only results above a certain threshold are considered relevant; this helps improve precision by filtering out less relevant documents.

The parameter k controls the number of results returned, allowing users to adjust the trade-off between recall (returning more documents) and precision (focusing on the most relevant results). The fetch_k parameter, on the other hand, defines how many documents are initially retrieved before applying further filtering or re-ranking. This parameter is useful for broadening the initial retrieval scope while still allowing the system to apply additional ranking based on relevance.

We have ensured that the functionality of these parameters is reflected both in the manuscript and in the algorithm. Users can adjust score_threshold, k, and fetch_k directly through the tool’s interface, aligning the online application with the methodology described in the paper.

Regarding the conclusion section: In the “further research” paragraph, could you (if already defined) describe how you will approach/organize keeping the tool up to date, in terms of cadence, for example, how often will the team curate new articles? How often will the web page be updated?

Regarding the scalability of our curation process, we acknowledge these as areas for improvement and are actively working on solutions. We are developing a semi-automated approach to expand our literature database, which involves systematically and periodically using PubMed queries to create databases of relevant literature for key categories in drug repurposing, including reviews, bioinformatics tools, and related databases (REPO4EU D2.2). This method will allow us to significantly increase the breadth of information available through DrugRepoChatter. Importantly, to maintain sustainability and efficiency, our plan is to focus on chatting with titles and abstracts of this expanded set of publications, rather than attempting to curate and embed full texts, which would be resource-intensive and difficult to maintain over time. This approach strikes a balance between expanding our knowledge base and maintaining practicality, allowing users to interact with a much larger and more current set of publications. While the core, human-reviewed curated dataset of full-text articles will remain as a foundation, this new layer of title and abstract interaction will provide up-to-date coverage of the rapidly evolving field.

Regarding the web tool itself, I understand it is work in progress, but here are some suggestions of what could be helpful to add:

Add list of the 285 references/papers currently used.

To enhance transparency and allow for further validation, we have made the metadata of our curated articles publicly available. Interested parties can access and cite this dataset through Zenodo using the following citation:

Delgado-Chaves, F. M. (2024). Articles included in REPO4EU's D2.1 review [Data set]. Zenodo. https://doi.org/10.5281/zenodo.13767731

This resource provides comprehensive details about the articles included in our review, facilitating reproducibility and future expansions of our database.

Add paper selection criteria (4 bullet points mentioned in the paper introduction) to the web tool page for more visibility when landing on the page.

Following the reviewer’s advice, we have included the details of how the REPO4EU D2.1 database was constructed in our “About” page, where we provide a link to an online document with the screening performed (also available as supplementary material on this manuscript). Additionally, we have clarified there the assessment performed over the reviewed articles based on the bullet point criteria that are now also stated in the methods section.

Add a short user guide on how to fine-tune options such as Score threshold, k, fetch_k (“temperature” is mentioned in the method section of the paper but is not explicitly explained in the web tool, for example), along with a few test examples of questions to ask the bot that proved useful.

We thank the reviewer for this valuable suggestion. We appreciate the attention to detail in highlighting the need for a more comprehensive explanation of the fine-tuning options available in our web tool. In response to the reviewer’s comment, we have made the following improvements:

We have now specified all the fine-tuning options (score threshold, k, and fetch-k) in detail in the Methods section of the paper. This addition provides a clear explanation of each parameter, its default value, and its range of possible values.
Regarding the "temperature" parameter, we apologize for the confusion. While it was mentioned in the Methods section, it is not directly controllable in the web interface. We have clarified in the paper that the temperature is set to 0.0 by default in our backend to minimize hallucinations but is not exposed as a user-adjustable parameter in the GUI. However, it can be modified when running the Docker image locally.
Taking your suggestion, we have developed a short user guide that will be integrated into the web tool. This guide is available as hover elements in the Q&A page and explains in situ how to use and fine-tune the available options (score threshold, k, and fetch_k), providing users with a clear understanding of how these parameters affect the search results and response generation.
We are also compiling a set of example questions demonstrating the capabilities of DrugRepoChatter. These are included as suggestions in our Q&A page.

The Q&A section is for the moment empty.

We have checked the webtool following the reviewer’s comment, and we could verify the Q&A section is available and the chat works. We hypothesize the reviewer might have entered our webtool at a time when it was temporarily unavailable.

Level of comprehensibility: While the overall article is comprehensible, I have some suggestions on how some sections could be restructured.

Introduction: keep the focus on why you created the tool. My suggestions:

Move the 4 bullet points of the publication filtering criteria to the “Article Curation Process” of the “Methods” section and refer to it in the introduction. In general, I consider the publication selection part to rather belong to the “Results” section, even though it was already published.

Likewise, I would suggest moving the part that refers to DrugRepoChatter to the “Results” section.

We thank the reviewer for these suggestions, and we have now rearranged the sections accordingly. We have also adapted the end of the introduction and the beginning of the results section to introduce our tool more clearly.

Discussion:

“In conclusion, DrugRepoChatter exemplifies…” Could be moved to the conclusion section.

We thank the reviewer for pointing this out and have moved this paragraph to the conclusion section in the new version of the manuscript.

Finally, here are a couple of typos / rephrasing suggestions: “quality of an RAG” -> “quality of a RAG”

We have fixed this grammatical error in the new version of the manuscript.

“Previously, in (List et al. 2024), we performed” -> “We previously performed (List et al. 2024)”

We have revised this sentence to improve its flow and clarity.

Abstract

Content

Author and article information

Journal

Title: DrugRxiv

Publisher: REPO4EU

Publication date (Electronic preprint): 16 September 2024

Affiliations

[1 ] Institute for Computational Systems Biology, University of Hamburg, Albert-Einstein-Ring 8-10, 22761 Hamburg, Germany ( https://ror.org/00g30e956)

[2 ] Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, 85354 Freising, Germany ( https://ror.org/02kkvpp62)

[3 ] Department of Medicine, Brigham and Women's Hospital, Harvard Medical School, Boston, Massachusetts, USA ( https://ror.org/04b6nzv94)

[4 ] Ludwig Boltzmann Institute for Network Medicine at the University of Vienna, Augasse 2-6, A-1090 Vienna, Austria ( https://ror.org/03prydq77)

[5 ] Max Perutz Labs, Vienna Biocenter Campus (VBC), Dr.-Bohr-Gasse 9, 1030, Vienna, Austria ( https://ror.org/04khwmr87)

[6 ] University of Vienna, Center for Molecular Biology, Department of Structural and Computational Biology, Dr.-Bohr-Gasse 9, 1030, Vienna, Austria ( https://ror.org/03prydq77)

[7 ] Faculty of Mathematics, University of Vienna, Oskar-Morgenstern-Platz 1, A-1090 Vienna, Austria ( https://ror.org/03prydq77)

[8 ] CeMM Research Center for Molecular Medicine of the Austrian Academy of Sciences, Lazarettgasse 14, AKH BT 25.3, A-1090 Vienna, Austria ( https://ror.org/02z2dfb58)

[9 ] GeneSurge GmbH, Ottostr. 3, 80333 Munich, Germany ;

[10 ] Discovery and Data Science (DDS) Unit, STALICLA SL, Moll de Barcelona, s/n, Edif Este, 08039 Barcelona, Spain;

[11 ] Department of Pharmacology and Personalised Medicine, FHML, MeHNS, Maastricht University, The Netherlands ( https://ror.org/02jz4aj89)

Author notes

[* ]Email: fernando.miguel.delgado-chaves@ 123456uni-hamburg.de .

Author information

Fernando Miguel Delgado-Chaves https://orcid.org/0000-0002-6171-1215

Article

DOI: 10.58647/DRUGARXIV.PR000014.v3

SO-VID: deda3fa1-39de-4f45-b071-d82adee5be82

License:

This work has been published open access under Creative Commons Attribution License CC BY 4.0 , which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Conditions, terms of use and publishing policy can be found at www.scienceopen.com .

History

Funding

Funded by: funder-id http://dx.doi.org/10.13039/100018693, HORIZON EUROPE Framework Programme;

Award ID: 101057619

Funded by: funder-id , Swiss State Secretariat for Education, Research and Innovation (SERI);

Award ID: 22.00115