Appendix for: Charting the Evolution of Artificial Intelligence Mental Health Chatbots: From Rule-Based to Large Language Models
Appendix A: Classification of AI Systems
1.Rule-Based Systems: These rely on deterministic scripts (e.g., rule-based conversation systems, simple decision trees, etc.) with no data-driven learning. They are ideal for structured, low-risk tasks (e.g., symptom checklists) where predictability ensures safety. However, their rigidity limits their utility in dynamic therapeutic contexts.
2.Machine-Learning-Based Systems: These include traditional ML (e.g., support vector machine [SVM]), non-generative deep learning (e.g., recurrent neuron recurrent neural networks [RNN], long short-term memory [LSTM], and bidirectional encoder representations from transformers [BERT]). While RNNs/LSTMs and traditional ML differ technically (e.g., sequential vs. static data processing), both lack natural language fluency. Grouping these under "ML-based" reflects their shared limitation in mental health: adaptability without generative capacity.
3.(Generative) LLM-Based Systems: These leverage generative models (e.g., GPT-3 Large Language Model Meta AI [LLaMA]) trained on vast text corpora to produce humanlike dialogue. Given the absence of a universally accepted definition of large language models, this review considers only generative LLMs, those capable of producing free-form text, as LLM-based chatbots. This category includes multimodal models that can process images, audio, or other modalities in addition to text as long as they maintain the core LLM architecture for language generation.
2.Machine-Learning-Based Systems: These include traditional ML (e.g., support vector machine [SVM]), non-generative deep learning (e.g., recurrent neuron recurrent neural networks [RNN], long short-term memory [LSTM], and bidirectional encoder representations from transformers [BERT]). While RNNs/LSTMs and traditional ML differ technically (e.g., sequential vs. static data processing), both lack natural language fluency. Grouping these under "ML-based" reflects their shared limitation in mental health: adaptability without generative capacity.
3.(Generative) LLM-Based Systems: These leverage generative models (e.g., GPT-3 Large Language Model Meta AI [LLaMA]) trained on vast text corpora to produce humanlike dialogue. Given the absence of a universally accepted definition of large language models, this review considers only generative LLMs, those capable of producing free-form text, as LLM-based chatbots. This category includes multimodal models that can process images, audio, or other modalities in addition to text as long as they maintain the core LLM architecture for language generation.
Appendix B: Search Queries and Additional Search Strategies
This study systematically investigated mental health chatbot research published between Jan 1st, 2020, and Jan 1st, 2025 was conducted across APA PsycNet, PubMed, Scopus, and Web of Science, adhering to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) 2020 guidelines. To ensure comprehensiveness, we designed our search strings based on two systematic scoping reviews on generative AI across different topics. Additionally, we used the shortest matching string to capture all lexical variations. The search strings combined terms such as “chatbot,” “conversational agent,” “virtual assistant,” “large language model,” and “mental health.” search results were supplemented with manual searches of Google Scholar and conference proceedings from ACL (Association for Computational Linguistics), EMNLP (Empirical Methods in Natural Language Processing), COLING (Computational Linguistics), NAACL (North American Chapter of the Association for Computational Linguistics), NeurIPS (Conference on Neural Information Processing Systems), and CHI (ACM Conference on Human Factors in Computing Systems).
PubMed Query
("virtual assistant"[Title/Abstract] OR "large language model"[Title/Abstract] OR "large language models"[Title/Abstract] OR "conversational agent"[Title/Abstract] OR "conversational agents"[Title/Abstract] OR "chatbot"[Title/Abstract] OR "chatbots"[Title/Abstract]) AND ("mental"[Title/Abstract] OR "psychiatry"[Title/Abstract] OR "psychiatric"[Title/Abstract] OR "psychological"[Title/Abstract] OR "psychology"[Title/Abstract] OR "psycho"[Title/Abstract] OR "emotional support"[Title/Abstract]) AND ((fft[Filter]) AND (2020:2024[pdat]))
APA PsycNet Query
(Any Field: virtual assistant OR Any Field: large language model OR Any Field: large language models OR Any Field: conversational agent OR Any Field: conversational agents OR Any Field: chatbot OR Any Field: chatbots) AND (Any Field: mental OR Any Field: psychiatry OR Any Field: psychiatric OR Any Field: psychological OR Any Field: psychology OR Any Field: psycho OR Any Field: emotional support)
Web of Science Query
(TI=("generative artificial intelligence" OR "large language models" OR "generative model" OR "chatbot") OR AB=("generative artificial intelligence" OR "large language models" OR "generative model" OR "chatbot")) AND (TI=("mental" OR "psychiatr*" OR "psycho*" OR "emotional support") OR AB=("mental" OR "psychiatr*" OR "psycho*" OR "emotional support"))
Scopus Query
( TITLE-ABS-KEY ( "virtual assistant" OR "conversational agent" OR "conversational AI" OR "digital assistant" ) ) AND ( TITLE-ABS-KEY ( "mental" OR "psychiatr" OR "psycho" OR "emotional support" ) ) AND PUBYEAR > 2019 AND PUBYEAR < 2025 AND ( LIMIT-TO ( SUBJAREA , "COMP" ) OR LIMIT-TO ( SUBJAREA , "MEDI" ) OR LIMIT-TO ( SUBJAREA , "PSYC" ) OR LIMIT-TO ( SUBJAREA , "NEUR" ) OR LIMIT-TO ( SUBJAREA , "NURS" ) OR LIMIT-TO ( SUBJAREA , "HEAL" ) OR LIMIT-TO ( SUBJAREA , "SOCI" ) ) AND ( LIMIT-TO ( DOCTYPE , "ar" ) OR LIMIT-TO ( DOCTYPE , "cp" ) )
Additional Sources
Additional articles were identified through manual searches of the same word combinations on Google Scholar and major AI conference proceedings of interest, including ACL, EMNLP, COLING, NAACL, NeurIPS, and CHI. These supplementary searches were conducted to ensure comprehensive coverage of relevant literature not indexed in the primary databases.
PubMed Query
("virtual assistant"[Title/Abstract] OR "large language model"[Title/Abstract] OR "large language models"[Title/Abstract] OR "conversational agent"[Title/Abstract] OR "conversational agents"[Title/Abstract] OR "chatbot"[Title/Abstract] OR "chatbots"[Title/Abstract]) AND ("mental"[Title/Abstract] OR "psychiatry"[Title/Abstract] OR "psychiatric"[Title/Abstract] OR "psychological"[Title/Abstract] OR "psychology"[Title/Abstract] OR "psycho"[Title/Abstract] OR "emotional support"[Title/Abstract]) AND ((fft[Filter]) AND (2020:2024[pdat]))
APA PsycNet Query
(Any Field: virtual assistant OR Any Field: large language model OR Any Field: large language models OR Any Field: conversational agent OR Any Field: conversational agents OR Any Field: chatbot OR Any Field: chatbots) AND (Any Field: mental OR Any Field: psychiatry OR Any Field: psychiatric OR Any Field: psychological OR Any Field: psychology OR Any Field: psycho OR Any Field: emotional support)
Web of Science Query
(TI=("generative artificial intelligence" OR "large language models" OR "generative model" OR "chatbot") OR AB=("generative artificial intelligence" OR "large language models" OR "generative model" OR "chatbot")) AND (TI=("mental" OR "psychiatr*" OR "psycho*" OR "emotional support") OR AB=("mental" OR "psychiatr*" OR "psycho*" OR "emotional support"))
Scopus Query
( TITLE-ABS-KEY ( "virtual assistant" OR "conversational agent" OR "conversational AI" OR "digital assistant" ) ) AND ( TITLE-ABS-KEY ( "mental" OR "psychiatr" OR "psycho" OR "emotional support" ) ) AND PUBYEAR > 2019 AND PUBYEAR < 2025 AND ( LIMIT-TO ( SUBJAREA , "COMP" ) OR LIMIT-TO ( SUBJAREA , "MEDI" ) OR LIMIT-TO ( SUBJAREA , "PSYC" ) OR LIMIT-TO ( SUBJAREA , "NEUR" ) OR LIMIT-TO ( SUBJAREA , "NURS" ) OR LIMIT-TO ( SUBJAREA , "HEAL" ) OR LIMIT-TO ( SUBJAREA , "SOCI" ) ) AND ( LIMIT-TO ( DOCTYPE , "ar" ) OR LIMIT-TO ( DOCTYPE , "cp" ) )
Additional Sources
Additional articles were identified through manual searches of the same word combinations on Google Scholar and major AI conference proceedings of interest, including ACL, EMNLP, COLING, NAACL, NeurIPS, and CHI. These supplementary searches were conducted to ensure comprehensive coverage of relevant literature not indexed in the primary databases.
Appendix C: The Annotation Protocol
Evaluation of chatbots for mental health care: a systematic review
Annotation Training Program Guidebook
1. Scope:
This review examines the evaluation of chatbots in mental health care. While our initial aim was to cover holistic areas of mental health care (i.e., psychiatry, psychotherapy, and general mental wellness support), most research focuses on more specific subconstructs, such as empathy, and care under specific situations.
2. Background:
We are building an evaluation framework for mental health chatbot studies to help stakeholders understand what to focus on when assessing a mental healthcare chatbot. This work is the second step in our project. In the first step, we developed a general framework for AI chatbots in healthcare.
This review will provide the basis for creating a similar evaluation framework for psychotherapy, making this study more focused on evaluation.
3. Inclusion criteria:
1. Focus on Mental Health Care: The study must involve mental health care, not just related fields like psychosocial demographics or psycholinguistics.
2. Chatbots Involved: The paper must involve chatbots, not prediction-only models, nor robots or voice bots - they have to be able to hold conversations.
3. Full, Peer-Reviewed Papers: Exclude reviews, meta-analyses, and retracted papers. The paper must be a complete, peer-reviewed study.
4. Actual Chatbot Evaluation: The study must include an actual evaluation of the chatbot's performance, not just protocols, registries, or unrelated
qualitative analyses.
4. Types of Studies
1. Automated or Expert Evaluation: Studies where chatbots are evaluated using pre-designed scripts or scenarios without the involvement of human
participants. These evaluations may involve automated testing or assessments by experts to analyse the chatbot's performance, functionality, or
adherence to predefined standards.
2. Cross-sectional Feedback Studies: Studies that collect feedback from human participants, including clinicians, patients, or potential users, based on
their experiences interacting with the chatbot. These studies typically focus on user satisfaction, perceived usefulness, and usability at a single point in
time.
3. Prospective Human Interaction Studies: Studies where human participants engage with the chatbot over a specified period, with their interactions
monitored and outcomes measured to assess the chatbot's impact on mental health. These studies are likely designed to evaluate the effectiveness and therapeutic value of the chatbot in real-world settings. Note: If a study fits into more than one category, note all relevant types (e.g., Type 1,2). Keep
track of any new categories for potential future classification.
5. Extraction Guidelines
Study Information:
Target Condition or Symptoms: Specify the mental health condition or symptoms the bot is designed to address, whether it is specific (e.g., general anxiety) or broad (e.g., general mental health support)
BOT INFORMAION:
Evaluation Measures:
Annotation Training Program Guidebook
1. Scope:
This review examines the evaluation of chatbots in mental health care. While our initial aim was to cover holistic areas of mental health care (i.e., psychiatry, psychotherapy, and general mental wellness support), most research focuses on more specific subconstructs, such as empathy, and care under specific situations.
2. Background:
We are building an evaluation framework for mental health chatbot studies to help stakeholders understand what to focus on when assessing a mental healthcare chatbot. This work is the second step in our project. In the first step, we developed a general framework for AI chatbots in healthcare.
This review will provide the basis for creating a similar evaluation framework for psychotherapy, making this study more focused on evaluation.
3. Inclusion criteria:
1. Focus on Mental Health Care: The study must involve mental health care, not just related fields like psychosocial demographics or psycholinguistics.
2. Chatbots Involved: The paper must involve chatbots, not prediction-only models, nor robots or voice bots - they have to be able to hold conversations.
3. Full, Peer-Reviewed Papers: Exclude reviews, meta-analyses, and retracted papers. The paper must be a complete, peer-reviewed study.
4. Actual Chatbot Evaluation: The study must include an actual evaluation of the chatbot's performance, not just protocols, registries, or unrelated
qualitative analyses.
4. Types of Studies
1. Automated or Expert Evaluation: Studies where chatbots are evaluated using pre-designed scripts or scenarios without the involvement of human
participants. These evaluations may involve automated testing or assessments by experts to analyse the chatbot's performance, functionality, or
adherence to predefined standards.
2. Cross-sectional Feedback Studies: Studies that collect feedback from human participants, including clinicians, patients, or potential users, based on
their experiences interacting with the chatbot. These studies typically focus on user satisfaction, perceived usefulness, and usability at a single point in
time.
3. Prospective Human Interaction Studies: Studies where human participants engage with the chatbot over a specified period, with their interactions
monitored and outcomes measured to assess the chatbot's impact on mental health. These studies are likely designed to evaluate the effectiveness and therapeutic value of the chatbot in real-world settings. Note: If a study fits into more than one category, note all relevant types (e.g., Type 1,2). Keep
track of any new categories for potential future classification.
5. Extraction Guidelines
Study Information:
Target Condition or Symptoms: Specify the mental health condition or symptoms the bot is designed to address, whether it is specific (e.g., general anxiety) or broad (e.g., general mental health support)
BOT INFORMAION:
- Base Bot Name: Record the original name and version of the chatbot. This is important because some papers revise existing bots but many not fully understand or accurately describe the version they are using, which will be discussed in our policy proposals.
- Adopted Bot Name: Note any new names given to modified versions of the bot, even if they changes are minor or not clearly defined.
- Functional Purpose of Evaluated Bot: The study might aim to address schizophrenia, but this bot was specifically designed to assist with managing auditory hallucinations or helping users cope with negative symptoms such as social withdrawal and lack of motivation
- Bot Type: Whether the bot is rule-based, AI-based, LLM-based, or a combination of these.
- LLM: Refers to large language models developed after T5 and GPT-3, such as those with in-context learning capabilities like ChatGPT. Most studies using LLMs will clearly state this. However, be cautious of studies that claim to use LLMs but actually use models like BERT, these are considered as ML.
- ML: Includes any machine learning (ML) or non-rule-based models not classified as LLMs.
- Rule: Includes rule-based systems, including NLP models like dependency trees, as they rely on predefined rules rather than statistical methods or probabilities.
Evaluation Measures:
- -Constructs Measured: List the specific constructs the paper measured to determine the chatbot's effectiveness. This information should come directly from the paper, but if necessary, you can infer from the qualitative analysis. However, avoid over-extrapolating; the goal is to understand what has been tested to inform our framework.
- -Scales Used: List the scales by their established names, ensuring that they are validated and can be referenced. For each scale, specify the construct it measures using the format: scale name (construct). For example, GAD-7 (anxiety). This helps to verify that the scales were not developed specifically for this study and allows for external reference and validation.
- Type I (Automated/Expert Review):
- Test Questions: provide a high-level summary of what questions, test sets, and test cases were used to test the bot's performance.
- Evaluator expertise: If the evaluation was conducted by experts rather than automated, note the evaluators' backgrounds—e.g., trainees, psychology or medical students, social workers, psychologists, etc. If automated evaluation put NA.
- Type II (User Perceived Performance):
- Evaluator information: Identify the type of human evaluators involved—e.g., controls (healthy individuals), patients, doctors, general users, etc.
- Bot Usage Duration: Record how long the participants used the bot. Be as specific as possible, noting any details on frequency and duration, even though most studies may not specify this.
- Type III (Prospective Interaction with Outcomes):
- Evaluator expertise: If the evaluation was conducted by experts rather than automated, note the evaluators' backgrounds—e.g., trainees, psychology or medical students, social workers, psychologists, etc)
- Primary Outcome: The primary mental health-related outcome (e.g., reduction in depressive symptoms)
- New Study Types: If you encounter papers that don’t fit into the existing three categories, note these for potential further classification. If enough such papers are found, we may define a new study type.-
- NA vs NS: When something is not mentioned in the paper, note it as "ns" for "not specified." When the paper takes an unconventional approach that makes it unnecessary to report something, note it as "na" for "not applicable."
- Gate Symbols: Use the following symbols to denote relationships between constructs or methods:
- "/" for "or" (rarely seen).
- "+" for "integration" (e.g., "trustworthiness+consistency" if they are evaluated together).
- ";" for "and" (e.g., "trustworthiness; consistency" if both are evaluated separately).
Appendix D: Data Processing and Analysis Methodology
Following the initial screening and annotation of 161 papers, we developed a systematic approach to transform raw annotation data into meaningful analytic categories while preserving the integrity of the original findings. This process involved several stages of data refinement and categorical structuring.
Annotation Processing WorkflowFor each annotated dimension, we applied one or more of the following transformation processes:
1. Validation and preservation: We maintained original annotations where they provided clear, discrete classifications (e.g., chatbot architecture: LLM |
Machine Learning | Rule-based).
2. Label splitting and weighting: For papers where multiple characteristics were identified within a single dimension (e.g., multiple target conditions or functional purposes), we disaggregated these into separate entries with proportional weighting to maintain the original study count. For instance, a
study addressing both depression and anxiety would generate two entries, each weighted at 0.5, ensuring that quantitative analyses reflected actual study counts.
3. Canonical labeling: Where dimensions contained semantically equivalent terms across studies (e.g., "depressive symptoms" and "depression"), we mapped these to standardized canonical labels through a hybrid approach combining computational text analysis and expert review.
4. Hierarchical clustering: For dimensions with substantial variability (e.g., Target Condition, Functional Purpose, Outcome Measures), we developed a two-level hierarchical framework of themes and subthemes to enable both granular and high-level analysis.Dimension-Specific Processing. The transformation process was tailored to each annotation dimension based on its complexity and analytical requirements, as summarized in Table 1.
Annotation Processing WorkflowFor each annotated dimension, we applied one or more of the following transformation processes:
1. Validation and preservation: We maintained original annotations where they provided clear, discrete classifications (e.g., chatbot architecture: LLM |
Machine Learning | Rule-based).
2. Label splitting and weighting: For papers where multiple characteristics were identified within a single dimension (e.g., multiple target conditions or functional purposes), we disaggregated these into separate entries with proportional weighting to maintain the original study count. For instance, a
study addressing both depression and anxiety would generate two entries, each weighted at 0.5, ensuring that quantitative analyses reflected actual study counts.
3. Canonical labeling: Where dimensions contained semantically equivalent terms across studies (e.g., "depressive symptoms" and "depression"), we mapped these to standardized canonical labels through a hybrid approach combining computational text analysis and expert review.
4. Hierarchical clustering: For dimensions with substantial variability (e.g., Target Condition, Functional Purpose, Outcome Measures), we developed a two-level hierarchical framework of themes and subthemes to enable both granular and high-level analysis.Dimension-Specific Processing. The transformation process was tailored to each annotation dimension based on its complexity and analytical requirements, as summarized in Table 1.
Single-level clustering
Computational Assistance and ValidationInitial clustering and categorization drafts were generated using large language models (GPT-4o and GPT-o3 mini high) to propose potential thematic structures based on term similarities. These computational suggestions were then rigorously reviewed, significantly refined, and validated by the research team to ensure:
1. Relevance: Categories reflected meaningful distinctions for the given topic
2. Research Fidelity: The transformed data accurately represented the annotated characteristics
3. Analytical Utility: The resulting framework enabled insightful cross-comparisons between different chatbot approaches and evaluation methodologies
This hybrid approach, combining computational pattern recognition with expert domain knowledg,e allowed us to develop a robust analytical framework while processing the substantial heterogeneity present in the literature.