2025 Data (Information)
Evaluating Trust and Inclusivity: A Machine-Driven Benchmark for Large Language Model Chatbots in LGBTQ+ Suicide Prevention: Data
The advent of conversational artificial intelligence (AI) has been increasing attention to evaluating the performance of large language models (LLMs) in high-risk mental health scenarios, particularly for marginalized groups. However, methodologies for assessing chatbot performance in such contexts remain underexplored. To address this gap, we introduce a comprehensive machine-driven evaluation pipeline for generative AI chatbots that provide mental health support, focusing on scenarios of suicidality among LGBTQ+ individuals. The pipeline assesses chatbot responses across six metrics: ROUGE (Recall-Oriented Understudy for Gisting Evaluation), METEOR (Metric for Evaluation of Translation with Explicit Ordering), ethical alignment, sentiment distribution, cultural inclusivity, and linguistic complexity. Nine general-purpose LLM-based chatbots (ChatGPT-5, ChatGPT-4.0, Claude, Gemini, LLaMA-3, DeepSeek, Mistral, Perplexity AI, and HuggingChat) and two LGBTQ+-focused chatbots (JackAI and the RUBIES Gender Journey chatbot) were evaluated using this pipeline. The evaluation revealed moderate lexical and semantic similarity between chatbot outputs (ROUGE ranging 0.22–0.36; METEOR 0.19–0.27), but also inconsistent ethical alignment (scores from 0.61 to 0.98) and potential deficiencies in cultural inclusivity (with most scores below 0.2 and only two above 0.3). There was considerable variation in sentiment distribution across models (scores ranging from 0.04 to 1.00), while linguistic complexity scores averaged around 56 (out of 100), indicating moderately complex language. These findings highlight fine-grained differences in chatbot performance across evaluation dimensions and underscore the need for a holistic interpretation of chatbot effectiveness. The insights gained can guide the appropriate use and improvement of AI chatbots for supporting LGBTQ+ individuals dealing with suicidality.
Keywords: Large language models; Chatbot evaluation; Mental health support; LGBTQ+; Ethical alignment; Inclusivity
Files
-
AI_Society_Data_Archive.zip
application/zip
305 KB
Download File
More About This Work
- Academic Units
- Social Work
- Published Here
- October 13, 2025