Large language models (LLMs), used by over half of England’s local authorities to support social workers, may be introducing gender bias into care decisions, according to new research from the London School of Economics and Political Science (LSE), funded by the National Institute for Health and Care Research.
Published in the journal BMC Medical Informatics and Decision Making, the research found that Google’s widely used AI model ‘Gemma’ downplays women’s physical and mental issues in comparison to men’s when used to generate and summarise case notes.
Terms associated with significant health concerns, such as “disabled,” “unable,” and “complex,” appeared significantly more often in descriptions of men than women. Similar care needs among women were more likely to be omitted or described in less serious terms.
Large language models are increasingly being used to ease the administrative workload of social workers and the public sector more generally. However, it remains unclear which specific models are being deployed by councils—and whether they may be introducing bias.
Dr Sam Rickman, lead author of the report and a researcher in LSE’s Care Policy and Evaluation Centre (CPEC), said: “If social workers are relying on biased AI-generated summaries that systematically downplay women’s health needs, they may assess otherwise identical cases differently based on gender rather than actual need. Since access to social care is determined by perceived need, this could result in unequal care provision for women.”
To investigate potential gender bias, Dr Rickman used large language models to generate 29,616 pairs of summaries based on real case notes from 617 adult social care users. Each pair described the same individual, with only the gender swapped, allowing for a direct comparison of how male and female cases were treated by the AI. The analysis revealed statistically significant gender differences in how physical and mental health issues were described.
Among the models tested, Google’s AI model, Gemma, exhibited more pronounced gender-based disparities than benchmark models developed by either Google or Meta in 2019. Meta’s Llama 3 model – which is of the same generation as Google’s Gemma - did not use different language based on gender.
Dr Rickman said: “Large language models are already being used in the public sector, but their use must not come at the expense of fairness. While my research highlights issues with one model, more are being deployed all the time making it essential that all AI systems are transparent, rigorously tested for bias and subject to robust legal oversight.”
The study is the first to quantitatively measure gender bias in LLM-generated case notes from real-world care records, using both state-of-the-art and benchmark models. It offers a detailed, evidence-based evaluation of the risks of AI in practice, specifically in the context of adult social care.
ENDS
For more information
Sue Windebank, LSE Media Relations Office, E: s.windebank@lse.ac.uk, T: + 44 (0)20 7955 7060