英国生物银行作为全球最大的健康信息数据库之一,面临数据泄露风险,研究人员在代码共享平台上意外公开了包含敏感医疗信息的数据集,引发了对大规模健康研究与个人隐私保护之间平衡的伦理讨论。
UK Biobank, which holds the medical records of 500,000 British volunteers, is one of the world’s most comprehensive stores of health information and is credited with driving breakthroughs in cancer, dementia and diabetes research. But scientists approved to access Biobank’s sensitive data appear to have sometimes been cavalier about its security.
One data expert said the scale and persistence of the problem was “shocking” at a time when AI and social media were making it ever easier to cross-reference information online.
UK Biobank rejected the concerns, saying that no identifying data, such as names and addresses, were provided to researchers.
In a statement, Prof Sir Rory Collins, the chief executive of UK Biobank, said: “We have never seen any evidence of any UK Biobank participant being re-identified by others.”
Founded in 2003 by the Department of Health and medical research charities, UK Biobank holds genome sequences, scans, blood samples and lifestyle information of 500,000 volunteers. Last month, the government extended Biobank’s access to volunteers’ GP records.
Scientists at universities and private companies across the world apply for access and, until late 2024, were free to download data directly on to their own computer systems.
Before this point, data had been inadvertently published online and Biobank appears to still be grappling with the problem.
The issue emerged because journals and funders increasingly require researchers to publish the code they have used to analyse large datasets. When intending to upload code, some researchers have also accidentally published partial or entire Biobank datasets to GitHub, a popular online code-sharing platform. UK Biobank prohibits researchers from sharing data outside their systems and says it has introduced further training for all researchers.
In the past year, the data leaks appear to have become a more urgent concern to UK Biobank. Between July and December 2025, it issued 80 legal notices to GitHub, which has complied with requests to remove data from the internet. Yet much still remains available.
A data expert, who reviewed the file said: “It sent shivers down my spine to even open. I deleted the file immediately. It was very detailed and felt like a gross invasion of privacy even to glance at.”
One volunteer, who provided treatment dates for a fracture and seizure, could not be located in the dataset. A second volunteer, a woman in her 70s, shared her month and year of birth and the month and year she had a hysterectomy. Only one person in the dataset matched these details. The apparent match was corroborated by five other diagnoses from the records that the volunteer had not initially disclosed.
“Effectively you were rehearsing the main parts of my medical history to me without me having given you any information at all. I didn’t expect that,” the volunteer said.
The woman said she was not too concerned about her own data being exposed and intended to remain a participant, saying that she viewed UK Biobank’s work as “extremely important”. But, she added: “I’m more concerned about whether Biobank has broken its agreement with people. They said they would hold our data securely … I just feel as though that has to come into the equation.”
A Biobank spokesperson said: “As we have communicated to our participants, including on our website: ‘If a participant puts information that reveals something about their health and identity, such as genealogy data, on a public website, this could make it possible for their identity to be discovered by cross-referencing UK Biobank research data.’
“You have simply demonstrated why we tell participants not to do this.”
The spokesperson added that Biobank had taken extensive measures to protect participants’ privacy, including proactively searching GitHub, contacting researchers directly and issuing legal takedown notices, actions which they said had led to about 500 repositories being removed. Many of these, it said, contained only patient IDs, not health data.
Privacy experts said UK Biobank’s approach appeared at odds with the reality that many people, reasonably, shared some health information online and that in an age of AI this could readily be identified and cross-referenced.
“Are these people aware that the internet exists?” asked Prof Felix Ritchie, an economist at the University of the West of England. “The idea that they can rely on their volunteers never putting any other information out there about themselves is an entirely unreasonable thing to expect.”
Dr Luc Rocher, associate professor at the Oxford Internet Institute, who reviewed several Biobank datasets found online, said that removing identifiers often did not guarantee anonymity and that simply knowing a person’s birthday and, say, the date they broke a leg might be enough to pinpoint their record with high confidence.
“Once identified, that record could reveal sensitive information such as a psychiatric diagnosis, an HIV test result, or a history of drug abuse,” they said.
Prof Niels Peek, professor of data science and healthcare improvement at the University of Cambridge, said the scale of the problem was “shocking”. “If it had happened once or 10 times I’d probably say: ‘It’s not great that it’s happened but at the same time zero risk is impossible,’” he said. “Hundreds. That’s a little bit too much.”
In Peek’s view, Biobank’s actions show it has taken the issue seriously and “done everything that one can reasonably expect”. But, he added: “The scale and persistence with which this has happened demonstrates that there are huge tensions between the ambition to drive health research with data at scale and the legal and ethical imperative to protect people’s privacy.”
Experts questioned whether Biobank will be able to fully regain control of the data released online. Despite researchers and GitHub having taken down most of the offending repositories in response to Biobank’s requests, many of the relevant files remained available on a code archive website until shortly before publication.