> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.
> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.
Good catch! The data is everywhere, re-uploaded every week.
I am aware of ~30 repositories that UK Biobank has asked GitHub to delete, and can still be found elsewhere online. They know the site, they have managed to delete data from that site before, and yet the files are still there.
> It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.
To me it seems rather naive to have done that.
After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.
If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.
The people who agreed to contribute their biodata did not consent to that.
If you want such a project you need to have a new project with a different agreement. I doubt you could get as many volunteers to freely give away such intimate data to anyone who wants though
You mean giving anyone access to the data? Or open sourcing the code? If the latter, I think that's a generally a good practice. Security through obscurity is never good for public infrastructure. In this case, UK Biobank has now switched to a remote access platform (not particularly secure, as the data was found for sale on Alibaba today), but contracting it to DNAnexus and Amazon. Private companies have no incentives to open source data, unless mandated to do so.
In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.
Hard to do. The same people with the collection and tracking infrastructure required are infinitely sue-able so you need legal protection if anything goes wrong.
Took me 5 minutes to find more: https://github.com/tanaylab/Mendelson_et_al_2023/blob/9c5a65... (Uses Date of Birth column).
And some information on how they were distributing it to researchers: https://github.com/broadinstitute/ml4h/blob/master/ingest/uk...
> The following steps require the ukbunpack and ukbconv utilities from the UK Biobank website. The file decrypt_all.sh will run through the following steps on one of the on-prem servers.
> Once the data is downloaded, it needs to be "ukbunpacked" which decrypts it, and then converts it to a file format of choice. Both ukbunpack and ukbconv are available from the UK Biobank's website. The decryption has to happen on a linux system if you download the linux tools, e.g. the Broad's on-prem servers. Note that you need plenty of space to decrypt/unpack, and the programs may fail silently if disk space runs out during the middle.
https://biobank.ctsu.ox.ac.uk/crystal/download.cgi
Good catch! The data is everywhere, re-uploaded every week.
I am aware of ~30 repositories that UK Biobank has asked GitHub to delete, and can still be found elsewhere online. They know the site, they have managed to delete data from that site before, and yet the files are still there.
> It has given 20,000 researchers around the world access under strict agreements that prohibit sharing data further.
To me it seems rather naive to have done that.
After all, you can't un-leak medical data. So even if the "strict agreement" included huge punishments, there's no getting the toothpaste back in the tube.
If you want to ensure compliance before a leak happens you have to (ugh) audit their compliance. And that isn't something that scales to 20,000 researchers.
Too late to do anything about it now though :(
What are the pros/cons of just open-sourcing everything for future bio bank projects?
The people who agreed to contribute their biodata did not consent to that.
If you want such a project you need to have a new project with a different agreement. I doubt you could get as many volunteers to freely give away such intimate data to anyone who wants though
You mean giving anyone access to the data? Or open sourcing the code? If the latter, I think that's a generally a good practice. Security through obscurity is never good for public infrastructure. In this case, UK Biobank has now switched to a remote access platform (not particularly secure, as the data was found for sale on Alibaba today), but contracting it to DNAnexus and Amazon. Private companies have no incentives to open source data, unless mandated to do so.
In the EU, there is a bigger interest in building scalable but also secure platforms for health data. Hopefully good innovation will come from there.
Hard to do. The same people with the collection and tracking infrastructure required are infinitely sue-able so you need legal protection if anything goes wrong.