FASTA files for BRCA1, BRCA2, SMA, KIR, MHC (bash scripting Entrez Direct)

I needed to create some FASTA files of important regions of the human genome for work. BRCA1 & 2 are important in breast cancer research, SMA is implicated in  spinal muscular atrophy, and KIR and MHC are important for your immune system (e.g. why you have to have organ donor “compatibility”). I wrote a bash script that used Entrez Direct to automatically download these files from the NCBI servers.

FASTA files

If you just want the FASTA files to play with, they are here.

They were downloaded from the NCBI website and based on the NCBI Gene database coordinates against hg38.

source code

https://github.com/nouyang-curoverse/GA4GH_regions

This repository contains a link to download the final FASTA files, the ids.csv file I used as the master list of “mhc” and “kir” genes, the bash script file, and the xsl “xml transform” file I used to extract the information I needed from the xml file.

what i learned

Hey, what’s a gene anyway? If you use various databases (ensembl, ncbi, genenames, lumc, omim, lrg), you”ll get a whole range of coordinates for the same gene.

Take, for instance, BRCA2:

Start End Length Source
32,889,611 32,974,403 84,792 ensembl
32,884,617 32,975,809 91,192 lrg
32,889,617 32,973,809 84,192 ncbi entrez
32,889,616 32,973,808 84,192 OMIM
32,890,598 32,972,907 82,309 COSMIC sanger
32,889,641 32,907,422 17,781 CGAP

 

In the end I standardized around using the NCBI Gene database and ignored the rest.

Hey, what’s the KIR gene region anyway? Turns out there’s a gazillion KIR genes, and there’s not exactly a “list” of them. Same for HLA. I just used my best human judgement and culled them from searches on NCBI Gene. From the git repo Readme:

For the HLA genes, I used the IMGT list of gene names, found at ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/ . For the KIR genes, I used NCBI Gene query to list all the KIR genes. http://www.ncbi.nlm.nih.gov/gene and "Homo sapiens"[porgn] AND KIR

The final list is in ids.csv.

The manual process was to download the FASTA file from the NCBI Gene database.

Screenshot from 2015-01-24 23:47:21

 

But for some genes, that is the KIR and HLA genes, each gene (e.g. just HLA-E) has many “alternate locii” versions.

Screenshot from 2015-01-24 23:49:03

Downloading these by hand would take days. Therefore, I needed to figure out a way to script this. My initial thought was to write a scraper using Google Sheet Scripts or a python library like Beautiful Soup. However, I thought this was dumb, because this NCBI Gene site is clearly the front-end to some database that hopefully had an API for programmatic access of some kind.

After a few days (seriously, this all took way longer than I thought it would) and with help from the biostars community I was able to figure out how to use Entrez Direct and write a bash script to automate the process for all the  genes.

(It has a hacky fix, where the bash script needs to be run again for each line in the file,

for run in {1..48} #change this!

 

I was too lazy to debug it that day and just wanted to get this finally done).

 

The source code is provided here, and contains reasonable comments and spits out some debugging info when you run it.

That’s all.

Leave a comment if you have questions 🙂

appendix

Emailed from me to ga4gh-dwg :

With help from biostars*, I finished pulling out the KIR and HLA regions and fixed the SMA region.

The updated collection may be publicly viewed at
https://workbench.qr1hi.arvadosapi.com/collections/download/qr1hi-4zz18-7zk4muy5grnaqpv/4qji0cfumh25dttlwteo6rj2b83z2b8vz1l0rja3uzo82bf3s/

The updated collection README and scripts are at
https://github.com/nouyang-curoverse/GA4GH_regions (the FASTA files are named by gene name and ncbi gene id, e.g. BRCA1-672.fa)

Of note, I did not mirror the IMGT HLA contents (ftp://ftp.ebi.ac.uk/pub/databases/ipd/imgt/hla/fasta/), even though that was requested on the minutes from the DWG meeting, due to their policy https://www.ebi.ac.uk/ipd/imgt/hla/licence.html .

Feedback appreciated!

Thanks,
–Nancy

*credit to https://www.biostars.org/p/122680/ and https://www.biostars.org/p/122522/