NCBI

library(webseq)

Download genome assemblies

Download genbank file for GCF_003007635.1.

The function will access files within this directory: https://ftp.ncbi.nlm.nih.gov/genomes/all/GCF/003/007/635/

Let’s download GCF_003007635.1_ASM300763v1_genomic.gbff.gz!

ncbi_download_genome("GCF_003007635.1", type = "genomic.gbff")

If we try to download it again, the function will indicate that the file is already downloaded.

ncbi_download_genome("GCF_003007635.1", type = "genomic.gbff")

Download metadata

Let’s download some metadata for this assembly!

assembly_meta <- ncbi_get_meta("GCF_003007635.1", db = "assembly")

Using this metadata we can find the BioSample ID associated with the assembly. Let’s use this ID to get the BioSample UID of the sample within the NCBI BioSample database.

biosample_uid <- ncbi_get_uid(assembly_meta$biosample, db = "biosample")
biosample_uid

And then get the metadata itself

biosample_meta <- get_meta(biosample_uid$uid, db = "biosample")
biosample_meta

Parse metadata files

Accessing metadata for one assembly at a time can take quite a while if you have a large number fo queries. However, if you want to access metadata for all hits of a search term, you can follow a hybrid approach: download the metadata manually and parse it with webseq.

Let’s download assembly metadata for all Thiobacillus denitrificans genomes!

NCBI link: https://www.ncbi.nlm.nih.gov/assembly/?term=Thiobacillus+denitrificans

upper right corner -> send to -> file -> format = xml -> create file

Download the file and then parse it.

ncbi_parse_assembly_xml("assembly_summary.xml")

Let’s download biosample metadata as well for all Thiobacillus denitrificans samples!

NCBI link: https://www.ncbi.nlm.nih.gov/biosample/?term=Thiobacillus+denitrificans

upper right corner -> send to -> file -> format = full (text) -> create file

Download the file and then parse it.

ncbi_parse_biosample_txt("biosample_summary.txt")