Skip to main content
Researchdata.se

COI reference sequences from BOLD DB

https://doi.org/10.17044/SCILIFELAB.20514192

Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database (https://boldsystems.org/Opens in a new tab) . The fasta file bold_clustered_cleaned.fasta.gz has record ids that can be queried in the Public Data Portal (https://boldsystems.org/index.php/Public_BINSearch?searchtype=recordsOpens in a new tab) and each fasta header contains the taxonomic ranks + the BIN ID assigned to the record. The taxonomic information for each record is also given in the tab-separated file bold_info_filtered.tsv.gz. The file bold_clustered.sintax.fasta.gz is directly compatible with the SINTAX algorithm in vsearch while files bold_clustered.assignTaxonomy.fasta.gz and bold_clustered.addSpecies.fasta.gz are directly compatible with the assignTaxonomy and addSpecies functions from DADA2, respectively. The dataset was last created on December 16, 2022 NOTE: We have noticed that the gzipped files in this upload have been compressed twice for some reason. A quick fix is to unzip any file with a ".gz" extension, then rename the unzipped file by adding the ".gz" extension back. Then running the unzipping once again. Sorry for the inconvenience. MethodsThe code used to generate this dataset consists of a snakemake workflow wrapped into a python package that can be installed with conda (https://docs.conda.io/en/latest/miniconda.htmlOpens in a new tab) (`conda install -c bioconda coidb`). Firstly sequence and taxonomic information for records in the BOLD database is downloaded from the GBIF Hosted Datasets (https://hosted-datasets.gbif.org/ibol/Opens in a new tab) . This data is then filtered to only keep records annotated as 'COI-5P' and assigned to a BIN ID. The taxonomic information is parsed in order to assign species names and resolve higher level ranks for each BIN ID. Sequences are processed to remove gap characters and leading and trailing `N`s. After this, any sequences with remaining non-standard characters are removed. Sequences are then clustered at 100% identity using vsearch (https://github.com/torognes/vsearchOpens in a new tab) (Rognes _et al._ 2016). This clustering is done separately for sequences assigned to each BIN ID. For more information, see https://github.com/biodiversitydata-se/coidbOpens in a new tab

Go to data source
Opens in a new tab
https://doi.org/10.17044/SCILIFELAB.20514192

Citation and access

Administrative information

Topic and keywords

Relations

Metadata

scilifelab
Swedish Museum of Natural History