COI reference sequences from BOLD DB

John Sundh

doi:10.17044/SCILIFELAB.20514192

COI reference sequences from BOLD DB

https://doi.org/10.17044/SCILIFELAB.20514192

Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The dataset is based on the BOLD Data Package (https://boldsystems.org/data/data-packages) from 30 January 2026 and was created on February 1 2026. The fasta file coidb.clustered.fasta.gz represents a non-redundant set of filtered sequences (clustered at 100% identity, see Methods) with record ids that can be queried in the Public Data Portal (https://portal.boldsystems.org) . Each fasta header also contains the BIN ID assigned to the record (with the exception of prokaryotic records which instead have process ids as BIN IDs). The taxonomic information for all filtered records is given in the tab-separated file coidb.info.tsv.gz. Files compatible with specific tools for taxonomic assignments are found under the dada2/, sintax/, and qiime2/ folders. MethodsThis dataset was generated with the coidb (https://github.com/insect-biome-atlas/coidb) package (v0.6.0). Briefly, records from the BOLD Data Package are filtered to - keep only records assigned a proper BOLD BIN (e.g. `BOLD:AAA0008`), as well as records assigned to Bacteria or Archaea - keep only records with marker_code 'COI-5P' - remove records shorter than 500 bp - remove records containing non-standard DNA characters Remaining sequences are then clustered at 100% identity separately for each BOLD BIN using vsearch (Rognes et al. 2016) (records without BOLD BINs that are assigned to Bacteria/Archaea are not clustered). The taxonomic information for records is processed to handle missing data and non-unique parent lineages. A consensus taxonomy for each BOLD BIN is calculated by taking into account the taxonomic information given for records assigned to each BIN. This is done in two ways: - the `inclNA` method calculates a consensus based on all taxonomic labels, even the ones with missing data - the `exclNA` method excludes taxonomic labels with missing data when calculating the consensus Because these methods have their pros and cons (in short `exclNA` resolves more species but `inclNA` is more conservative) both versions of downstream files are available in this item and it is up to the user to decide which one to use. Description of files- coidb.clustered.fasta.gz This file contains nucleotide sequences of all filtered records, clustered at 100% identity within each BOLD BIN. The fasta headers have the format: >{processid} bin_uri:{BOLD BIN} where '{processid}' corresponds to the record identifier chosen as the cluster centroid and '{BOLD BIN}' shows which BOLD BIN the record belongs to. - coidb.info.tsv.gz This file contains taxonomic information (including BOLD BIN where applicable) as well as nucleotide sequences for all filtered records. - coidb.stats.exclNA.txt / coidb.stats.inclNA.txt These files contain summary statistics with number of total records, unique BINs, clustered sequences etc. The first seven lines are identical as they refer to general statistics of the database while the rest is specific to the method used to calculate the consensus taxonomy (see Methods). - timestamps.txt This file shows the name of the BOLD Data Package and the TSV file extracted andused as input to coidb. - logs/fix_nonunique.coidb.log This logfile shows how taxa with non-unique parent lineages were modified during database creation. - shasum.txt This file contains checksums and can be used to verify file integrity by running shasum -c shasum.txtTool-specific filesDADA2 The dada2/ folder contains fasta files that are compatible with the DADA2 assignTaxonomy and addSpecies functions. See more information at https://benjjneb.github.io/dada2/assign.html. The files wtih 'toGenus' and 'toSpecies' in their names have taxonomic information down to the genus and species level, respectively. The files with 'addSpecies' contain only the species name and should be used with the 'addSpecies' function. SINTAX The sintax/ folder contains fasta files that are compatible with taxonomic assignments using the SINTAX algorithm as implemented in `vsearch`. See more information in the vsearch manual (https://github.com/torognes/vsearch/releases/download/v2.30.4/vsearch_manual.pdf) . QIIME2 The qiime2/ folder contains info files that can be imported with QIIME2. For more information, see the README file at https://github.com/insect-biome-atlas/coidb.

Gå till källa för data

https://doi.org/10.17044/SCILIFELAB.20514192