COI reference sequences from BOLD DB
https://doi.org/10.17044/SCILIFELAB.20514192
Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The dataset is based on the BOLD Data Package (https://boldsystems.org/data/data-packagesOpens in a new tab) from 30 January 2026 and was created on February 1 2026.
The fasta file coidb.clustered.fasta.gz represents a non-redundant set of filtered sequences (clustered at 100% identity, see Methods) with record ids that can be queried in the Public Data Portal (https://portal.boldsystems.orgOpens in a new tab) . Each fasta header also contains the BIN ID assigned to the record (with the exception of prokaryotic records which instead have process ids as BIN IDs).
The taxonomic information for all filtered records is given in the tab-separated file coidb.info.tsv.gz.
Files compatible with specific tools for taxonomic assignments are found under the dada2/, sintax/, and qiime2/ folders.
MethodsThis dataset was generated with the coidb (https://github.com/insect-biome-atlas/coidbOpens in a new tab) package (v0.6.0).
Briefly, records from the BOLD Data Package are filtered to
- keep only records assigned a proper BOLD BIN (e.g. `BOLD:AAA0008`), as well as records assigned to Bacteria or Archaea
- keep only records with marker_code 'COI-5P'
- remove records shorter than 500 bp
- remove records containing non-standard DNA characters
Remaining sequences are then clustered at 100% identity separately for each BOLD BIN using vsearch (Rognes et al. 2016) (records without BOLD BINs that are assigned to Bacteria/Archaea are not clustered).
The taxonomic information for records is processed to handle missing data and non-unique parent lineages. A consensus taxonomy for each BOLD BIN is calculated by taking into account the taxonomic information given for records assigned to each BIN. This is done in two ways:
- the `inclNA` method calculates a consensus based on all taxonomic labels, even the ones with missing data
- the `exclNA` method excludes taxonomic labels with missing data when calculating the consensus
Because these methods have their pros and cons (in short `exclNA` resolves more species but `inclNA` is more conservative) both versions of downstream files are available in this item and it is up to the user to decide which one to use.
Description of files- coidb.clustered.fasta.gz
This file contains nucleotide sequences of all filtered records, clustered at 100% identity within each BOLD BIN. The fasta headers have the format:
>{processid} bin_uri:{BOLD BIN}
where '{processid}' corresponds to the record identifier chosen as the cluster centroid and '{BOLD BIN}' shows which BOLD BIN the record belongs to.
- coidb.info.tsv.gz
This file contains taxonomic information (including BOLD BIN where applicable) as well as nucleotide sequences for all filtered records.
- coidb.stats.exclNA.txt / coidb.stats.inclNA.txt
These files contain summary statistics with number of total records, unique BINs, clustered sequences etc. The first seven lines are identical as they refer to general statistics of the database while the rest is specific to the method used to calculate the consensus taxonomy (see Methods).
- timestamps.txt
This file shows the name of the BOLD Data Package and the TSV file extracted andused as input to coidb.
- logs/fix_nonunique.coidb.log
This logfile shows how taxa with non-unique parent lineages were modified during database creation.
- shasum.txt
This file contains checksums and can be used to verify file integrity by running
shasum -c shasum.txtTool-specific filesDADA2
The dada2/ folder contains fasta files that are compatible with the DADA2 assignTaxonomy and addSpecies functions. See more information at https://benjjneb.github.io/dada2/assign.htmlOpens in a new tab.
The files wtih 'toGenus' and 'toSpecies' in their names have taxonomic information down to the genus and species level, respectively. The files with 'addSpecies' contain only the species name and should be used with the 'addSpecies' function.
SINTAX
The sintax/ folder contains fasta files that are compatible with taxonomic assignments using the SINTAX algorithm as implemented in `vsearch`. See more information in the vsearch manual (https://github.com/torognes/vsearch/releases/download/v2.30.4/vsearch_manual.pdfOpens in a new tab) .
QIIME2
The qiime2/ folder contains info files that can be imported with QIIME2. For more information, see the README file at https://github.com/insect-biome-atlas/coidbOpens in a new tab.
Go to data source
Opens in a new tabhttps://doi.org/10.17044/SCILIFELAB.20514192
Citation and access
Citation and access
Creator/Principal investigator(s):
Research principal:
Citation:
Administrative information
Administrative information
Funding
Funding
Funding agency:
- Swedish Research Council
Opens a new window at ror.org.
ROROpens in a new tab
Funding agency:
- Swedish Research Council
Opens a new window at ror.org.
ROROpens in a new tab
Topic and keywords
Topic and keywords
Standard för svensk indelning av forskningsämnen 2025:
Relations
Relations
References:
References:
References:
References:
Metadata
Metadata

Swedish Museum of Natural History