COI reference sequences from BOLD DB
https://doi.org/10.17044/SCILIFELAB.20514192
Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The dataset is based on the BOLD Data Package (https://boldsystems.org/data/data-packagesÖppnas i en ny tabb) from 30 January 2026 and was created on February 1 2026.
The fasta file coidb.clustered.fasta.gz represents a non-redundant set of filtered sequences (clustered at 100% identity, see Methods) with record ids that can be queried in the Public Data Portal (https://portal.boldsystems.orgÖppnas i en ny tabb) . Each fasta header also contains the BIN ID assigned to the record (with the exception of prokaryotic records which instead have process ids as BIN IDs).
The taxonomic information for all filtered records is given in the tab-separated file coidb.info.tsv.gz.
Files compatible with specific tools for taxonomic assignments are found under the dada2/, sintax/, and qiime2/ folders.
MethodsThis dataset was generated with the coidb (https://github.com/insect-biome-atlas/coidbÖppnas i en ny tabb) package (v0.6.0).
Briefly, records from the BOLD Data Package are filtered to
- keep only records assigned a proper BOLD BIN (e.g. `BOLD:AAA0008`), as well as records assigned to Bacteria or Archaea
- keep only records with marker_code 'COI-5P'
- remove records shorter than 500 bp
- remove records containing non-standard DNA characters
Remaining sequences are then clustered at 100% identity separately for each BOLD BIN using vsearch (Rognes et al. 2016) (records without BOLD BINs that are assigned to Bacteria/Archaea are not clustered).
The taxonomic information for records is processed to handle missing data and non-unique parent lineages. A consensus taxonomy for each BOLD BIN is calculated by taking into account the taxonomic information given for records assigned to each BIN. This is done in two ways:
- the `inclNA` method calculates a consensus based on all taxonomic labels, even the ones with missing data
- the `exclNA` method excludes taxonomic labels with missing data when calculating the consensus
Because these methods have their pros and cons (in short `exclNA` resolves more species but `inclNA` is more conservative) both versions of downstream files are available in this item and it is up to the user to decide which one to use.
Description of files- coidb.clustered.fasta.gz
This file contains nucleotide sequences of all filtered records, clustered at 100% identity within each BOLD BIN. The fasta headers have the format:
>{processid} bin_uri:{BOLD BIN}
where '{processid}' corresponds to the record identifier chosen as the cluster centroid and '{BOLD BIN}' shows which BOLD BIN the record belongs to.
- coidb.info.tsv.gz
This file contains taxonomic information (including BOLD BIN where applicable) as well as nucleotide sequences for all filtered records.
- coidb.stats.exclNA.txt / coidb.stats.inclNA.txt
These files contain summary statistics with number of total records, unique BINs, clustered sequences etc. The first seven lines are identical as they refer to general statistics of the database while the rest is specific to the method used to calculate the consensus taxonomy (see Methods).
- timestamps.txt
This file shows the name of the BOLD Data Package and the TSV file extracted andused as input to coidb.
- logs/fix_nonunique.coidb.log
This logfile shows how taxa with non-unique parent lineages were modified during database creation.
- shasum.txt
This file contains checksums and can be used to verify file integrity by running
shasum -c shasum.txtTool-specific filesDADA2
The dada2/ folder contains fasta files that are compatible with the DADA2 assignTaxonomy and addSpecies functions. See more information at https://benjjneb.github.io/dada2/assign.htmlÖppnas i en ny tabb.
The files wtih 'toGenus' and 'toSpecies' in their names have taxonomic information down to the genus and species level, respectively. The files with 'addSpecies' contain only the species name and should be used with the 'addSpecies' function.
SINTAX
The sintax/ folder contains fasta files that are compatible with taxonomic assignments using the SINTAX algorithm as implemented in `vsearch`. See more information in the vsearch manual (https://github.com/torognes/vsearch/releases/download/v2.30.4/vsearch_manual.pdfÖppnas i en ny tabb) .
QIIME2
The qiime2/ folder contains info files that can be imported with QIIME2. For more information, see the README file at https://github.com/insect-biome-atlas/coidbÖppnas i en ny tabb.
Gå till källa för data
Öppnas i en ny tabbhttps://doi.org/10.17044/SCILIFELAB.20514192
Citering och åtkomst
Citering och åtkomst
Skapare/primärforskare:
Forskningshuvudman:
Citering:
Administrativ information
Administrativ information
Finansiering
Finansiering
Finansiär:
- Swedish Research Council
Öppnar nytt fönster hos ror.org.
RORÖppnas i en ny tabb
Finansiär:
- Swedish Research Council
Öppnar nytt fönster hos ror.org.
RORÖppnas i en ny tabb
Ämnesområde och nyckelord
Ämnesområde och nyckelord
Standard för svensk indelning av forskningsämnen 2025:
Nyckelord:
- Ecology not elsewhere classified
- Bioinformatics and computational biology not elsewhere classified
- Computational ecology and phylogenetics
Relationer
Relationer
Referar till:
Referar till:
Referar till:
Referar till:
Referar till:
Metadata
Metadata

Naturhistoriska riksmuseet