<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">COI reference sequences from BOLD DB</parTitl>
        <IDNo agency="SND">doi-10-17044-scilifelab-20514192-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.17044/SCILIFELAB.20514192</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.17044/SCILIFELAB.20514192">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">COI reference sequences from BOLD DB</parTitl>
        <IDNo agency="SND">doi-10-17044-scilifelab-20514192-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.17044/SCILIFELAB.20514192</IDNo>
      </titlStmt>
      <rspStmt>
        <AuthEnty xml:lang="en" affiliation="Science for Life Laboratory">Sundh, John</AuthEnty>
      </rspStmt>
      <prodStmt />
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2026-02-06" />
      </distStmt>
      <verStmt>
        <version elementVersion="0" elementVersionDate="2026-02-06" />
      </verStmt>
      <holdings URI="https://doi.org/10.17044/SCILIFELAB.20514192">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject />
      <abstract xml:lang="en" contentType="abstract">Dataset descriptionThis item contains COI (mitochondrial cytochrome oxidase subunit I) sequences collected from the BOLD database. The dataset is based on the BOLD Data Package (https://boldsystems.org/data/data-packages/)  from 30 January 2026 and was created on February 1 2026.

The fasta file coidb.clustered.fasta.gz represents a non-redundant set of filtered sequences (clustered at 100% identity, see Methods) with record ids that can be queried in the Public Data Portal (https://portal.boldsystems.org/) . Each fasta header also contains the BIN ID assigned to the record (with the exception of prokaryotic records which instead have process ids as BIN IDs). 

The taxonomic information for all filtered records is given in the tab-separated file coidb.info.tsv.gz.
Files compatible with specific tools for taxonomic assignments are found under the dada2/, sintax/, and qiime2/ folders.

MethodsThis dataset was generated with the coidb (https://github.com/insect-biome-atlas/coidb)  package (v0.6.0).

Briefly, records from the BOLD Data Package are filtered to

- keep only records assigned a proper BOLD BIN (e.g. `BOLD:AAA0008`), as well as records assigned to Bacteria or Archaea
- keep only records with marker_code 'COI-5P'
- remove records shorter than 500 bp
- remove records containing non-standard DNA characters
Remaining sequences are then clustered at 100% identity separately for each BOLD BIN using vsearch (Rognes et al. 2016) (records without BOLD BINs that are assigned to Bacteria/Archaea are not clustered).

The taxonomic information for records is processed to handle missing data and non-unique parent lineages. A consensus taxonomy for each BOLD BIN is calculated by taking into account the taxonomic information given for records assigned to each BIN. This is done in two ways:

- the `inclNA` method calculates a consensus based on all taxonomic labels, even the ones with missing data
- the `exclNA` method excludes taxonomic labels with missing data when calculating the consensus
Because these methods have their pros and cons (in short `exclNA` resolves more species but `inclNA` is more conservative) both versions of downstream files are available in this item and it is up to the user to decide which one to use.

Description of files- coidb.clustered.fasta.gz
This file contains nucleotide sequences of all filtered records, clustered at 100% identity within each BOLD BIN. The fasta headers have the format:

&gt;{processid} bin_uri:{BOLD BIN}

where '{processid}' corresponds to the record identifier chosen as the cluster centroid and '{BOLD BIN}' shows which BOLD BIN the record belongs to.

- coidb.info.tsv.gz
This file contains taxonomic information (including BOLD BIN where applicable) as well as nucleotide sequences for all filtered records.

- coidb.stats.exclNA.txt / coidb.stats.inclNA.txt
These files contain summary statistics with number of total records, unique BINs, clustered sequences etc. The first seven lines are identical as they refer to general statistics of the database while the rest is specific to the method used to calculate the consensus taxonomy (see Methods).

- timestamps.txt
This file shows the name of the BOLD Data Package and the TSV file extracted andused as input to coidb.

- logs/fix_nonunique.coidb.log
This logfile shows how taxa with non-unique parent lineages were modified during database creation.

- shasum.txt
This file contains checksums and can be used to verify file integrity by running

shasum -c shasum.txtTool-specific filesDADA2

The dada2/ folder contains fasta files that are compatible with the DADA2 assignTaxonomy and addSpecies functions. See more information at https://benjjneb.github.io/dada2/assign.html. 

The files wtih 'toGenus' and 'toSpecies' in their names have taxonomic information down to the genus and species level, respectively. The files with 'addSpecies' contain only the species name and should be used with the 'addSpecies' function.

SINTAX

The sintax/ folder contains fasta files that are compatible with taxonomic assignments using the SINTAX algorithm as implemented in `vsearch`. See more information in the vsearch manual (https://github.com/torognes/vsearch/releases/download/v2.30.4/vsearch_manual.pdf) .

QIIME2

The qiime2/ folder contains info files that can be imported with QIIME2. For more information, see the README file at https://github.com/insect-biome-atlas/coidb.</abstract>
      <sumDscr />
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through an external actor. </restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via extern aktör. </restrctn>
      </useStmt>
    </dataAccs>
    <othrStdyMat />
  </stdyDscr>
</codeBook>