<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">nf-core/metatdenovo taxonomy</parTitl>
        <IDNo agency="SND">doi-10-17044-scilifelab-28211678-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.17044/SCILIFELAB.28211678</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.17044/SCILIFELAB.28211678">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">nf-core/metatdenovo taxonomy</parTitl>
        <IDNo agency="SND">doi-10-17044-scilifelab-28211678-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.17044/SCILIFELAB.28211678</IDNo>
      </titlStmt>
      <rspStmt>
        <AuthEnty xml:lang="en" affiliation="Science for Life Laboratory">Lundin, Daniel</AuthEnty>
      </rspStmt>
      <prodStmt />
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2025-02-24" />
      </distStmt>
      <verStmt>
        <version elementVersion="0" elementVersionDate="2025-02-24" />
      </verStmt>
      <holdings URI="https://doi.org/10.17044/SCILIFELAB.28211678">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject />
      <abstract xml:lang="en" contentType="abstract">The data in this repository can be used to assign taxonomy to sequences with Diamond [Buchfink et al. 2015], particularly using the --diamond_dbs parameter in nf-core/metatdenovo (https://nf-co.re/metatdenovo) , release 1.1 or later.Currently, the data available represents species-representative genomes from the Genome Taxonomy Database (GTDB), release R09-RS220 [Parks et al. 2018].

File preparationAll species-representative genomes from GTDB were downloaded from the National Center for Biotechnology Information (NCBI) and annotated with Prokka [v. 1.14.6; Seemann 2014], and the sequences for all resulting proteins were used for this data. The taxonomy dump files (in NCBI taxonomy dump format) were created from the GTDB metadata with TaxonKit [v. 0.18.0; Shen and Ren 2021] and the Diamond database with Diamond [v. 2.1.10; Buchfink et al. 2015] in "taxonomy mode", i.e. using the taxonomy dump created with TaxonKit. (See below for commands used.)

File descriptionsThere are five files:

- gtdb-r220.faa.gz: Fasta file with protein sequences. Not used by nf-core/metatdenovo but can be used to create the Diamond database below.
- gtdb-r220.taxonomy.dmnd: Diamond database with taxonomy information.
- gtdb-r220.names.dmp: Taxonomy dump file.
- gtdb-r220.nodes.dmp: Nodes dump file.
- gtdb-r220.seqid2taxid.tsv.gz: Mapping from protein accession to taxon.
The Diamond database and taxonomy dump files can be used with nf-core/metatdenovo (Version &gt;1.1) by providing a csv file like below to the --diamond_dbs parameter. (Although Nextflow can use https-urls for paths, it is usually better to download the very large files and keep local copies.)

db,dmnd_path,taxdump_names,taxdump_nodes,ranks,parse_with_taxdump

gtdb,gtdb_r220_repr.dmnd,gtdb_taxdump/names.dmp,gtdb_taxdump/nodes.dmp,domain;phylum;class;order;genus;species;strain,

Commands used to prepare taxonomy dump files and the Diamond database- Taxonomy dump: cut -f 1,19-20 *metadata.tsv | grep -v 'accession' | awk 'BEGIN { FS="\t" } { if ( $2 == "t" ) { print $1 "\t" $3 } }' | taxonkit create-taxdump --gtdb -O .
- Diamond database: gunzip -c gtdb-r220.faa.gz | sed '/^&gt;/s/ .*//' | diamond makedb --taxonmap gtdb-r220.seqid2taxid.tsv.gz --taxonnames gtdb-r220.names.dmp --taxonnodes gtdb-r220.nodes.dmp --db gtdb-r220.taxonomy.dmnd --no-parse-seqids
Revision history20250211 First version</abstract>
      <sumDscr />
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through an external actor. </restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via extern aktör. </restrctn>
      </useStmt>
    </dataAccs>
    <othrStdyMat />
  </stdyDscr>
</codeBook>