<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">Data för: "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025)</titl>
        <parTitl xml:lang="en">Data for: "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025)</parTitl>
        <IDNo agency="SND">2025-117-1</IDNo>
        <IDNo agency="DOI">https://doi.org/10.58141/0bz5-dc62</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.58141/0bz5-dc62">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">Data för: "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025)</titl>
        <parTitl xml:lang="en">Data for: "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025)</parTitl>
        <IDNo agency="SND">2025-117-1</IDNo>
        <IDNo agency="DOI">https://doi.org/10.58141/0bz5-dc62</IDNo>
        <IDNo agency="DOI">10.1038/s41467-025-65642-x</IDNo>
        <IDNo agency="DOI">10.1101/2025.05.26.656076</IDNo>
      </titlStmt>
      <rspStmt>
        <AuthEnty xml:lang="en" affiliation="Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University">Hallgren, Joel</AuthEnty>
        <AuthEnty xml:lang="sv" affiliation="Institutionen för molekylär biovetenskap, Wenner-Grens institut, Stockholms universitet">Hallgren, Joel</AuthEnty>
        <AuthEnty xml:lang="en" affiliation="Department of Molecular Biosciences, The Wenner-Gren Institute, Stockholm University">Jonas, Kristina</AuthEnty>
        <AuthEnty xml:lang="sv" affiliation="Institutionen för molekylär biovetenskap, Wenner-Grens institut, Stockholms universitet">Jonas, Kristina</AuthEnty>
      </rspStmt>
      <prodStmt />
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2025-06-19" />
      </distStmt>
      <verStmt>
        <version elementVersion="1" elementVersionDate="2025-06-19" />
      </verStmt>
      <holdings URI="https://doi.org/10.58141/0bz5-dc62">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject>
        <keyword xml:lang="en" vocab="EnvThes" vocabURI="http://vocabs.lter-europe.net/EnvThes/21655">phylogenetic tree</keyword>
        <keyword xml:lang="en" vocab="EnvThes" vocabURI="http://vocabs.lter-europe.net/EnvThes/21641">microbiology</keyword>
        <keyword xml:lang="en" vocab="EnvThes" vocabURI="http://vocabs.lter-europe.net/EnvThes/21645">phylogenetic relationship</keyword>
        <keyword xml:lang="en" vocab="EnvThes" vocabURI="http://vocabs.lter-europe.net/EnvThes/21677">phylogenetic diversity</keyword>
        <keyword xml:lang="en" vocab="EnvThes" vocabURI="http://vocabs.lter-europe.net/EnvThes/20662">bacteria</keyword>
        <keyword xml:lang="en" vocab="EnvThes" vocabURI="http://vocabs.lter-europe.net/EnvThes/20201">DNA sequence analysis</keyword>
        <keyword xml:lang="en" vocab="EnvThes" vocabURI="http://vocabs.lter-europe.net/EnvThes/77">microbial ecology</keyword>
        <keyword xml:lang="en" vocab="EnvThes" vocabURI="http://vocabs.lter-europe.net/EnvThes/21638">phylogenetic analysis</keyword>
        <keyword xml:lang="en" vocab="EnvThes" vocabURI="http://vocabs.lter-europe.net/EnvThes/22062">bacterial diversity</keyword>
        <keyword xml:lang="en" vocab="YSO" vocabURI="http://www.yso.fi/onto/yso/p15748">bioinformatics</keyword>
        <keyword xml:lang="sv" vocab="YSO" vocabURI="http://www.yso.fi/onto/yso/p15748">bioinformatik</keyword>
        <keyword xml:lang="en" vocab="YSO" vocabURI="http://www.yso.fi/onto/yso/p18492">cell biology</keyword>
        <keyword xml:lang="sv" vocab="YSO" vocabURI="http://www.yso.fi/onto/yso/p18492">cellbiologi</keyword>
      </subject>
      <abstract xml:lang="en" contentType="abstract">Additional data for the article "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025).

The data includes the following:
- Sequence alignment files and tree files for all species phylogenies and gene phylogenies of the article.
- Genome annotation data files for all genomes of the "core" and "extended" datasets of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH).

Datasets have been compressed into a .zip archive that contains the following folders:

CORE-dataset_Gene_Annotations/
CORE-dataset_Gene_Phylogenies/
CORE-dataset_Species_Phylogenies/
EXTENDED-dataset_Gene_Annotations/
EXTENDED-dataset_Species_Phylogenies/

Some of these folders include files that in turn have been packaged into .tar.gz archives, which can be unpacked using the "tar" command in the Linux, macOS, or Windows (Windows 10 or later) command line.


Description of datasets:

1. CORE-dataset_Gene_Annotations/ contains the following data files:

all_genomes.annotations.tar.gz
all_genomes.diamond-blastp.tar.gz
all_genomes.emapper.annotations.tar.gz
all_genomes.ghostKOALA-KOs.tar.gz
all_genomes.interproscan.tar.gz
all_genomes.RBH.tar.gz

These .tar.gz packages contain the genome annotation data files for the genomes of the "core" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH). Moreover, all annotations, with the exception of RBH annotations, have been compiled into an overview table (all_genomes.annotations.tar.gz).


2. CORE-dataset_Gene_Phylogenies/ contains the following folders:

bchY/
creS_ALL/
creS_SUBSET/
pufM/

These folders correspond to each of the four sets of gene phylogenies inferred for the article. They contain data files including sequence alignments (.align), trimmed alignments (.trim), and tree files (.treefile), generated as outlined in the article. Moreover, the creS_ALL/ folder includes files for protein domain mapping using iTOL (.dataset_protein_domains_template.txt and .interproscan.tsv files).


3. CORE-dataset_Species_Phylogenies/ contains the following folders:

GToTree/
Martijn_etal_2018_marker_genes/

These folders correspond to the two species phylogenies inferred for the "core" dataset of the article, as outlined in the article.

For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes provided by GToTree, the concatenated sequence file (.fa) and the tree file (.treefile) are included.

For the refined species phylogeny inferred using the marker genes compiled by Martijn et al. (2018; Nature 557:101-105), individual gene trees were first inferred separately and visualized. After manual inspection of the initial gene trees, putative paralogs, contamination, long-branching, horizontal transfers, and duplicate sequences were removed. Annotated phylogenies of these initial gene trees are provided (marker_single_gene_trees_round-one.pdf), with sequences removed after curation highlighted with red branches. A list of sequences removed during the curation, or removed due to poor alignment, is also provided (marker_single_gene_trees.xlsx). After this curation step, gene trees were re-inferred (marker_single_gene_trees_round-two.pdf). The folder marker_single_gene_trees/ contains sequence alignments (.align), trimmed alignments (.trim), tree files (.treefile), and annotated trees (.pdf) for each separate gene tree, before curation (without the "_v2" file name suffix) and after curation (with the "_v2" file name suffix). Lastly, for the final concatenated phylogenies inferred after curation, the concatenated sequence file (.fa), the non-parametric bootstrap tree file (.NPboot.treefile), and the ultrafast bootstrap tree file (.ufboot.treefile) are included.


4. EXTENDED-dataset_Species_Phylogenies/ contains the following folders:

16S_23S_rRNA_genes/
GToTree/

These folders correspond to the two species phylogenies inferred for the "extended" dataset of the article, as outlined in the article.

For the phylogeny inferred for concatenated 16S and 23S rRNA genes from the "core dataset" together with the Acaudatibacter ("Palsa-881") species representatives of the "extended" dataset, as outlined in the article, the sequence alignments (.align), trimmed alignments (.trim), concatenated sequence file (.fa), and the tree file (.treefile) are included.

For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes for the Acaudatibacter ("Palsa-881") species of the "extended" dataset together with a selection of Caulobacterales genomes from the "core" dataset", as outlined in the article, the concatenated sequence file (.faa) and the tree file (.treefile) are included.


5. EXTENDED-dataset_Gene_Annotations/ contains the following data files:

all_proteins.emapper.annotations.tar.gz
all_proteins.ghostKOALA-KOs.tar.gz
all_proteins.RBH.tar.gz

These .tar.gz packages contain the genome annotation data files for the genomes of the "extended" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: eggNOG-mapper (emapper), GhostKOALA, and the reciprocal best blast hit algorithm (RBH).</abstract>
      <abstract xml:lang="sv" contentType="abstract">Additional data for the article "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025).

The data includes the following:
- Sequence alignment files and tree files for all species phylogenies and gene phylogenies of the article.
- Genome annotation data files for all genomes of the "core" and "extended" datasets of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH).

Datasets have been compressed into a .zip archive that contains the following folders:

CORE-dataset_Gene_Annotations/
CORE-dataset_Gene_Phylogenies/
CORE-dataset_Species_Phylogenies/
EXTENDED-dataset_Gene_Annotations/
EXTENDED-dataset_Species_Phylogenies/

Some of these folders include files that in turn have been packaged into .tar.gz archives, which can be unpacked using the "tar" command in the Linux, macOS, or Windows (Windows 10 or later) command line.


Description of datasets:

1. CORE-dataset_Gene_Annotations/ contains the following data files:

all_genomes.annotations.tar.gz
all_genomes.diamond-blastp.tar.gz
all_genomes.emapper.annotations.tar.gz
all_genomes.ghostKOALA-KOs.tar.gz
all_genomes.interproscan.tar.gz
all_genomes.RBH.tar.gz

These .tar.gz packages contain the genome annotation data files for the genomes of the "core" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH). Moreover, all annotations, with the exception of RBH annotations, have been compiled into an overview table (all_genomes.annotations.tar.gz).


2. CORE-dataset_Gene_Phylogenies/ contains the following folders:

bchY/
creS_ALL/
creS_SUBSET/
pufM/

These folders correspond to each of the four sets of gene phylogenies inferred for the article. They contain data files including sequence alignments (.align), trimmed alignments (.trim), and tree files (.treefile), generated as outlined in the article. Moreover, the creS_ALL/ folder includes files for protein domain mapping using iTOL (.dataset_protein_domains_template.txt and .interproscan.tsv files).


3. CORE-dataset_Species_Phylogenies/ contains the following folders:

GToTree/
Martijn_etal_2018_marker_genes/

These folders correspond to the two species phylogenies inferred for the "core" dataset of the article, as outlined in the article.

For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes provided by GToTree, the concatenated sequence file (.fa) and the tree file (.treefile) are included.

For the refined species phylogeny inferred using the marker genes compiled by Martijn et al. (2018; Nature 557:101-105), individual gene trees were first inferred separately and visualized. After manual inspection of the initial gene trees, putative paralogs, contamination, long-branching, horizontal transfers, and duplicate sequences were removed. Annotated phylogenies of these initial gene trees are provided (marker_single_gene_trees_round-one.pdf), with sequences removed after curation highlighted with red branches. A list of sequences removed during the curation, or removed due to poor alignment, is also provided (marker_single_gene_trees.xlsx). After this curation step, gene trees were re-inferred (marker_single_gene_trees_round-two.pdf). The folder marker_single_gene_trees/ contains sequence alignments (.align), trimmed alignments (.trim), tree files (.treefile), and annotated trees (.pdf) for each separate gene tree, before curation (without the "_v2" file name suffix) and after curation (with the "_v2" file name suffix). Lastly, for the final concatenated phylogenies inferred after curation, the concatenated sequence file (.fa), the non-parametric bootstrap tree file (.NPboot.treefile), and the ultrafast bootstrap tree file (.ufboot.treefile) are included.


4. EXTENDED-dataset_Species_Phylogenies/ contains the following folders:

16S_23S_rRNA_genes/
GToTree/

These folders correspond to the two species phylogenies inferred for the "extended" dataset of the article, as outlined in the article.

For the phylogeny inferred for concatenated 16S and 23S rRNA genes from the "core dataset" together with the Acaudatibacter ("Palsa-881") species representatives of the "extended" dataset, as outlined in the article, the sequence alignments (.align), trimmed alignments (.trim), concatenated sequence file (.fa), and the tree file (.treefile) are included.

For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes for the Acaudatibacter ("Palsa-881") species of the "extended" dataset together with a selection of Caulobacterales genomes from the "core" dataset", as outlined in the article, the concatenated sequence file (.faa) and the tree file (.treefile) are included.


5. EXTENDED-dataset_Gene_Annotations/ contains the following data files:

all_proteins.emapper.annotations.tar.gz
all_proteins.ghostKOALA-KOs.tar.gz
all_proteins.RBH.tar.gz

These .tar.gz packages contain the genome annotation data files for the genomes of the "extended" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: eggNOG-mapper (emapper), GhostKOALA, and the reciprocal best blast hit algorithm (RBH).</abstract>
      <sumDscr>
        <dataKind xml:lang="en">Numeric</dataKind>
        <dataKind xml:lang="en">Text</dataKind>
      </sumDscr>
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through SND. Data are freely accessible.</restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via SND. Data är fritt tillgängliga.</restrctn>
        <conditions elementVersion="info:eu-repo-Access-Terms vocabulary">openAccess</conditions>
      </useStmt>
    </dataAccs>
    <othrStdyMat>
      <relPubl>
        <citation>
          <titlStmt>
            <titl xml:lang="sv">Joel Hallgren, Jennah E. Dharamshi,  Alejandro Rodríguez-Gijón,  Julia Nuy,  Sarahi L. Garcia,  Kristina Jonas. Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales.  2025. Nature Communications. 16: 11003.</titl>
            <parTitl xml:lang="en">Joel Hallgren, Jennah E. Dharamshi,  Alejandro Rodríguez-Gijón,  Julia Nuy,  Sarahi L. Garcia,  Kristina Jonas. Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales.  2025. Nature Communications. 16: 11003.</parTitl>
            <IDNo agency="DOI">10.1038/s41467-025-65642-x</IDNo>
          </titlStmt>
          <distStmt>
            <distDate date="2025">2025</distDate>
          </distStmt>
        </citation>
      </relPubl>
      <relPubl>
        <citation>
          <titlStmt>
            <titl xml:lang="sv">Joel Hallgren, Jennah E. Dharamshi,  Alejandro Rodríguez-Gijón,  Julia Nuy,  Sarahi L. Garcia,  Kristina Jonas. Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales. bioRxiv 2025.05.26.656076 [preprint]</titl>
            <parTitl xml:lang="en">Joel Hallgren, Jennah E. Dharamshi,  Alejandro Rodríguez-Gijón,  Julia Nuy,  Sarahi L. Garcia,  Kristina Jonas. Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales. bioRxiv 2025.05.26.656076 [preprint]</parTitl>
            <IDNo agency="DOI">10.1101/2025.05.26.656076</IDNo>
          </titlStmt>
          <distStmt>
            <distDate date="2025">2025</distDate>
          </distStmt>
        </citation>
      </relPubl>
    </othrStdyMat>
  </stdyDscr>
</codeBook>