Data för: "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025) Data for: "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025) 2025-117-1 https://doi.org/10.58141/0bz5-dc62 Swedish National Data Service Svensk nationell datatjänst Landing page Data för: "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025) Data for: "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025) 2025-117-1 https://doi.org/10.58141/0bz5-dc62 10.1038/s41467-025-65642-x 10.1101/2025.05.26.656076 Hallgren, Joel Hallgren, Joel Jonas, Kristina Jonas, Kristina Swedish National Data Service Svensk nationell datatjänst Landing page phylogenetic tree microbiology phylogenetic relationship phylogenetic diversity bacteria DNA sequence analysis microbial ecology phylogenetic analysis bacterial diversity bioinformatics bioinformatik cell biology cellbiologi parvularculaceae garrity et al., 2003 caulobacterales henrici & johnson, 1935 alphaproteobacteria garrity et al., 2006 hyphomonadaceae lee et al., 2005 caulobacteraceae henrici & johnson, 1935 caulobacter vibrioides henrici & johnson, 1935 Additional data for the article "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025). The data includes the following: - Sequence alignment files and tree files for all species phylogenies and gene phylogenies of the article. - Genome annotation data files for all genomes of the "core" and "extended" datasets of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH). Datasets have been compressed into a .zip archive that contains the following folders: CORE-dataset_Gene_Annotations/ CORE-dataset_Gene_Phylogenies/ CORE-dataset_Species_Phylogenies/ EXTENDED-dataset_Gene_Annotations/ EXTENDED-dataset_Species_Phylogenies/ Some of these folders include files that in turn have been packaged into .tar.gz archives, which can be unpacked using the "tar" command in the Linux, macOS, or Windows (Windows 10 or later) command line. Description of datasets: 1. CORE-dataset_Gene_Annotations/ contains the following data files: all_genomes.annotations.tar.gz all_genomes.diamond-blastp.tar.gz all_genomes.emapper.annotations.tar.gz all_genomes.ghostKOALA-KOs.tar.gz all_genomes.interproscan.tar.gz all_genomes.RBH.tar.gz These .tar.gz packages contain the genome annotation data files for the genomes of the "core" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH). Moreover, all annotations, with the exception of RBH annotations, have been compiled into an overview table (all_genomes.annotations.tar.gz). 2. CORE-dataset_Gene_Phylogenies/ contains the following folders: bchY/ creS_ALL/ creS_SUBSET/ pufM/ These folders correspond to each of the four sets of gene phylogenies inferred for the article. They contain data files including sequence alignments (.align), trimmed alignments (.trim), and tree files (.treefile), generated as outlined in the article. Moreover, the creS_ALL/ folder includes files for protein domain mapping using iTOL (.dataset_protein_domains_template.txt and .interproscan.tsv files). 3. CORE-dataset_Species_Phylogenies/ contains the following folders: GToTree/ Martijn_etal_2018_marker_genes/ These folders correspond to the two species phylogenies inferred for the "core" dataset of the article, as outlined in the article. For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes provided by GToTree, the concatenated sequence file (.fa) and the tree file (.treefile) are included. For the refined species phylogeny inferred using the marker genes compiled by Martijn et al. (2018; Nature 557:101-105), individual gene trees were first inferred separately and visualized. After manual inspection of the initial gene trees, putative paralogs, contamination, long-branching, horizontal transfers, and duplicate sequences were removed. Annotated phylogenies of these initial gene trees are provided (marker_single_gene_trees_round-one.pdf), with sequences removed after curation highlighted with red branches. A list of sequences removed during the curation, or removed due to poor alignment, is also provided (marker_single_gene_trees.xlsx). After this curation step, gene trees were re-inferred (marker_single_gene_trees_round-two.pdf). The folder marker_single_gene_trees/ contains sequence alignments (.align), trimmed alignments (.trim), tree files (.treefile), and annotated trees (.pdf) for each separate gene tree, before curation (without the "_v2" file name suffix) and after curation (with the "_v2" file name suffix). Lastly, for the final concatenated phylogenies inferred after curation, the concatenated sequence file (.fa), the non-parametric bootstrap tree file (.NPboot.treefile), and the ultrafast bootstrap tree file (.ufboot.treefile) are included. 4. EXTENDED-dataset_Species_Phylogenies/ contains the following folders: 16S_23S_rRNA_genes/ GToTree/ These folders correspond to the two species phylogenies inferred for the "extended" dataset of the article, as outlined in the article. For the phylogeny inferred for concatenated 16S and 23S rRNA genes from the "core dataset" together with the Acaudatibacter ("Palsa-881") species representatives of the "extended" dataset, as outlined in the article, the sequence alignments (.align), trimmed alignments (.trim), concatenated sequence file (.fa), and the tree file (.treefile) are included. For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes for the Acaudatibacter ("Palsa-881") species of the "extended" dataset together with a selection of Caulobacterales genomes from the "core" dataset", as outlined in the article, the concatenated sequence file (.faa) and the tree file (.treefile) are included. 5. EXTENDED-dataset_Gene_Annotations/ contains the following data files: all_proteins.emapper.annotations.tar.gz all_proteins.ghostKOALA-KOs.tar.gz all_proteins.RBH.tar.gz These .tar.gz packages contain the genome annotation data files for the genomes of the "extended" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: eggNOG-mapper (emapper), GhostKOALA, and the reciprocal best blast hit algorithm (RBH). Additional data for the article "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025). The data includes the following: - Sequence alignment files and tree files for all species phylogenies and gene phylogenies of the article. - Genome annotation data files for all genomes of the "core" and "extended" datasets of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH). Datasets have been compressed into a .zip archive that contains the following folders: CORE-dataset_Gene_Annotations/ CORE-dataset_Gene_Phylogenies/ CORE-dataset_Species_Phylogenies/ EXTENDED-dataset_Gene_Annotations/ EXTENDED-dataset_Species_Phylogenies/ Some of these folders include files that in turn have been packaged into .tar.gz archives, which can be unpacked using the "tar" command in the Linux, macOS, or Windows (Windows 10 or later) command line. Description of datasets: 1. CORE-dataset_Gene_Annotations/ contains the following data files: all_genomes.annotations.tar.gz all_genomes.diamond-blastp.tar.gz all_genomes.emapper.annotations.tar.gz all_genomes.ghostKOALA-KOs.tar.gz all_genomes.interproscan.tar.gz all_genomes.RBH.tar.gz These .tar.gz packages contain the genome annotation data files for the genomes of the "core" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH). Moreover, all annotations, with the exception of RBH annotations, have been compiled into an overview table (all_genomes.annotations.tar.gz). 2. CORE-dataset_Gene_Phylogenies/ contains the following folders: bchY/ creS_ALL/ creS_SUBSET/ pufM/ These folders correspond to each of the four sets of gene phylogenies inferred for the article. They contain data files including sequence alignments (.align), trimmed alignments (.trim), and tree files (.treefile), generated as outlined in the article. Moreover, the creS_ALL/ folder includes files for protein domain mapping using iTOL (.dataset_protein_domains_template.txt and .interproscan.tsv files). 3. CORE-dataset_Species_Phylogenies/ contains the following folders: GToTree/ Martijn_etal_2018_marker_genes/ These folders correspond to the two species phylogenies inferred for the "core" dataset of the article, as outlined in the article. For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes provided by GToTree, the concatenated sequence file (.fa) and the tree file (.treefile) are included. For the refined species phylogeny inferred using the marker genes compiled by Martijn et al. (2018; Nature 557:101-105), individual gene trees were first inferred separately and visualized. After manual inspection of the initial gene trees, putative paralogs, contamination, long-branching, horizontal transfers, and duplicate sequences were removed. Annotated phylogenies of these initial gene trees are provided (marker_single_gene_trees_round-one.pdf), with sequences removed after curation highlighted with red branches. A list of sequences removed during the curation, or removed due to poor alignment, is also provided (marker_single_gene_trees.xlsx). After this curation step, gene trees were re-inferred (marker_single_gene_trees_round-two.pdf). The folder marker_single_gene_trees/ contains sequence alignments (.align), trimmed alignments (.trim), tree files (.treefile), and annotated trees (.pdf) for each separate gene tree, before curation (without the "_v2" file name suffix) and after curation (with the "_v2" file name suffix). Lastly, for the final concatenated phylogenies inferred after curation, the concatenated sequence file (.fa), the non-parametric bootstrap tree file (.NPboot.treefile), and the ultrafast bootstrap tree file (.ufboot.treefile) are included. 4. EXTENDED-dataset_Species_Phylogenies/ contains the following folders: 16S_23S_rRNA_genes/ GToTree/ These folders correspond to the two species phylogenies inferred for the "extended" dataset of the article, as outlined in the article. For the phylogeny inferred for concatenated 16S and 23S rRNA genes from the "core dataset" together with the Acaudatibacter ("Palsa-881") species representatives of the "extended" dataset, as outlined in the article, the sequence alignments (.align), trimmed alignments (.trim), concatenated sequence file (.fa), and the tree file (.treefile) are included. For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes for the Acaudatibacter ("Palsa-881") species of the "extended" dataset together with a selection of Caulobacterales genomes from the "core" dataset", as outlined in the article, the concatenated sequence file (.faa) and the tree file (.treefile) are included. 5. EXTENDED-dataset_Gene_Annotations/ contains the following data files: all_proteins.emapper.annotations.tar.gz all_proteins.ghostKOALA-KOs.tar.gz all_proteins.RBH.tar.gz These .tar.gz packages contain the genome annotation data files for the genomes of the "extended" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: eggNOG-mapper (emapper), GhostKOALA, and the reciprocal best blast hit algorithm (RBH). Numeric Text Access to data through SND. Data are freely accessible. Åtkomst till data via SND. Data är fritt tillgängliga. openAccess Joel Hallgren, Jennah E. Dharamshi, Alejandro Rodríguez-Gijón, Julia Nuy, Sarahi L. Garcia, Kristina Jonas. Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales. 2025. Nature Communications. 16: 11003. Joel Hallgren, Jennah E. Dharamshi, Alejandro Rodríguez-Gijón, Julia Nuy, Sarahi L. Garcia, Kristina Jonas. Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales. 2025. Nature Communications. 16: 11003. 10.1038/s41467-025-65642-x 2025 Joel Hallgren, Jennah E. Dharamshi, Alejandro Rodríguez-Gijón, Julia Nuy, Sarahi L. Garcia, Kristina Jonas. Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales. bioRxiv 2025.05.26.656076 [preprint] Joel Hallgren, Jennah E. Dharamshi, Alejandro Rodríguez-Gijón, Julia Nuy, Sarahi L. Garcia, Kristina Jonas. Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales. bioRxiv 2025.05.26.656076 [preprint] 10.1101/2025.05.26.656076 2025