Data för: "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025)
https://doi.org/10.58141/0bz5-dc62
Additional data for the article "Widespread potential for phototrophy and convergent reduction of lifecycle complexity in the dimorphic order Caulobacterales" (Hallgren et al., 2025).
The data includes the following:
- Sequence alignment files and tree files for all species phylogenies and gene phylogenies of the article.
- Genome annotation data files for all genomes of the "core" and "extended" datasets of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH).
Datasets have been compressed into a .zip archive that contains the following folders:
CORE-dataset_Gene_Annotations/
CORE-dataset_Gene_Phylogenies/
CORE-dataset_Species_Phylogenies/
EXTENDED-dataset_Gene_Annotations/
EXTENDED-dataset_Species_Phylogenies/
Some of these folders include files that in turn have been packaged into .tar.gz archives, which can be unpacked using the "tar" command in the Linux, macOS, or Windows (Windows 10 or later) command line.
Description of datasets:
1. CORE-dataset_Gene_Annotations/ contains the following data files:
all_genomes.annotations.tar.gz
all_genomes.diamond-blastp.tar.gz
all_genomes.emapper.annotations.tar.gz
all_genomes.ghostKOALA-KOs.tar.gz
all_genomes.interproscan.tar.gz
all_genomes.RBH.tar.gz
These .tar.gz packages contain the genome annotation data files for the genomes of the "core" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: DIAMOND blastp, eggNOG-mapper (emapper), GhostKOALA, InterProScan, and the reciprocal best blast hit algorithm (RBH). Moreover, all annotations, with the exception of RBH annotations, have been compiled into an overview table (all_genomes.annotations.tar.gz).
2. CORE-dataset_Gene_Phylogenies/ contains the following folders:
bchY/
creS_ALL/
creS_SUBSET/
pufM/
These folders correspond to each of the four sets of gene phylogenies inferred for the article. They contain data files including sequence alignments (.align), trimmed alignments (.trim), and tree files (.treefile), generated as outlined in the article. Moreover, the creS_ALL/ folder includes files for protein domain mapping using iTOL (.dataset_protein_domains_template.txt and .interproscan.tsv files).
3. CORE-dataset_Species_Phylogenies/ contains the following folders:
GToTree/
Martijn_etal_2018_marker_genes/
These folders correspond to the two species phylogenies inferred for the "core" dataset of the article, as outlined in the article.
For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes provided by GToTree, the concatenated sequence file (.fa) and the tree file (.treefile) are included.
For the refined species phylogeny inferred using the marker genes compiled by Martijn et al. (2018; Nature 557:101-105), individual gene trees were first inferred separately and visualized. After manual inspection of the initial gene trees, putative paralogs, contamination, long-branching, horizontal transfers, and duplicate sequences were removed. Annotated phylogenies of these initial gene trees are provided (marker_single_gene_trees_round-one.pdf), with sequences removed after curation highlighted with red branches. A list of sequences removed during the curation, or removed due to poor alignment, is also provided (marker_single_gene_trees.xlsx). After this curation step, gene trees were re-inferred (marker_single_gene_trees_round-two.pdf). The folder marker_single_gene_trees/ contains sequence alignments (.align), trimmed alignments (.trim), tree files (.treefile), and annotated trees (.pdf) for each separate gene tree, before curation (without the "_v2" file name suffix) and after curation (with the "_v2" file name suffix). Lastly, for the final concatenated phylogenies inferred after curation, the concatenated sequence file (.fa), the non-parametric bootstrap tree file (.NPboot.treefile), and the ultrafast bootstrap tree file (.ufboot.treefile) are included.
4. EXTENDED-dataset_Species_Phylogenies/ contains the following folders:
16S_23S_rRNA_genes/
GToTree/
These folders correspond to the two species phylogenies inferred for the "extended" dataset of the article, as outlined in the article.
For the phylogeny inferred for concatenated 16S and 23S rRNA genes from the "core dataset" together with the Acaudatibacter ("Palsa-881") species representatives of the "extended" dataset, as outlined in the article, the sequence alignments (.align), trimmed alignments (.trim), concatenated sequence file (.fa), and the tree file (.treefile) are included.
For the GToTree phylogeny inferred using the 117 alphaproteobacterial marker genes for the Acaudatibacter ("Palsa-881") species of the "extended" dataset together with a selection of Caulobacterales genomes from the "core" dataset", as outlined in the article, the concatenated sequence file (.faa) and the tree file (.treefile) are included.
5. EXTENDED-dataset_Gene_Annotations/ contains the following data files:
all_proteins.emapper.annotations.tar.gz
all_proteins.ghostKOALA-KOs.tar.gz
all_proteins.RBH.tar.gz
These .tar.gz packages contain the genome annotation data files for the genomes of the "extended" dataset of the article. Genome annotations have been generated as outlined in the article using the following tools and methods: eggNOG-mapper (emapper), GhostKOALA, and the reciprocal best blast hit algorithm (RBH).
Datafiler
Datafiler
Dokumentationsfiler
Dokumentationsfiler
Citering och åtkomst
Citering och åtkomst
Tillgänglighetsnivå:
Skapare/primärforskare:
Forskningshuvudman:
Data innehåller personuppgifter:
Nej
