Data associated with Bergsten et al (2025) "Whole Genome Shotgun Phylogenomics Resolve the Diving Beetle Tree of Life" in Systematic Entomology.
Dataset used to infer the phylogeny of Dytiscidae in Bergsten et al (2025) "Whole Genome Shotgun Phylogenomics Resolve the Diving Beetle Tree of Life" in Systematic Entomology. Included files: - aa-baits-macse-trees_nov_concat.faa.gz (30M) Amino-acid sequence data in fasta format. Gzip-compressed. - aa-baits-macse-trees_nov_concat.partitions.gz (48K) Gene-partion definitions in plain text format. Gzip-compressed. - ortho.aa.tsv.gz (40K) Tab-separated translation table between TC-V9-V7 ortholog definitions. Gzip-compressed. - README.md (8K) Readme-file, plain text format. -MANIFEST.txt Manifest file listing file content in this item, plain text format. Abstract of original article: Diving beetles (Dytiscidae) are important generalist predators in freshwater ecosystems that have been around since the Jurassic. Previous phylogenetic studies have identified a largely stable set of monophyletic named groups (subfamilies, tribes and subtribes), however backbone relationships among these have remained elusive. Here we use whole-genome sequencing to reconstruct the phylogeny of Dytiscidae. We mine de novo assemblies and combine them with others available from transcriptome studies of Adephaga to compile a dataset of 149 taxa and 5364 orthologous genes. Species tree and concatenated maximum likelihood methods provide largely congruent results resolving in agreement all but two inter-subfamily nodes. All eleven subfamilies are monophyletic supporting previous results, possibly also all tribes but Hydroporini is recovered as paraphyletic with weak support and monophyly of Dytiscini is method dependent. One large clade includes eight of eleven subfamilies (excluding Laccophilinae, Lancetinae and Coptotominae). Matinae is sister to Hydrodytinae + Hydroporinae in contrast with previous studies that have hypothesized Matinae as sister to the remaining Dytiscidae. Copelatinae belong in a clade with Cybistrinae, Dytiscinae, Agabinae and Colymbetinae. Strongly confirmed sister-group relationships of subfamilies include Cybistrinae + Dytiscinae, Agabinae + Colymbetinae, Lancetinae + Coptotominae, and Hydrodytinae + Hydroporinae. Remaining problems include resolving with confidence the basal ingroup trichotomy and relationships between tribes in Hydroporinae. Resolution of tribes in Dytiscinae is affected by methodological inconsistencies. Platynectini, new tribe, is described and Hydrotrupini redefined within subfamily Agabinae. This study is a step forward towards completely resolving the backbone phylogeny of Dytiscidae which we hope will stimulate further work on remaining challenges. DNA extraction Specimen DNAs were extracted using Qiagen DNEasy or Puregene kits (Valencia, California, USA) using the animal tissue protocols. Library preparation DNA extractions of 14 samples were prepared with Chromium Genome kit to generate linked reads with 10X Genomics technology. The 14-sample library was sequenced on 8 lanes of Illumina HiSeqX using a 2x151 bp setup and the 'HiSeq X SBS' chemistry. 12 of the 14 samples were re-sequenced in a second run with identical run parameters as the previous but on 6 lanes and the assembly for these 12 samples are based on merging the data from run 1 and 2. Illumina libraries were prepared for an additional 62 samples following Prum et al. (2015). In short, a Covaris ultrasonicator was used to fragment extracted DNA to a size range of 200-700bp. Using a Beckman-Coulter Biomek FXp liquid-handling robot, we performed blunt-end repair followed by size selection to 200-400 using SPRI select beads (Beckman-Coulter Inc.; 0.9x ratio of bead to sample volume). Adapters containing sample-specific indexes were also ligated (for details, see Prum et al. 2015). After assessing DNA concentration using Qubit, we pooled libraries equally in groups of ~16, and verified library quality using qPCR. Initial sequencing took place on an Illumina NovaSeq6000 S2 flow cell (shared with 38 other samples), with the PE150bp protocol and dual 8bp indexing. After assessing sequencing coverage from this initial run, we re-pooled the libraries (to optimize coverage uniformity) and collected additional reads (same protocol) on a portion of an S4 flow cell. The re-pooling/re-sequencing process was repeated twice more using SP flow cells. De novo assembly For samples prepared using 10X Genomics, draft de novo assemblies were generated using supernova v.2.1.1 with non-default parameters "--nopreflight" and "--accept-extreme-coverage" (Weisenfeld, et al. 2017). For remaining samples, processed reads from the four sequenced lanes were concatenated and used as input for Abyss v2.2 in paired-end mode. After testing several k-mer sizes, we decided to use a k-mer size of 48 based on the quality of resulting alignments and appropriate length of resulting contigs (N50). We also set Abyss to run using a Bloom filter size to 100G with three hash functions (-H argument) and a k-mer count threshold of 3 (-kc). Extraction of orthologous gene dataset We used three previous transcriptome-based studies focusing on Coleoptera (McKenna, et al. 2019), Dytiscoidea (Vasilikopoulos, et al. 2019) or Neuropterida (Vasilikopoulos, et al. 2020) (see ReadMe file) to assemble an orthologous gene dataset. Each orthologous gene is identified by an OrthoDB code and we used OrthoDB V.10 and a translation table to match the codes between OrthoDB V.7 (McKenna, et al. 2019) (Vasilikopoulos, et al. 2020) and V.9 (Vasilikopoulos, et al. 2019) (see ortho.aa.tsv.gz).The matching and merging of the datasets resulted in 6,413 preliminary reference genes. The exon-capture study of Adephaga (Vasilikopoulos, et al. 2021) is a subset of Vasilikopoulous et al. (2019) since it targeted 651 of the 3,085 genes in the latter study and the 651 genes from the Adephaga terminals were downloaded, matched and included as well. The amino-acid alignments from the published data were subsequently used as baits when extracting corresponding regions from the new genome assemblies (scaffolds files). The gene extraction was made using the ALiBaSeq workflow v1.2 (Knyshov, et al. 2021). The workflow performs sequence extraction based on a local alignment search. We used tblastn v2.12.0+ (Camacho, et al. 2009) with an E-value set to 1e-10, and ALiBaSeq run with alibaseqPy3.py -x a -f M -b blast_results -t assemblies -e 1e-10 --is --amalgamate-hits --ac tdna-tdna. The combined nucleotide data set was translated to amino-acids and aligned using the program suite MACSE v10.02 (Ranwez, et al. 2011). This process included multiple sequence alignment with MAFFT v7.271 (Katoh, et al. 2002), at both nucleotide and amino-acid level, and both pre- and post-alignment filtering steps using HMMCleaner v1.8.VR2 (Di Franco, et al. 2019), where longer indel regions, shorter isolated codons, and frameshifts are identified and masked, as well as trimming alignments at the ends. Alignment and filtering The gene files were aligned with MAFFT v7.453 (option --auto). The multiple sequence alignment was then filtered using BMGE v1.12 (Criscuolo and Gribaldo 2010) with default settings. Maximum likelihood phylogenies were then estimated using RAxML-NG v1.1.0 (Kozlov, et al. 2019) with a fixed substitution model (LG+G8+F) (Yang 1994, Le and Gascuel 2008). These trees were used together with the multiple sequence alignment as input to TreeShrink v1.3.9 (Mai and Mirarab 2018) (with default settings), which can filter sequences based on whether a terminal appears as an outlier in a tree as determined by its branch length. The resulting, filtered alignment (5,364 genes, 825,452 aa positions, 149 terminals, aa file: aa-baits-macse-trees_nov_concat.faa, gene partition file: aa-baits-macse-trees_nov_concat.partitions) was then re-aligned with MAFFT, and subjected to a new tree inference with RAxML-NG, this time with automatic selection of the substitution model using ModelTest-NG v.0.2.0 (Darriba, et al. 2019). The final set of gene trees was used as input to ASTRAL-III v5.6.3 (Zhang, et al. 2018) and the concatenated dataset used for maximum likelihood analysis with IQ-TREE v2.1.2 (Nguyen, et al. 2015). See original paper and its supplementary information for further details on analyses and data sources.
Citation and access
Citation and access
Administrative information
Administrative information
Topic and keywords
Topic and keywords
Metadata
Metadata
