<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">References and test datasets for the Cactus pipeline</parTitl>
        <IDNo agency="SND">doi-10-17044-scilifelab-20171347-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.17044/SCILIFELAB.20171347</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.17044/SCILIFELAB.20171347">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">References and test datasets for the Cactus pipeline</parTitl>
        <IDNo agency="SND">doi-10-17044-scilifelab-20171347-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.17044/SCILIFELAB.20171347</IDNo>
      </titlStmt>
      <rspStmt>
        <AuthEnty xml:lang="en" affiliation="Science for Life Laboratory">Salignon, Jerome</AuthEnty>
      </rspStmt>
      <prodStmt />
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2022-12-21" />
      </distStmt>
      <verStmt>
        <version elementVersion="0" elementVersionDate="2022-12-21" />
      </verStmt>
      <holdings URI="https://doi.org/10.17044/SCILIFELAB.20171347">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject />
      <abstract xml:lang="en" contentType="abstract">Overview

This item contains references and test datasets for the Cactus pipeline.

Cactus (Chromatin ACcessibility and Transcriptomics  Unification Software) is an mRNA-Seq and ATAC-Seq analysis pipeline that aims to provide advanced molecular insights on the conditions under study. 

Test datasets

The test datasets contain all data needed to run Cactus in each of the 4 supported organisms. This include ATAC-Seq and mRNA-Seq data (*.fastq.gz), parameter files (*.yml) and design files (*.tsv). They were were created for each species by downloading publicly available datasets with fetchngs (Ewels et al., 2020) and subsampling reads to the minimum required to have enough DAS (Differential Analysis Subsets) for enrichment analysis.

Datasets downloaded:

- Worm and Humans: GSE98758

- Fly: GSE149339

- Mouse: GSE193393

References

One of the goals of Cactus is to make the analysis as simple and fast as possible for the user while providing detailed insights on molecular mechanisms. This is achieved by parsing all needed references for the 4 ENCODE (Dunham et al., 2012; Stamatoyannopoulos et al., 2012; Luo et al., 2020) and modENCODE (THE MODENCODE CONSORTIUM et al., 2010; Gerstein et al., 2010) organisms (human, M. musculus, D. melanogaster and C. elegans). This parsing step was done with a Nextflow pipeline with most tools encapsulated within containers for improved efficiency and reproducibility and to allow the creation of customized references.

Genomic sequences and annotations were downloaded from Ensembl (Cunningham et al., 2022). The ENCODE API (Luo et al., 2020) was used to download the CHIP-Seq profiles of 2,714 Transcription Factors (TFs) (Landt et al., 2012; Boyle et al., 2014) and chromatin states in the form of 899 ChromHMM profiles (Boix et al., 2021; van der Velde et al., 2021) and 6 HiHMM profiles (Ho et al., 2014). Slim annotations (cell, organ, development, and system) were parsed and used to create groups of CHIP-Seq profiles that share the same annotations, allowing users to analyze only CHIP-Seq profiles relevant to their study. 2,779 TF motifs were obtained from the Cis-BP database (Lambert et al., 2019). GO terms and KEGG pathways were obtained via the R packages AnnotationHub (Morgan and Shepherd, 2021) and clusterProfiler (Yu et al., 2012; Wu et al., 2021), respectively.

Documentation

More information on how to use Cactus and how references and test datasets were created is available on the documentation website: https://github.com/jsalignon/cactus.</abstract>
      <sumDscr />
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through an external actor. </restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via extern aktör. </restrctn>
      </useStmt>
    </dataAccs>
    <othrStdyMat />
  </stdyDscr>
</codeBook>