<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">CSAW-CC (mammografi) – ett dataset för AI-forskning för att förbättra screening, diagnostik och prognostik för bröstcancer</titl>
        <altTitl>Cohort of Screen-age Women - Case control (CSAW-CC)</altTitl>
        <parTitl xml:lang="en">CSAW-CC (mammography) – a dataset for AI research to improve screening, diagnostics and prognostics of breast cancer</parTitl>
        <IDNo agency="SND">2021-204-1-1</IDNo>
        <IDNo agency="ki.se">4-3790/2016</IDNo>
        <IDNo agency="DOI">https://doi.org/10.5878/45vm-t798</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.5878/45vm-t798">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv">CSAW-CC (mammografi) – ett dataset för AI-forskning för att förbättra screening, diagnostik och prognostik för bröstcancer</titl>
        <altTitl>Cohort of Screen-age Women - Case control (CSAW-CC)</altTitl>
        <parTitl xml:lang="en">CSAW-CC (mammography) – a dataset for AI research to improve screening, diagnostics and prognostics of breast cancer</parTitl>
        <IDNo agency="SND">2021-204-1-1</IDNo>
        <IDNo agency="ki.se">4-3790/2016</IDNo>
        <IDNo agency="DOI">https://doi.org/10.5878/45vm-t798</IDNo>
        <IDNo agency="DOI">10.1148/radiol.2019190872</IDNo>
        <IDNo agency="URN">urn:nbn:se:kth:diva-267834</IDNo>
        <IDNo agency="DOI">10.1007/s10278-019-00278-0</IDNo>
        <IDNo agency="DOI">10.1016/S2589-7500(20)30185-0</IDNo>
        <IDNo agency="URN">urn:nbn:se:kth:diva-281510</IDNo>
        <IDNo agency="DOI">10.1001/jamaoncol.2020.3321</IDNo>
        <IDNo agency="URN">urn:nbn:se:kth:diva-284972</IDNo>
      </titlStmt>
      <rspStmt>
        <AuthEnty xml:lang="en" affiliation="Department of Oncology-Pathology, Karolinska Institutet">Strand, Fredrik</AuthEnty>
        <AuthEnty xml:lang="sv" affiliation="Institutionen för Onkologi-Patologi, Karolinska Institutet">Strand, Fredrik</AuthEnty>
      </rspStmt>
      <prodStmt>
        <grantNo xml:lang="en" agency="Vinnova">2017-01382_Vinnova</grantNo>
        <grantNo xml:lang="sv" agency="Vinnova">2017-01382_Vinnova</grantNo>
      </prodStmt>
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2022-04-22" />
      </distStmt>
      <verStmt>
        <version elementVersion="1" elementVersionDate="2022-04-22" />
      </verStmt>
      <holdings URI="https://doi.org/10.5878/45vm-t798">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject>
        <keyword xml:lang="en" vocab="MeSH" vocabURI="http://id.nlm.nih.gov/mesh/D001943">Breast Neoplasms</keyword>
        <keyword xml:lang="sv" vocab="MeSH" vocabURI="http://id.nlm.nih.gov/mesh/D001943">Brösttumörer</keyword>
        <keyword xml:lang="en" vocab="MeSH" vocabURI="http://id.nlm.nih.gov/mesh/D008327">Mammography</keyword>
        <keyword xml:lang="sv" vocab="MeSH" vocabURI="http://id.nlm.nih.gov/mesh/D008327">Mammografi</keyword>
      </subject>
      <abstract xml:lang="en" contentType="abstract">The dataset contains x-ray images, mammography, from breast cancer screening at the Karolinska University Hospital, Stockholm, Sweden, collected by principal investigator Fredrik Strand at Karolinska Institutet. The purpose for compiling the dataset was to perform AI research to improve screening, diagnostics and prognostics of breast cancer.

The dataset is based on a selection of cases with and without a breast cancer diagnosis, taken from a more comprehensive source dataset.

1,103 cases of first-time breast cancer for women in the screening age range (40-74 years) during the included time period (November 2008 to December 2015) were included. Of these, a random selection of 873 cases have been included in the published dataset.

A random selection of 10,000 healthy controls during the same time period were included. Of these, a random selection of 7,850 cases have been included in the published dataset.

For each individual all screening mammograms, also repeated over time, were included; as well as the date of screening and the age. In addition, there are pixel-level annotations of the tumors created by a breast radiologist (small lesions such as micro-calcifications have been annotated as an area). Annotations were also drawn in mammograms prior to diagnosis; if these contain a single pixel it means no cancer was seen but the estimated location of the center of the future cancer was shown by a single pixel annotation. 

In addition to images, the dataset also contains cancer data created at the Karolinska University Hospital and extracted through the Regional Cancer Center Stockholm-Gotland. This data contains information about the time of diagnosis and cancer characteristics including tumor size, histology and lymph node metastasis. 

The precision of non-image data was decreased, through categorisation and jittering, to ensure that no single individual can be identified.

The following types of files are available:
- CSV: The following data is included (if applicable): cancer/no cancer (meaning breast cancer during 2008 to 2015), age group at screening, days from image to diagnosis (if any), cancer histology, cancer size group, ipsilateral axillary lymph node metastasis. There is one csv file for the entire dataset, with one row per image. Any information about cancer diagnosis is repeated for all rows for an individual who was diagnosed (i.e., it is also included in rows before diagnosis). For each exam date there is the assessment by radiologist 1, radiologist 2 and the consensus decision.
- DICOM: Mammograms. For each screening, four images for the standard views were acuqired: left and right, mediolateral oblique and craniocaudal. There should be four files per examination date.
- PNG: Cancer annotations. For each DICOM image containing a visible tumor.

Access:
The dataset is available upon request due to the size of the material. The image files in DICOM and PNG format comprises approximately 2.5 TB.
Access to the CSV file including parametric data is possible via download as associated documentation.</abstract>
      <abstract xml:lang="sv" contentType="abstract">Detta dataset innehåller röntgenbilder, mammografi, från bröstcancerscreening på Karolinska Universitetssjukhuset för perioden november 2008 till december 2015. Datasetet har sammanställts med syftet att utföra AI-forskning för att förbättra screening, diagnostik och prognostik för bröstcancer.

Datasetet bygger på ett urval av individer med och utan bröstcancerdiagnos som är hämtat från ett mer omfattande källdataset. 

Källdatasetet innehåller bröstcancerdiagnosfall för 1 103 individer, där följande ej är inkluderade: de vars ålder är utanför screeningintervallet 40 till 74 år, de som saknar komplett screeningundersökning. Från källdatasetet har ett slumpmässigt urval av 873 fall med bröstcancerdiagnos inkluderats i det publicerade datasetet. 

Källdatasetet innehåller vidare ett slumpmässigt urval av 10 000 friska individer som inte fått bröstcancerdiagnos år 2018 eller tidigare. Från källdatasetet har ett slumpmässigt urval av 7 850 friska individer inkluderats i det publicerade datasetet. 

För varje individ är samtliga mammografier inkluderade från 2008 fram till diagnos eller senast 31 december 2015. Utöver mammografibilderna finns annoteringsbilder där en bröstradiolog har annoterat tumörens utbredning på pixelnivå (små förändringar som t.ex. förkalkningar har annoterats som ett område). Även mammografibilden för föregående screening granskades och om tumörtecken var synliga annoterades de även där. Om inga tumörtecken var synliga markerades motsvarande lokalisation med en punkt. 

Utöver bilder finns även parametriska data som kommer från Karolinska Universitetssjukhuset men inhämtats via Regionalt Cancer Centrum Stockholm Gotland. Dessa data innehåller information om kvinnans ålder vid mammografi, tid från bild till diagnos, tumörstorlek, histologi och lymfkörtelmetastas. Parametriska data har begränsats, kategoriserats, och perturberats för att säkerställa anonymiteten (se vidare i bilaga).

Tillgängliga filer:
- CSV: Följande data är inkluderade (om relevant): cancer ja/nej (d.v.s. bröstcancer 2008 till 2015), åldersgrupp, dagar från mammografibild till diagnos (om någon), cancerhistologi, cancerns storleksgrupp, ipsilateral axillär lymfkörtelmetastas. Det finns en csv-fil för hela datasetet, med en rad per bild. Om någon cancerdiagnos erhållits är denna information upprepad för alla rader - även för de som hör till undersökning före diagnos. För varje undersökningsdatum finns bedömning av radiolog 1, av radiolog 2 samt consensusbeslut.
- DICOM-filer: Mammografibilder. För varje screening finns de fyra standardbilderna: vänster/höger, mediolateral oblik och kraniokaudal. Det ska därmed finnas fyra filer per examinationsdatum.
- PNG: Cancer-annoteringar. För varje DICOM bild där en tumör kan visualiseras.

Åtkomst:
Datasetet är tillgängligt efter förfrågan på grund av materialets storlek. Bildmaterialet i form av DICOM-filer och PNG-filer omfattar ca 2,5 TB. 
Önskas endast tillgång till CSV-filen med parametriska data finns den att ladda ned som tillhörande dokumentation.</abstract>
      <sumDscr>
        <collDate xml:lang="en" date="2008" event="start">2008</collDate>
        <collDate xml:lang="en" date="2015" event="end">2015</collDate>
        <collDate xml:lang="en" date="2008" event="start">2008</collDate>
        <collDate xml:lang="en" date="2015" event="end">2015</collDate>
        <nation xml:lang="en" abbr="SE">Sweden</nation>
        <nation xml:lang="sv" abbr="SE">Sverige</nation>
        <anlyUnit xml:lang="en" unit="Individual">Individual<concept vocab="DDI Analysis Unit" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/AnalysisUnit/2.1.3?languageVersion=en-2.1.3">Individual</concept></anlyUnit>
        <anlyUnit xml:lang="sv" unit="Individ">Individ<concept vocab="DDI Analysis Unit" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/AnalysisUnit/2.1.3?languageVersion=sv-2.1.3">Individ</concept></anlyUnit>
        <universe xml:lang="en">Women 40-74 years of age who were invited to mammography screening</universe>
        <universe xml:lang="sv">Kvinnor 40-74 år som inbjudits till mammografiscreening</universe>
        <dataKind xml:lang="en">Numeric</dataKind>
        <dataKind xml:lang="en">Text</dataKind>
        <dataKind xml:lang="en">Still image</dataKind>
      </sumDscr>
    </stdyInfo>
    <method>
      <dataColl>
        <sampProc xml:lang="en">Cases: Consecutive breast cancer diagnoses within the population of women who were invited to mammography screening before Dec 31, 2015.
Controls: Randomly selected women who were not diagnosed with breast cancer before Dec 31, 2015.<concept vocab="DDI Sampling Procedure" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/SamplingProcedure/2.0.1?languageVersion=en-2.0.1">Cases: Consecutive breast cancer diagnoses within the population of women who were invited to mammography screening before Dec 31, 2015.
Controls: Randomly selected women who were not diagnosed with breast cancer before Dec 31, 2015.</concept></sampProc>
        <sampProc xml:lang="sv">Fall: Konsekutiva bröstcancerdiagnoser inom populationen kvinnor som inbjudits till mammografiscreening före 2015-12-31
Kontroller: Slumpmässigt urval av kvinnor som ej erhållit bröstcancerdiagnos före 2015-12-31<concept vocab="DDI Sampling Procedure" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/SamplingProcedure/2.0.1?languageVersion=sv-2.0.1">Fall: Konsekutiva bröstcancerdiagnoser inom populationen kvinnor som inbjudits till mammografiscreening före 2015-12-31
Kontroller: Slumpmässigt urval av kvinnor som ej erhållit bröstcancerdiagnos före 2015-12-31</concept></sampProc>
        <sampProc xml:lang="en">Total universe/Complete enumeration<concept vocab="DDI Sampling Procedure" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/SamplingProcedure/2.0.1?languageVersion=en-2.0.1">Total universe/Complete enumeration</concept></sampProc>
        <sampProc xml:lang="sv">Hela populationen/total räkning<concept vocab="DDI Sampling Procedure" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/SamplingProcedure/2.0.1?languageVersion=sv-2.0.1">Hela populationen/total räkning</concept></sampProc>
        <sampProc xml:lang="en">Probability: Systematic random<concept vocab="DDI Sampling Procedure" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/SamplingProcedure/2.0.1?languageVersion=en-2.0.1">Probability: Systematic random</concept></sampProc>
        <sampProc xml:lang="sv">Sannolikhetsurval: systematiskt slumpmässigt urval<concept vocab="DDI Sampling Procedure" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/SamplingProcedure/2.0.1?languageVersion=sv-2.0.1">Sannolikhetsurval: systematiskt slumpmässigt urval</concept></sampProc>
        <collMode xml:lang="en">Registry extract and/or access to biobank sample<concept vocab="DDI Mode of Collection" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/ModeOfCollection/5.0.0?languageVersion=en-5.0.0">Registry extract and/or access to biobank sample</concept></collMode>
        <collMode xml:lang="sv">Registerutdrag och/eller tillgång till prov i biobank<concept vocab="DDI Mode of Collection" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/ModeOfCollection/5.0.0?languageVersion=sv-5.0.0">Registerutdrag och/eller tillgång till prov i biobank</concept></collMode>
        <collMode xml:lang="en">Registry extract and/or access to biobank sample<concept vocab="DDI Mode of Collection" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/ModeOfCollection/5.0.0?languageVersion=en-5.0.0">Registry extract and/or access to biobank sample</concept></collMode>
        <collMode xml:lang="sv">Registerutdrag och/eller tillgång till prov i biobank<concept vocab="DDI Mode of Collection" vocabURI="https://vocabularies.cessda.eu/v2/vocabularies/ModeOfCollection/5.0.0?languageVersion=sv-5.0.0">Registerutdrag och/eller tillgång till prov i biobank</concept></collMode>
      </dataColl>
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through SND. Access to data is restricted.</restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via SND. Tillgång till data är begränsad.</restrctn>
        <conditions elementVersion="info:eu-repo-Access-Terms vocabulary">restrictedAccess</conditions>
      </useStmt>
    </dataAccs>
    <othrStdyMat>
      <relPubl>
        <citation>
          <titlStmt>
            <titl xml:lang="sv">Dembrower, K., Liu, Y., Azizpour, H., Eklund, M., Smith, K., Lindholm, P., &amp; Strand, F. (2020). Comparison of a deep learning risk score and standard mammographic density score for breast cancer risk prediction. Radiology, 294(2), 265–272. https://doi.org/10.1148/radiol.2019190872</titl>
            <parTitl xml:lang="en">Dembrower, K., Liu, Y., Azizpour, H., Eklund, M., Smith, K., Lindholm, P., &amp; Strand, F. (2020). Comparison of a deep learning risk score and standard mammographic density score for breast cancer risk prediction. Radiology, 294(2), 265–272. https://doi.org/10.1148/radiol.2019190872</parTitl>
            <IDNo agency="DOI">10.1148/radiol.2019190872</IDNo>
            <IDNo agency="URN">urn:nbn:se:kth:diva-267834</IDNo>
          </titlStmt>
          <distStmt>
            <distDate date="2020">2020</distDate>
          </distStmt>
        </citation>
      </relPubl>
      <relPubl>
        <citation>
          <titlStmt>
            <titl xml:lang="sv">Dembrower K, Lindholm P, Strand F. A Multi-million Mammography Image Dataset and Population-Based Screening Cohort for the Training and Evaluation of Deep Neural Networks-the Cohort of Screen-Aged Women (CSAW). J Digit Imaging. 2019.</titl>
            <parTitl xml:lang="en">Dembrower K, Lindholm P, Strand F. A Multi-million Mammography Image Dataset and Population-Based Screening Cohort for the Training and Evaluation of Deep Neural Networks-the Cohort of Screen-Aged Women (CSAW). J Digit Imaging. 2019.</parTitl>
            <IDNo agency="DOI">10.1007/s10278-019-00278-0</IDNo>
          </titlStmt>
          <distStmt>
            <distDate date="2019">2019</distDate>
          </distStmt>
        </citation>
      </relPubl>
      <relPubl>
        <citation>
          <titlStmt>
            <titl xml:lang="sv">Dembrower, K., Wahlin, E., Liu, Y., Salim, M., Smith, K., Lindholm, P., Eklund, M., &amp; Strand, F. (2020). Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload : a retrospective simulation study. The Lancet Digital Health, 2(9), E468–E474. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281510</titl>
            <parTitl xml:lang="en">Dembrower, K., Wahlin, E., Liu, Y., Salim, M., Smith, K., Lindholm, P., Eklund, M., &amp; Strand, F. (2020). Effect of artificial intelligence-based triaging of breast cancer screening mammograms on cancer detection and radiologist workload : a retrospective simulation study. The Lancet Digital Health, 2(9), E468–E474. http://urn.kb.se/resolve?urn=urn:nbn:se:kth:diva-281510</parTitl>
            <IDNo agency="DOI">10.1016/S2589-7500(20)30185-0</IDNo>
            <IDNo agency="URN">urn:nbn:se:kth:diva-281510</IDNo>
          </titlStmt>
          <distStmt>
            <distDate date="2020">2020</distDate>
          </distStmt>
        </citation>
      </relPubl>
      <relPubl>
        <citation>
          <titlStmt>
            <titl xml:lang="sv">Salim, M., Wåhlin, E., Dembrower, K., Azavedo, E., Foukakis, T., Liu, Y., Smith, K., Eklund, M., &amp; Strand, F. (2020). External Evaluation of 3 Commercial Artificial Intelligence Algorithms for Independent Assessment of Screening Mammograms. JAMA Oncology, 6(10), 1581. https://doi.org/10.1001/jamaoncol.2020.3321</titl>
            <parTitl xml:lang="en">Salim, M., Wåhlin, E., Dembrower, K., Azavedo, E., Foukakis, T., Liu, Y., Smith, K., Eklund, M., &amp; Strand, F. (2020). External Evaluation of 3 Commercial Artificial Intelligence Algorithms for Independent Assessment of Screening Mammograms. JAMA Oncology, 6(10), 1581. https://doi.org/10.1001/jamaoncol.2020.3321</parTitl>
            <IDNo agency="DOI">10.1001/jamaoncol.2020.3321</IDNo>
            <IDNo agency="URN">urn:nbn:se:kth:diva-284972</IDNo>
          </titlStmt>
          <distStmt>
            <distDate date="2020">2020</distDate>
          </distStmt>
        </citation>
      </relPubl>
    </othrStdyMat>
  </stdyDscr>
</codeBook>