ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology

Mattias Rantalainen; Johan Hartman

doi:10.48723/w728-p041

ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology

ACROBAT

https://doi.org/10.48723/w728-p041

The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.se) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.org). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data table with one row for each WSI, specifying an anonymised patient ID, the stain or IHC antibody type of each WSI, as well as the magnification and microns per pixel at each available resolution level. Automated registration algorithm performance evaluation is possible through the ACROBAT challenge website based on over 37,000 landmark pair annotations from 13 annotators. While the primary purpose of this data set was the development and evaluation of WSI registration methods, this data set has the potential to facilitate further research in the context of computational pathology, for example in the areas of stain-guided learning, virtual staining, unsupervised learning and stain-independent models. The data set consists of three subsets, the training, validation and test set, based on the ACROBAT WSI registration challenge. There are 750 cases in the training set, for each of which one H&E WSI and one to four IHC WSIs are available, with 3406 WSIs in total. The validation set consists of 100 cases with 200 WSIs in total and the test set of 303 cases with 606 WSIs in total. Both for the validation and test set, one H&E WSI as well as one randomly selected IHC WSI is available. WSIs were anonymised by deleting the associated macro images, by generating filenames with random case IDs and by overwriting meta data fields with potentially personal information. Hamamatsu NDPI files were then converted using libvips (libvips.org/). WSIs are available as generic tiled TIFF WSIs (openslide.org/formats/generic-tiff/) at 10X magnification and lower image levels. The data set is available for download in seven separate ZIP archives, five for the training data (train_part1.zip (71.47 GB), train_part2.zip (70.59 GB), train_part3.zip (75.91 GB), train_part4.zip (71.63 GB) and train_part5.zip (69.09 GB)), one for the validation data (valid.zip 21.79 GB) and one for the test data (test.zip 68.11 GB). File listings and checksums in SHA1 format are available for checking archive/data integrity when downloading. While it would be helpful to notify SND of any publications using this data set by sending an email to request@snd.gu.se, please note that this is not required to use the data.

Download data and documentation (17 files / 448.6 GiB)

Data files

test.zip
68.11 GiB
Download: test.zip
train_part1.zip
71.47 GiB
Download: train_part1.zip
train_part2.zip
70.59 GiB
Download: train_part2.zip
train_part3.zip
75.91 GiB
Download: train_part3.zip
train_part4.zip
71.63 GiB
Download: train_part4.zip
train_part5.zip
69.09 GiB
Download: train_part5.zip
valid.zip
21.79 GiB
Download: valid.zip

Documentation files

df_acrobat_meta_readme.txt
2.91 KiB
Download: df_acrobat_meta_readme.txt
df_acrobat_meta.csv
1.11 MiB
Download: df_acrobat_meta.csv
test_zip_listing.txt
30.52 KiB
Download: test_zip_listing.txt
train_part1_zip_listing.txt
35.68 KiB
Download: train_part1_zip_listing.txt
train_part2_zip_listing.txt
36.54 KiB
Download: train_part2_zip_listing.txt
train_part3_zip_listing.txt
36.17 KiB
Download: train_part3_zip_listing.txt
train_part4_zip_listing.txt
35.48 KiB
Download: train_part4_zip_listing.txt
train_part5_zip_listing.txt
36.01 KiB
Download: train_part5_zip_listing.txt
valid_zip_listing.txt
10.06 KiB
Download: valid_zip_listing.txt
zipfiles_sha1_checksums.txt
418 Bytes
Download: zipfiles_sha1_checksums.txt

Citation and access

Data access level:

Data are openly accessible

Creator/Principal investigator(s):

Research principal:

Karolinska Institutet
Opens a new window at ror.org.
ROR

Data contains personal data:

No

Citation:

License:

Creative Commons Attribution 4.0 International (CC BY 4.0)

Language:

English

Method and outcome

Unit of analysis:

Individual

Population:

Anonymised female primary breast cancer patients from the Stockholm region

Study design:

Observational study

Description of sampling:

A subset of the whole-slide-images that were generated in terms of the CHIME study were randomly selected for the ACROBAT data set. Training and validation data are a random subset, whereas the test data was generated using stratified sampling, taking into account biomarker statuses and the scanner model that was used to generate the respective whole-slide-image.

Time period(s) investigated:

2012 - 2018

Number of individuals/objects:

1153

Data format/data structure:

Still image

Data collection

Description of the mode of collection:

Archived routine clinical diagnostic tissue slides with tissue material were scanned using whole-slide-image scanners at Karolinska Institutet.

Time period(s) for data collection:

2012 - 2018

Data collector:

Karolinska Institutet
Opens a new window at ror.org.
ROR

Instrument

Name:

NanoZoomer XR

Type:

Technical instrument(s)

Description of the instrument:

Hamamatsu whole-slide-imaging scanner.

Name:

NanoZoomer S360

Type:

Technical instrument(s)

Description of the instrument:

Hamamatsu whole-slide-imaging scanner

Geographic coverage

Geographic location:

Stockholm County

Administrative information

Responsible department/unit:

Department of Medical Epidemiology and Biostatistics [C8]

Contributor(s):

Masi Valkonen – University of Turku - Institute of Biomedicine
Opens a new window at orcid.org.
ORCID
Kimmo Kartasalo – Karolinska Institutet - Department of Medical Epidemiology and Biostatistics
Opens a new window at orcid.org.
ORCID
Kajsa Ledesma Eriksson – Karolinska Institutet - Department of Medical Epidemiology and Biostatistics
Opens a new window at orcid.org.
ORCID
Leena Latonen – University of Eastern Finland - Institute of Biomedicine
Opens a new window at orcid.org.
ORCID
Constance Boissin – Karolinska Institutet - Department of Medical Epidemiology and Biostatistics
Opens a new window at orcid.org.
ORCID
Yanbo Feng – Karolinska Institutet - Department of Medical Epidemiology and Biostatistics
Opens a new window at orcid.org.
ORCID
Philippe Weitz – Karolinska Institutet - Department of Medical Epidemiology and Biostatistics
Opens a new window at orcid.org.
ORCID
Dusan Rasic – Zealand University Hospital - Department of Surgical Pathology
Opens a new window at orcid.org.
ORCID
Sonja Koivukoski – University of Eastern Finland - Institute of Biomedicine
Pekka Ruusuvuori – University of Turku - Institute of Biomedicine
Opens a new window at orcid.org.
ORCID
Circe Carr – University of Turku - Institute of Biomedicine
Sandra Pouplier – Zealand University Hospital - Department of Surgical Pathology
Leslie Solorzano – Karolinska Institutet - Department of Medical Epidemiology and Biostatistics
Opens a new window at orcid.org.
ORCID
Abhinav Sharma – Karolinska Institutet - Department of Medical Epidemiology and Biostatistics
Opens a new window at orcid.org.
ORCID
Anne-Vibeke Laenkholm – Zealand University Hospital - Institute of Biomedicine
Opens a new window at orcid.org.
ORCID
Aino Kuusela – University of Turku - Institute of Biomedicine

Ethical Review

Reviewer:

Stockholm Ethical Review Board

Registration number:

2017/2106-31

Ethical review information:

Amendment: 2018/1462-32

Funding

Funding agency:

Swedish Research Council
Opens a new window at ror.org.
ROR

Award number:

2019-00947_VR

Award title:

Advancing Breast Cancer histopathology towards AI-based Personalised medicine (ABCAP)

Funding information:

Manual histopathological assessment is the main mode to detect presence of breast cancer (BC), identify clinically relevant cancer, and to establish diagnosis. However, there is a shortage of pathology expertise and also a high inter-assessor. This leads to prolonged response times and unequal access to top-quality histopathology assessments for cancer patients. Misclassifications in histopathology assessments will cause both over- and under-treatment. We hypothesise that it is now possible to develop advanced image-based prediction models based on artificial intelligence (AI) and deep-learning (DL) techniques for BC histopathology assessment that match or outperform the performance of top-level human experts. In this research programme we will develop and validate state-of-the-art AI-based models for BC routine histopathology and for improved patient stratification in respect to prognosis and treatment response. Through both retrospective and prospective validation we will establish evidence towards clinical translation. Our studies are based on large-scale population samples, ensuring unbiased data and models. Novel methodologies for stain-free and multi-stain analysis will also be developed. The project aims to improve the quality of BC histopathology assessments by reducing errors and inter-assessor variability, enhancing patient stratification and reducing over- and under-treatment of patients, and contribute towards more efficient and reliable routine pathology.

Funding agency:

ERA PerMed

Award number:

ERAPERMED2019-224-ABCAP

Award title:

Advancing Breast Cancer histopathology towards AI-based Personalised medicine

Funding agency:

Swedish Cancer Society
Opens a new window at ror.org.
ROR

Topic and keywords

CESSDA topic classification:

Swedish Standard Classification of Research Subjects 2025:

Keywords:

Relations

Website:

Is referenced by:

https://github.com/rantalainenGroup/ACROBAT

Publications

Citation:

Weitz P, Valkonen M, Solorzano L, Carr C, Kartasalo K, Boissin C, Koivukoski S, Kuusela A, Rasic D, Feng Y, Sinius Pouplier S, Sharma A, Ledesma Eriksson K, Latonen L, Laenkholm AV, Hartman J, Ruusuvuori P, Rantalainen M. A Multi-Stain Breast Cancer Histological Whole-Slide-Image Data Set from Routine Diagnostics. Sci Data. 2023 Aug 24;10(1):562.

DOI:
10.1038/s41597-023-02422-6

Citation:

Weitz, P. et al., (2022). ACROBAT -- a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology. doi:10.48550/ARXIV.2211.13621

DOI:
10.48550/ARXIV.2211.13621

Contact

Philippe Weitzphilippe.weitz@ki.se

Mattias Rantalainenmattias.rantalainen@ki.se

Metadata

Version 1

ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology

Data files

Documentation files

Citation and access

Data access level:

Creator/​Principal investigator(s):

Research principal:

Data contains personal data:

Citation:

License:

Language:

Method and outcome

Unit of analysis:

Population:

Study design:

Description of sampling:

Time period(s) investigated:

Number of individuals/​objects:

Data format/​data structure:

Data collection

Description of the mode of collection:

Time period(s) for data collection:

Data collector:

Instrument

Name:

Type:

Description of the instrument:

Name:

Type:

Description of the instrument:

Geographic coverage

Geographic location:

Administrative information

Responsible department/​unit:

Contributor(s):

Ethical Review

Reviewer:

Registration number:

Ethical review information:

Funding

Funding agency:

Award number:

Award title:

Funding information:

Funding agency:

Award number:

Award title:

Funding agency:

Topic and keywords

CESSDA topic classification:

Swedish Standard Classification of Research Subjects 2025:

Keywords:

Relations

Website:

Is referenced by:

Publications

Citation:

DOI:

Citation:

DOI:

Contact

Metadata

Creator/Principal investigator(s):

Number of individuals/objects:

Data format/data structure:

Responsible department/unit: