ACROBAT - a multi-stain breast cancer histological whole-slide-image data set from routine diagnostics for computational pathology
https://doi.org/10.48723/w728-p041
The ACROBAT data set consists of 4,212 whole slide images (WSIs) from 1,153 female primary breast cancer patients. The WSIs in the data set are available at 10X magnification and show tissue sections from breast cancer resection specimens stained with hematoxylin and eosin (H&E) or immunohistochemistry (IHC). For each patient, one WSI of H&E stained tissue and at least one one, and up to four, WSIs of corresponding tissue stained with the routine diagnostic stains ER, PGR, HER2 and KI67 are available. The data set was acquired as part of the CHIME study (chimestudy.seOpens in a new tab) and its primary purpose was to facilitate the ACROBAT WSI registration challenge (acrobat.grand-challenge.orgOpens in a new tab). The histopathology slides originate from routine diagnostic pathology workflows and were digitised for research purposes at Karolinska Institutet (Stockholm, Sweden). The image acquisition process resembles the routine digital pathology image digitisation workflow, using three different Hamamatsu WSI scanners, specifically one NanoZoomer S360 and two NanoZoomer XR. The WSIs in this data set are accompanied by a data table with one row for each WSI, specifying an anonymised patient ID, the stain or IHC antibody type of each WSI, as well as the magnification and microns per pixel at each available resolution level. Automated registration algorithm performance evaluation is possible through the ACROBAT challenge website based on over 37,000 landmark pair annotations from 13 annotators. While the primary purpose of this data set was the development and evaluation of WSI registration methods, this data set has the potential to facilitate further research in the context of computational pathology, for example in the areas of stain-guided learning, virtual staining, unsupervised learning and stain-independent models.
The data set consists of three subsets, the training, validation and test set, based on the ACROBAT WSI registration challenge. There are 750 cases in the training set, for each of which one H&E WSI and one to four IHC WSIs are available, with 3406 WSIs in total. The validation set consists of 100 cases with 200 WSIs in total and the test set of 303 cases with 606 WSIs in total. Both for the validation and test set, one H&E WSI as well as one randomly selected IHC WSI is available.
WSIs were anonymised by deleting the associated macro images, by generating filenames with random case IDs and by overwriting meta data fields with potentially personal information. Hamamatsu NDPI files were then converted using libvips (libvips.orgOpens in a new tab). WSIs are available as generic tiled TIFF WSIs (openslide.org/formats/generic-tiffOpens in a new tab) at 10X magnification and lower image levels.
The data set is available for download in seven separate ZIP archives, five for the training data (train_part1.zipOpens in a new tab (71.47 GB), train_part2.zipOpens in a new tab (70.59 GB), train_part3.zipOpens in a new tab (75.91 GB), train_part4.zipOpens in a new tab (71.63 GB) and train_part5.zipOpens in a new tab (69.09 GB)), one for the validation data (valid.zipOpens in a new tab 21.79 GB) and one for the test data (test.zipOpens in a new tab 68.11 GB).
File listings and checksums in SHA1 format are available for checking archive/data integrity when downloading.
While it would be helpful to notify SND of any publications using this data set by sending an email to request@snd.gu.seOpens in a new tab, please note that this is not required to use the data.
Data files
Data files
Documentation files
Documentation files
Citation and access
Citation and access
Method and outcome
Method and outcome
Data collection
Data collection
Geographic coverage
Geographic coverage
Administrative information
Administrative information
Topic and keywords
Topic and keywords
Relations
Relations
Publications
Publications
Metadata
Metadata
Version 1

Karolinska Institutet