<codeBook xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xsd="http://www.w3.org/2001/XMLSchema" xsi:schemaLocation="ddi:codebook:2_5 http://www.ddialliance.org/Specification/DDI-Codebook/2.5/XMLSchema/codebook.xsd" xmlns="ddi:codebook:2_5">
  <docDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">The Klarna Product-Page Dataset</parTitl>
        <IDNo agency="SND">doi-10-5281-zenodo-12605480-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.5281/zenodo.12605480</IDNo>
      </titlStmt>
      <prodStmt>
        <producer xml:lang="en" abbr="SND">Swedish National Data Service</producer>
        <producer xml:lang="sv" abbr="SND">Svensk nationell datatjänst</producer>
      </prodStmt>
      <holdings URI="https://doi.org/10.5281/zenodo.12605480">Landing page</holdings>
    </citation>
  </docDscr>
  <stdyDscr>
    <citation>
      <titlStmt>
        <titl xml:lang="sv"></titl>
        <parTitl xml:lang="en">The Klarna Product-Page Dataset</parTitl>
        <IDNo agency="SND">doi-10-5281-zenodo-12605480-0</IDNo>
        <IDNo agency="DOI">https://doi.org/10.5281/zenodo.12605480</IDNo>
      </titlStmt>
      <rspStmt>
        <AuthEnty xml:lang="en" affiliation="KTH Royal Institute of Technology">Hotti, Alexandra</AuthEnty>
        <AuthEnty xml:lang="en" affiliation="KTH Royal Institute of Technology">Lagergren, Jens</AuthEnty>
      </rspStmt>
      <prodStmt />
      <distStmt>
        <distrbtr xml:lang="en" abbr="SND" URI="https://snd.se">Swedish National Data Service</distrbtr>
        <distrbtr xml:lang="sv" abbr="SND" URI="https://snd.se">Svensk nationell datatjänst</distrbtr>
        <distDate xml:lang="en" date="2024-07-01" />
      </distStmt>
      <verStmt>
        <version elementVersion="0" elementVersionDate="2024-07-01" />
      </verStmt>
      <holdings URI="https://doi.org/10.5281/zenodo.12605480">Landing page</holdings>
    </citation>
    <stdyInfo>
      <subject />
      <abstract xml:lang="en" contentType="abstract">Description

The Klarna Product Page Dataset is a dataset of publicly available pages corresponding to products sold online on various e-commerce websites. The dataset contains offline snapshots of 51,701 product pages collected from 8,175 distinct merchants across 8 different markets (US, GB, SE, NL, FI, NO, DE, AT) between 2018 and 2019. On each page, analysts labelled 5 elements of interest: the price of the product, its image, its name and the add-to-cart and go-to-cart buttons (if found). These labels are present in the HTML code as an attribute called klarna-ai-label taking one of the values: Price, Name, Main picture, Add to cart and Cart.

The snapshots are available in 3 formats: as MHTML files (~24GB), as WebTraversalLibrary (WTL) snapshots (~7.4GB), and as screeshots (~8.9GB). The MHTML format is less lossy, a browser can render these pages though any Javascript on the page is lost. The WTL snapshots are produced by loading the MHTML pages into a chromium-based browser. To keep the WTL dataset compact, the screenshots of the rendered MTHML are provided separately; here we provide the HTML of the rendered DOM tree and additional page and element metadata with rendering information (bounding boxes of elements, font sizes etc.). The folder structure of the screenshot dataset is identical to the one the WTL dataset and can be used to complete the WTL snapshots with image information. For convenience, the datasets are provided with a train/test split in which no merchants in the test set are present in the training set.

Corresponding Publication

For more information about the contents of the datasets (statistics etc.) please refer to the following TMLR paper.

GitHub Repository

The code needed to re-run the experiments in the publication accompanying the dataset can be accessed here.

Citing

If you found this dataset useful in your research, please cite the paper as follows:

@article{hotti2024the,
title={The Klarna Product Page Dataset: Web Element Nomination with Graph Neural Networks and Large Language Models},
author={Alexandra Hotti and Riccardo Sven Risuleo and Stefan Magureanu and Aref Moradi and Jens Lagergren},
journal={Transactions on Machine Learning Research},
issn={2835-8856},
year={2024},
url={https://openreview.net/forum?id=zz6FesdDbB},
note={}
}</abstract>
      <sumDscr />
    </stdyInfo>
    <method>
      <dataColl />
    </method>
    <dataAccs>
      <useStmt>
        <restrctn xml:lang="en">Access to data through an external actor. Data are freely accessible.</restrctn>
        <restrctn xml:lang="sv">Åtkomst till data via extern aktör. Data är fritt tillgängliga.</restrctn>
        <conditions elementVersion="info:eu-repo-Access-Terms vocabulary">openAccess</conditions>
      </useStmt>
    </dataAccs>
    <othrStdyMat />
  </stdyDscr>
</codeBook>