News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021

Peter M. Dahlgren

doi:10.5878/d18f-q220

News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021

https://doi.org/10.5878/d18f-q220

This dataset contains news articles from Swedish news sites during the covid-19 corona pandemic 2020–2021. The purpose was to develop and test new methods for collection and analyses of large news corpora by computational means. In total, there are 677,151 articles collected from 19 news sites during 2020-01-01 to 2021-04-26. The articles were collected by scraping all links on the homepages and main sections of each site every two hours, day and night. The dataset also includes about 45 million timestamps at which the articles were present on the front pages (homepages and main sections of each news site, such as domestic news, sports, editorials, etc.). This allows for detailed analysis of what articles any reader likely was exposed to when visiting a news site. The time resolution is (as stated previously) two hours, meaning that you can detect changes in which articles were on the front pages every two hours. The 19 news sites are aftonbladet.se, arbetet.se, da.se, di.se, dn.se, etc.se, expressen.se, feministisktperspektiv.se, friatider.se, gp.se, nyatider.se, nyheteridag.se, samnytt.se, samtiden.nu, svd.se, sverigesradio.se, svt.se, sydsvenskan.se and vlt.se. Due to copyright, the full text is not available but instead transformed into a document-term matrix (in long format) which contains the frequency of all words for each article (in total, 80 million words). Each article also includes extensive metadata that was extracted from the articles themselves (URL, document title, article heading, author, publish date, edit date, language, section, tags, category) and metadata that was inferred by simple heuristic algorithms (page type, article genre, paywall). The dataset consists of the following: article_metadata.csv (53 MB): The file contains information about each news article, one article per row. In total, there are 677,151 observations and 17 variables. article_text.csv (236 MB): The file contains the id of each news article and how many times (count) a specific word occurs in the news article. The file contains 80,090,784 observations and 3 variables in long format. frontpage_timestamps.csv (175 MB): The file contains when each news article was found on the front page (homepage and main sections) of the news sites. The file contains 45,337,740 observations and 4 variables in long format. More information about the content in the files is found in the README-file. In it you will also find the R-script for using the data.

Download data and documentation (2 files / 456.06 MiB)

Data files

News articles and front pages corona pandemic 2020–2021.zip
455.82 MiB
Download: News articles and front pages corona pandemic 2020–2021.zip

Documentation files

README.pdf
249.82 KiB
Download: README.pdf

Citation and access

Data contains personal data:

No

Citation:

License:

Creative Commons Attribution 4.0 International (CC BY 4.0)

Language:

Method and outcome

Unit of analysis:

Population:

News articles

Time method:

Longitudinal

Sampling procedure:

Total universe/Complete enumeration

Description of sampling:

An open source web scraper scraped news articles from 19 Swedish news sites every two hours. Code in Python for the web scraper is available at: https://github.com/peterdalle/mechanicalnews

Time period(s) investigated:

2021-01-01 - 2021-04-26

Variables:

17

Number of individuals/objects:

677151

Data format/data structure:

Text

Data collection - Other

Mode of collection:

Other

Time period(s) for data collection:

2019 - 2019

Source of the data:

Communications: Public
Communications

Geographic coverage

Geographic location:

Sweden

Administrative information

Responsible department/unit:

Department of Journalism, Media and Communication (JMG)

Funding

Funding agency:

The Swedish Civil Contingencies Agency (MSB)

Topic and keywords

CESSDA topic classification:

Swedish Standard Classification of Research Subjects 2025:

Keywords:

Covid-19

Relations

Homepage:

KRISAMS (Kriskommunikation och samhällsförtroende i det multipublika samhället)

Publications

Citation:

Dahlgren, P. M. (2021). Svenskar eller utrikesfödda i medierna? – att identifiera födelseland från namn. I L. Truedson & J. Lundqvist (Red.), Vitt eller brett? – vilka får ta plats i medier och på redaktioner. Stockholm: Institutet för mediestudier.

ISBN:
978-91-987098-0-3

Citation:

Dahlgren, P. M. (2021). Medieinnehåll och mediekonsumtion under coronapandemin: Datoriserade metoder för insamling och analys av stora mängder text- och mediedata. Göteborg: Institutionen för journalistik, medier och kommunikation (JMG), Göteborgs universitet.

ISSN:
1101-4679

Metadata

Version 1

News articles and front pages from 19 Swedish news sites during the covid-19/corona pandemic 2020–2021

Data files

Documentation files

Citation and access

Data access level:

Creator/​Principal investigator(s):

Research principal:

Data contains personal data:

Citation:

License:

Language:

Method and outcome

Unit of analysis:

Population:

Time method:

Sampling procedure:

Description of sampling:

Time period(s) investigated:

Variables:

Number of individuals/​objects:

Data format/​data structure:

Data collection - Other

Mode of collection:

Time period(s) for data collection:

Source of the data:

Geographic coverage

Geographic location:

Administrative information

Responsible department/​unit:

Funding

Funding agency:

Topic and keywords

CESSDA topic classification:

Swedish Standard Classification of Research Subjects 2025:

Keywords:

Relations

Homepage:

Publications

Citation:

ISBN:

Citation:

ISSN:

Metadata

Creator/Principal investigator(s):

Number of individuals/objects:

Data format/data structure:

Responsible department/unit: