The Arabic E-Book Corpus
https://doi.org/10.5878/7rbh-gy93
The Arabic E-Book Corpus is a freely available collection of 1,745 books (81.5 million words) published in by the Hindawi foundation between 2008 and 2024. The books are of various genres, including non-fiction, novels, children's literature, poetry, and plays. The corpus is provided in two versions: html and unformatted plain text. The latter version will be appropriate for most purposes.
For additional detail, see Hallberg, A. (2025). An 81-million-word multi-genre corpus of Arabic books. Data in Brief, 60, 111456. https://doi.org/10.1016/j.dib.2025.111456Opens in a new tab
Citation and access
Citation and access
Data access level:
Creator/Principal investigator(s):
Research principal:
Data contains personal data:
Yes
Type of personal data:
The data container names of copyright holders, such as authors and translators, as well as historical, political, and other public figures mentioned in the works.
Citation:
Language:
Corpus
Corpus
Foreseen use:
NLP application, Human use
Text part
Text part
Linguality:
Monolingual
Language:
Arabic (ara)
:
Modality:
Written Language
Size:
Words: 80.5 million
Files: 1,745
Annotation:
Original source:
Link to other media:
Text: https://www.hindawi.org
Method and outcome
Method and outcome
Time period(s) investigated:
Data format/data structure:
Geographic coverage
Geographic coverage
Geographic location:
Administrative information
Administrative information
Responsible department/unit:
Department of Languages and Literatures
Topic and keywords
Topic and keywords
Standard för svensk indelning av forskningsämnen 2025:
Relations
Relations
Compiles:
Publications
Publications
Citation:
Hallberg, A. (2025). An 81-million-word multi-genre corpus of Arabic books. Data in Brief, 60, 111456. https://doi.org/10.1016/j.dib.2025.111456Opens in a new tab
