February 16, 2023

SautiDB: Nigerian Accent Dataset Collection

A TRI AI initiative

Abstract

SautiDB is an ongoing effort to build a comprehensive collection of Nigerian accented English speech recordings. Data is collected through a publicly accessible webapp where speakers from different Nigerian language backgrounds can contribute voice samples. The dataset covers speakers of Yorùbá, Ìgbò, Ẹdó, Efik-Ibibio, Igala, and Hausa, and is designed to support the development of voice technology that is inclusive of Nigerian users. Originally released in February 2021 with 919 samples across five languages, the collection has since grown to 1,137 samples with the addition of Hausa in version 1.2. This work originated from research into accent transfer technology aimed at improving online learning experiences for non-native English speakers.

Background

Voice technology development has largely overlooked Nigerian accents, leaving a significant gap for the hundreds of millions of people who speak English with Nigerian linguistic influence. SautiDB was created to address this directly, growing out of a broader project on improving online experiences through accent transfer technology.

Contributions

A crowdsourced collection of Nigerian accented English speech recorded through a public webapp, covering native speakers of Yorùbá, Ìgbò, Ẹdó, Efik-Ibibio, Igala, and Hausa. Each recording is labelled with the speaker’s native language, fluent language, speaker ID, gender, and sentence ID drawn from the CMU Arctic sentence set. The dataset includes a metadata CSV file to make it easier for researchers to work with the data directly.

Findings

The initial release contained 919 clean samples totalling just under one hour of audio. Version 1.2 expanded the collection to 1,137 samples at just over one hour and fifteen minutes, with Hausa added as a new language. The collection is ongoing, with updated versions published as new contributions come in.

Future Work

The project is designed to grow continuously. The team collects data through the webapp, meaning any user can contribute recordings, and further language additions are expected in future versions.

Abstract

Related publications