Speech Datasets

ROMANIAN READ-SPEECH CORPUS (RSC)

License

Licensed under Creative Commons BY-NC-ND 4.0.

Description

“RSC” is a read speech corpus collected by Speech and Dialogue Research Laboratory. The recordings were made under different conditions (various microphones and various audio recording systems), using an online audio recording application developed by the same research group. The speakers were mainly students and staff of Faculty of Electronics, Telecommunications and Information Technology from University “Politehnica” of Bucharest.

The corpus consists of 136,120 audio files collected from 164 Romanian native speakers. Each audio file contains utterances from literature, online news and isolated words in Romanian language. In general, there are between 130 and 11,000 audio files per speaker. The total size of the database is around 100 hours. The average length of an utterance is 2.6 seconds.

“RSC” is split into training, and evaluation sets, as follows:

training set: 133,616 files from 156 speakers
evaluation set: 2,504 files from 21 speakers (out of which 13 speakers are also part of the training set)

Note: the above overlap occurs only in terms of speakers (voices), not in terms of utterances.

If you use this corpus in your research please cite one of the following papers:

Alexandru-Lucian Georgescu, Horia Cucu, Andi Buzo, Corneliu Burileanu,“RSC: A Romanian Read Speech Corpus for Automatic Speech Recognition,” submitted to 12th International Conference on Language Resources and Evaluation, 2020.
Alexandru-Lucian Georgescu, Horia Cucu, Corneliu Burileanu, “SpeeD’s DNN Approach to Romanian Speech Recognition,” in the Proceedings of the 9th Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, 2017, 8p, ISBN 978-1-5090-6496-0.
Horia Cucu, Andi Buzo, Lucian Petrică, Dragoş Burileanu and Corneliu Burileanu, “Recent Improvements of the SpeeD Romanian LVCSR System“, in the Proceedings of the 10th International Conference on Communications (COMM), Bucharest, 2014, pp. 111-114.

Please contact horia.cucu@upb.ro for download details.

SPONTANEOUS SPEECH CORPUS evaluation set 1 (SSC-eval1) – version 2

License

Licensed under Creative Commons BY-NC-ND 4.0.

Version update

Version 2 of the dataset comprises corrected transcripts and was released at the begining of 2020. All results reported on this dataset prior to 2020 were obtained on version 1 of the dataset. It is expected that all results reported on this dataset in 2020 or later to use version 2 of the dataset. Further information regarding the update from version 1 to version 2 can be found in the TADARAV 2020 Technical report: http://tadarav.speed.pub.ro/storage/rapoarte/41.1._RST_in_extenso_TADARAV_2020_v3.pdf.

Description

SSC-eval1 is a read speech corpus collected by the Speech and Dialogue Research Laboratory. The recordings were collected from several online TV and radio stations and the annotations were performed manually. The evaluation set 1 comprises 3,035 audio files with an average length of 4.15 seconds. The total size of the corpus is around 3.5 hours. The audios comprise news recordings in various acoustic conditions.

SSC-eval1 speech corpus comprises a “clean” part with recordings performed in ideal acoustic conditions and an “other” part with recordings performed in suboptimal acoustic conditions:

“clean” part: 1,935 files, 2.2 hours of speech recorded in ideal acoustic conditions;
“other” part: 1,100 files, 1.3 hours of speech recorded in noisy or degraded acoustic conditions.

If you use this corpus in your research please cite the following paper:

Alexandru-Lucian Georgescu, Horia Cucu, Corneliu Burileanu, “Improvements of SpeeD’s Romanian ASR system during ReTeRom project,” in the Proceedings of the 11th Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, 2021.

Please contact horia.cucu@upb.ro for download details.

SPONTANEOUS SPEECH CORPUS evaluation set 2 (SSC-eval2)

License

Licensed under Creative Commons BY-NC-ND 4.0.

Description

SSC-eval2 is a read speech corpus collected by the Speech and Dialogue Research Laboratory. The recordings were collected from several online TV and radio stations and the annotations were performed manually. The evaluation set 2 comprises 100 audio files with an average length of 54 seconds. The total size of the corpus is around 1.5 hours. The audios comprise news recordings in various acoustic conditions.

If you use this corpus in your research please cite the following paper:

Alexandru-Lucian Georgescu, Horia Cucu, Corneliu Burileanu, “Improvements of SpeeD’s Romanian ASR system during ReTeRom project,” in the Proceedings of the 11th Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, 2021.

Please contact horia.cucu@upb.ro for download details.

CHAMBER OF DEPUTIES SPEECH CORPUS evaluation set (CDP-eval)

License

Licensed under Creative Commons BY-NC-ND 4.0.

Description

CDP-eval is a speech corpus extracted from video recordings of the sittings from Chamber of Deputies of Romania between January 2003 and February 2019. It comprises 300 files of around 1 minute each one, summing up to a total of 5 hours of manually annotated speech.

If you use this corpus in your research please cite the following paper:

Alexandru-Lucian Georgescu, Horia Cucu, Corneliu Burileanu, “Improvements of SpeeD’s Romanian ASR system during ReTeRom project,” in the Proceedings of the 11th Conference on Speech Technology and Human-Computer Dialogue (SpeD), Bucharest, 2021.

Please contact horia.cucu@upb.ro for download details.

KITE – A SPEECH DATABASE FOR UAV CONTROL

“Kite” is multi-modal dataset for the control of unmanned aerial vehicles (UAVs).
Please see Kite website for details and download information.

RODIGITS SPEECH CORPUS

License

Licensed under Creative Commons BY-NC-ND 4.0.

Description

“RoDigits” speech corpus was collected by Speech and Dialogue Research Laboratory. The recordings were made under different conditions (various microphones and various audio recording systems), using an online audio recording application developed by the same research group. The speakers were mainly students of Faculty of Electronics, Telecommunications and Information Technology from University “Politehnica” of Bucharest.

The corpus consists of 15,389 audio files collected from 154 Romanian native speakers. Each audio file contains the utterances of 12 random digits [0-9] in Romanian language. In general, there are 100 audio files per speaker. There are several exceptions: for 11 speakers the corpus comprises only 99 audio files per speaker. The total size of the database is around 38 hours. The average length of an utterance is 8.7 seconds.

“RoDigits” speech corpus is split into training, development and evaluation sets, as follows:

training set: 11120 files – 80 files from 139 speakers (file IDs between 1-50 and 71-100)
development set: 2780 files – 20 files from 139 speakers (file IDs between 51-70)
evaluation set: 1489 files – ~100 files from 15 speakers

If you use this corpus in your research please cite one of the following papers:

Alexandru Lucian Georgescu, Alexandru Caranica, Horia Cucu, Corneliu Burileanu, “RoDigits – a Romanian connected-digits speech corpus for automatic speech and speaker recognition,” in University “Politehnica” of Bucharest Scientific Bulletin, Series C, vol. 80, issue 3, pp. 45-62, Bucharest, 2018, ISSN: 2286-3540.

Download RoDigits Speech Corpus (pass: rodigits)

Note: a first version of the corpus, available online before December 8, 2017, comprised some corrupted files. If you downloaded the corpus before this date please download the correct version which is now available.