Noise-robust, domain-adaptable, large-vocabulary automatic speech recognition system for the Romanian language (LVCSR-ROM)

The LVCSR-ROM project was funded by the Romanian-American Foundation through the Applied Research, Technological Innovation and Entrepreneurship (ARTIE) Fellowship Program. The project proposal ranked first out of more than 150 submitted proposals in the ARTIE Proof of Concept (POC) 2013 Call for Proposals. The project was implemented between October 2013 and July 2014.

The implementing team was formed by Lect. Horia Cucu (project manager), Prof. Dragoș Burileanu, Lect. Andi Buzo, and Lect. Lucian Petrică.

Summary

The main goal of this project was to develop a Rich Speech Transcription (RST) service for audio documents. The final outcome of the project is a web-service that enables individuals to access the textual content of an audio document (news bulletin, interview, lecture, meeting recording, etc.) without listening it. This feature is of critical importance in many applications such as multimedia databases indexing and retrieval, real-time radio/TV monitoring, transcription of self-recorded documents, etc.

The RST service is based on the first speaker-independent, large vocabulary continuous speech recognition (LVCSR) system for Romanian, developed by our research laboratory in 2011. The RST service development implied enhancing and adapting the LVCSR system to the particularities of multimedia documents transcription. To achieve the main objective, the current LVCSR system was augmented with several modules:

a speech enhancement module that reduces the noise effect on the transcription accuracy,
a speaker diarization module that divides the speech signal into segments (based on the speaker who uttered the speech) and identifies the speaker (from a set of previously known speakers),
a text post-processing module that formats paragraphs, numbers, dates, etc. and restores diacritics, punctuation marks and capital letters, increasing the intelligibility of the output text,
better acoustic and language models that improve the accuracy of the system.

The first version of the service transcribes into text the Romanian speech within multimedia documents, while future versions may be adapted for other low-resourced languages as well. As opposed to high-resourced languages, such as English, Spanish, Mandarin Chinese, under-resourced languages are those languages for which there aren’t sufficient acoustic, phonetic and linguistic databases for the straight-forward development of spoken language technology (SLT) systems and applications.

We believe that adapting the system to other under-resourced languages will have an important social and economic impact, because for many such languages, there are currently no automatic solutions for speech transcription. In this context, the continuous growth of multimedia production, sharing and consumption leaves us with large multimedia databases that cannot be efficiently accessed and exploited. Their content can only be classified and accessed based on metadata and this is insufficient when one wants to find multimedia documents on specific topics or sub-topics. Moreover, complete and rich transcriptions of these multimedia documents can only be generated manually and this is a non-scalable and time/cost inefficient process.

The beneficiaries of this service could be: a) the individuals and companies that need to transcribe multimedia documents, b) the companies that possess large, un-annotated multimedia databases and have no means of efficiently accessing and exploiting them and c) individual users of public multimedia libraries and online multimedia-sharing websites.

Achievements

A functional and solid proof-of-concept. At the end of the project we were able to deliver a proof-of-concept in a form which is very close to the final commercial service. The user is able to connect to connect to a RST server via a web-based client. The user is able to load audio files with speech content at different formats (wav and mp3) and receive its transcription as soon as it is available. With the increased accuracy and intelligibility, the transcription is very close to the final format, with very few modifications left for the user. The user is able to follow the transcription by simultaneously listening to the audio recording. As the audio recording is played the corresponding words in the transcription are highlighted.

WER reduction. The improvements brought by the various activities made possible a relative WER reduction of 13.8% for read speech and by 12.7% for the spontaneous speech.

Increased intelligibility. The transcriptions are organized in paragraphs, they contain diacritics and punctuation marks and have appropriate true casing. Dates and numbers are converted from text to their conventional number format. If the audio recording contains multiple speakers the output is formatted in dialogue-like style with speaker IDs.

Dissemination. The results of the project has been published in international scientific conferences and a web page has been created at the laboratory’s official web page in order to make public the activities and achievements of the project.

Patent registration. The intellectual property has been protected by registering two patents at the Romanian State Office for Invention and Trademarks (OSIM). The patents regard inventions related to automatic diacritics restoration and real-time diarization.

Team consolidation. Besides the technical achievements the project has made the team grow better in terms of management, business, expertise and team-spirit. There is an obvious evolution of the team from a scientific-oriented one to a more business-oriented one. The optimism is high because the team feels capable of closing the gap between the proof-of-concept and a final commercial service.

Deliverables

RST Service Demo. The proof-of-concept demo is available online. Usage credentials are provided on request (please email us at horia.cucu@upb.ro). A few audio recording samples, with reference transcriptions are available here and here.

Scientific papers. We wrote and submitted two papers to the COMM 2014 International Conference, two papers to the SLTU 2014 International Conference and two other papers to Interspeech 2014 International Conference. The submitted papers illustrate the various research activities and results obtained within the ARTIE-POC project: ASR noise robustness, speaker diarization, unsupervised acoustic modeling, and diacritics restoration methodology. Among these 6 papers only 5 were published or accepted for publication:

Horia Cucu, Andi Buzo, Lucian Petrică, Dragoş Burileanu and Corneliu Burileanu, “Recent Improvements of the SpeeD Romanian LVCSR System “, Proceedings of COMM, pp. 111-114, 2014.
Andi Buzo, Horia Cucu, Lucian Petrică and Dragoş Burileanu, “An Automatic Speech Recognition Solution with Speaker Identification Support”, Proceedings of COMM, pp. 119-122, 2014.
Lucian Petrică, Horia Cucu, Andi Buzo, and Corneliu Burileanu, “A robust diacritics restoration system using unreliable raw text data”, Proceedings of SLTU, pp. 215-221, 2014.
Horia Cucu, Andi Buzo and Corneliu Burileanu, “Unsupervised acoustic model training using multiple seed ASR systems”, Proceedings of SLTU, pp. 124-130, 2014.
Horia Cucu, Andi Buzo, Laurent Besacier, Corneliu Burileanu, “Enhancing ASR Systems for Under-Resourced Languages through a Novel Unsupervised Acoustic Model Training Technique”, submitted to Interspeech 2014, but rejected.
Valentin Andrei, Corneliu Burileanu, Horia Cucu, Andi Buzo, “Detecting the number of competing speakers – human selective hearing versus spectrogram distance based estimator”, accepted for publication at Interspeech 2014.

Patent applications. We have also applied for two patents at the Romanian State Office for Inventions and Trademarks (OSIM) with regard to methods about automatic restoration of the diacritics and real-time diarization:

Andi Buzo, Horia Cucu, Lucian Petrică and Dragoş Burileanu, “Metodă și sistem pentru diarizare în timp real a semnalelor audio, utilizate pentru recunoașterea automată a vorbirii și a vorbitorului” (Method and system for real-time diarization of audio signals, with applications in automatic speech and speaker recognition), patent application registered at OSIM, no. A2014/00346.
Lucian Petrică, Horia Cucu and Andi Buzo, “Metodă pentru restaurarea automată a semnelor diacritice, folosind texte achiziționate electronic, utilizată în procesarea limbajului natural” (Automatic diacritics restoration method using electronically collected texts with applications in natural language processing), patent application registered at OSIM, no. A2014/00347.