Research Projects

Current Research Projects

Intelligent Systems for Video and Audio Analysis (SPIA-VA)

The SPIA-VA project is funded by the Romanian Government through UEFISCDI, “Solutions” program – Technologies and Innovative Video Systems for Person Re-Identification and Analysis of Dissimulated Behavior”. The project consortium is formed by four partners: (1) the University Politehnica of Bucharest through the Center for Advanced Research on New Materials, Products and Innovative Processes (CAMPUS) as Project Coordinator, (2) UTI Grup, a Romanian IT&C company, as Partner #1, (3) the Romanian Ministry of National Defence — Military Equipment and Technologies Research Agency as Partner #2, and Protection and Guard Service as the public beneficiary of the project results. The project started in May 2017 and is expected to be implemented by April 2020.

The main objectives of SPIA-VA are: (1) development of a system for automatic person re-identification which is capable to build a virtual profile of the target via the voice specificity, spoken keywords, movement topology, analysis of the main physiological characteristics, detection of events; (2) development of a system for automatic detection of the dissimulated behavior which is capable to identify emotions and the specific physiological characteristics; (3) development of a system for automatic analysis of Romanian spoken language which would allow lipreading and voice-to-text. The main outcomes of SPIA-VA consist of algorithms and software tools for processing and analysis of multimedia data, algorithms and tools for artificial intelligence and machine learning, software tools and services for person re-identification, analysis of dissimulated behavior and interpretation of Romanian spoken language.

Resources and Technologies for Developing of Human-Machine Interfaces
in Romanian (ReTeRom)

The ReTeRom project is funded by the Romanian Government through the Romanian Ministry of Research and Innovation, PCCDI – UEFISCDI. The project consortium is formed by four partners: (1) the Research Institute for Articifial Intelligence “Mircea Drăgănescu” as Project Coordinator, (2) the University Politehnica of Bucharest through the Speech and Dialogue Research Laboratory as Partner #1, (3) the Technical University of Cluj-Napoca as Partner #2, and (4) the University “Alexandru Ioan Cuza” from Iași as Partner #4. The project started in March 2018 and is expected to be implemented by November 2022.

The ReTeRom project aims at merging four projects – COBILIRO (Multi-level annotated bimodal Corpus for Romanian), TEPROLIN (echnologies for processing natural language – text), TADARAV (Technologies for automatic annotation of audio data and for the creation of automatic speech recognition interfaces) and SINTERO (Technologies for the realization of human-machine interfaces for text-to-speech synthesis with expressivity).

ReTeRom – TADARAV (Technologies for automatic annotation of audio data and for the creation of automatic speech recognition interfaces) represents the sub-project coordinated and developed by Speech and Dialogue Research team. The main purpose of the project is the design, implementation and validation of automated annotation technologies for speech units. TADARAV primarily aims at developing a set of advanced technologies for generating transcriptions correctly aligned with the voice signal from the corpus collected in the COLIBIRO component project. As a side effect, the project aims to increase the accuracy of SpeeD’s current automatic speech recognition system by retraining its acoustic model based on the entire collected speech corpus and using more powerful language models generated in the TEPROLIN component project. Read more…

Enhanced Text to Speech (TTS) synthesis in Romanian

Text-to-speech synthesis has been an important area of research for SpeeD in the last 15 years. Several versions of a Romanian language TTS system were built successively in order to improve the performance of different constituent modules and consequently enhance the quality of the system. The main work was split in two different directions and a successive number of achievements has been accomplished, regarding both the Natural Language Processing (NLP) stage and speech generation techniques.

Currently, the system’s most important NLP sub-stages are: diacritic restoration, preprocessing and normalization (including acronym/ abbreviation, proper name, and sentence boundary detection), syllabification, letter-to-phone conversion, lexical stress positioning, and prosody prediction. Our team made great efforts to improve continuously all the NLP modules, by using those methods which can lead to the best possible results, and at the same time to increase the base of linguistics resources in Romanian for the TTS purpose. Another issue that is presently being approached is related to the developing of a new efficient prosody model for the NLP stage.

Also, we are developing two different speech engines: the first one uses a classic TD-PSOLA algorithm and is based on acoustic segment concatenation and multiple instances of non-uniform speech units (diphones and polyphones – to solve a number of difficult vowel-semivowel transitions), labeled (off-line) according to contextual and phonetico-prosodic information from the recorded speech corpus; the system uses a two-stage unit selection procedure for speech signal generation. The second uses a statistical parametric synthesis technique based on Hidden Markov Models (HMMs).

Enhanced Large Vocabulary Continuous Speech Recognition (LVCSR)
for Romanian

Although Automatic Speech Recognition (ASR), i.e. transforming a speech signal into text, has been an important research direction since the 70’s, current academic and commercial systems are truly efficient only in specific conditions: medium vocabularies, small speech/speaker variability, lack of background noise, etc. For high-resourced languages, such as English, French, Mandarin, the performance of LVCSR systems in ideal conditions is much higher than for other languages which were disadvantaged by the lack of resources and the small number of speech researchers, such as Romanian.

Since 2008, SpeeD has started an intense research effort aiming to develop the first LVCSR system for Romanian. A prototype of this system was released in October 2011 and is available online ever since. Our current objective is to enhance this speech recognition system in order to improve its recognition rate and make it more robust to speech/speaker variability, background noise, etc.

Spoken Term Detection (STD) for under-resourced languages

Spoken Term Detection is a relatively new research direction (introduced in 2006) that aims at finding spoken content within a speech database by using a spoken query. STD systems are useful especially for under-resourced languages for which no phonetic dictionaries are available. In 2012 SpeeD participated at the Spoken Web Search competition (part of the MediaEval Benchmarking Initiative) and created its first STD system. Since then, we continued the research in this direction with the intention to improve the current performance of our STD system and also participate at the 2013 competition.

Our current approach involves adapting our Romanian ASR system to any other under-resourced language and then performing ASR for both query and speech database. Once the speech data is converted into text, the problem becomes a text searching one. Further on, the main difficulty is created by the inaccuracy of the ASR system. Hence, one have to deal with searching an approximate text query into an approximate text database. It is obvious that a higher accuracy of the ASR system yields to higher search performance. The project aims at building accurate acoustic models for under-resourced languages and providing efficient searching algorithms in approximate text databases.

Speaker recognition system

Speaker recognition is a generic name used for two distinct applications: speaker verification and speaker identification. Speaker verification systems have to decide whether a speech utterance belongs to a claimed speaker or not. Speaker identification systems are required to find out which is the speaker that uttered a given speech signal. In both cases the speakers characteristics are usually modelled with statistical models. The most common techniques are based on Gaussian Mixture Models (GMM). Common speech features, such as MFCC, PLP coefficients, etc., are used for modelling the acoustic characteristics of the speakers. The objective of the project is to build a GMM-based speaker identification system with state-of-the-art performances.