Speech Synthesis in Romanian

Platform Based on Expressive Speech Synthesis in Romanian Language
for Accessing Text Information on Communication Channels

Project manager: Prof. Dragoş Burileanu

This project was implemented between 2007 and 2010 and was funded by the Romanian Government through the National Research Authority (CNCSIS “IDEI” programme), owner University Politehnica of Bucharest, ID 782.

Spoken language processing holds today a key position among Information Society technologies. The main reason for this privileged condition is the fact that speech technology can add the simplest and most natural interface to a computing environment, allowing easier information exchange between humans, or between humans and computers. As customer demand for more convenient access to a variety of services in network-based applications increases, speech synthesis technology is becoming more important for service providers. However, in many network-based applications one cannot predict the message that needs to be spoken (e.g., e-mail and SMS, text database records, etc.), and the system must generate sentences from arbitrary text. This task can be accomplished only by text-to-speech (TTS) synthesis systems, which must provide at least a very good intelligibility for the resulting speech to be helpful and accepted by the user.

The research project designed and implemented a telecommunication platform, able to offer services based on speech synthesis in Romanian language, using the current communication networks, mobile (2G, 3G) or fixed. The platform is based on a client-server architecture and on standard communication protocols. It must be emphasized that such a centralized platform offers several advantages:

  • the possibility of granting the reliability of the service;
  • the end-user is able to access the basic functions of the terminal services, leading to small costs;
  • the possibility of deploying new services without influencing the end-user;
  • granting the intellectual property over the TTS technology.

The services have as input information various text sources: e-mails, SMSs, instant messages, news, and can cover the following areas of interest:

  • real-time alarms, using mobile communication networks;
  • access to text messages from the standard fixed (voice-only) terminals;
  • access to text messages by persons with disabilities, and so on.

We designed a good quality TTS system in Romanian language and realized a telecommunication oriented platform, based on this system, a client-server architecture, and standard communication protocols; the practical implementation use only open source software products and commercial off-the-shelf (COTS) hardware, in order to minimize production costs. Because nowadays communication networks are quite heterogeneous, at different levels of technological maturity, and taking into account the present tendency of migrating towards IMS (“Internet Multimedia Subsystem”) architectures, the proposed platform follows 3GPP (“3rd Generation Partnership Project”) / TISPAN (“Telecommunications and Internet Services and Protocols for Advanced Networks”) needs. The platform achieves the function of a media processor (“Media Resource Function Processor”) as defined by the 3GPP TS 23.228 Release 6 standard.

In order to test our TTS system in real working conditions, but still using minimal resources and without the contribution of a phone operator, we realized a prototype-platform able to facilitate the access at the messages stored on an e-mail server, using ordinary phone line. This platform allows reading the messages through a classical phone terminal.

The prototype platform is functionally organized in accordance with the following structure:

  • TTS application, realized on a Linux platform;
  • HTTP server, closely tied to the TTS system, running on the same platform;
  • Media processor, which provides the interface with the phone network through a hardware module; the O.S. used is Linux and the finite state machine which implements the application is realized using Perl programming language.

Hence, the main stages of a typical session are the followings:

  • The platform is activated by a phone-call to a specific phone number. The media processor initiates the session and treats all the possible events that could appear during the execution. Furthermore, the media processor creates an instance of the connector used to attach to the e-mail server through the POP3 protocol.
  • The object which implements the connector initiates the connection to the e-mail server.
  • The last message is brought onto the server.
  • The media processor ends e-mail connector and initiates a HTTP connection towards the TTS module through the HTTP server. The text message derived from e-mail is sent to the
    TTS module for processing.
  • The HTTP answer contains the audio message encoded as a binary stream, using a MIME (“Multipurpose Internet Mail Extension”) standard. The result is saved on the media
    processor as an audio file.
  • A Perl script suite on the media processor captures the audio file, re-encodes it and sends it on the phone-line through the hardware interface.

The platform described above opened up the path to the next step of integrating the TTS with the existing Telecom services in order to create new convergent real-time services. One such new service would be real-time instant messaging with bidirectional voice-text media translation over existing telecommunication networks.