RAPP Platform Wiki
v0.6.0
RAPP Platform is a collection of ROS nodes and back-end processes that aim to deliver ready-to-use generic services to robots
|
In our case we desire a flexible speech recognition module, capable of multi-language extension and good performance on limited or generalized language models. For this reason, Sphinx-4 was selected. Since we aim for speech detection services for RApps regardless of the actual robot at hand, we decided to install Sphinx-4 in the RAPP Cloud and provide services for uploading or streaming audio, as well as configuring the ASR towards language or vocabulary-specific decisions.
Before proceeding to the actual description, it should be said that the RAPP Sphinx4 detection was create in order to handle limited vocabulary speech recognition in various languages. This means that it is not suggested for detecting words in free speech or words from a vocabulary larger than 10-15 words.
Before describing the actual implementation, it is necessary to investigate the Sphinx-4 configuration capabilities. In order to perform speech detection, Sphinx-4 requires:
The overall Speech Detection architecture utilizing the Sphinx4 library is presented in the figure below:
[[images/sphinx_diagram.png]]
As evident, two distinct parts exist: the NAO robot and RAPP Cloud part. Let’s assume that a robotic application (RApp) is deployed in the robot, which needs to perform speech detection. The first step is to invoke the Capture Audio service the Core agent provides, which in turn captures an audio file via the NAO microphones. This audio file is sent to the cloud RAPP ASR node in order to perform ASR. The most important module of the RAPP ASR is the Sphinx-4 wrapper. This node is responsible for receiving the service call data and configuring the Sphinx-4 software according to the request. The actual Sphinx-4 library is executed as a separate process and Sphinx-4 wrapper is communicating with it via sockets.
Between the RAPP ASR and the Sphinx-4 wrapper lies the Sphinx-4 Handler node which is responsible for handling the actual service request. It maintains a number of Sphinx wrappers in different threads, each of which is capable of handling a different request. The Sphinx-4 handler is responsible for scheduling the Sphinx-4 wrapper threads and for this purpose maintains information about the state of each thread (active/idle) and each thread's previous configuration parameters. Three possible situations exist:
Regarding the Sphinx-4 configuration, the user is able to select the ASR language and if they desire ASR on a limited vocabulary or on a generalized one stored in the RAPP cloud. If a limited vocabulary is selected, the user can also define the language model (the sentences of the statistical language model or the grammar). The configuration task is performed by the Sphinx-4 Configuration module. There, the ASR language is retrieved and the corresponding language modules are employed (currently Greek, English and their combination). If the user has requested ASR on a limited vocabulary, the corresponding language module must feed the Limited vocabulary creator with the correct grapheme to phoneme transformations, in order to create the necessary configuration files. In the English case, this task is easy, since Sphinx-4 provides a generalized English vocabulary, which includes the words' G2P transformations. When Greek is requested, a simplified G2P method is implemented, which will be discussed next. In the case where the user requests a generalized ASR, the predefined generalized dictionaries are used (currently only English support exists).
The second major task that needs to be performed before the actual Sphinx-4 ASR is the audio preparation. This involves the employment of the SoX audio library utilizing the Audio processing node. Then the audio file is provided to the Sphinx4 Java library and the resulting words are extracted and transmitted back to the RApp, as a response to the HOP service call.
Regarding the Greek support, first a description of some basic rules of the Greek language will be presented. The Greek language is equipped with 24 letters and 25 phonemes. Phonemes are structural sound components defining a word’s acoustic properties. Some pronunciation rules the Greek language has follow:
Finally, there are several other trivial and rare rules that we did not take under consideration in our approach.
Let’s assume that some Greek words are available and we must configure the Sphinx4 library in order to perform speech recognition. These words must be converted to the Sphinx4-supported Arpabet format which contains 39 phonemes. The individual steps followed are:
Then, the appropriate files are created (custom dictionary and language model) and the Sphinx4 library is configured. Then the audio pre-processing takes place, performing denoising similarly to the Google Speech Recognition module by deploying the ROS services of the Audio processing node.
The RAPP Speech Detection using Sphinx component diagram is depicted in the figure.
[[images/sphinx_speech_component_diagram.png]]
It should be stated that the language model is created based on the ARPA model for the sentences
and on the JSGS for the grammar
parameters, nevertheless only pure sentences are supported (i.e. the advanced JSGF uses cannot be employed). More information on the Sphinx language model can be found here.
The Sphinx4 ROS node provides a ROS service, dedicated to perform speech recognition.
Service URL: /rapp/rapp_speech_detection_sphinx4/batch_speech_to_text
Service type: ```bash #The language we want ASR for string language #The limited vocabulary. If this is empty a general vocabulary is supposed string[] words #The language model in the form of grammar string[] grammar #The language model in the form of sentences string[] sentences #The audio file path string path #The audio file type string audio_source #The user requesting the ASR
#The words recognized string[] words #Possible error string error ```
The speech_recognition_sphinx RPS is of type 3 since it contains a HOP service frontend, contacting a RAPP ROS node, which utilizes the Sphinx4 library. The speech_recognition_sphinx RPS can be invoked using the following URI:
Service URL: localhost:9001/hop/speech_recognition_sphinx4
The speech_recognition_sphinx RPS has several input arguments, which are encoded in JSON format in an ASCII string representation.
The speech_detection_sphinx RPS returns the recognized words in JSON forma.
``` Input = { “language”: “gr, en” “words”: “[WORD_1, WORD_2 …]” “grammar”: “[WORD_1, WORD_2 …]” “sentences”: “[WORD_1, WORD_2 …]” “file”: “AUDIO_FILE_URI” “audio_source”: “nao_ogg, nao_wav_1_ch, nao_wav_4_ch, headset” }
Output = { “words”: “[WORD_1, WORD_2 …] “error”: “Possible error” } ```
The request parameters are:
language
: The language to perform ASR (Automatic Speech Recognition). Supported values:en
: Englishel
: Greek (also supports English)words[]
: The limited vocabulary from which Sphinx4 will do the matching. Must provide individual words in the language declared in the language parameter. If left empty a generalized vocabulary will be assumed. This will be valid for English but the results are not good.grammar[]
: A form of language model. Contains either words or sentences that contain the words declared in the words parameter. If grammar is declared, Sphinx4 will either return results that exist as is in grammar or <nul> if no matching exists.sentences[]
: The second form of language model. Same functionality as grammar but Sphinx can return individual words contained in the sentences provided. This is essentially used to extract probabilities regarding the phonemes succession.file
: The audio file path.audio_source
: Declares the source of the audio capture in order to perform correct denoising. The different types are:user
: The user invoking the service. Must exist as username in the database to work. Also a noise profile for the declared user must exist (check rapp_audio_processing node for set_noise_profile service)The returned parameters are:
error
: Possible errorswords[]
: The recognized wordsThe full documentation exists here