Ludusan, Ioan Bogdan (2010) Beyond short units in speech recognition: a syllable centric and prominence based approach. [Tesi di dottorato] (Unpublished)
Visibile a [TBR] Repository staff only
|Item Type:||Tesi di dottorato|
|Uncontrolled Keywords:||speech recognition; syllable; prominence|
|Date Deposited:||02 Dec 2010 15:33|
|Last Modified:||30 Apr 2014 19:45|
Current state-of-the-art speech recognition systems operate well under normal conditions, but the framework on which they are built has reached its maximum performance for speech. Furthermore, these systems use sub-phonetic units, with features computed from short analysis frames (usually 20-30 ms) which cannot capture well the variation present in the speech signal. The work presented in thesis consisted in the development of a speech recognition system based on long-term units. The unit chosen for recognition was the syllable due to its greater ability to catch variation due to coarticulation effects. The proposed system is composed of three blocks: an automatic segmentation procedure, a classification stage and the decoding algorithm. The segmentation procedure uses a rule-based approach to segment the continuous speech signal into syllables, employing as features the energy of the signal, its pitch and harmonicity information. Syllable nuclei are searched for in correspondence to energy maxima, while syllable boundaries are placed at the energy minima between consecutive syllable nuclei candidates. The classification stage uses an multi-class Support Vector Machines classifier to compute the acoustic probability of the segments, given the syllable classes. Several features were used to train the models, both spectral as well as prosodic and long-term information. The best model was obtained as a combination of Mel Frequency Cepstral Coefficients and modulation spectrogram features, thus showing that features representing different time-spans offer complementary information. The probabilities of the classifier, along with linguistic information coming from an N-gram language model, are then employed in a search algorithm to obtain the most likely syllable sequence. Due to errors produced by the segmentation stage, the decoding algorithm was modified in order to be able to recover from these errors. Several approaches for performing this were evaluated and considerable improvement was obtained. Although our system doesn't arrive at the same recognition rates as state-of-the-art systems, it achieves similar performances to other recognizers, which use more complex architectures and significantly more knowledge sources. In the last part of the thesis, the role of syllabic prominence for speech recognition was explored. Two systems for the detection of prominent syllables were developed and tested on Italian and French, obtaining good results. For combining the information about syllabic prominence in the recognition process, we departed from the classical left-to-right search method in favour for an island-driven approach. The island driven methods start the search from the more reliable regions of speech, called islands, towards the less reliable regions, called gaps. In order to verify our claim that prominent syllables represent the island regions, the classification results were considered and they showed a significantly higher accuracy for prominent syllables than for non-prominent syllables. The results obtained with the manual segmentation are similar to the one using the left-to-right search, but a larger corpus is needed to better explore this approach.
Actions (login required)