Poster A4 Eurospeech'99 - Jérôme Farinas

In most systems, phonetic recognition is merely considered as a front-end ... speech according to a vowel / non-vowel label and then to model separately the ...
330KB taille 1 téléchargements 29 vues
Comparison of two phonetic approaches to Language Identification François PELLEGRINO 1 - Jérôme FARINAS 2 - Régine ANDRE-OBRECHT 2 1 Laboratoire Dynamique Du Langage

2 Institut de Recherche en Informatique de Toulouse

F-69363 Lyon Cedex 07 - France [email protected]

F-31062 Toulouse Cedex - France (Jerome.Farinas, obrecht)@irit.fr

Corpus

Synopsis

{c i}

Acoustic Modeling

{a i,di,ci}

VS Model - 1

Introduction

CS Model - 1

VS Model - 2

Automatic language identification is one of the main emerging challenges for the next decade in automatic speech processing. Today, the standard approach is based on the modeling of the phonotactic rules. In most systems, phonetic recognition is merely considered as a front-end and not really exploited for language identification because phonetic models usually require a consequent amount of labeled data. These methods yields a sub optimal use of the phonetic and phonological differences among languages though they are essential from a linguistic point of view.

CS Model - 2

...

...

VS Model - N

CS Model - N

Vowel System Decision Rule

We propose an alternative approach that requires no labeled data and that considers phonological cues, resulting in an efficient unsupervised modeling. It consists is segmenting speech according to a vowel / non-vowel label and then to model separately the vocalic and the consonantal system of each language as a whole without looking for individual phone models.

Consonant System Decision Rule Statistical Merging

For the VSM, LBG-Rissanen algorithmcorrectly handles with the language specific structure of the vowel systemand it reaches 78 %of correct identification while the best constant size system(20 components) givesonly68%.

80

70

60

50 5

20

40

60

80

Rissanen

VSM Model topology

For theCSM,thebestidentificationrateisgivenbyaconstantsizesystem (30 components) and it is similar to the VSLbest result (78 %of correct identification). An interpretation of these differences is that vowel segments share a commonacoustic structure that enables LBG-Rissanen to reach an optimal topology. On the contrary, consonantal segments are acoustically heterogeneous (voiced/unvoiced, etc.) and dynamically more complex. The LBG-Rissanenisthentrappedinalocal optimal.

Corpus is recorded via telephone channel (8 kHz, often noisy) and it mainly consists in spontaneous speech. For the experiments, data are divided in two non overlapping sets: Ø Learning set Ø Test set Ä50 speakers per language Ä 15-20 speakers per language Ä 2minutes per speaker. Ä 10-second and 45-second utterances Due to the poor number (less than 20 %) of females speakers, our experiments mostly ignore them.

VowelSystemIdentification

80

ConsonantSystemIdentification

70

Vowel detection

Vocalic SystemModeling It consists in Gaussian Mixture Modeling (GMM) of the vowel segments

the following steps:

automatically detected in the learning sets. GMM are estimated using the EM algorithmin a 17-parameter space (8 MFCC, 8 ∆MFCC and segmental duration). The initializing step is performed using Vector Quantization (VQ) algorithms.The number of Gaussian components is either given a priori and constant among the languages or determined by a data-driven implementation of the VQ algorithm. This LBG-Rissanen algorithm [Rissanen 83, Pellegrino 99] provides the optimal number of components for each language according to a Minimum Description Length criterion.

Consonantal SystemModeling Consonantal system modeling is similar to vowel system modeling. GMM are estimated using the non-vowel segments of each training set. Note that these non-vowel segments can not be exactly considered as consonants (vowel transitions may also be labeled as non-vowels) and that Consonantal system Model is there slightly incorrect.

Differentiated modeling (VSM + CSM) reaches 85

95

% of correct identification rate.

90

However, additional experiments show that the two approaches are not redundant since merging GSM and differentiated modeling results in an improvement.

60

Merging VSM with GSMresults in 91 %of correct identification rate. Merging CSMand GSM results in a drop of performances. Once again, it seems that our consonant modeling is not discriminative enough to consider the heterogeneous structure of consonants.

50 20

Differentiated modelingconsists in a statistical merging of VSMandCSM. Resultsaregiveninthenextframe.

Differentiated vs. vs. Global Approach

This score is similar to the one given by a Global modeling of all the segments (GSM).

40

60

80

Rissanen

CSM Model topology

L*

Alanguage independent procedure performs the vowel detection. It consists in •Utterance is segmented using the “Forward-Backward Divergence” algorithm [André-Obrecht 88]. Ä It provides steady and transient segments (see example) and it gives a duration infrmation for vowel segments. ‚Aspeech activity detector is applied. ÄSince pauses carry no phonetic cues, they are not considered. ƒAvowel detection algorithm[Pellegrino 97] is implemented. Ä It is based on spectral analysis and locates steady parts of vowels (see example). „Asegmental cepstral analysis extracts parameters from each segment. Ä Mel Frequency Cepstral Coefficients (MFCC) and ∆MFCC are computed. A cepstral subtraction is in charge with speaker normalization and channel effect removal.

SystemModels (VSM) andConsonantal SystemModels (CSM). Only male speakersareconsidered. Thetwoinitializingalgorithms lead to models with a constant number of Gaussian components among languages and models whomsizeisdata-driven(seehistograms).

Example

75 70

GSM:Global Segmental Modeling VSM : Vowel System Modeling CSM: Consonantal System Modeling

Conclusion Vowel

400

Non Vowel

0

Silence

-400

0

80

This work proves that a significant part of the language characterization is embedded in its

800

-800

85

VS M+ CS +V M SM + GS CSM M + GS CSM M +V SM

Vowel Detection

Experiments are performed to test the discriminative power of Vowel

GS M

Speech Activity Detection

{di}

% of identification

A priori Segmentation {s i}

Multi Language Telephone Speech corpus. The phonological differences of the vowel system between these languages have motivated the use of this subset (see below). Spanish and Japanese vowel systems are rather elementary (5 vowels) and quasi-identical. Korean and French systems are quite complex, and they make use of secondary articulations (long vs. short vowel opposition in Korean and nasalization in French). Vietnamese systemis of average size.

Experiments

% of identification

Signal

% of identification

This paper presents two unsupervised approaches to Automatic Language Identification (ALI) based on a segmental preprocessing. In the Global Segmental Model approach, the language systemis modeled by a Gaussian Mixture Model (GMM) trained with automatically detected segments. In the Phonetic Differentiated Model approach, an unsupervised detection vowel/non vowel is performed and the language model is defined with two GMMs, one to model the vowel segments and a second one to model the others segments. For each approach, no labeled data are required. GMMs are initialized using an efficient data-driven variant of the LBGalgorithm: the LBG-Rissanen algorithm. With 5 languages fromthe OGI MLTS corpus and in a closed set identification task, we reach 85% of correct identification with each system using 45 second duration utterances for the male speakers. We increase this performance (91%) when we merge the two systems

Experiments are performed upon 5 languages (French, Japanese, Korean, Spanish and Vietnamese) of the OGI

GS M VS M CS M

Abstract

O

a

p{o k s i m i 0,5

t e d ¿) 1,0

p « t i S

a 1,5

The speaker pronounced «euh, à proximité d’un petit château » which means: near a little castle. Despite the noisy conditions, no error occur for vowel detection.

t o Time (s)

vowel system: vowel segments seem to be highly discriminative since the same level of performance is reached with vowel system modeling and consonantal system modeling though the consonantal duration is twice the vocalic duration in the utterances. Moreover, vowel system modeling using the LBG-Rissanen algorithm provides additional identification cues that are not exploited in the global segmental model (GSM). The interest of the differentiated modeling approach is actual, and many advantages of the use of acoustic modeling in homogeneous spaces may be pointed out: - Minimum Description Length algorithms (like LBG-Rissanen) are able to handle with the structure of the acoustic-phonetic system. - A better discrimination is reached inside each model. - The parameter space can be adapted to the characteristics of the acoustic class that is modeled. We will complete the notion of differentiated model, by introducing different model structures (GMM, HMM) and different acoustic parameters dependent of the phonetic classes (occlusive, fricative, et al). Then, to compare this approach to the classical ones, it will be necessary to complete our systemwith a phonotactic model, appropriate to our own acoustic projection.