Process Control and Optimization, VOLUME II - Unicauca

and velocities are quite small, as listed in Table 1.7a. Sound pressures are measured in root-mean-square (RMS) values. The RMS values are arrived at by ...
436KB taille 2 téléchargements 48 vues
1.7

Speech Synthesis and Voice Recognition D. H. F. LIU

B. G. LIPTÁK

(1995)

(2005)

THE NATURE OF SOUND Sound normally reaches the ear through pressure waves in the air. Vibratory motion at frequencies between 16 Hz and 20 kHz is recognized as audible sound. Sound pressure levels and velocities are quite small, as listed in Table 1.7a. Sound pressures are measured in root-mean-square (RMS) values. The RMS values are arrived at by taking the square root of the arithmetic mean of the squared instantaneous values. The RMS pressure of spherical sound waves in a free field of air can be described as: 2

Pr = P0 /r dynes/cm

where Pr is the RMS sound pressure at a distance r from the sound source and P0 is the RMS pressure at a unit distance from the source.

INTRODUCTION Several companies and universities have developed voice recognition systems. These systems and others built under grants from the Defense Advanced Research Projects Agency

(DARPA) couple the ability to convert speech into electronic text with the artificial intelligence to understand the text. In these systems the computer acts as an agent that knows what the users want and how it is to be accomplished, including the voice output. Altogether, this type of speech synthesis allows untrained manufacturing workers the uninterrupted use of their hands, which could translate into increased productivity and cost savings.

SPEECH SYNTHESIS Figure 1.7b shows the components of a computer voice response system. A synthesis program must access from a stored vocabulary a description of the required sequence of words. It must put them together, with the proper duration, intensity, and inflection for the prescribed text. This description of the connected utterances is given to a synthesizer device, which generates a signal for transmission over a voice circuit. Three different techniques can be used for generating the speech: (1) adaptive differential pulse-code modulation (ADPCM), (2) formant synthesis, and (3) test synthesis. Their typical bit rates for vocabulary storage are shown in Table 1.9c.

TABLE 1.7a Mechanical Characteristics of Sound Waves

Threshold of hearing

Quiet room

Normal speech at 3' Possible hearing impairment

RMS Sound Pressure 2 (dynes/cm )

RMS Sound Particle Velocity (cm/sec)

0.0002

0.0000048

Incipient mechanical damage

Atmospheric pressure

64 © 2006 by Béla Lipták

0.76 × 10

−9

0.002

0.000048

7.6 × 10

−9

0.02

0.00048

76.0 × 10

−9

0.2

0.0048

2.0

0.048

20.0 200

Threshold of pain

RMS Sound Particle Motion at (1,000 Hz cm)

2000

760 × 10

−9

7.6 × 10

−6

0.48

076.0 × 10

−6

4.80

760 × 10

48.0

20 × 10

3

480

200 × 10

3

4800

2000 × 10

3

48000

−6

7.6 × 10

−3

76.0 × 10

−3

760 × 10 7.6

−3

Sound Pressure Level (dB 0.0002 bar) 0 20 40 60 80 100 120 140 160 180 200

1.7 Speech Synthesis and Voice Recognition

+

Vocabulary store

Answer-back message (English text)

Synthesis program

Sampled input

Σ

Q

Speech output

– Σ

Σ

D

Speech formation rules

Speech output

P P-predictor L-logic

Q-quantizer C-coder D-decoder (a)

1.0 Signal amplitude

Adaptive Differential Pulse-Code Modulation (ADPCM)

0.5

DPCM

Signal ∆

101 100 011 010 101 111 111 111 111 111 010

0

Digital output

0 1 2 3 4 5 6 7 8 9 10 Sample time

1.0 Signal amplitude

Figure 1.7d shows the schematic of a simple ADPCM decoder. The synthesis program pulls the digitally coded words from the disk, where they are stored in the required sequence, and supplies them to an ADPCM decoder, which produces the analog signal. The only control of prosody (accent and voice modulation) that can be effected in the assembling of the message is the adjustment of time intervals between successive words, and possibly of their intensities. No control of voice pitch or merging of vocal resonance at word boundaries is possible. This technique is suitable for messages requiring modest vocabulary size and very lax semantic constraints. Typical messages are voice readouts of numerals, such as telephone numbers, and instructions for sequential operations, such as equipment assembly or wiring.

L

LPF

L

FIG. 1.7b Block diagram of a computer voice response system. (Courtesy of Bell Laboratories.)

Digital channel

C

P

Digital speech synthesizer

65

4∆ 0.5

ADPCM

2∆ Signal

101 100 011 010 101 111 111 101 011 100 011

0

Digital output

0 1 2 3 4 5 6 7 8 9 10 Sample time (b)

Formant Synthesis Formant synthesis uses a synthesis model in which a word library is analyzed and parametically stored as the time variations of vocal-tract resonance, or formants. Speech formants are represented as dark bars on sound spectrogram in the bottom panel of Figure 1.7e. These frequency variations and voice pitch periods can be analyzed by computer, as shown in the upper two panels of the figure.

TABLE 1.7c Features of the Different Coding Techniques Used in Speech Generation Coding Technique

Data Rate (bits/s)

Duration of Speech in 10 Bits of Storage (min)

ADPCM

20k

1

Formant synthesis

500

30

Text synthesis

700

240

© 2006 by Béla Lipták

DPCM -Differential Pulse-code Modulation ADPCM -Adaptive DPCM

FIG. 1.7d Adaptive differential coding of speech. (a) Block diagram of coder. (Adapted from U.S. Patent 3,772,682.) (b) Comparison of waveforms coded by 3-bit DPCM and ADPCM. (Courtesy of Bell Laboratories.)

In formant synthesis, the synthesizer device is a digital filter, as shown in Figure 1.7f. The filter’s transmission resonance or antiresonance are controlled by the computer, and its input excitation is derived by programmed rules for voice pitch and for amplitudes of voiced sounds and noise-like unvoiced sounds. The parametric description of the speech information provides the control and modification of the prosodic characteristics (pitch, duration, intensity) of the synthetic speech and produces smooth formant transitions at successive word boundaries. This technique has also been

General

13.0 Voice pitch

Period in (msec)

66

11.0 9.0 7.0 5.0

0

200

400

600 800 1000 1200 Time (msec)

1400

2 1 0

0

200

400

600 800 1000 1200 1400 Time (msec)

0

200

400

600 800 1000 Time (msec)

3 Original speech

Frequency (kHz)

Formant frequencies

Frequency (kHz)

3

2

message into spoken form. It generates speech from an information input corresponding to a typewriter rate. In one design, the vocabulary is literally a stored pronouncing dictionary. Each entry has the word, a phonetic translation of the word, word stress marks, and some rudimentary syntax information. The synthesis program includes a syntax analyzer that examines the message to be generated and determines the role of each word in the sentence. The stored rules for prosody then will calculate the sound intensity, sound duration, and voice pitch for each phoneme (the sound of human voice) in the particular context. Figure 1.7g shows a dynamic articulatory model of the human vocal tract. The calculated controls cause the vocaltract model to execute a sequence of “motions.” These motions are described as changes in the coefficients of a wave equation for sound propagation in a nonuniform tube. From the time-varying wave equation, the formants of the deforming tract are computed iteratively. These resonant frequencies and the calculated pitch and intensity information are sent to the same digital formant synthesizer shown in Figure 1.7f. The system generates its synthetic speech completely from stored information. Language text can be typed into the system and the message can be synthesized online.

1 0

VOICE RECOGNITION 1200

1400

FIG. 1.7e Analysis of the sentence “We were away a year ago.” The sound spectrogram in the bottom panel shows the time variation of vocaltract resonances or formants (dark areas). Computer-derived estimates of the formant frequencies and the voice pitch period are shown in the top two panels. (Courtesy of Bell Laboratories.)

used experimentally for generating automatic-intercept messages and for producing wiring instructions by voice.

Figure 1.7h illustrates one general structure for a machine to acquire knowledge of expected pronunciations and to compare input data with those expectations. A precompiling stage involves the user’s speaking sample pronunciations for each allowable word, phrase, or sentence, while identifying each with a vocabulary item number. Later, when an unknown utterance is spoken, it is compared with all the lexicon of expected pronunciations to find the training sample that it most closely resembles. Word Boundary Detection

Text Synthesis Text synthesis is the most forward-looking voice response technique. It provides for storing and voice-accessing voluminous amounts of information or for converting any printed An Random number gen. Pitch pulse gen. P

×

× Av

Recursive digital filter

D/A

Feature Extraction Pole-zero data

FIG. 1.7f Block diagram of a digital formant synthesizer. (Courtesy of Bell Laboratories.)

© 2006 by Béla Lipták

Silent pauses permit easy detection of word onsets at transitions from no signal to high energy and word offsets as energy dips below the threshold. Word boundary confusions can occur, when short silences during stop consonants (/p,t,k/) may resemble intended pauses. A word such as “transportation” could appear to divide into three words: “trans,” “port,” and “ation.” Recognizers must measure the duration of the pause to distinguish short consonantal silences from longer deliberate pauses.

Acoustic data provide features to detect contrasts of linguistic significance and to detect segments such as vowels and consonants. Figure 1.7i shows a few typical acoustic parameters used in recognizers. One can extract local peak values, such as 1 in the top panel, as an amplitude measure. The sum of the squares

1.7 Speech Synthesis and Voice Recognition

67

Nasal tract Nostrail

Muscle force

UN Velum Ps

UM

Uo

Trachea Lungs bronchi

T

Vocal cords

t

Vocal tract

Mouth

UN ZN PA

Zv ZG AG

Ps

UM

UG Cord model

Zu PA At

Ps(t)

Q(t) AGo(t)

Subglottal pressure

Cord Rest tension area

AN N(t)

A(x, t)

Nasal coupling

Tract shape

FIG. 1.7g Speech synthesizer based upon computer models for vocal cord vibration, sound propagation in a yielding-wall tube, and turbulent flow generation at places of constriction. The control inputs are analogous to human physiology and represent subglottal air pressure in the lungs (Ps); vocal-cord tension (Q) and area of opening at rest (AGo); the cross-sectional shape of the vocal tract A(x); and the area of coupling to the nasal tract (N). (Courtesy of Bell Laboratories.)

Sample pronunciations Utterance identity

Utterance boundary detection

Feature extraction

Pattern standardization

Training process Recognition process Lexicon of expected pronunciations Identity 1 Identity 2

Unknown utterance

Utterance boundary detection

Feature extraction

Pattern normalization

Identity N

Word matcher

Hypothesized words

Prosodic analysis

Semantic analysis

Syntactic analysis Cues to linguistic structures

Sentence structures

Pragmatic analysis Interpretations

FIG. 1.7h Typical structure for a trainable speech recognizer that could recognize words, phrases, or sentences.

© 2006 by Béla Lipták

Machine response

68

General

1 2

3

Amplitude

4

Time

To Pitch period Fo = 1/To

Zero crossings

Frame 2355

5 6 7

Frame 2359

(a)

LPC

FFT

Frequency

Frequency

(b)

(c)

F2

F3

Spectral Amplitude

F1

Frame 2359 Frame 2358 Frame 2357 Frame 2356 Frame 2355

Ti m e

Amplitude

Sharp spectral cutoff due to telephone bandwidth filtering

FIG. 1.7i Typical acoustic parameters used in speech recognizers. (a) Time waveform showing parameters that can be extracted. (b) Frequency spectrum of the waveform of a, with the detailed frequency spectrum derived from a fast Fourier transform (FFT) smoothed by linear predictive coding (LPC) to yield a smooth spectrum from which formants can be found as the spectral peaks. (c) Smoothed LPC spectra for five successive short time segments, with formants F1, F2, and F3 tracked as the spectral peaks.

of all waveform values over a time window provides a measure of energy. Vowel-like smooth waveforms give few crossings per unit time; noiselike segments give many crossings per unit time. The time between the prominent peaks at the onsets of pitch cycles determines the pitch period T0, or its inverse, the rate of vibration of the vocal cords, called fundamental frequency or F0. Resonance frequencies are indicated by the number of peaks per pitch period. The third pitch period shows seven local peaks, indicating a ringing resonance of seven times the fundamental frequency. This resonance is the first formant or vocal tract resonance of the /a/-like vowel. It is the best cue to the vowel identity. The frequency content of the short speech is shown in panel (a). The superimposed results of two methods for analyzing the frequency of speech sample are given in panel (b). The jagged Fourier-frequency spectrum, with its peak at harmonics of the fundamental frequency, is determined using a fast Fourier transform (FFT), and the exact positions of major spectral peaks at the formants of the speech are determined using an advanced method called linear predictive coding (LPC). The peaks indicate the basic resonance of the speaker’s

© 2006 by Béla Lipták

vocal tract and can be tracked as a function of time in panel (c). This indicates the nature of the vowel being articulated. Pattern Standardization and Normalization A word spoken on two successive occasions has different amplitudes. A recognizer must realign the data so that proper portions of utterance are aligned with corresponding portions of the template. This requires nonuniform time normalization, as suggested by the different phonemic durations. Dynamic programming is a method for trying all reasonable alignments and yields the closest match to each template. Another pattern normalization involves speaker normalization, such as moving one speaker’s formants up or down the frequency axis to match those of a “standard” speaker that the machine has been trained for. Word Matching The above processes yield an array of feature values versus time. During training, this array is stored as a template of expected pronunciation. During recognition, a new array can

1.7 Speech Synthesis and Voice Recognition

69

disallow meaningless but grammatical sequences such as “zero divide by zero.” A pragmatic constraint might eliminate unlikely sequences, such as the useless initial zero in “zero one nine.” It is possible to restructure the recognizer components to avoid propagation of errors through successive stages of a recognizer by having all the components intercommunicate directly through a central control component that might allow syntax or semantics to affect feature-extraction or word matching processes, or vice versa.

be compared with all stored arrays to determine which word is closest to it. Ambiguities in possible wording result from error and uncertainties in detecting the expected sound structure of an utterance. To prevent errors in word identifications, recognizers will often give a complete list of possible words in decreasing order of agreement with the data. High-Level Linguistic Components Speech recognizers need to limit the consideration of alternative word sequences. That is the primary purpose of high-level linguistic components such as prosodics, syntax, semantics, and pragmatics. Prosodic information such as intonation can help distinguish questions from commands and can divide utterances into phrases and rule out word sequences with incorrect stress patterns. A syntactic rule can disallow ungrammatical sequences, like “plus divide” or “multiply clear plus.” A semantic rule might

PRACTICAL IMPLEMENTATION Figure 1.7j illustrates a typical circuit-board implementation of a method of word recognition using LPC coding for spectral data representation and dynamic programming for time alignment and word matching. An analog-to-digital converter

LPC Coefficients

RAM vocabulary storage

a11 … a1n am1 … amn

Microphone

Analog-todigital converter and amplifier

LPC analysis chip

Distance Decision measures and dynamic programming

LPC-Linear predictive coding

Reference

(a)

V

3

6

6

6

2

Y

7

5

1

1

6

A

8

1

5

5

6

A

8

1

5

5

6

F

1

7

7

7

2

F

1

7

7

7

2

A Y F Unknown

Y

V

(b)

FIG. 1.7j Practical speech synthesizer based on linear predictive coding (LPC) and dynamic programming. (a) Major components of a circuit-board speech recognizer using LPC analysis. (b) Matrix of speech sound differences between reference words and word portions of unknown inputs. Alternative alignments are allowed within the parallelogram, and the path with least accumulated distance (heavy line) is chosen for best alignment of reference and input words.

© 2006 by Béla Lipták

70

General

chip amplifies and digitizes the microphone handset signal. An LPC chip performs the acoustic analysis and determines the necessary coefficients for specifying an inverse filter that separates such smooth resonance structure from the harmonically rich impulses produced by the vocal cords. A matrix of these LPC coefficients versus time represents the articulatory structure of a word. Word matching can be based on comparing such coefficients with those throughout each word of the vocabulary for which the recognizer is trained. As shown in Figure 1.7j, the analysis frames of the input are on the horizontal axis; those of a candidate reference word from the training data are on the vertical axis. The distances between the respective frames of the input and reference words are entered into the corresponding intersection cells in the matrix. The distances along the lowest-accumulateddistance alignment of reference and input data are accumulated. If that distance is less than the distance of any other reference word inserted in the illustrated reference word, then that reference word is accepted as the identity of the input. Dynamic programming is a method of picking that path through all successive distance increments that produces the lowest accumulated distance. As shown in the heavy line in Figure 1.7j, a couple of frames of “f ”-like sounds of the reference word may be identified with a single “f ”-like frame of the

© 2006 by Béla Lipták

input, or two inputs framed of “y”-like sounds may be associated with one such frame in the reference. Dynamic programming is also applicable to word sequences, looking for beginning and end points of each word, and matching with reference words by best paths between such end points. The same procedure can be applied to larger units in successive steps, to yield “multiple-level” dynamic programming, which is used in some commercial recognition devices.

Bibliography Bristow, G., Electronic Speech Recognition, New York: McGraw-Hill, 1986. Bristow, G., Electronic Speech Synthesis, New York: McGraw-Hill, 1984. Burnett, D. C., et al., “Speech Synthesis Markup Language,” http://www.w3. org/TR/speech-synthesis, copyright 2002. Dutoit, T., An Introduction to Text-To-Speech Synthesis, Dordrecht, The Netherlands: Kluwer Academic Publishers, 1996. Flanagan, J. L., “Synthesis and Recognition of Speech: Teaching Computers to Listen,” Bell Laboratory Record, May–June 1981, pp. 146–151. http://www.3ibm.com/software/speech/ Levinson, S. E., and Liberman, M. Y., “Speech Recognition by Computers,” Scientific American, April 1981, pp. 64–76. Yarowsky, D., “Homograph Disambiguation in Speech Synthesis,” Proceedings, 2nd ESCA/IEEE Workshop on Speech Synthesis, New Paltz, NY, 1994.