Speech Speech production Respiratory system Phonatory system

making speech the highest learned skill a human ... standard' characters makes it .... In the situations of video teleconferencing or cartoon-style animation, the.
262KB taille 8 téléchargements 340 vues
Speech

Speech production

• what’s so good about speech? 1. Speech is a unique faculty to humans and one of the most important, requiring the precise control and co-ordination of over eighty different muscles, making speech the highest learned skill a human can achieve. 2. Speaking requires generally around 1,500 muscle commands every second but needs only, as children, a few years to perfect. 3. Although a primary means of communication in itself, speech can convey other messages through accents, tone, pitch, and quality. 4. …..but only limited achievements in machine speech recognition and synthesis  2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 1

• The act of speech involves three major anatomical subsystems: 1. respiratory system including the lungs, rib cage, and diaphragm;

2. phonatory system which includes the larynx;

3. articulatory features the lips, teeth, tongue, and jaw.

 2001 Christian Martyn Jones

Respiratory system

Speech and Natural Language Processing

Speech – Slide 2

Phonatory system

nasal cavity

alveolar ridge velum

teeth lips

epiglottis

tongue

esophagus

glottis larynx

lungs

diaphragm

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 3

Articulatory system

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 4

Articulation of consonants

• ….considering in tern how we create

• How we classify the production of consonants involves

consonants vowels other sound

1. the place of articulation (the relative position of the lips, teeth, and tongue),

2. the manner of articulation ….using the glottis, tongue, teeth, lips, and nasal cavity

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 5

(how the air-stream from the lungs is obstructed stops, fricatives, affricates, nasals, liquids and glides),

3. and whether the vocal cords are set to vibrate

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 6

1

…..will use American English

Annotation

….rather than British English but why?

• ASR takes an acoustic waveform as input and produces as output a string of words

• my work • most other researchers work

• TtS takes a sequence of text words to produce an acoustic waveform.

….and will also simplify • not consider regional accents

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 7

 2001 Christian Martyn Jones

…two main alphabet standards

standard originally developed by international phonetic association in 1888 with the idea to transcribe all human languages it is more than just a set of symbols (eg one to one relationship to sounds) and differ for different languages

• 2. ARPAbet using ASCII characters rather than more ’nonstandard’ characters makes it much easier to create phonetic dictionaries, syntactic and semantic rules, and build them into ASR systems.....as we will see. Speech and Natural Language Processing

Speech – Slide 8

Place of articulation

• 1. International Phonetic Alphabet (IPA)

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 9

• …..refers to the relative positions of the lips, teeth and tongue. • There are six distinct types of classification: bilabial, labiodental, interdental, alveolar, alveo-palatal, and velar.

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 11

• The six places of articulation

nasal cavity soft palate uvula hard palate palate alveolar ridge lips tip of tongue blade of tongue back of tongue jaw

 2001 Christian Martyn Jones

Speech and Natural Language Processing

bilabial, labiodental, interdental, alveolar, alveopalatal, and velar describe the parts of the vocal tract which are responsible for the obstruction of the air flow from the lungs

• the degree of obstruction the airstream incurs must also be considered …...this is the manner of articular.

Speech – Slide 12

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 15

2

Manner of articulation

Voicing

• The manner of articulation describes…

• Voicing

...how, and to what degree, air from the lungs is obstructed • Terms used: stops, fricatives, affricates, nasals, lateral, retroflex, and glides  2001 Christian Martyn Jones

vibration of the vocal cords in order to change the characteristics of the airstream through the mouth or nose and the overall acoustic nature of the phone.

• ...sounds that are generated with the vocal cords vibrating are voiced, and conversely those sound requiring static vocal cords are voiceless.

Speech and Natural Language Processing

Speech – Slide 16

 2001 Christian Martyn Jones

Articulation of vowels

Speech and Natural Language Processing

Speech – Slide 19

Features to consider

• The articulation of American English vowels is not as defined as consonants and can vary a great deal from speaker to speaker, especially due to dialect variations. • Vowels generally present little obstruction and require wide open mouth positions.

• What we look for are: 1. tongue elevation, 2. part of tongue involved, 3. tongue muscle tension, 4. mouth shape.

• The description of vowels will consider the motion of the tongue and lips.

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 21

 2001 Christian Martyn Jones

Elevation of the tongue

Speech and Natural Language Processing

Speech – Slide 22

Examples

• normally categorised in terms of simply: high, mid, or low positions.

• /iy/ in ‘beet’ and /ih/ in ‘bit’ • /ey/ in ‘bait’ and /eh/ in ‘bet’ • /ae/ in ‘bat’

high front

mid front

• /uw/ in ‘boot’ and /uh/ in ‘book’

low front

• /ow/ in ‘boat’ and /ao/ in ‘bought’ • /aa/ in ‘bott

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 23

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 24

3

Region of the tongue

Tongue tenseness

• We classify the region of the tongue as either: front, central, or back

• vowels requiring above the ‘normal’ level of muscle tension are termed tense

we have already seen front vowels:

• whilst those in which this degree of tension is not needed are simply relaxed.

'beet’, ‘bit’, ‘bait’, ‘bet’, 'bat’

• Consider list ‘beat’ and ‘bit, ‘bait’ and ‘bet’, and ‘boot’ and ‘book’

and back vowels such as: ‘boot’, ‘book’, ‘boat’, ‘bought’, and ‘bott’

- the first phone in each group is tense, with the tongue slightly higher in the mouth than the relaxed second phone

central vowels such as:

- in addition, front vowels which are tense are articulated with the tongue slightly ahead of a similarly relax phone, whilst back tense vowels will be pronounced with the tongue further back.

/ah/ in ‘but’ and schwa vowel in ‘machine’

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 25

 2001 Christian Martyn Jones

Mouth shape

Speech and Natural Language Processing

Speech – Slide 26

Other phonetic articulations

• a range of lip positions for all phones

• these require more dynamic movement of the tongue and lips than can be described within the constraints of consonants and vowels, and include:

• however vowels possess some degree of generalisation

Off-glides, diphthongs, and r-colourisation consonant /h/

vowels: 'beet’, ‘bit’, ‘bait’, ‘bet’, 'bat’, ‘machine’, and ‘but’

whilst the back vowels of: ‘boot’, ‘book’, ‘boat’, ‘bought’:

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 27

 2001 Christian Martyn Jones

Off-glides

Speech and Natural Language Processing

Speech – Slide 29

Diphthongs



Already considered are the high, front, tense /iy/ ('beet'), the mid, front, tense /ey/ ('bait'), the high, back, tense /uw/ ('boot'), and the mid, back, tense /ow/ ('boat') vowels, however the articulation of each is more than these simple descriptors can suggest.



Similar in composition to off-glides, diphthongs are complex vowels consisting of a vowel sound followed by the glide /y/ or /w/. However, what separates diphthongs from off-glides is that diphthongs involve considerably greater tongue motion



The front vowels (/iy/ and /ey/) are composed of a pure vowel /i/ (idealised position) and /e/ followed immediately by the glide /y/ (blade of tongue near hard palate), and similarly the back vowels involve vowels /u/ and /o/ followed by /w/.



The diphthongs of American English are /aw/ in ‘bout’, /ay/ in ‘bite’, and /oy/ in ‘boy’.



Each of the constituent phones are pronounced and these vowels are known as off-glides indicating the increased motion of the tongue and lips.

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 30

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 31

4

R-colourisation

/h/



The combination of vowel and /r/ sounds are termed as rcolourisation, and differ from off-glides and diphthongs in that they are in fact two symbols representing a single sound.



The r-coloured, mid, central vowel /er/ as in ‘bird’ involves the articulation of two previously described phones at one time:



the mid, central vowel pronunciation of the schwas as in ‘machine’

have already mentioned consonant /h/ assumes the tongue and lip positions of the proceeding vowel. As the articulation of the consonant /h/ is very much dependent on that of the vowel, and vowels do not require the vibration of the vocal cords, it is often described as a voice-less vowel although grammatically a consonant.

together with the tongue curl associated with the retroflexed /r/.



The tongue is again in motion during pronunciation, however does not exceed the boundaries of tongue elevation and region as is the case with diphthongs.

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 32

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 33

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 36

Deaf studies • will consider the field of speechreading addressing the visual similarity in speech.

• Research number of studies since 1950s to the present day major works from Jeffers, Nitchie, Berger, Fisher, my own studies look at here is Nitchie 1979

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 34

….less info for speechreaders

lip cues for all…..

• Larry Thronson, a sign language instructor and counsellor at the Central Coast Centre for Independent Living in California has said: “Lipreading is guesswork; under ideal conditions, only about 30 to 40 percent (of the speech) is retained”

• With my research into lip syncing I found that it can be more like 50% to 60% of the information is lost

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 37



how much attention we pay to the visual and auditory information can vary depending on the ‘cocktail party effect’ and also subject matter:



Predictable words in a conversation are spoken less clearly, as are references to objects present or passed between speakers in view, however, when an object is mentioned for the first time it is named more clearly than subsequent times



When speakers are face-to-face their speech becomes degraded even when they are not actually facing each other. Although it can be seen that visual cues can greatly assist communications, speakers generally rarely look at each other during conversations which seems to contradict suggestions that speakers adopt their articulation to the needs of the listener. If this was the case then one may speak less clearly when being watched, however the reverse is true.



Although we generally use visual cues less frequently than would initially be imagined (typically less that 50% of the time), when the listener does look to the speaker’s face the speaker then articulates much clearer than before

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 38

5

Eye gaze

Bad lip sync.



Comfort:



In the situations of video teleconferencing or cartoon-style animation, the viewer is aware that the image is synthetic and is less prone to feel uneasy looking at the speakers face. Therefore, the viewer will look at the image for a greater percentage of the overall time implying that the level of accuracy of the articulation should be as high as possible.



However, the psychology of human subjects is not always consistent and in fact listeners find that excessive precision in the articulation of the speaker to be generally annoying.



Similarly, during face-to-face conversations subjects subconsciously negotiate an acceptable level of mutual eye contact depending on a level of intimacy and comfort.



Note that the very latest computer facial models provide very life-like persona, and it is plausible that in the future such models may well be perceived to be real-life images, in which case listeners will find themselves again unable to make prolonged eye contact during the communication.

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 39



Although most people have not had any formal speechreading training and therefore cannot accurately lip-read, bad lip synchronisation is easily detectable and highly unacceptable to the viewer



Extreme cases has been sighted by Groß at Oxford University, England, in which films dubbed in different languages can not be synchronised with the visuals and can cause irritation to audience



In countries such as Germany and France which regularly dub British or American films, the spectators appear to be able to ignore the discrepancy between what they see and what they hear as if the brain can be trained to accept bad lip synchronisation when the situation is nonsensical.



However, if the same audience watch a German actor dubbed in to another language such as French then they strongly object to the inconsistency between the speech and mouth synchronisation.

 2001 Christian Martyn Jones

Cartoon lip sync. •



Speech and Natural Language Processing

Speech – Slide 40

Audio/visual mis-matches

Cartoon lip synchronisation does not require the same level of accuracy as its human counterpart. Instead, animators generally supply only keyframes in the articulation and the human mind ‘fills’ in the gaps. The audience perceive the characters in animation to be life-like but at the same time not human and thus are happy to ignore inaccuracies in the lip synchronisation which would otherwise be unacceptable. Therefore animators are able to get away with poor articulation models for simple characters.

• McGurk effect • what do you….

Disney animators found that although mouth motion required exact synchronisation, other speechreading cues such as head, body, or gestures needed to be synchronised three to four frames ahead of the visual action

...hear? ...see? …with both?

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 41

 2001 Christian Martyn Jones

McGurk research

• ……see results

Speech and Natural Language Processing

Speech – Slide 42

why does it occur?

• The work involved filming a female subject whilst repeated uttering ‘ba-ba’, ‘ga-ga’, ‘pa-pa’, ‘ka-ka’, and generating four dubbed audio-video sequences in which the original sound-track and lip movements were combined correctly and mis-matched as: [ba] (voice)-[ga] (lips), [ga] (voice)-[ba] (lips), [pa] (voice)[ka] (lips), and [ka] (voice)-[pa] (lips).

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 43



The McGurk effect can be explained in terms of the visual similarity



Speechreading studies show that lip positions for [ga] are frequently misread as [da], [ka] misread as [ta], and [pa] for [ba].



McGurk assumed that the acoustic information for [ba] and [da] contained some common features which were not present in [ga]. Thus a [ba] (voice)-[ga] (lips) presentation provided the viewer with visual information common to [ga] and [da] and auditory information with features common to [da] and [ba]. The spectator would then respond with the phone-code for which there was most data: [da].



A similar explanation was presupposed for the [pa] (voice)-[ka] (lips) effects and reversing the audio and visual stimuli. When the acoustic information does not bear any similarity with the articulation, in the case of [ka] (voice)-[pa] (lips) the viewer invariably must guess the spoken message and responds with combinations of [kapka], [pakpa] etc.

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 44

6

….further work

…by now



Burnham and Dodd researching in Australia found that the McGurk effect was observed in infants as young as 4 month-old and that the auditory [ba] and visual [ga] could be perceived as [dh] as well as [da].



Also noted the effect to transcend language and phonological constraints.



Work by Massaro has considered the use of synthetic faces on the McGurk effect. Massaro has extended the sensory mismatch to include auditory /b/ and visual /d/ to perceived /w/.

• you should have an appreciation of: speech, how it is produced, what articulators are involved, how to annotate speech using symbols, how speech appears visually, similarities in visual speech, and mis-information cues and difficulties for hearing impaired.

• next…....how this affects the acoustics of speech......  2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 45

 2001 Christian Martyn Jones

Speech and Natural Language Processing

Speech – Slide 46

7