Speech
Speech production
• what’s so good about speech? 1. Speech is a unique faculty to humans and one of the most important, requiring the precise control and co-ordination of over eighty different muscles, making speech the highest learned skill a human can achieve. 2. Speaking requires generally around 1,500 muscle commands every second but needs only, as children, a few years to perfect. 3. Although a primary means of communication in itself, speech can convey other messages through accents, tone, pitch, and quality. 4. …..but only limited achievements in machine speech recognition and synthesis 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 1
• The act of speech involves three major anatomical subsystems: 1. respiratory system including the lungs, rib cage, and diaphragm;
2. phonatory system which includes the larynx;
3. articulatory features the lips, teeth, tongue, and jaw.
2001 Christian Martyn Jones
Respiratory system
Speech and Natural Language Processing
Speech – Slide 2
Phonatory system
nasal cavity
alveolar ridge velum
teeth lips
epiglottis
tongue
esophagus
glottis larynx
lungs
diaphragm
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 3
Articulatory system
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 4
Articulation of consonants
• ….considering in tern how we create
• How we classify the production of consonants involves
consonants vowels other sound
1. the place of articulation (the relative position of the lips, teeth, and tongue),
2. the manner of articulation ….using the glottis, tongue, teeth, lips, and nasal cavity
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 5
(how the air-stream from the lungs is obstructed stops, fricatives, affricates, nasals, liquids and glides),
3. and whether the vocal cords are set to vibrate
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 6
1
…..will use American English
Annotation
….rather than British English but why?
• ASR takes an acoustic waveform as input and produces as output a string of words
• my work • most other researchers work
• TtS takes a sequence of text words to produce an acoustic waveform.
….and will also simplify • not consider regional accents
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 7
2001 Christian Martyn Jones
…two main alphabet standards
standard originally developed by international phonetic association in 1888 with the idea to transcribe all human languages it is more than just a set of symbols (eg one to one relationship to sounds) and differ for different languages
• 2. ARPAbet using ASCII characters rather than more ’nonstandard’ characters makes it much easier to create phonetic dictionaries, syntactic and semantic rules, and build them into ASR systems.....as we will see. Speech and Natural Language Processing
Speech – Slide 8
Place of articulation
• 1. International Phonetic Alphabet (IPA)
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 9
• …..refers to the relative positions of the lips, teeth and tongue. • There are six distinct types of classification: bilabial, labiodental, interdental, alveolar, alveo-palatal, and velar.
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 11
• The six places of articulation
nasal cavity soft palate uvula hard palate palate alveolar ridge lips tip of tongue blade of tongue back of tongue jaw
2001 Christian Martyn Jones
Speech and Natural Language Processing
bilabial, labiodental, interdental, alveolar, alveopalatal, and velar describe the parts of the vocal tract which are responsible for the obstruction of the air flow from the lungs
• the degree of obstruction the airstream incurs must also be considered …...this is the manner of articular.
Speech – Slide 12
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 15
2
Manner of articulation
Voicing
• The manner of articulation describes…
• Voicing
...how, and to what degree, air from the lungs is obstructed • Terms used: stops, fricatives, affricates, nasals, lateral, retroflex, and glides 2001 Christian Martyn Jones
vibration of the vocal cords in order to change the characteristics of the airstream through the mouth or nose and the overall acoustic nature of the phone.
• ...sounds that are generated with the vocal cords vibrating are voiced, and conversely those sound requiring static vocal cords are voiceless.
Speech and Natural Language Processing
Speech – Slide 16
2001 Christian Martyn Jones
Articulation of vowels
Speech and Natural Language Processing
Speech – Slide 19
Features to consider
• The articulation of American English vowels is not as defined as consonants and can vary a great deal from speaker to speaker, especially due to dialect variations. • Vowels generally present little obstruction and require wide open mouth positions.
• What we look for are: 1. tongue elevation, 2. part of tongue involved, 3. tongue muscle tension, 4. mouth shape.
• The description of vowels will consider the motion of the tongue and lips.
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 21
2001 Christian Martyn Jones
Elevation of the tongue
Speech and Natural Language Processing
Speech – Slide 22
Examples
• normally categorised in terms of simply: high, mid, or low positions.
• /iy/ in ‘beet’ and /ih/ in ‘bit’ • /ey/ in ‘bait’ and /eh/ in ‘bet’ • /ae/ in ‘bat’
high front
mid front
• /uw/ in ‘boot’ and /uh/ in ‘book’
low front
• /ow/ in ‘boat’ and /ao/ in ‘bought’ • /aa/ in ‘bott
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 23
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 24
3
Region of the tongue
Tongue tenseness
• We classify the region of the tongue as either: front, central, or back
• vowels requiring above the ‘normal’ level of muscle tension are termed tense
we have already seen front vowels:
• whilst those in which this degree of tension is not needed are simply relaxed.
'beet’, ‘bit’, ‘bait’, ‘bet’, 'bat’
• Consider list ‘beat’ and ‘bit, ‘bait’ and ‘bet’, and ‘boot’ and ‘book’
and back vowels such as: ‘boot’, ‘book’, ‘boat’, ‘bought’, and ‘bott’
- the first phone in each group is tense, with the tongue slightly higher in the mouth than the relaxed second phone
central vowels such as:
- in addition, front vowels which are tense are articulated with the tongue slightly ahead of a similarly relax phone, whilst back tense vowels will be pronounced with the tongue further back.
/ah/ in ‘but’ and schwa vowel in ‘machine’
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 25
2001 Christian Martyn Jones
Mouth shape
Speech and Natural Language Processing
Speech – Slide 26
Other phonetic articulations
• a range of lip positions for all phones
• these require more dynamic movement of the tongue and lips than can be described within the constraints of consonants and vowels, and include:
• however vowels possess some degree of generalisation
Off-glides, diphthongs, and r-colourisation consonant /h/
vowels: 'beet’, ‘bit’, ‘bait’, ‘bet’, 'bat’, ‘machine’, and ‘but’
whilst the back vowels of: ‘boot’, ‘book’, ‘boat’, ‘bought’:
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 27
2001 Christian Martyn Jones
Off-glides
Speech and Natural Language Processing
Speech – Slide 29
Diphthongs
•
Already considered are the high, front, tense /iy/ ('beet'), the mid, front, tense /ey/ ('bait'), the high, back, tense /uw/ ('boot'), and the mid, back, tense /ow/ ('boat') vowels, however the articulation of each is more than these simple descriptors can suggest.
•
Similar in composition to off-glides, diphthongs are complex vowels consisting of a vowel sound followed by the glide /y/ or /w/. However, what separates diphthongs from off-glides is that diphthongs involve considerably greater tongue motion
•
The front vowels (/iy/ and /ey/) are composed of a pure vowel /i/ (idealised position) and /e/ followed immediately by the glide /y/ (blade of tongue near hard palate), and similarly the back vowels involve vowels /u/ and /o/ followed by /w/.
•
The diphthongs of American English are /aw/ in ‘bout’, /ay/ in ‘bite’, and /oy/ in ‘boy’.
•
Each of the constituent phones are pronounced and these vowels are known as off-glides indicating the increased motion of the tongue and lips.
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 30
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 31
4
R-colourisation
/h/
•
The combination of vowel and /r/ sounds are termed as rcolourisation, and differ from off-glides and diphthongs in that they are in fact two symbols representing a single sound.
•
The r-coloured, mid, central vowel /er/ as in ‘bird’ involves the articulation of two previously described phones at one time:
•
the mid, central vowel pronunciation of the schwas as in ‘machine’
have already mentioned consonant /h/ assumes the tongue and lip positions of the proceeding vowel. As the articulation of the consonant /h/ is very much dependent on that of the vowel, and vowels do not require the vibration of the vocal cords, it is often described as a voice-less vowel although grammatically a consonant.
together with the tongue curl associated with the retroflexed /r/.
•
The tongue is again in motion during pronunciation, however does not exceed the boundaries of tongue elevation and region as is the case with diphthongs.
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 32
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 33
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 36
Deaf studies • will consider the field of speechreading addressing the visual similarity in speech.
• Research number of studies since 1950s to the present day major works from Jeffers, Nitchie, Berger, Fisher, my own studies look at here is Nitchie 1979
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 34
….less info for speechreaders
lip cues for all…..
• Larry Thronson, a sign language instructor and counsellor at the Central Coast Centre for Independent Living in California has said: “Lipreading is guesswork; under ideal conditions, only about 30 to 40 percent (of the speech) is retained”
• With my research into lip syncing I found that it can be more like 50% to 60% of the information is lost
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 37
•
how much attention we pay to the visual and auditory information can vary depending on the ‘cocktail party effect’ and also subject matter:
•
Predictable words in a conversation are spoken less clearly, as are references to objects present or passed between speakers in view, however, when an object is mentioned for the first time it is named more clearly than subsequent times
•
When speakers are face-to-face their speech becomes degraded even when they are not actually facing each other. Although it can be seen that visual cues can greatly assist communications, speakers generally rarely look at each other during conversations which seems to contradict suggestions that speakers adopt their articulation to the needs of the listener. If this was the case then one may speak less clearly when being watched, however the reverse is true.
•
Although we generally use visual cues less frequently than would initially be imagined (typically less that 50% of the time), when the listener does look to the speaker’s face the speaker then articulates much clearer than before
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 38
5
Eye gaze
Bad lip sync.
•
Comfort:
•
In the situations of video teleconferencing or cartoon-style animation, the viewer is aware that the image is synthetic and is less prone to feel uneasy looking at the speakers face. Therefore, the viewer will look at the image for a greater percentage of the overall time implying that the level of accuracy of the articulation should be as high as possible.
•
However, the psychology of human subjects is not always consistent and in fact listeners find that excessive precision in the articulation of the speaker to be generally annoying.
•
Similarly, during face-to-face conversations subjects subconsciously negotiate an acceptable level of mutual eye contact depending on a level of intimacy and comfort.
•
Note that the very latest computer facial models provide very life-like persona, and it is plausible that in the future such models may well be perceived to be real-life images, in which case listeners will find themselves again unable to make prolonged eye contact during the communication.
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 39
•
Although most people have not had any formal speechreading training and therefore cannot accurately lip-read, bad lip synchronisation is easily detectable and highly unacceptable to the viewer
•
Extreme cases has been sighted by Groß at Oxford University, England, in which films dubbed in different languages can not be synchronised with the visuals and can cause irritation to audience
•
In countries such as Germany and France which regularly dub British or American films, the spectators appear to be able to ignore the discrepancy between what they see and what they hear as if the brain can be trained to accept bad lip synchronisation when the situation is nonsensical.
•
However, if the same audience watch a German actor dubbed in to another language such as French then they strongly object to the inconsistency between the speech and mouth synchronisation.
2001 Christian Martyn Jones
Cartoon lip sync. •
•
Speech and Natural Language Processing
Speech – Slide 40
Audio/visual mis-matches
Cartoon lip synchronisation does not require the same level of accuracy as its human counterpart. Instead, animators generally supply only keyframes in the articulation and the human mind ‘fills’ in the gaps. The audience perceive the characters in animation to be life-like but at the same time not human and thus are happy to ignore inaccuracies in the lip synchronisation which would otherwise be unacceptable. Therefore animators are able to get away with poor articulation models for simple characters.
• McGurk effect • what do you….
Disney animators found that although mouth motion required exact synchronisation, other speechreading cues such as head, body, or gestures needed to be synchronised three to four frames ahead of the visual action
...hear? ...see? …with both?
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 41
2001 Christian Martyn Jones
McGurk research
• ……see results
Speech and Natural Language Processing
Speech – Slide 42
why does it occur?
• The work involved filming a female subject whilst repeated uttering ‘ba-ba’, ‘ga-ga’, ‘pa-pa’, ‘ka-ka’, and generating four dubbed audio-video sequences in which the original sound-track and lip movements were combined correctly and mis-matched as: [ba] (voice)-[ga] (lips), [ga] (voice)-[ba] (lips), [pa] (voice)[ka] (lips), and [ka] (voice)-[pa] (lips).
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 43
•
The McGurk effect can be explained in terms of the visual similarity
•
Speechreading studies show that lip positions for [ga] are frequently misread as [da], [ka] misread as [ta], and [pa] for [ba].
•
McGurk assumed that the acoustic information for [ba] and [da] contained some common features which were not present in [ga]. Thus a [ba] (voice)-[ga] (lips) presentation provided the viewer with visual information common to [ga] and [da] and auditory information with features common to [da] and [ba]. The spectator would then respond with the phone-code for which there was most data: [da].
•
A similar explanation was presupposed for the [pa] (voice)-[ka] (lips) effects and reversing the audio and visual stimuli. When the acoustic information does not bear any similarity with the articulation, in the case of [ka] (voice)-[pa] (lips) the viewer invariably must guess the spoken message and responds with combinations of [kapka], [pakpa] etc.
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 44
6
….further work
…by now
•
Burnham and Dodd researching in Australia found that the McGurk effect was observed in infants as young as 4 month-old and that the auditory [ba] and visual [ga] could be perceived as [dh] as well as [da].
•
Also noted the effect to transcend language and phonological constraints.
•
Work by Massaro has considered the use of synthetic faces on the McGurk effect. Massaro has extended the sensory mismatch to include auditory /b/ and visual /d/ to perceived /w/.
• you should have an appreciation of: speech, how it is produced, what articulators are involved, how to annotate speech using symbols, how speech appears visually, similarities in visual speech, and mis-information cues and difficulties for hearing impaired.
• next…....how this affects the acoustics of speech...... 2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 45
2001 Christian Martyn Jones
Speech and Natural Language Processing
Speech – Slide 46
7