articulatory modelling of the vocal tract in feeding ... - Antoine Serrurier

/i/ /u/ ellipses of the literature, suggesting that speech articulations could indeed be producible from feeding movements. These results support the hypothesis ...
279KB taille 1 téléchargements 338 vues
Serrurier & Barney, Vocal Tract Modelling in Feeding from X-Ray Images, MAVEBA 2009

ARTICULATORY MODELLING OF THE VOCAL TRACT IN FEEDING FROM X-RAY IMAGES A. Serrurier1, A. Barney 1 1

Institute of Sound and Vibration Research, University of Southampton, UK

Abstract: Two of the major functions of the human vocal tract are feeding and speaking. Ontogenetically and phylogenetically feeding tasks precede speaking tasks and it has been hypothesized that speaking movements constitute a subset of feeding movements. This study investigates whether the vowels /a/ /i/ /u/ can be articulated using feeding movements. Midsagittal tongue surfaces have been extracted from a Digital Videofluoroscopy film of liquid swallowing, and a 5-component articulatory model has been derived, explaining 96% of the tongue variance. Acoustic transfer functions have been estimated by means of an expansion model from midsagittal measurements to area function and an acoustic wave propagation model. The articulations optimally approaching the acoustic and articulatory characteristics of /a/ /i/ /u/ have been extracted from both the data and the model. The results show that the model can produce three /a/ /i/ /u/-like articulations whose points in the acoustic plane F1-F2 reach the /a/ /i/ /u/ ellipses of the literature, suggesting that speech articulations could indeed be producible from feeding movements. These results support the hypothesis that speech movements might have evolved from feeding movements. Keywords : Speech, Feeding, Articulatory Modelling I. INTRODUCTION Three main functions can be ascribed to the human vocal tract: breathing, eating and speaking. The tasks of eating and speaking are associated with involved articulatory movements requiring significant coordination. Since both ontogenetically and phylogenetically feeding tasks precede speaking tasks, MacNeilage [1] suggests in his significant paper on the frame/content theory of the evolution of speech production that speech cyclicity may have evolved from feeding cyclicity. Following this theory, Hiiemae & Palmer [2] suggest that, although controversial, it is reasonable to hypothesis that tongue movements in speech are derived from tongue movements in feeding. Our study explores this hypothesis on the basis of articulatory measurement and modelling to determine if the movements found in speech are a subset of those found in feeding. We present here an analysis based on an imaging study of the vocal tract (VT) during feeding.

An earlier study [3] used ElectroMagnetic Articulography (EMA) to record the motion of the jaw tongue and lips during speech and feeding tasks. The articulatory model developed from these data recorded on a single subject suggested that the movements of the tongue and jaw in speech might indeed form a subset of the tongue- and jaw-based feeding movements. For speech the underlying assumptions of models from EMA data have generally been well tested, but for feeding we must question whether they are equally valid. For example it has been demonstrated that three tongue points, as measured by EMA, are sufficient to recover the full 3D shape of the tongue for speech tasks (see e.g. [4]), but no such validation has been attempted for feeding. A similar question arises for other articulators. Further, EMA gives no data about the static parts of the VT boundary and therefore does not allow estimates of the acoustic response associated with the various geometrical configurations adopted in feeding. To overcome these limitations it is necessary to use a data collection method that images the VT boundaries from glottis to lips but also has sufficient time resolution to capture the dynamics of the motion of the articulators. We have therefore chosen to use Digital Video Fluoroscopy (DVF) which permits the recording of X-ray films with a relatively low dose exposure. For the work presented here we will focus on the motion of the tongue only. In the debate on speech evolution, emphasis has been placed on the three point vowels /a/, /i/ and /u/ known as the quantal vowels. These three vowels are considered to delimit acoustically the maximum space of a vowel system (e.g. [5]). We are therefore interested here in whether we can extract from the data, or from an articulatory model based on the data, articulations which are geometrically and/or acoustically close to the quantal vowels. As benchmarks we define the acoustic targets as typical F1-F2 values for /a/, /i/ and /u/ and the articulatory targets as the tongue shapes representing the typical /a/, /i/ and /u/ geometry [6]. II. METHOD Data was collected at Hospitais da Universidade de Coimbra, Portugal using a single, native European Portuguese, female subject, age 40 years, with no known pathology of the upper respiratory tract and normal

1

Serrurier & Barney, Vocal Tract Modelling in Feeding from X-Ray Images, MAVEBA 2009

swallowing function. Full local ethical approval was obtained prior to carrying out the study. DVF images were recorded during feeding tasks of the upper VT from the lips to just below the glottis with a frame rate of 25 images/s. The feeding corpus [3] was designed in collaboration with a speech and language therapist guided by the UK National Descriptors for Texture Modification in Adults [7]. The food was divided into three categories covering a wide range of food textures from saliva to Hard Food. For this study, DVF images were obtained of Liquids (water, pourable custard), Solid Food (thick custard) and Hard Food (shortbread biscuit). A typical image from the DVF sequence is shown in Fig. 1a.

a) Original DVF image

b) Manually segmented image

Fig. 1 A typical frame from the DVF image sequence (a) and the manual segmentation of the hard palate the tongue and the jaw contours (b). Ideally, all images from all recordings would be analysed. Manual processing of images is, however, time consuming, so we have limited our pilot study to the recordings of pourable custard; a film sequence lasting about 14 sec. and containing two complete swallows. This choice is motivated by our EMA study [3] where we could extract most of the representative articulations from liquid swallowing. Custard was chosen here over water since it presents more contrasted images which facilitate analysis. Additionally, to make processing tractable only every second image has been analysed (i.e. 186 images), equivalent to a frame rate of 12.5 images per second. Each image has been manually segmented to extract the contours of the tongue from epiglottis to tip. Despite some spatial blurring and some masking of the tongue surface by other anatomical features on certain images, the complete tongue surface has been carefully outlined (Fig. 1b) by the first author of this article, who has significant experience in manual VT segmenting. The contour of the hard palate has also been extracted for each image to act as a reference for image alignment. Finally, the full set of VT articulators (i.e. epiglottis, larynx, back pharyngeal wall, velum, jaw, and lips) has been outlined for one specific articulation for use as a basis for acoustic modelling. The entire set of contours has then been calibrated in cm and aligned using the hard palate reference. Following common practice these

contours are considered as the midsagittal outlines of the VT (see e.g. [8]). Two methods were used to asses whether the feeding articulations gave good approximations to the targets, one based on the raw tongue profile data and the other based on an articulatory tongue model derived from the data. The modelling process consists of extracting the statistically independent articulatory components of the tongue movement during the feeding task. The articulatory model has the advantage over the raw data of encompassing all task-derivable articulations theoretically producible by the tongue even though they may not be present in the recorded data. The tongue surface modelling was based on Principal Component Analysis (PCA) following the protocol described in [9]. Each tongue contour was re-interpolated using 200 regularly spaced points from tongue tip to epiglottis. A PCA has been applied to the 186 200 2 X-Y coordinates of the tongue points. Table 1 shows that five components are enough to explain 96% of the variance of the tongue points with a reconstruction error of 0.12 cm. Table 1 Explained variance, cumulative explained variance and cumulative RMS reconstruction error for the tongue for each of the articulatory parameters

Parameter P1 P2 P3 P4 P5

Var. 62 % 17 % 9% 5% 3%

Cum.Var. 62 % 79 % 88 % 93 % 96 %

RMS 0.35 cm 0.26 cm 0.2 cm 0.15 cm 0.12 cm

For both the articulatory and the acoustic cases we have defined a distance measure for the error between prediction and target. The articulatory distance is the Root Mean Square (RMS) distance between the target sets of points and a measured set describing the tongue surface. The acoustic distance is the relative RMS distance between a target set of formants and those derived from an estimated acoustic transfer function. The process to obtain from the data and from the model the closest articulations to our articulatory and acoustic targets was as follows: (1) An optimisation was performed on the articulatory model parameters to obtain the tongue contours that minimized the articulatory distance to three manual approximates of the articulatory target contours for the quantal vowels (Figs. 2a to 2c). (2) The three tongue contours were inserted into a fixed, midsagittal VT contour (Fig. 2d). Note that currently the epiglottis and the larynx are fixed, leading sometimes to some anatomical abnormalities in the lower pharyngeal region (see e.g. Fig. 2a).

2

Serrurier & Barney, Vocal Tract Modelling in Feeding from X-Ray Images, MAVEBA 2009

III. RESULTS

the area functions (Fig. 3). The F1-F2 computed for the three model articulations fall just inside the ellipses, (on the border for /a/ and /i/). i 10

10

Y (cm)

12

Y (cm)

12

8

8

6

6

4

4

Model Data

2 4

6

8

Model Data

2

10

12

4

14

6

8

10

12

14

12

14

X (cm)

X (cm)

a) /a/

b) /i/ 12

12

10

Y(cm)

Y (cm)

10

8

8

6

6

4 4

Model Data

2 4

6

8

2

10

12

4

14

6

8

10

X(cm)

X (cm)

d) Intersection with the grid Fig. 2 Midsagittal contours of the tongue (a, b, and c) from the model (solid line) and from the data (dashed line) having best fit to articulatory targets /a/, /i/ and /u/, plotted on fixed midsagittal VT contours to give context. In d, intersection of the VT articulated for a /i/ with the grid. a c) /u/

12

Model Data

10

/a/ Area (cm2)

(3) The VT contours with the three tongue articulations were intersected from the glottis to the lips by a fixed grid (Fig. 2d). (4) For each grid line we measured the distance between the points of intersection with the VT contours, considered as the diameter of the VT tube for the purpose of area function generation. Wherever grid lines were not perpendicular to the VT tube central axis, the measured distances were corrected using the method described by [9]. (5) Area functions were computed from each of the three sets of midsagittal distances according to the alphabeta model [10] using values appropriate for women (Fig. 3). Since articulations represent vowels, areas calculated as less that 0.1 cm2 have been set equal to 0.1 cm2. (6) Planar acoustic wave propagation was simulated with these area functions as described by [6]. The area function for /u/ was artificially lengthened by a tube 2 cm long and 0.2 cm2 in area to simulate the expected lip protrusion. (7) F1-F2 for each of the acoustic transfer functions were extracted. The acoustic distances between these F1F2 and the target F1-F2 were computed. (8) An optimisation loop on the model parameters including steps (2) to (7) was performed to obtain the three tongue contours minimizing the acoustic distance to the quantal vowel targets. These articulations and associated area functions are considered as the closest model articulations to the acoustic targets (solid lines Figs. 2 & 3). (9) The articulations from the data with minimal articulatory distance to the articulations obtained in step (8) were finally extracted (dashed lines Figs. 2 & 3).

8 6 4 2 0

0

2

4

6



14

/i/

Area (cm2)

3.5

16

Model Data

4

3 2.5 2 1.5 1 0.5 0

0

2

9

Model Data

8 7

Area (cm2)

Three articulations extracted from the DVF film of thin custard swallowing, have been selected as the closest articulatory and acoustic articulations to the quantal vowels. Similarly, three articulations have been produced from the articulatory model as the closest articulatory and acoustic estimates theoretically producible by the subject. We observe (Figs. 2 & 3) that the tongue shape and position match well overall to the pattern expected for these articulations: low and backwards for /a/; up and forwards for /i/; and up and backwards for /u/ although two constrictions are observed for the midsagittal contour of the model /u/ (solid line Fig. 2c). Predicted formants are shown in the F1-F2 plane in Fig. 4 with dispersion ellipses adapted from [11]. For the articulations extracted from the data, none of the F1-F2 points are inside the ellipses. The low F1 observed for /i/ and /u/ may relate to the smaller than expected constrictions in the area functions (in fact complete occlusion of the tract (see Figs. 2b & 2c); their null measurements have been artificially set to 0.1 cm2 in

4

6



/u/

6 5 4 3 2 1 0

0

2

4

6



Fig. 3 Area functions derived from the articulations plotted in Fig. 2.

3

Serrurier & Barney, Vocal Tract Modelling in Feeding from X-Ray Images, MAVEBA 2009

which impact on area function computation (e.g. larynx, velum, lips).

0

200

ACKNOWLEGGEMENTS

/i/

/u/

F1 (Hz)

400

The authors wish to acknowledge A. Matos & R. Santos (Hospitais da Universidade de Coimbra, Portugal), Dr M. Collins (Southampton General Hospital, UK) and Dr P. Badin (GIPSA-lab, Grenoble Universities, France) for their assistance with this work. This work is part of the HandtoMouth project funded under the EC NEST initiative.

600

800

Model Data

1000

/a/

REFERENCES

1200

1400 4000

3500

3000

2500

2000

1500

1000

500

F2 (Hz) Fig. 4 Position in the F1-F2 plane of the articulations of Figs. 2 & 3. The ellipses for the vowels /a/, /i/ and /u/ are adapted from [11]. IV. DISCUSSION AND CONCLUSION Our results show that while feeding we can articulate tongue shapes that match the typical midsagittal geometry of the quantal vowels with predicted F1-F2 close to the corresponding ellipses reported by [11]. Using an articulatory model, we have shown that the feeding movements representative of liquid swallowing allow articulation of tongue shapes which follow the typical midsagittal patterns found for the quantal vowels with predicted F1-F2 falling within the corresponding ellipses reported by [11]. In other words, it seems possible to articulate the point vowels /a/, /i/ and /u/ from feeding movements. This article complements our previous study based on EMA recordings of a French male subject [3] by investigating another recording technique and a new subject (Portuguese, female). Our results support the general conclusion of [3] that speech articulations can largely be found within the set of feeding movements. More generally, our findings support the hypothesis that speech movements might evolve from feeding movements with the caveat that we have not considered the question of the control. The results obtained from this pilot study appear promising for the future. We have shown the benefit of using DVF films which allow derivation of acoustic propagation simulations in the VT. An extended study should however include a more representative task of feeding extending analysis beyond the two swallows of liquid considered here to the complete data set. We have moreover seen that limiting our analysis to the tongue contours may lead to geometrical artefact in the predicted VT. Future work should also include the other articulators

[1] P. F. MacNeilage, “The frame/content theory of evolution of speech production”. Behav. Brain Sci., vol. 21, pp 499-546, 1998. [2] K. M. Hiiemae and J. B. Palmer, “Tongue Movements in feeding and speech”, Crit. Rev. Oral Biol. M., vol. 14(6), pp. 413-429, 2003. [3] A. Serrurier, A. Barney, P. Badin, L.-J. Boë and C. Savariaux, “Comparative articulatory modelling of the tongue in speech and feeding.” Proc. 8th ISSP, Strasbourg, France, pp 325-328, 2008. [4] Y. Tarabalka, P. Badin, F. Elisei and G. Bailly, “Can you “read tongue movements”? Evaluation of the contribution of tongue display to speech understanding.” Proc. ASSISTH'2007, France, 2007. [5] J. Liljencrants and B. Lindblom, “Numerical simulation of vowel quality systems: the role of perceptual contrast”, Language, vol. 48, pp. 839-862, 1972. [6] G. Fant, Acoustic theory of speech production, The Hague: Mouton & co., 1960. [7] National Descriptors for Texture Modification in Adults. British Dietetic Association and the Royal College of Speech and Language Therapists. http://tinyurl.com/mf9lsc, last accessed September 3rd 2009. [8] D. Beautemps, P. Badin and G. Bailly, “Linear degrees of freedom in speech production: Analysis of cineradio- and labio-film data and articulatory-acoustic modelling”, J. Acoust. Soc. Am., vol. 109(5), pp. 21652180, 2001. [9] A. Serrurier and P. Badin, “A three-dimensional articulatory model of the velum and nasopharyngeal wall based on MRI and CT data.” J. Acoust. Soc. Am., vol. 123(4), pp. 2335-2355, 2008. [10] A. Soquet, V. Lecuit, T. Metens and D. Demolin, “Mid-sagittal cut to area function transformations: Direct measurements of mid-sagittal distance and area with MRI”, Speech Commun., vol. 36(3-4), pp. 169-180, 2002. [11] G.E. Peterson and H.L. Barney, "Control methods used in a study of the vowels", J. Acoust. Soc. Am., vol. 24, pp. 175-184, 1952.

4