An articulatory and acoustic study of ”retroflex” and ”bunched” American English rhotic sound based on MRI Xinhui Zhou1 , Carol Y. Espy-Wilson1 , Mark Tiede2 , Suzanne Boyce3 1 Department of Electrical and Computer Engineering, University of Maryland, College Park, USA 2 Haskins Laboratories and MIT R.L.E., USA 3 Department of Communication Sciences and Disorders, University of Cincinnati, USA [email protected], [email protected], [email protected], [email protected] Abstract The North American rhotic liquid has two maximally distinct articulatory variants, the classic ”retroflex” and the classic ”bunched” tongue postures. The evidence for acoustic differences between these two variants is reexamined using magnetic resonance images of the vocal tract in this study. Two subjects with similar vocal tract dimensions but different tongue postures for sustained /r/ are used. It is shown that these two variants have similar patterns of F1-F3 and zero frequencies. However, the ”retroflex” variant has a larger difference between F4 and F5 than the ”bunched” one (around 1400 Hz vs. around 700 Hz). This difference can be explained by the geometry differences between these two variants, in particular, the shorter and more forward palatal constriction of the ”retroflex” /r/ and the sharper transition between palatal constriction and its anterior and posterior cavities. This formant pattern difference is confirmed by measurement from acoustic data of several additional subjects. Index Terms: liquid sound,3D vocal tract, magnetic resonance imaging, finite element analysis, sensitivity function 1. Introduction It is well known that different speakers may use very different tongue configurations for producing the rhotic liquid of American English [1] . Traditionally, phoneticians have classified the tongue shape for American English /r/ into two maximally distinct types: ”retroflex” (with a raised tongue tip and a lowered tongue dorsum) and ”bunched” (with a lowered tongue tip and a raised tongue dorsum). Usually these shapes have three supraglottal constrictions along the vocal tract: a constriction narrowing the pharynx, a constriction along the palatal vault and a constriction at the lips. However, the classification as ”retroflex” and ”bunched” understates the degree of variability found across speakers. This variability is shown in [2] which obtained magnetic resonance(MR)images of the vocal tract for 22 subjects producing sustained /r/. There are a number of subjects whose /r/ configuration appears to be intermediate. Given the large degree of articulatory difference between ”bunched” /r/ and ”retroflex” /r/, it might be expected that the two would be acoustically distinct. There have been several attempts to correlate particular tongue configurations and acoustic differences across different types of /r/ at F1, F2, and F3, but no consistent pattern has emerged [1, 3]. In recent years, Espy-Wilson and colleagues have suggested that the higher formants may contain clues to tongue configuration and vocal tract dimensions [4, 5]. In this study, we examine the acoustic variability in F1-F5 associated with different tongue configurations used to produce /r/. In particular, the two maximally distinct and classic configurations, ”retroflex” /r/ and ”bunched” /r/, are studied toward the long-term goal of finding acoustic signatures for the different types of /r/. Two subjects, producing ”retroflex” /r/ and ”bunched” /r/ respectively, are used in this study. Ideally, we would like to use a single speaker who naturally produces both a retroflex and bunched /r/. However, it appears that speakers tend to produce /r/ in the same way across contexts. Subjects who can produce more than one variant tend to produce /r/ with configurations that still have a lot of similarity [6]. 2. Materials and methodologies 2.1. Subjects information and data acquisitions Two Native American English male speakers from the database in [2], Speaker 22 with ”retroflex” /r/ and speaker 5 with ”bunched” /r/ (See Figure 1), are chosen as subjects. These two subjects are similar in overall height, vocal tract length, and volume and length of palate. The data collected from both speakers includes MRI data of the vocal tract for sustained /r/ (sagittal, axial, and coronal slices) and booth acoustic data for the sustained /r/ and some nonsense and real words containing /r/. MR imaging was performed on a 1.5 Tesla G.E. machine. The scanning sequence used was FMPSPGR (Fast MultiPlanar SPoiled GRadient echo) with TR (Time of repetition) 110 ms and TE (Time of echoing) 4.2 ms. The thickness is 3 mm for coronal slices at palatal constriction and 5 mm for all other slices. The field of view of image is 240 mm by 240 mm and the image size in pixel is 256 by 256. 2.2. 3D vocal tract reconstruction and FEM The medical image processing software MIMICS (Materialise, Inc) was used to process MR images to obtain a 3D reconstruction of the vocal tract and the geometry is represented by STL (STeroLithography) format. The finite element method (FEM) was applied to this geometry using the COMSOL MULTIPHYSICS package. Harmonic analysis was performed with hard wall property and pressure release condition at lips. The excitation at the glottis was the normal velocity profile of a sinusoidal signal. 2.3. Area function extraction The area function of the vocal tract was extracted based on the reconstructed 3D geometry. The wave propagation property resulting from the 3D FEM is used to guide the area function extraction. As the curvature of the vocal tract changes, the cutting (a) The retroflex tongue shape (b) The bunched tongue shape Figure 1: Midsaggital MR images of the vocal tract for retroflex and bunched shapes(a subset of database in [2]) (a) The retroflex shape(Speaker 22) tongue Figure 3: Area function and the acoustic responses of the retroflex /r/ (b) The bunched tongue shape (Speaker 5) Figure 2: Area function extraction for retroflex and bunched shapes on the reconstructed 3D geometry (straight line indicates the cutting plane) orientation in our method was adjusted to be approximately parallel to the pressure isosurface at frequency 500 Hz, as shown in Figure 2. 3. Results 3.1. Reconstructed 3D vocal tract geometries and FEMbased acoustic response The reconstructed 3D vocal tract shapes for the retroflex and the bunched /r/ are shown in Figure 2. Both of them have a large front cavity. Neither of them have a sublingual space underneath the tongue. The retroflex /r/ has a shorter and more forward palatal constriction. The volume of the back cavity posterior to the palatal constriction is larger in the case of the retroflex /r/ since the tongue dorsum is lowered and the transition between the palatal constriction and its anterior and posterior cavities is sharper. The tongue root of the bunched /r/ is relatively closer to the pharyngeal wall . As a result, there is a tighter contact with the epiglottis in the bunched /r/ and the area of the cross section in the pharyngeal cavity is smaller. Figures 3 and 4 show the 3D FEM result of the acoustic responses. It can be seen that F1-F3 are in the normal range of an /r/ sound for both subjects. However, F4 and F5 are significantly different between these two subjects. The difference between F4 and F5 for the retroflex /r/ is much larger than the bunched /r/ (about 1400 Hz vs. about 700 Hz). We speculate that this difference in F4 and F5 is due to the difference in vocal Figure 4: Area function and the acoustic responses of the bunched /r/ tract shapes caused by the different tongue shapes. The formant values from 3D FEM are listed in Table 1 along with results from Section 3.2 and Section 3.3. Zero frequencies above 5000 Hz are produced in both cases due to the cross modes, but they are higher than F1-F5 which is in the range of interest. 3.2. Sensitivity functions and simple tube modeling The cutting planes for the area function are shown in Figure 2 and the resulting area functions are shown in Figure 3 and 4. The acoustic responses computed by VTAR [7] based on area function match well with the results from the 3D FEM (See Table 1). For simplification, the 3D FEM model does not take the effect of radiation into account. But VTAR can take the radiation effect into account conveniently. It was found that radiation affects only F2, lowering it so that it is closer to that measured from acoustic data (See Table 1). However, the vocal tract modeling using the area function has not produced the zeros which are revealed by 3D FEM. This result is not surprising given there are no side branches in the computer vocal tract model which assumes plane wave propagation. To get insight into the formant-cavity affiliation, the sensitivity functions [8] of F1-F5 are computed as shown in Figure 5 and 6. The sensitivity functions for F1, F2 and F3 have similar (a) The retroflex tongue shape (b) The bunched tongue shape Figure 7: Simple-tube models for retroflex and bunched shapes Figure 5: Sensitivity functions of F1-F5 for retroflex /r/ the palatal constriction). F1 comes from the back cavity which acts as a Helmholtz resonator formed by the palatal constriction and the tube behind it. F3, F4 and F5 are half-wavelength resonances of the cavity posterior to the palatal constriction. Thus, they are fairly evenly spaced. In the case of the bunched /r/, the back cavity is more uniform so that we model the /r/ with only 3 cavities. When decoupled, the back cavity will act as a quarter-wavelength tube instead of a half-wavelength tube as in the retroflex /r/.In both cases, if the pharyngeal and laryngeal constrictions are modeled, the resulting formant values match well with the results from the area functions (See Table 1) . 3.3. Formants in acoustic data Figure 6: Sensitivity functions of F1-F5 for bunched /r/ patterns for both the retroflex /r/ and the bunched /r/. In both cases, F2 is mainly affected by the front cavity where the lip constriction and the large posterior volume act as a Helmholtz resonator. Due to the coupling effect along the vocal tract, F1 and F3 in both cases can be affected by area perturbation along almost the whole vocal tract. Sensitivity functions for F4 and F5 have very different patterns for the retroflex /r/ and the bunched /r/. In the retroflex /r/, F4 and F5 are affected only minimally by the area perturbation of the front cavity, starting around 14.8 cm from the glottis, which means that they are resonances of the cavities posterior to the palatal constriction. In the bunched /r/, F4 and F5 are not sensitive to the area perturbation of the cavity posterior to the pharyngeal constriction and they are affected to some extent by the front cavity. This sensitivity to the front cavity is probably due to the more gradual transition between the front and back cavities. Simple-tube models for the retroflexed and bunched /r/s were derived from area functions in this study, as seen in Figure 7. In the case of the retroflex /r/, the simple model consists of four tubes: a lip constriction, a large volume behind the lip constriction, a palatal constriction and a long tube posterior to the palatal constriction. F2 comes from the front part of the vocal tract which acts like a Helmholtz resonator (includes the lip constriction and the large volume between the lip constriction and Table 1 shows that the acoustic measurement of formant frequencies of Speakers 22 and 5 match well with those obtained from both 3D FEM and the simple-tube model derived from area functions. In order to see if the F4 and F5 pattern in Speakers 22 and 5 holds in other subjects, four more subjects’ sustained /r/ acoustic data are analyzed. Among them, two subjects (Speaker 1 and 20) have retroflex /r/ tongue shapes similar to Speaker 22 and the other two subjects (Speaker 17 and 19) have bunched /r/ tongue shapes similar to Speaker 5, as seen in Figure 1. The spectra of sustained /r/ sound produced by the six subjects are shown in Figure 8. The differences between F4 and F5 for Speaker 1 and 20 are about 1900 Hz and 2000 Hz, respectively, while the differences between F4 and F5 for Speaker 17 and 19 are about 500 Hz and 600 Hz, respectively. These results are consistent with the result obtained from Speaker 22 and 5 in that the retroflexed /r/ has larger difference between F4 and F5 than in the case of the bunched /r/ (about 1400 Hz vs. 700 Hz). Additionally, the formant trajectories of nonsense word ’warav’ also indicate the same difference pattern between F4 and F5 during dynamic speech. 4. Discussion The salient difference between the retroflex and bunched tongue shapes, the spacing between F4 and F5, is due to the difference in the back cavities. In the case of the retroflexed /r/, the back cavities consist of the palatal constriction and the long cavity posterior to it. Our simple tube modeling and the sensitivity functions show that F4 and F5 are resonances of the half-wavelength cavity posterior to the palatal constriction. In fact, F4 and F5 are the second and third resonances of the halfwavelength cavity (F3 is the first resonance of this cavity). For Speaker 22, this half-wavelength cavity is about 12 cm long Table 1: Speakers 22 and 5 /r/ formants from acoustic measurement of sustained /r/ utterances, 3D FEM model , tube model with area function ,and simple-tube model (Unit: Hz) (a) Retroflex /r/s (Left: Speaker 22, Middle: Speaker 1, Right: Speaker 20) (b) Bunched /r/s (Left: Speaker 5, Middle: Speaker 17, Right: Speaker 19) Figure 8: Spectra of sustained /r/ utterances from 6 speakers which gives a spacing between the resonances of about 1460 Hz. The narrowing in the laryngeal region shifts F4 and F5 upwards by different amounts so that the spacing changes to about 1300 Hz. This spacing agrees well with the 1380 Hz measured from Speaker 22’s sustained /r/. For the bunched /r/, the back cavity can be modeled as a quarter-wavelength tube. Our simple tube modeling shows that F4 and F5 are the third and fourth resonances of this cavity. The sensitivity functions, on the other hand, show that F4 and F5 are influenced by the front cavity. This is probably due to the higher degree of coupling between the front and back cavities for the bunched /r/ of Speaker 5. The length of the back cavity for Speaker 5 is about 15 cm. Thus, the spacing between F4 and F5 for the bunched /r/ should be about 1150 Hz. However, the narrowing in the laryngeal, pharyngeal and palatal regions decreases this difference to about 650 Hz as seen in Table 1. This formant difference agrees well with the value of 700 Hz measured from Speaker 5’s sustained /r/. 5. Summary The articulatory-acoustic relationship of retroflex and bunched /r/ in American English is examined in this paper using MRI. 3D FEM analysis shows that both the retroflex /r/ and the bunched /r/ produce zero frequencies above 5000 Hz. Both of them produce similar formant patterns in F1, F2 and F3, but differ in F4 and F5. The difference between F4 and F5 in the retroflex /r/ is much larger than in the case of bunched /r/ (around 1400 Hz vs. around 700 Hz). While both /r/s are produced with narrowings in the back cavity in the palatal, pharyngeal and laryngeal regions, there is a much larger difference in areas between the constricted and unconstricted regions for the retroflex /r/ than for the bunched /r/. Further, the palatal constriction for the retroflex /r/ is shorter and more forward. In both cases, F2 is produced by the front cavity. For the retroflex /r/, the palatal constriction decouples the vocal tract and F3, F4 and F5 are mainly produced by the back cavity posterior to the palatal constriction. However, in the bunched /r/, it is difficult to decouple the vocal tract due to the more gradual change of the area function along the vocal tract, and F4 and F5 are sensitive to the area perturbation of a much longer length along the vocal tract than in the case of retroflex /r/. The acoustic data from other speakers further proves the validity of these results, and the results in this study might be helpful in discriminating the tongue shapes for producing /r/. Our future work will be on the analysis of more subjects in the configuration continuum to make the connection between articulation and acoustic consequences. 6. Acknowledgements This work was supported by NIH grant 1 R01 DC05250-01. 7. References [1] P. Delattre and D. C. Freeman, “A dialect study of american english r’s by x-ray motion picture,” Linguistics, vol. 44, pp. 28–69, 1968. [2] M. Tiede, S. E. Boyce, C. Holland, and A. Chou, “A new taxonomy of american english /r/ using mri and ultrasound,” JASA, vol. 115, no. 5, pp. 2633–2634, 2004. [3] J. R. Westbury, M. Hashi, and M. J. Lindstrom, “Differences among speakers in lingual articulation for american english /r/,” Speech Communication, vol. 26, no. 3, pp. 203–226, 1998. [4] C. Y. Espy-Wilson and S. E. Boyce, “The relevance of f4 in distinguishing between different articulatory configurations of american english /r/,” JASA, vol. 105, no. 2, p. 1400, 1999. [5] C. Y. Espy-Wilson, “Articulatory strategies, speech acoustics and variability,” in Proc. of sound to sense: 50+ years of discoveries in speech communication, MIT, Cambridge, 2004. [6] F. H. Guenther, C. Y. Espy-Wilson, S. E. Boyce, M. L. Matthies, M. Zandipour, and J. S. Perkell, “Articulatory tradeoffs reduce acoustic variability during american english /r/ production,” JASA, vol. 105, no. 5, pp. 2854–2865, 1999. [7] X. H. Zhou, Z. Y. Zhang, and C. Y. Espy-Wilson, “Vtar: A matlab-based computer program for vocal tract acoustic modeling,” JASA, vol. 115, no. 5, p. 2543, 2004. [8] G. Fant and S. Pauli, “Spatial characteristics of vocal tract resonance modes,” in Proc. of the speech communication seminar 74, Stockholm, 121-132, 1974.