caricato da yosobaw550

t zhou etal icslp 07

An articulatory and acoustic study of ”retroflex” and ”bunched” American
English rhotic sound based on MRI
Xinhui Zhou1 , Carol Y. Espy-Wilson1 , Mark Tiede2 , Suzanne Boyce3
1
Department of Electrical and Computer Engineering, University of Maryland, College Park, USA
2
Haskins Laboratories and MIT R.L.E., USA
3
Department of Communication Sciences and Disorders, University of Cincinnati, USA
[email protected], [email protected], [email protected], [email protected]
Abstract
The North American rhotic liquid has two maximally distinct articulatory variants, the classic ”retroflex” and the classic
”bunched” tongue postures. The evidence for acoustic differences between these two variants is reexamined using magnetic
resonance images of the vocal tract in this study. Two subjects
with similar vocal tract dimensions but different tongue postures
for sustained /r/ are used. It is shown that these two variants
have similar patterns of F1-F3 and zero frequencies. However,
the ”retroflex” variant has a larger difference between F4 and F5
than the ”bunched” one (around 1400 Hz vs. around 700 Hz).
This difference can be explained by the geometry differences
between these two variants, in particular, the shorter and more
forward palatal constriction of the ”retroflex” /r/ and the sharper
transition between palatal constriction and its anterior and posterior cavities. This formant pattern difference is confirmed by
measurement from acoustic data of several additional subjects.
Index Terms: liquid sound,3D vocal tract, magnetic resonance
imaging, finite element analysis, sensitivity function
1. Introduction
It is well known that different speakers may use very different
tongue configurations for producing the rhotic liquid of American English [1] . Traditionally, phoneticians have classified
the tongue shape for American English /r/ into two maximally
distinct types: ”retroflex” (with a raised tongue tip and a lowered tongue dorsum) and ”bunched” (with a lowered tongue tip
and a raised tongue dorsum). Usually these shapes have three
supraglottal constrictions along the vocal tract: a constriction
narrowing the pharynx, a constriction along the palatal vault
and a constriction at the lips. However, the classification as
”retroflex” and ”bunched” understates the degree of variability
found across speakers. This variability is shown in [2] which
obtained magnetic resonance(MR)images of the vocal tract for
22 subjects producing sustained /r/. There are a number of subjects whose /r/ configuration appears to be intermediate.
Given the large degree of articulatory difference between
”bunched” /r/ and ”retroflex” /r/, it might be expected that the
two would be acoustically distinct. There have been several attempts to correlate particular tongue configurations and acoustic
differences across different types of /r/ at F1, F2, and F3, but no
consistent pattern has emerged [1, 3].
In recent years, Espy-Wilson and colleagues have suggested
that the higher formants may contain clues to tongue configuration and vocal tract dimensions [4, 5]. In this study, we examine the acoustic variability in F1-F5 associated with different
tongue configurations used to produce /r/. In particular, the two
maximally distinct and classic configurations, ”retroflex” /r/ and
”bunched” /r/, are studied toward the long-term goal of finding
acoustic signatures for the different types of /r/.
Two subjects, producing ”retroflex” /r/ and ”bunched” /r/
respectively, are used in this study. Ideally, we would like to
use a single speaker who naturally produces both a retroflex and
bunched /r/. However, it appears that speakers tend to produce
/r/ in the same way across contexts. Subjects who can produce
more than one variant tend to produce /r/ with configurations
that still have a lot of similarity [6].
2. Materials and methodologies
2.1. Subjects information and data acquisitions
Two Native American English male speakers from the database in [2], Speaker 22 with ”retroflex” /r/ and speaker 5 with
”bunched” /r/ (See Figure 1), are chosen as subjects. These two
subjects are similar in overall height, vocal tract length, and volume and length of palate. The data collected from both speakers
includes MRI data of the vocal tract for sustained /r/ (sagittal,
axial, and coronal slices) and booth acoustic data for the sustained /r/ and some nonsense and real words containing /r/.
MR imaging was performed on a 1.5 Tesla G.E. machine.
The scanning sequence used was FMPSPGR (Fast MultiPlanar SPoiled GRadient echo) with TR (Time of repetition) 110
ms and TE (Time of echoing) 4.2 ms. The thickness is 3 mm
for coronal slices at palatal constriction and 5 mm for all other
slices. The field of view of image is 240 mm by 240 mm and
the image size in pixel is 256 by 256.
2.2. 3D vocal tract reconstruction and FEM
The medical image processing software MIMICS (Materialise,
Inc) was used to process MR images to obtain a 3D reconstruction of the vocal tract and the geometry is represented by
STL (STeroLithography) format. The finite element method
(FEM) was applied to this geometry using the COMSOL MULTIPHYSICS package. Harmonic analysis was performed with
hard wall property and pressure release condition at lips. The
excitation at the glottis was the normal velocity profile of a sinusoidal signal.
2.3. Area function extraction
The area function of the vocal tract was extracted based on the
reconstructed 3D geometry. The wave propagation property resulting from the 3D FEM is used to guide the area function extraction. As the curvature of the vocal tract changes, the cutting
(a) The retroflex tongue shape
(b) The bunched tongue shape
Figure 1: Midsaggital MR images of the vocal tract for retroflex
and bunched shapes(a subset of database in [2])
(a) The retroflex
shape(Speaker 22)
tongue
Figure 3: Area function and the acoustic responses of the
retroflex /r/
(b) The bunched tongue shape
(Speaker 5)
Figure 2: Area function extraction for retroflex and bunched
shapes on the reconstructed 3D geometry (straight line indicates
the cutting plane)
orientation in our method was adjusted to be approximately parallel to the pressure isosurface at frequency 500 Hz, as shown
in Figure 2.
3. Results
3.1. Reconstructed 3D vocal tract geometries and FEMbased acoustic response
The reconstructed 3D vocal tract shapes for the retroflex and
the bunched /r/ are shown in Figure 2. Both of them have a
large front cavity. Neither of them have a sublingual space underneath the tongue. The retroflex /r/ has a shorter and more
forward palatal constriction. The volume of the back cavity
posterior to the palatal constriction is larger in the case of the
retroflex /r/ since the tongue dorsum is lowered and the transition between the palatal constriction and its anterior and posterior cavities is sharper. The tongue root of the bunched /r/ is
relatively closer to the pharyngeal wall . As a result, there is a
tighter contact with the epiglottis in the bunched /r/ and the area
of the cross section in the pharyngeal cavity is smaller.
Figures 3 and 4 show the 3D FEM result of the acoustic
responses. It can be seen that F1-F3 are in the normal range
of an /r/ sound for both subjects. However, F4 and F5 are significantly different between these two subjects. The difference
between F4 and F5 for the retroflex /r/ is much larger than the
bunched /r/ (about 1400 Hz vs. about 700 Hz). We speculate
that this difference in F4 and F5 is due to the difference in vocal
Figure 4: Area function and the acoustic responses of the
bunched /r/
tract shapes caused by the different tongue shapes. The formant
values from 3D FEM are listed in Table 1 along with results
from Section 3.2 and Section 3.3.
Zero frequencies above 5000 Hz are produced in both cases
due to the cross modes, but they are higher than F1-F5 which is
in the range of interest.
3.2. Sensitivity functions and simple tube modeling
The cutting planes for the area function are shown in Figure 2
and the resulting area functions are shown in Figure 3 and 4.
The acoustic responses computed by VTAR [7] based on
area function match well with the results from the 3D FEM (See
Table 1). For simplification, the 3D FEM model does not take
the effect of radiation into account. But VTAR can take the
radiation effect into account conveniently. It was found that
radiation affects only F2, lowering it so that it is closer to that
measured from acoustic data (See Table 1).
However, the vocal tract modeling using the area function
has not produced the zeros which are revealed by 3D FEM. This
result is not surprising given there are no side branches in the
computer vocal tract model which assumes plane wave propagation.
To get insight into the formant-cavity affiliation, the sensitivity functions [8] of F1-F5 are computed as shown in Figure 5
and 6. The sensitivity functions for F1, F2 and F3 have similar
(a) The retroflex tongue shape
(b) The bunched tongue shape
Figure 7: Simple-tube models for retroflex and bunched shapes
Figure 5: Sensitivity functions of F1-F5 for retroflex /r/
the palatal constriction). F1 comes from the back cavity which
acts as a Helmholtz resonator formed by the palatal constriction
and the tube behind it. F3, F4 and F5 are half-wavelength resonances of the cavity posterior to the palatal constriction. Thus,
they are fairly evenly spaced. In the case of the bunched /r/,
the back cavity is more uniform so that we model the /r/ with
only 3 cavities. When decoupled, the back cavity will act as a
quarter-wavelength tube instead of a half-wavelength tube as in
the retroflex /r/.In both cases, if the pharyngeal and laryngeal
constrictions are modeled, the resulting formant values match
well with the results from the area functions (See Table 1) .
3.3. Formants in acoustic data
Figure 6: Sensitivity functions of F1-F5 for bunched /r/
patterns for both the retroflex /r/ and the bunched /r/. In both
cases, F2 is mainly affected by the front cavity where the lip
constriction and the large posterior volume act as a Helmholtz
resonator. Due to the coupling effect along the vocal tract, F1
and F3 in both cases can be affected by area perturbation along
almost the whole vocal tract.
Sensitivity functions for F4 and F5 have very different patterns for the retroflex /r/ and the bunched /r/. In the retroflex
/r/, F4 and F5 are affected only minimally by the area perturbation of the front cavity, starting around 14.8 cm from the glottis,
which means that they are resonances of the cavities posterior
to the palatal constriction. In the bunched /r/, F4 and F5 are not
sensitive to the area perturbation of the cavity posterior to the
pharyngeal constriction and they are affected to some extent by
the front cavity. This sensitivity to the front cavity is probably
due to the more gradual transition between the front and back
cavities.
Simple-tube models for the retroflexed and bunched /r/s
were derived from area functions in this study, as seen in Figure
7. In the case of the retroflex /r/, the simple model consists of
four tubes: a lip constriction, a large volume behind the lip constriction, a palatal constriction and a long tube posterior to the
palatal constriction. F2 comes from the front part of the vocal
tract which acts like a Helmholtz resonator (includes the lip constriction and the large volume between the lip constriction and
Table 1 shows that the acoustic measurement of formant frequencies of Speakers 22 and 5 match well with those obtained
from both 3D FEM and the simple-tube model derived from
area functions. In order to see if the F4 and F5 pattern in Speakers 22 and 5 holds in other subjects, four more subjects’ sustained /r/ acoustic data are analyzed. Among them, two subjects (Speaker 1 and 20) have retroflex /r/ tongue shapes similar
to Speaker 22 and the other two subjects (Speaker 17 and 19)
have bunched /r/ tongue shapes similar to Speaker 5, as seen in
Figure 1. The spectra of sustained /r/ sound produced by the
six subjects are shown in Figure 8. The differences between F4
and F5 for Speaker 1 and 20 are about 1900 Hz and 2000 Hz, respectively, while the differences between F4 and F5 for Speaker
17 and 19 are about 500 Hz and 600 Hz, respectively. These results are consistent with the result obtained from Speaker 22
and 5 in that the retroflexed /r/ has larger difference between
F4 and F5 than in the case of the bunched /r/ (about 1400 Hz
vs. 700 Hz). Additionally, the formant trajectories of nonsense
word ’warav’ also indicate the same difference pattern between
F4 and F5 during dynamic speech.
4. Discussion
The salient difference between the retroflex and bunched tongue
shapes, the spacing between F4 and F5, is due to the difference in the back cavities. In the case of the retroflexed /r/, the
back cavities consist of the palatal constriction and the long
cavity posterior to it. Our simple tube modeling and the sensitivity functions show that F4 and F5 are resonances of the
half-wavelength cavity posterior to the palatal constriction. In
fact, F4 and F5 are the second and third resonances of the halfwavelength cavity (F3 is the first resonance of this cavity). For
Speaker 22, this half-wavelength cavity is about 12 cm long
Table 1: Speakers 22 and 5 /r/ formants from acoustic measurement of sustained /r/ utterances, 3D FEM model , tube model
with area function ,and simple-tube model (Unit: Hz)
(a) Retroflex /r/s (Left: Speaker 22, Middle: Speaker 1, Right: Speaker 20)
(b) Bunched /r/s (Left: Speaker 5, Middle: Speaker 17, Right: Speaker 19)
Figure 8: Spectra of sustained /r/ utterances from 6 speakers
which gives a spacing between the resonances of about 1460
Hz. The narrowing in the laryngeal region shifts F4 and F5 upwards by different amounts so that the spacing changes to about
1300 Hz. This spacing agrees well with the 1380 Hz measured
from Speaker 22’s sustained /r/. For the bunched /r/, the back
cavity can be modeled as a quarter-wavelength tube. Our simple tube modeling shows that F4 and F5 are the third and fourth
resonances of this cavity. The sensitivity functions, on the other
hand, show that F4 and F5 are influenced by the front cavity.
This is probably due to the higher degree of coupling between
the front and back cavities for the bunched /r/ of Speaker 5. The
length of the back cavity for Speaker 5 is about 15 cm. Thus, the
spacing between F4 and F5 for the bunched /r/ should be about
1150 Hz. However, the narrowing in the laryngeal, pharyngeal
and palatal regions decreases this difference to about 650 Hz as
seen in Table 1. This formant difference agrees well with the
value of 700 Hz measured from Speaker 5’s sustained /r/.
5. Summary
The articulatory-acoustic relationship of retroflex and bunched
/r/ in American English is examined in this paper using MRI. 3D
FEM analysis shows that both the retroflex /r/ and the bunched
/r/ produce zero frequencies above 5000 Hz. Both of them produce similar formant patterns in F1, F2 and F3, but differ in F4
and F5. The difference between F4 and F5 in the retroflex /r/ is
much larger than in the case of bunched /r/ (around 1400 Hz vs.
around 700 Hz).
While both /r/s are produced with narrowings in the back
cavity in the palatal, pharyngeal and laryngeal regions, there is
a much larger difference in areas between the constricted and
unconstricted regions for the retroflex /r/ than for the bunched
/r/. Further, the palatal constriction for the retroflex /r/ is shorter
and more forward. In both cases, F2 is produced by the front
cavity. For the retroflex /r/, the palatal constriction decouples
the vocal tract and F3, F4 and F5 are mainly produced by the
back cavity posterior to the palatal constriction. However, in the
bunched /r/, it is difficult to decouple the vocal tract due to the
more gradual change of the area function along the vocal tract,
and F4 and F5 are sensitive to the area perturbation of a much
longer length along the vocal tract than in the case of retroflex
/r/. The acoustic data from other speakers further proves the
validity of these results, and the results in this study might be
helpful in discriminating the tongue shapes for producing /r/.
Our future work will be on the analysis of more subjects in the
configuration continuum to make the connection between articulation and acoustic consequences.
6. Acknowledgements
This work was supported by NIH grant 1 R01 DC05250-01.
7. References
[1] P. Delattre and D. C. Freeman, “A dialect study of american
english r’s by x-ray motion picture,” Linguistics, vol. 44,
pp. 28–69, 1968.
[2] M. Tiede, S. E. Boyce, C. Holland, and A. Chou, “A
new taxonomy of american english /r/ using mri and ultrasound,” JASA, vol. 115, no. 5, pp. 2633–2634, 2004.
[3] J. R. Westbury, M. Hashi, and M. J. Lindstrom, “Differences among speakers in lingual articulation for american
english /r/,” Speech Communication, vol. 26, no. 3, pp.
203–226, 1998.
[4] C. Y. Espy-Wilson and S. E. Boyce, “The relevance of f4 in
distinguishing between different articulatory configurations
of american english /r/,” JASA, vol. 105, no. 2, p. 1400,
1999.
[5] C. Y. Espy-Wilson, “Articulatory strategies, speech
acoustics and variability,” in Proc. of sound to sense: 50+
years of discoveries in speech communication, MIT, Cambridge, 2004.
[6] F. H. Guenther, C. Y. Espy-Wilson, S. E. Boyce, M. L.
Matthies, M. Zandipour, and J. S. Perkell, “Articulatory
tradeoffs reduce acoustic variability during american english /r/ production,” JASA, vol. 105, no. 5, pp. 2854–2865,
1999.
[7] X. H. Zhou, Z. Y. Zhang, and C. Y. Espy-Wilson, “Vtar:
A matlab-based computer program for vocal tract acoustic
modeling,” JASA, vol. 115, no. 5, p. 2543, 2004.
[8] G. Fant and S. Pauli, “Spatial characteristics of vocal tract
resonance modes,” in Proc. of the speech communication
seminar 74, Stockholm, 121-132, 1974.