Automatic recognition of L2 speech-in-noise
In this study, we compared four state-of-the-art Automatic Speech Recognition (ASR) systems (Google, HuBERT, wav2vec 2.0, whisper) and human listeners on word recognition accuracy of second language (L2) speech embedded in noise.
We found that one system, whisper, performed at levels similar to (or in some cases, better than) human listeners.
However, the content of its responses diverged substantially from human responses.
This suggests that ASR could be utilized to predict human intelligibility but should be used with caution.
Proactive and reactive F0 adjustments in speech
A production experiment was conducted to investigate speakers' (i) pre-planned and (ii) adaptive F0 control.
In particular, the experiment examined whether speakers vary F0 parameters (i) according to the initially planned utterance length and (ii) in response to the unanticipated changes in the length.
An experimental paradigm was developed in which the visual stimuli that cue the parts of the utterance are delayed until after participants initiate an utterance.
Analyses of F0 trajectories found strong evidence for both pre-planned and adaptive F0 control.
F0 control: pitch targets vs. pitch register
The present study examined what speakers control most directly to produce variationa in F0, by evaluating target-control and register-control hypotheses.
In the target-control hypothesis, it is individual pitch targets that speakers mainly control to produce variations in F0, whereas in the register-control hypothesis,
it is the control of pitch register (F0 space in which the targets are realized) that induces F0 variations.
These hypotheses were assessed by examining the correlations between F0 peaks and valleys in empirical F0 trajectories and through computational modeling.
The results suggest that pitch register may be a more important control parameter than previous models have assumed.
Functional relations between speech rate and phonetic variables
This study examined how phonetic measures covary with speech rate, specifically assessing whether there is evidence for linear and/or non-linear relations with rate, and how those relations may differ between phrase boundaries.
Productions of English non-restrictive (NRRCs) and restrictive relative clauses (RRCs) were collected using a method in which variation in speech rate was cued by the speed of motion of a visual stimulus.
Analyses of articulatory and acoustic variables showed that the variables associated with a phrase boundary that follows the RC were more susceptible to rate variation than those at a boundary that precedes the RC.
Phonetic variables at the post-RC boundary also showed evidence for non-linear relations with rate, which suggest floor or ceiling attenuation effects at extreme rates.
Temporal localization of syntactic-prosodic information
This study used a novel neural network-based analysis method for temporally localizing prosodic information that is associated with syntactic contrast in acoustic and articulatory signals.
Neural networks were trained on multi-dimensional acoustic and articulatory data to classify the two types of relative clauses (RRCs vs. NRRCs), and the network accuracies on test data were analyzed.
The results found two different patterns: (i) syntactically conditioned prosodic information was either widely distributed around the boundaries or (ii) narrowly distributed at specific locations.
The findings suggest that prosodic expression of syntactic contrasts does not occur in the uniform way or at a fixed location.
Phonetic evidence for hierarchical prosodic phrases
This study shows that the existing phonetic evidence for hierarchical organization of prosodic phrases is ambiguous, and that a non-hierarchical organization of phrases is also consistent with the data.
To compare hierarchical and non-hierarchical organization models, the current study analyzed productions of English NRRCs and RRCs at varying speech rates.
We examined whether articulatory and acoustic variables at phrase boundaries exhibit evidence of speech rate-dependent mixtures of categories through regression mixture models.
Overall, the evidence for multiple levels of prosodic phrase categories was not very compelling. The measures that were most supportive of hierarchical phrase structure were measures of boundary-related slowing and gestural overlap at boundaries.