[PDF][PDF] Joint modeling of text and acoustic-prosodic cues for neural parsing

T Tran, S Toshniwal, M Bansal, K Gimpel… - arXiv preprint arXiv …, 2017 - ttmt001.github.io
arXiv preprint arXiv:1704.07287, 2017ttmt001.github.io
In conversational speech, the acoustic signal provides cues that help listeners disambiguate
difficult parses. For automatically parsing a spoken utterance, we introduce a model that
integrates transcribed text and acoustic-prosodic features using a convolutional neural
network over energy and pitch trajectories coupled with an attention-based recurrent neural
network that accepts text and word-based prosodic features. We find that different types of
acoustic-prosodic features are individually helpful, and together improve parse F1 scores …
Abstract
In conversational speech, the acoustic signal provides cues that help listeners disambiguate difficult parses. For automatically parsing a spoken utterance, we introduce a model that integrates transcribed text and acoustic-prosodic features using a convolutional neural network over energy and pitch trajectories coupled with an attention-based recurrent neural network that accepts text and word-based prosodic features. We find that different types of acoustic-prosodic features are individually helpful, and together improve parse F1 scores significantly over a strong text-only baseline. For this study with known sentence boundaries, error analysis shows that the main benefit of acoustic-prosodic features is in sentences with disfluencies and that attachment errors are most improved.
ttmt001.github.io
Showing the best result for this search. See all results