Alak Presentation-Hong
Alak Presentation-Hong
Alak Presentation-Hong
2021.08.22
Table of Contents
1. Research Background
2. Rhythm Evaluation
3. Intonation Evaluation
4. Stress Evaluation
4. Conclusion
5. Future Studies
1. Research Background
Research Background
CALL & CAPT
• Research topic 2: effectiveness on learning - testing the learning effectiveness using CAPT,
anlaysis on effective feedback types, learners’ satisfaction level
Research Background
Limitations on previous research
• Effective pronunciation learning requires quantitative feedback on each of 4 criteria: segmental(phoneme) and
suprasegmental(rhythm, intonation, stress) levels
(Neri et al., 2002; Lee et al., 2015; McGregor and Reed, 2018; Lee, 2019; Perez-Ramon et al., 2020;)
• State-of-the-art automated pronunciation evaluation technology is based on deep learning and requires too much
computing resources (Zhang et al., 2020; Lin et al., 2020)
• Even then, still holistic scores
• Without practicality in classrooms, CALL/CAPT is meaningless (Neri et al., 2002)
Research Background
Goal & Hypothesis
• Research Goal
• Develop automated evaluation system of phoneme, rhythm, intonation, stress—> Feedback on effective pronunciation learning
• Run experiments to test the effectiveness of the autoamted pronunciation evaluation system —> Technology considering educational
practicality
• Hypothesis
• Evaluation of English segementals
• ASR-based phoneme evaluation system will provide human-evaluator-level scores when it’s modeled with native speakers phoneme
information
• Evaluation of English suprasegmentals
• It will provide human-evaluator-level scores when rhythm is modeled based on duration, intonation on pitch, stress on energy and pitch
• Evaluation of effectiveness of automated English phoneme scoring systems
• English learners’ pronunciation level will be enhanced after practicing with automted English phoneme evaluation systems for multiple times
• The satisfaction level of English learners will be dependent on the pronunciation score units: word, syllable, phoneme
2. Automated Evaluation of
Pronunciation
Intro
Overview
STEP1. Modeling
Test Data
= NNS speech
STEP2. Scoring
Model Predicted
Scores
Model Performance Test
= Machine-Human Score Agreement
Test Data
= Human Scores
STEP3. Validation
Intro
Database - English Read by Japanese (ERJ)
1. Structure
Structure of L2 speech Struture of L2 speech
• Readspeech of English spoken by Japanese college students Number Criteria Number
(Minematsu et al., 2004) Criteria
Intonatio
Phoneme Rhythm Stress
n
Eng L1 speech DB
ASR-based phoneme Predict phoneme
scoring model scores
Pronunciation Dict
STEP2. Scoring
Model predicted
Phoneme scores
Machine-Human agreement
5 human raters’
scores
STEP3. Validation
Method
Modeling
AH
2
10ms
…
MFCC …
He is running. (spkr-independent)
NG
R
i-Vector …
(spkr-dependent)
…
Z
Method
Scoring
1. Forced alignment of ERJ L2 speech & text 2. Percentage conversion of
frame-wise phoneme log-likelihood
- ERJ L2 speech with phoneme scores: 5,674 utterances
- Forced alignment using DNN-HMM acoustic model - Log-likelihood ranges from -infinity to 0
- Log-likelihood distributions are different phoneme by phoneme in L1
speech
He is running. - Normalization of L1 phoneme log-likelihood statistics
10ms
Percentage conversion
based on L1’s IH1 llh stats
H IY1 IH1 Z R AH1 N IH0 NG 76 82 93
IH1 frame-wise percent score
51
Method
Validation
1. Normalize phoneme scores per utterance within 1 to 2. Machine-Human agreement
5
- For comparison with 1-5 Likert scale human raters’ scores -Human-Human agreement: 5 human raters’ score
agreement to test credibility of human scores
Num of utt 5674 1890 1890 945 1890 1890 945 1890 945 945 5674 5674 1890 1890 945
PCC* 0.69 0.59 0.55 0.49 0.54 0.57 0.51 0.49 0.46 0.54 0.54 0.54 0.55 0.44 0.53 0.42 0.50
|SMD| 0.06 0.09 0.46 0.67 0.15 0.41 0.35 0.56 0.68 0.12 0.35 0.19 0.13 0.22 0.26 0.18 0.20
QWK 0.69 0.59 0.49 0.39 0.53 0.53 0.49 0.41 0.36 0.54 0.50 0.49 0.51 0.39 0.50 0.38 0.45
EPA 47.02 51.96 35.56 33.23 46.24 36.82 37.78 32.22 32.70 41.48 39.50 29.34 31.81 28.89 33.92 30.26 30.84
APA 92.12 94.34 83.60 81.69 92.91 85.87 86.03 81.01 81.48 87.09 86.61 73.37 74.09 76.93 77.99 74.39 75.35
Human-Human Human-Machine
- Low performance with models using raw log-likelihood of the users’ speech (Hu et
Metric al. ,2015)
Avg Avg
- ASR-based phoneme evaluation system will show human-level performance if
PCC* 0.54 0.50 native speakers’ log-likelihood is used to normalize raw log-likelhood of the
users.
|SMD| 0.35 0.20
>> Results
QWK 0.50 0.45
- Human raters’ scores are credible.
EPA 39.50 30.84 - Human-Machine agreement is similar to Human-Human agreement.
ERJ L1 speech
ERJ L2 speech Rhythm
Multiple linear Predict rhythm
feature
selection regression scores
ERJ mean rhythm scores
STEP2. Scoring
Model predicted
rhythm scores
Human-Machine JapL1, EngL1, EngL2
agreement Rhythm score comparison
4 human raters’
scores
STEP3. Validation
Method
Modeling
1. Rhythm feature selection
feat2
Multiple linear Mean of 4 human raters’
regression rhythm scores
feat3
…
feat27
Human-Human Human-Machine
H-H H-M
Metric H1-H2 H1-H3 H1-H4 H2-H3 H2-H4 H3-H4 H1-M H2-M H3-M H4-M
Avg Avg
Num of items 950 950 950 950 950 950 950 950 950 950
PCC* 0.53 0.55 0.46 0.45 0.56 0.42 0.50 0.24 0.43 0.29 0.37 0.33
|SMD| 0.06 0.40 0.18 0.33 0.24 0.57 0.30 0.3492 0.29 0.03 0.50 0.29
QWK 0.53 0.50 0.45 0.42 0.55 0.35 0.47 0.18 0.37 0.24 0.28 0.27
EPA 38.21 40.11 39.05 34.00 39.47 30.95 36.96 24.21 26.84 24.63 25.47 25.29
APA 84.95 86.63 81.26 83.79 85.37 75.79 82.96 62.84 71.68 67.89 64.63 66.76
Human-Human Human-Machine
- Multiple linear regression based rhythm scoring model trained with the
Metric selected rhythm features known to be proper in the previous studies will
Avg Avg
show human-level performance.
PCC* 0.50 0.33
>> Results
|SMD| 0.30 0.29
- Human raters’ scores are credible.
- Human-Machine agreement is similar to Human-Human agreement.
QWK 0.47 0.27
2 L1 L2 JNAS L2
3 L2 JNAS L2 JNAS
Result
ERJ L1 vs ERJ L2 vs JNAS
Interpretation1: Global Interval Proportion, Duration, nPVI, rPVI, Speech Rate —> Nativeness
Interpretation2: Pause, Isochrony —> English rhythm
Interpretation3: All the features combined, the resut ranking is L1 > JNAS > L2 with neutralized effects (BEST
features)
Summary
>> Hypothesis
- Rhythm scoring model trained with English rhythm scores will predict the scores in the following order:
Eng L1 > Eng L2 > Jap L1.
>> Results
- The results are Eng L1 > Jap L1 > Eng L2, proving Eng L1 > Eng L2
- Trained with sentence internal pause and isochrony, it proves Eng L1 > Eng L2 > Jap L1
- Trained with duration related features of language units, it rejects Jap L1 > Eng L1 > Eng L2
>> Conclusion
ERJ L2 speech
Feature Multiple linear Predict intonation
extraction regression scores
ERJ mean intonation
scores
STEP2. Scoring
Model predicted
intonation scores
Human-Model agreement
4 human raters’
scores
STEP3. Validation
Method
Modeling
L1 model speech
feat1
feat2
Multiple linear Mean of 4 human raters’
regression intonation scores
feat3
feat8
Method
Scoring & Validation
That’s from my brother who lives in London. Fred ate the beans. The play ended, happily.
[Model, H1, H2, H3, H4] [Model, H1, H2, H3, H4] [Model, H1, H2, H3, H4]
5, 3, 4, 3, 4 2, 1, 2, 1, 1 5, 2, 4, 3, 5
Result
Score agreement
Human-Human Human-Machine
Metric
H-H H-M
H1-H2 H1-H3 H1-H4 H2-H3 H2-H4 H3-H4 H1-M H2-M H3-M H4-M
Avg Avg
Num of items 950 950 950 950 950 950 950 950 950 950
PCC* 0.45 0.49 0.54 0.35 0.47 0.38 0.45 0.22 0.17 0.16 0.19 0.19
|SMD| 0.89 0.34 0.61 0.61 0.14 0.35 0.49 0.86 0.04 0.57 0.11 0.39
QWK 0.32 0.46 0.45 0.30 0.44 0.33 0.38 0.16 0.17 0.15 0.18 0.17
EPA 26.11 37.89 28.95 31.16 32.84 23.79 30.12 22.95 32.63 28.95 27.79 28.08
APA 69.26 85.37 70.42 79.68 79.47 70.84 75.84 65.89 79.37 74.95 67.89 72.03
Metric
Human-Human Human-Machine - Multiple linear regression based intonation scoring model trained with
Avg Avg features of L1-L2 pitch smilarities will show human-level performance.
APA 75.84 72.03 - Multiple linear regression based intonation scoring model trained with
features of L1-L2 pitch smilarities shows human-level performance.
2.4. Stress Evaluation
Intro
Overview
STEP2. Scoring
Model predicted
stress scores
Human-Model agreement
Mean of 2 human
raters’ scores
STEP3. Validation
Method-1
Modeling
2. Vowel stress recognition model
1. DNN-HMM based acoustic model
-Multiply the reference pronunciation by applying 3 types of stress on
-The same acoustic model used for phoneme scoring system vowels {0,1,2}
-Word-to-pronunciation is based on CMU dictionary -Develop vowel stress recognition model combinig the pronunciation
-0 = no stress, 1=primary stress, 2=secondary stress dictionary with the DNN-HMM acoustic model
AH
IN: Frame-wise feature 0 OUT: 69 phonemes “About”
(MFCC + I-Vector)
(24cons + 15vowel x 3stresses)
AH AH0 AW0
1
AH1 B AW1 T
AH
2 AH2 AW2
…
MFCC …
(spkr-independent)
NG
9 combination of pronunciations
can possibly be recognized
“About”
R Vowel stress AH0 B AW0 T
i-Vector … recognizer
(spkr-dependent) AH0 B AW1 T
…
Z …
Method-1
Scoring & Validation
*PCC is statistically significant showing P<0.001 in conditions where P-value is not mentioned.
Summary-1
Discussion
>> Hypothesis
ERJ L2 speech
Feature Multiple linear Predict stress
extraction regression scores
ERJ mean stress
scores
STEP2. Scoring
Model predicted
stress scores
Machine-Human agreement
Mean of 2 human
raters’ scores
STEP3. Validation
Method-2
Modeling
Trend similarity correlation
1. Stress feature extraction from ERJ L2 & L1 speech Normalized dot product(A,B) / (std A x std B)
- ERJ L2 speech with stress scores: 1900 utterances
- reading the same sentences as ERJ L2 (1 audio per sentence) Energy-pitch KL divergence
- Dynamic Time Warping (DTW) using MFCC of L1 and L2 speech alpha x energy trend similarity - (1-alpha) x pitch trend
- Align pitch and energy sequences of L1 and L2 based on DTW alignment similarity
- Extract 3 similarity features from the 2 pitch and energy sequences (Arias et al., 2010)
Pitch
(Interpolation + smoothing)
Energy
(Interpolation)
Method-2
Scoring & Validation
Human-Human Human-Machine
Metric
H-H H-M
H1-H2 H1-M H2-M
Avg Avg
Num of
1900 1900 1900
items
>> Results
|SMD| 0.69 0.39
- Human raters’ scores are credible.
QWK 0.35 0.12 - Human-Machine agreement is similar to Human-Human agreement.
>> Conclusion
EPA 34.00 32.84
- Multiple linear regression based stress scoring model trained with features of
APA 75.32 81.58 L1-L2 pitch & energy smilarities shows human-level performance.
3. Effectiveness of Automated
Evaluation of Phoneme
Intro
Overview
1. Goal
Examine the effectiveness of pronunciation score feedback of automated pronunciation evaluation system for
non-native English learners
2. Participants
17 Korean undergraduate students
3. Experiment
Practice pronunciation of English words with real-time pronunciation scores using the automated English
pronunciation system
3 types of pronunciation score feedback: word score, syllable score, phoneme score
4. Survey
User experience of the pronunciation evaluation system: Subjective evaluation of the participants
Method
Participants
Recording
Word to practice
Playback
Real-time score
1. The participants are guided to practice pronuncing words in the provided URL every last 15 minutes per
hour
(In the order of word score type, phoneme score type, syllable score type)
2. Right before the 1st practice, the professor demonstrates how to record and interpret the score feedback
with the testing sentence “This is a test.”
3. The participants access the URL and login with their student IDs using their own PCs (The class is online, so
they have to find their own private places where there is a strong Internet connection and no noises)
5. After the practice, they should answer the 21 questions on user experience of the system (GoogleForm)
Method
Experiment - Survey for user experience
X Y Objective Subjective
• Feedback type has no statistically significant effect on score
RQ1 Feedback type
Pronunciation
Y/N, how Y/N, how improvment, but it has a statistically significant effect on both trial
improvement
1 and trial 5.
RQ2 Feedback type Participation Y/N, how Y/N, how
• In general, scores between trial 1 and trial 5 are significantly
RQ3 English proficiency
Pronunciation
Y/N, how Y/N, how different.
improvement
• But in the phoneme score type, there is no significant difference
RQ4 English proficiency Participation Y/N, how Y/N, how
between trial 1 and trial 5 (Kim et al., 2020)
Result
Feedback type on participation (obj)
X Y Objective Subjective
Pronunciation
RQ1 Feedback type Y/N, how Y/N, how
improvement
• Feedback type has an effect on learners participation.
• The number of practices are in the following order:
RQ2 Feedback type Participation Y/N, how Y/N, how
Pronunciation
RQ3 English proficiency
improvement
Y/N, how Y/N, how word > phoneme > syllable.
RQ4 English proficiency Participation Y/N, how Y/N, how
Result
Feedback type on participation (obj)
TOEIC score & trial 1 score TOEIC score & trial 5 score TOEIC score & score improvement
Pearson’s r P-value Pearson’s r P-value Pearson’s r P-value
All 0.305 0.233 All 0.420* 0.093 All -0.017 0.949
Word 0.309 0.228 Word 0.210 0.418 Word -0.249 0.336
Syllable 0.201 0.439 Syllable 0.515** 0.035 Syllable 0.252 0.329
Phone 0.137 0.601 Phone 0.109 0.677 Phone -0.054 0.836
***p<0.01, **p<0.05 ,*p<0.1 ***p<0.01, **p<0.05 ,*p<0.1 ***p<0.01, **p<0.05 ,*p<0.1
• The higher TOEIC scores, the higher scores both in trial1 and trial5.
• The higher TOEIC scores, the smaller the pronuncaiton improvement.
X Y Objective Subjective (Derwing and Munro, 2005; Lee et al., 2015)
Pronunciation
RQ1 Feedback type
improvement
Y/N, how Y/N, how
• In syllable score feedback type, positive correlation between TOEIC
RQ2 Feedback type Participation Y/N, how Y/N, how scores and trial5 scores.
RQ3 English proficiency
Pronunciation
Y/N, how Y/N, how
• In syllable score feedback type, positive correlation between TOEIC
improvement scores and score improvement.
RQ4 English proficiency Participation Y/N, how Y/N, how —> The more English knowledge, the better utilizing syllable score
information.
Result
English proficiency on participation (obj)
Pronunciation
RQ1 Feedback type Y/N, how Y/N, how
improvement
RQ2 Feedback type Participation Y/N, how Y/N, how • The higher TOEIC scores, the less number of practices.
RQ3 English proficiency
Pronunciation
Y/N, how Y/N, how • Except for syllable score feedback type.
improvement
X Y Objective Subjective
Q17. Which feedback type (word, phoneme, syllable) was the most helpful?
Feedback type %
word 11%
syllable 33%
phone 56%
• Those with less practices agreed more that the system enhances the
X Y Objective Subjective motivation/opportunities to speak English.
Pronunciation
• This was statistically significant in word score feedback type
RQ1 Feedback type Y/N, how Y/N, how
improvement —> The sooner they stopped practicing, the more positive perception
RQ2 Feedback type Participation Y/N, how Y/N, how
Pronunciation • Those with less practices in word feedback type agreed more that the
RQ3 English proficiency Y/N, how Y/N, how
improvement word score feedback was helpful.
RQ4 English proficiency Participation Y/N, how Y/N, how —> Feedback satisfaction and number of practices have negative
correlation
***p<0.01, **p<0.05 ,*p<0.1
Result
English proficiency on pronunciation improvement (subj)
Q5. Using the automatic pronunciation evaluation system improved my English pronunciation.
qID Pearson’s r P-value
q5 0.2409254045* 0.08855220314
X Y Objective Subjective • Those with higher proficiency agreed more that their English
RQ1 Feedback type
Pronunciation
Y/N, how Y/N, how pronunciation was improved using the system.
improvement
• Those with lower proficinecy agreed more that their Englsih
RQ2 Feedback type Participation Y/N, how Y/N, how
speaking was improved using the feedback
RQ3 English proficiency
Pronunciation
Y/N, how Y/N, how —> Those with lower proficiency consider explicit feedback as
improvement
more helpful.
RQ4 English proficiency Participation Y/N, how Y/N, how
Result
English proficiency on satisfaction/participation (subj)
Q3. I enjoy practicing English pronunciation with the automatic pronunciation evaluation system.
qID Pearson’s r P-value
q3 0.3364481699** 0.01577884639
RQ1: Does the subword score type have an effect on pronunciation improvement in the automated pronunciation evaluation system?
Ans: Yes
- Obj: In the order of Word > Syllable > Phoneme. Statistically significant except for phoneme
- Subj: In the order of Phoneme > Syllable > Word (The higher granularity of feedback, the higher learner satisfaction)
RQ2: Does the subword score type have an effect on participation/satisfaction in the automated pronunciation evaluation system?
Ans: Yes
- Obj: In the order of Word > Phoneme > Syllable. The lower the initial score, the lower the participation level
- Subj: The participants with lower participation showed higher satisfaction level
RQ3: Does the English proficiency have an effect on pronunciation improvement in the automated pronunciation evaluation system?
Ans: Yes
- Obj: Those with higher proficiency showed smaller degree of improvement, despite the highest final score.
But they were more capable of utilizing explicit feedback information in advanced pronunciation tasks.
- Subj: Those with higher proficiency agreed more that the automated pronunciation evaluation system helped improve their pronunciation.
RQ4: Does the English proficiency have an effect on participation/satisfaction in the automated pronunciation evaluation system?
Ans: Yes
- Obj: The higher the proficiency, the less number of practices.
- Subj: The higher the proficiency, the higher the satisfaction on the system.
4. Conclusion
결론
• Research hypothesis
• Evaluation of English segementals
• ASR-based phoneme evaluation system will provide human-evaluator-level scores when it’s modeled
with native speakers phoneme information > TRUE
• Evaluation of English suprasegmentals
• It will provide human-evaluator-level scores when rhythm is modeled based on duration, intonation
on pitch, stress on energy and pitch > TRUE
• Evaluation of effectiveness of automated English phoneme scoring systems
• English learners’ pronunciation level will be enhanced after practicing with automted English
phoneme evaluation systems for multiple times > TRUE
• The satisfaction level of English learners will be dependent on the pronunciation score units: word,
syllable, phoneme > TRUE
5. Future Studies
Future Studies
• Experiment on effectiveness of automated suprasegmental evaluation on
pronunciation learning
- Hönig, F., Batliner, A., & Nöth, E. (2012). Automatic assessment of non-native prosody annotation, modelling and evaluation.
- Hu, W., Qian, Y., Soong, F. K., & Wang, Y. (2015). Improved mispronunciation detection with deep neural network trained acoustic models and transfer learning based logistic regression classifiers. Speech Communication, 67, 154-166.
- Kim, J. E., Cho, Y., Cho, Y., Hong, Y., Kim, S., & Nam, H. (2020). The effects of L1–L2 phonological mappings on L2 phonological sensitivity: evidence from self-paced Listening. Studies in Second Language Acquisition, 42(5), 1041-1076.
- Kim, M. (2020) . A study of rhythm improvements and relevant linguistic factors in the pronunciation of English learners. Studies in Foreign Language Education, 34(1), 237-261.
- Kommissarchik, J., & Komissarchik, E. (2000). Better Accent Tutor–Analysis and visualization of speech prosody. Proceedings of InSTILL 2000, 86-89.
- Lee, J., Jang, J., & Plonsky, L. (2015). The effectiveness of second language pronunciation instruction: A meta-analysis. Applied Linguistics, 36(3), 345-366.
- Lee, O. (2019). Suprasegmental instruction and the improvement of EFL learners’ listening comprehension. English Language & Literature Teaching, 25(4), 41-60.
- Lin, B., Wang, L., Feng, X., & Zhang, J. (2020). Automatic scoring at multi-granularity for L2 pronunciation. Proc. Interspeech 2020, 3022-3026.
- Loukina, A., Zechner, K., Bruno, J., & Klebanov, B. B. (2018, June). Using exemplar responses for training and evaluating automated speech scoring systems. In Proceedings of the thirteenth workshop on innovative use of NLP for building
educational applications (pp. 1-12).
- McGregor, A., & Reed, M. (2018). Integrating pronunciation into the English language curriculum: A framework for teachers. CATESOL Journal, 30(1), 69-94.
- Minematsu, N., Tomiyama, Y., Yoshimoto, K., Shimizu, K., Nakagawa, S., Dantsuji, M., & Makino, S. (2004). Development of English speech database read by Japanese to support CALL research. In Proc. ICA (Vol. 1, No. 2004, pp. 557-
560).
- Neri, A., Cucchiarini, C., & Strik, H. (2002). Feedback in computer assisted pronunciation training: When technology meets pedagogy.
- Pérez-Ramón, R., Cooke, M., & Lecumberri, M. L. G. (2020). Is segmental foreign accent perceived categorically?. Speech Communication, 117, 28-37.
- Prince, J. B. (2014). Contributions of pitch contour, tonality, rhythm, and meter to melodic similarity. Journal of Experimental Psychology: Human Perception and Performance, 40(6), 2319.