by Sander Van Wijngaarden
This article focuses on ways an audio engineer can improve speech intelligibility for non-native listeners.
Our natural reflexes tell us what to do about poor intelligibility
Picture a foreign tourist, lost in the streets of a big city – the city where you live, perhaps. The tourist is looking at a city map, trying to make sense of it, but obviously failing. The tourist turns to you, and asks for directions – in heavily accented English. You try to help out, but the conversation quickly becomes very confused and, frankly, a little bit frustrating. After a few minutes, you find yourself repeating the same phrases over and over again. Suddenly you realize that you have raised your voice – you are, in fact, on the verge of shouting. You feel a bit embarrassed – certainly shouting is not going to help. Or is it?
If this has happened to you, don’t be embarrassed. Raising your speech level to increase intelligibility is a universal, natural reflex. When this reflex is triggered by the presence ambient noise, this is known as the Lombard effect. Other causes of decreased intelligibility trigger the same response: “I’m not understood, so I need to speak up.”
So the reflex is natural, but more importantly: it usually works! To a certain degree, problems occurring in non-native speech communication can be fixed by speaking more loudly. The reason is that we are improving our speech-to-noise ratio by increasing our vocal effort.
Why non-native listeners require a better speech-to-noise ratio
In almost every real-life situation, noise and reverberation have a measurable effect on speech intelligibility. This effect is not always noticeable to the individual listener – real-life speech is very robust against degrading influences such as noise.
The way we form sentences introduces loads of redundancy. Even if a large proportion of phonemes in a (meaningful) sentence are knocked out by noise, we are still able to reconstruct the sentence without errors. We’re usually not even aware of it, but linguistic redundancy (introduced, for instance, by the use of grammar) is like a natural error correction code. Even if half of the individual phonemes in a sentence are lost, we can usually reproduce the sentence without much effort. That is: assuming that we are listening to speech in our native language, free from accents.
Non-native listeners also make use of linguistic redundancy to recreate meaningful messages from degraded speech, but they are not as efficient at making use of context: their vocabulary is smaller, their understanding of grammar is less intuitive, and their phonetic categorization processed is flawed. In other words, the linguistic error correction algorithm is does not perform as well. For the same level of intelligibility, a non-native listener requires a much better phoneme recognition rate, and hence a higher speech-to-noise ratio.
Expressing non-nativeness in decibels
This model of the non-native speech perception process is not new; we have come to think of non-native speech this way because of an enormous amount of phonetic and linguistic studies – many of those done decades ago. Unfortunately, much of this research was done by phoneticians and linguists, and is not at all applicable to the acoustician or the audio engineer. From a scientist’s perspective, it’s nice to understand exactly how the non-native phonetic categorization process works – but that does not tell us how much effect this has, or how to compensate for it.
This is why we did a series of studies in the years 2000-2005 to determine:
- If a better SNR can compensate for the loss of intelligibility suffered by non-native listeners
- If so, how big is the difference between native and non-native intelligibility? Can we put a number on non-nativeness in decibels?
The answers to these questions turned out to be: yes, we can put a dB number on the effect of non-nativeness. This is true for non-native listeners as well as non-native talkers (accented speech). The effect in terms of SNR can be anything between 2 and 10 dB – depending on proficiency.
The relation between speech-to-noise ratio and intelligibility is shown in the figure below, for native as well as (worst-case) non-native listening (Source: Van Wijngaarden et al. (2002), “Quantifying the Intelligibility of Speech in Noise for Non-native Listeners,” J. Acoust. Soc. Am., 111(4), 1906-1916).
The definition of “worst-case non-native listener” in this case is: a listener whose proficiency is barely sufficient to take part in formal speech intelligibility tests. Most tourists in the US will understand English much better than this. Curves like this can be measured for every level of proficiency above this practical threshold. Infact, many of these curves were measured, differing not only in the proficiency level of the test subjects, but also the language combinations.
Similar relations apply when studying the effect of non-native talkers: foreign accented speech is affected in a similar fashion, as shown in the figure below. (Source: Van Wijngaarden et al. (2002), “Quantifying the Intelligibility of Speech in Noise for Non-native Talkers,” J. Acoust. Soc. Am., 112(6), 3004-3013).
Interpreting Speech Transmission Index measurements in case of non-native speech
Figures such as the ones shown above may quantify intelligibility effects, but they are not easily applicable to the audio engineer. However, they are relatively easily applied to our interpretation of the Speech Transmission Index (STI).
The Speech Transmission Index measures speech transmission quality: it tells us how speech intelligibility is preserved along the tested transmission path. With a number of assumptions (which exclude non-native speech), the STI is an accurate predictor of speech intelligibility.
With non-native speech, the transmission path is unaffected – the problem lies with speech perception and speech production. Since the STI characterizes the path, we cannot incorporate non-native communication in the STI model. What we can do, however, is modify the way we interpret STI results.
The STI is a number in the range of 0 to 1. Different labels (from “bad” to “excellent”) correspond to different ranges of the STI scale. Based on quantitative studies like the ones cited above, alternative interpretation tables can be compiled. In fact, in the latest edition of the STI standard (IEC 60268-16 4th ed., 2011), such tables are provided in Annex H.
Practical rules of thumb
To the sound system designer, it is usually not practical to completely study and model the influence of non-native factors. However, in situations where a high number of people are likely to be non-native listeners, such as airports, it may be worthwhile to keep the following in mind:
- No matter how well you design your system, there is no way to reach 100% of the population. There will always be people who are wearing headphones, or are just completely distracted. It only makes sense to be concerned specifically with non-native intelligibility if a significant percentage of your listener population is non-native – say, more that 10%.
- If you do decide to take non-native perception into account in your design, this simply means that you need to raise the bar in terms of STI. The question is: by how much? You will need information on the average proficiency of the non-native listeners. This usually comes down to an educated guess, either by the audio engineer or his customer. This then needs to be translated into an STI requirement using the table shown above.
- The intelligibility of high-quality pre-recorded messages is always better than the intelligibility of live speech. This is always true, but even more so for non-native listeners. Never use foreign-language accented speech in recordings – every listener will suffer a decrease in intelligibility, even a listener who shares the same accent. A strong regional accent, even if it does not bother native listeners, can have a profound impact on the intelligibility to non-natives.
- It is always risky to rely on just “informal listening tests” to evaluate intelligibility – but if you need to design for non-native listeners, informal listening tells you… nothing at all. Just look at the curves in the figure above corresponding to non-native listeners (marked as “Fig.11”). If a native “informal listener” experiences no problems with intelligibility, then sentence intelligibility can be anything over (say) 90% – so the speech-to-noise ratio will be in the range between +3 dB and infinity. That means that, to a non-native listener, sentence intelligibility could still be as low as 25% – at which level, the sound system is utterly useless to the listener.