Full HD Voice Will Soon Give Your Phone an Audio Upgrade
Powerful algorithms will yield perfectly clear calls
HD Voice, the first major upgrade to telephone sound quality since the vacuum-tube era, has finally become widely available—just in time for a new generation of phone service called Full-HD Voice to take its place.
At the Mobile World Congress in Barcelona earlier this year, Fraunhofer IIS (Institute for Integrated Circuits) demonstrated a system based on a combination of powerful standard algorithms that can encode and decode in real time the full audio spectrum to 20 kilohertz in stereo. Switching to Full HD, which could be done in many devices as early as next year, would also mark the complete merging of voice into the mobile data stream, a goal long in the making.
Full-HD Voice converts speech into packets that can flow through the Internet along with data traffic, incorporating algorithms that can recover from packet loss, which turns today’s Voice over Internet Protocol (VoIP) calls into choppy, unintelligible hash. The technology includes algorithms that encode music and other nonspeech audio, sounds that are typically mangled by codes optimized to squeeze many voice calls into narrow slices of the spectrum. Because Full-HD Voice carries the whole audio spectrum, calls sound as if everybody’s in the same room; you can even hear soft background sounds, like the faint clatter of fingers on a keyboard. And the powerful coding-decoding (codec) software can run as a smartphone app.
“We want to bring telephony into the 21st century,” just as HD television has done for video, says H.P. Baumeister, director of Fraunhofer IIS’s U.S. branch, in San Jose, Calif.
There’s no doubt that voice telephony still has a foot in the 20th century. Modern landline phones have a frequency range of 300 to 3,400 hertz, a standard based on Bell Labs studies of the requirements for intelligible speech dating back to the 1920s. That range cuts off high frequencies needed to discriminate between consonants such as fand s, but it fit the limited bandwidth of old analog copper phone lines.
In 1988, the International Telecommunication Union approved the G.722 standard for HD Voice, which allows digital phone lines to carry 50 to 7,000 Hz. But it was little used because it would have required upgrading the landline phone network. The first three generations of cellular phones instead retained the 3,400-Hz narrowband landline audio, but they often sounded worse because of the way they compressed speech to squeeze more calls into the limited mobile spectrum. [See “Why Mobile Voice Quality Still Stinks—and How to Fix It,” IEEE Spectrum, October 2014.]
The broader bandwidth of the Internet allowed Skype and some other VoIP services to carry 7,000-Hz HD Voice, but VoIP calls into the phone network have been limited to 3,400 Hz. Most 4G smartphones include dedicated circuits running algorithms to code and decode 7,000-Hz HD Voice, but they can connect at that rate only if both phones and every link between them can handle the signals. In practice, that means it works only between 4G phones on the same carrier.
Full HD will be able to bridge the audio gap regardless of the network or the device connected to it. The technological heart of Full-HD Voice is a standard called the Enhanced Voice Services (EVS) codec. Its speech compression algorithms are more complex and powerful
than those used for the decade-old HD Voice system, and it can squeeze stereo speech spanning the whole audible range into data rates as low as 9.6 kilobits per second. The codec also includes other algorithms developed to compress music.
The separate algorithms are vital because speech and music are compressed in different ways. Voice compression typically relies on algorithms called code-excited linear prediction (CELP), which is built on the physics underlying the human vocal system. CELP can reduce the data rate of voice signals by about a factor of 10. “That coding did a good job on speech but was terrible on everything else,” says Richard Stern, an electrical and computer engineering professor at Carnegie Mellon University, in Pittsburgh.
Music-compression algorithms, such as the MP3 and AAC codecs used for streaming audio, are optimized for human auditory perception. For example, the algorithms don’t bother to accurately reproduce the soft components of sounds likely to be masked by louder sounds at other frequencies and times. That method can represent a wider range of sound, but it requires more bits per second than a speech-based codec, Stern says.
The new EVS codec is a hybrid, containing algorithms for both voice and music, and it switches between them as needed. The new voice algorithms are substantially more complex than those of the decade-old 7,000-Hz codec. Rather than being developed around characteristics of specific languages, as earlier codecs were, these are nearly language independent. The music part is the latest low-latency version of the AAC algorithm, developed for real-time streamed communications. Called AAC-ELDv2, it delivers CD-quality stereo sound in a stream of only 32 kb/s by transmitting one stereo channel plus a lower-data-rate signal that represents the difference between that channel and the other stereo channel.
An important feature of the combined package, says Baumeister, is that EVS is the first codec designed to compensate for packet loss. Such losses degrade voice quality and are inevitable on IP networks such as 4G LTE.
To verify performance of the codec and its loss tolerance, Fraunhofer IIS and 11 partners—including Ericsson, Huawei, Qualcomm, and Samsung—spent millions of euros on human listening tests. Full-HD Voice quality was possible even at data rates as low as 9.6 kb/s.
The processing power of modern smartphone chips is a key enabler for the new codecs. They can be implemented in digital signal processing chips as the 7,000-Hz codecs in 4G smartphones are, or as apps running on a smartphone’s applications processor. The EVS codec “is not complex compared to the apps in a smartphone,” says Baumeister.
Because Full-HD Voice can tolerate packet losses, it could feed compressed data directly into the Internet data stream for routing directly to other equipped devices, like a Skype-to-Skype call between computers or smartphones. Fraunhofer’s Mobile World demonstration did that using apps on Google Nexus 5 phones. With no need for network upgrades, Baumeister says, “you could conceptually roll out service this year, but next year is more realistic.”
You can hear samples at https://www.full-hd-voice.com, but be sure to use good headphones in a quiet environment. Stern compares the difference to the shift from standard resolution to HD television. “It’s going to be subtle, not a huge difference in intelligibility, but it will sound better and more natural, like a high-quality speaker system,” he says.
This article originally appeared in print as “Full-HD Voice is Nearly Here.”
Jeff Hecht writes about lasers, optics, fiber optics, electronics, and communications. Trained in engineering and a life senior member of IEEE, he enjoys figuring out how laser, optical, and electronic systems work and explaining their applications and challenges. At the moment, he’s exploring the challenges of integrating lidars, cameras, and other sensing systems with artificial intelligence in self-driving cars. He has chronicled the histories of laser weapons and fiber-optic communications and written tutorial books on lasers and fiber optics.