Headphone Immersive Audio, Head Tracking, and Virtual Speakers

July 13 2023, 17:10
Headphones are a convenient way to listen to music, but they have several disadvantages. One major problem is that the sound seems to be localized inside the listener’s head, rather than spaced away from the listener as in live music or conventional stereo or multichannel. Over the years, this issue has been understood by considering the crosstalk that real acoustic sources have to the ears, including amplitude and timing differences, as well as frequency response due to diffraction and absorption by the head and pinnae.
 

Interaural Timing Difference (ITD) and Interaural Level Difference (ILD) are known phenomena that help explain how we know something is to the left or right, but they cannot convey front/back, up/down, or range. You can easily imagine that a sound that emanates from a plane perpendicular to the axis between the ears cannot provide any information as to front or back, or above or below. For this, you need to rotate your head in azimuth or elevation to get the sound source out of that plane and re-introduce ITDs and ILDs.

But even without moving one’s head, localization cues come from the Head Related Transfer Function (HRTF) as well as subtle reflections from the upper body or differences between direct and reflected sound, or with the help of a hand, held next to an ear - something we all have done to pinpoint the source of a sound. The HRTF can supply the angular localization information (azimuth and elevation). Range (distance to source) is determined by the brain from changes in the HRTF with head movement, and to some extent using the ratio of direct to reverberant energy, as determined by the venue acoustics.

We should also should not forget that auditory perception is a cooperation of several senses to build a “sound field” in the brain. As an example, if you are listening to a speaker setup, the fact that you can see the speakers helps a lot in localizing what you hear. Think about your typical reaction when you hear a sound: you turn around to face the source, so you know where it comes from (and what it is, of course). This is a very important factor, which can be demonstrated if you offer an externalized sound field to a listener, where this does not match what the listener sees [1]. Even with head tracking, it is hard to present a convincing externalized sound field based on the characteristics of a specific listening room if that is at odds with what the listener sees.

Another issue that can come up with head tracking is when you virtualize the sound of a video or film presentation. Just as with loudspeaker listening, latency between video and audio must be kept low to maintain the illusion of simultaneous listening/ viewing. The added factor with head tracking is that the latency between head movements and the reaction of the externalization should also be small to be convincing [2]. For those interested in this topic, audioXpress has covered Binaural reproduction
 
BrandenburgLabs_photo1-TWeb.jpg
Have you ever experienced perfect lifelike audio via headphones? Brandenburg Labs is working to deliver that immersive experience.
Fast Rewind
About 12 years ago, I attended an Electrostatic Speaker listening event in Germany’s beautiful Eifel area. Wandering through the demo rooms I spotted an unusual sight: Someone wearing a headset and listening intently to music while stepping sideways and rotating his head. Upon further inspection, I noticed his headphone was connected with a long thin wire to a very strange piece of equipment. There was also something that looked like a sensor glued to the front wall, connected to the same equipment. This turned out to be a Smyth Realiser A8. I will not delve into every detail here, as audioXpress regular contributor Stuart Yaniger and yours truly wrote an extensive review on it for Linear Audio [4].

The sensor I noticed on the wall in front of the listener was an IR receiver. There was an IR transmitter clipped to the top of the headphone and this setup allowed the Realiser to detect head movements, and adjust the sound in terms of EQ, ITD and ILD. The result was uncanny – you really were convinced you were listening to speakers! I remember taking the whole thing to the residence of the late Siegfried Linkwitz in California for a demo. After some preliminaries, I activated the setup, played music, and turned off the speakers in Siegfried’s system. After listening and turning his head a bit for a few seconds, Siegfried looked at me and said, OK, you can turn off the speakers now. I said nothing, and finally he took off the headset and looked at me with total surprise. In my book, if you can fool Siegfried Linkwitz with a headphone making him think he is listening to his speakers, that’s an achievement!
___
Note that before using the Realiser, there is an extensive setup procedure to calibrate the unit for your speakers and listening room. It includes fitting a pair of miniature microphones into your ear canal; the referenced article has details. One interesting side effect was that I could go home, activate the calibration done in someone else’s home with someone else’s speakers and get a reproduction as if I were in that room with those speakers!
 
AES_HeadphoneConf.Aalborg-Denmark-2016_Realiser-Web.jpg
The very impressive Smyth Realiser demonstration setup at the AES Headphone Conference in Aalborg, Denmark, in 2016. On the right is the Realiser A8 used at the time (now already superseded by the Realiser A16).
Fast Forward
After all the COVID misery, I once again decided to attend an European Audio Engineering Society (AES) convention, this time the 154th AES International Convention, held May 13-15 at Aalto University, Helsinki, in Finland. And, in one of the demo setups I spotted someone wearing a headset with something clipped to the top of the headband, moving his head this way, that way, and making some tentative steps left and right. Déjà vu! This turned out to be a demo set up for headphone immersive audio reproduction developed by Dr. Karl-Heinz Brandenburg and his colleagues at Brandenburg Labs.

The Nitty Gritty Details
ITD and ILD are central concepts in auditory (spatial) perception and the Smyth Realiser manipulated those parameters during the listening process with great success. But research has shown that there's a lot more to it. Some of those effects are higher level and long-term cognitive effects. Room acoustics and latency play a big part in this, and the Brandenburg system includes processing of these quantities as well to make the immersive experience even more realistic.

A critical component in the whole setup is the head position and movement sensor. There have been solutions in the past, with emitters that were linked to static sensors to provide absolute position reference, but also USB connected and wireless position and movement sensors. Early sensors had limited accuracy and limited degrees of freedom, were sensitive to stationary drift and had to be recalibrated or reset from time to time. However, the advent of cheap and powerful digital processing and wireless connectivity has changed this. You can now get a digital plugin for your DAW that emulates the sound field of a “standard” listening room, a car, a studio, what have you [5].

One of the strong features of the Realiser was that you could record the room response with two small mikes inside your ears. That captured the room as well as (part of) your ears, and your HRTF. So that it is totally personalized. It also allows you to record music in a friend’s home that has great acoustics, or indeed in a concert hall, and play all your music as if it was performed in that venue.

The Brandenburg system has a similar setup, but it does mean that you need to perform some important measurements at the setup stage; not plug-and-play. Sony, Apple, and many other headphone companies have adopted an interesting solution for that - there is a short setup procedure where you take a close-up picture of your ear canal with (of course) your phone as well as a front view of your head, and presumably the system can calculate the most approximate HRTF response and use that to equalize the replay over the in-ear monitors [6].
 
AES-Helsinki_demo-BrandenburgLabs_IMG_2741-Web.jpg
The Brandenburg Labs’ setup at the AES Convention in Helsinki. audioXpress editor-in-chief J. Martins checks the "speakers" up close.
The current algorithm used for the Brandenburg Labs methodology is based on ideas described in several papers [7, 8]. At its core, it is a parametric extrapolation algorithm that enables calculating Binaural Room Impulse Responses (BRIRs) in real-time, based on a single omnidirectional Room Impulse Response (RIR). A basic geometric model of a room as well as the positions of the sound sources and the microphone need to be captured in advance. From tracking the user's head position, the directions of arrival for the direct sound are calculated while the early reflections are estimated by a simplified image source model. The RIR is processed in segments that will be convolved with generic HRTF filters. The late reverberation is emulated by a noise shaping approach. The algorithm allows six degrees of freedom (6DoF) in rotation and translation for the user.

"Positional tracking is still an issue, which has not been fully solved yet. We benefit from advancements on tracking, originating from the AR and VR markets (we are using the HTC VIVE System [9] with additional trackers). The VIVE system allows 6DoF tracking, so we get the head rotation of the user as well as the translation approximately in real time, and the users’ movements in the room," explains Professor Brandenburg. "This is achieved with the help of static infrared emitters in the corners of the room. The tracker devices receive emitted infrared light, which makes it possible for the system to estimate the position and orientation. This is the current solution for our research demo, we are already looking into other options more suitable for future consumer products, to reduce the hardware requirements. The ideal tracking solution combines low latency, high precision, no drift and works everywhere with little to no setup. This solution does not exist yet, but we can make certain trade-offs for certain use cases."

For its current demo system, Brandenburg Labs takes an acoustic measurement with an omnidirectional microphone. They play a sine sweep over the loudspeakers to get the room impulse response containing the reflections of the room, and this room impulse response is then used by the Brandenburg Labs' algorithm. The process only takes a few minutes and is currently done for each room. The Brandenburg Labs team is already working on smarter ways to do this, so it can all be set up by the user.

"We are currently talking to several interested pilot customers. Systems like our current demo system will be shipped very soon to interested music recording studios and research institutions. Ultimately, we are convinced that every person using digital audio over headphones will benefit from our technology. Immersive audio has the potential to enrich every application, since it matches how we listen to the real world," Professor Brandenburg states.

"In the upcoming early adoption stages, our immersive audio technology will be first introduced in B2B markets, such as music production, communication, virtual prototyping, and many more, where high specialization and more complex hardware setups are less problematic. Once the headphone technology goes into mass production, it will initially be adopted by the entertainment industry, before spreading wide enough for general usage."
 
Brandenburg Labs GmbH was founded in 2019, initiated by Prof. Dr.-Ing. Karlheinz Brandenburg. Brandenburg Labs relies on a network and extensive knowledge in the fields of basic and applied research. The team's development work expands on the work on “Personalized Auditory Realities (PARty),” a concept developed at Technische Universität Ilmenau and Fraunhofer Institute for Digital Media Technology.

The Proof Is in the Pudding
Brandenburg Labs had a small demo set up at the Helsinki 2023 AES convention as mentioned. Music could be played over the demo immersive headphone setup as well as over a pair of small active monitor speakers. First, I listened to some music over the speakers, before donning the headphones. OK, no significant change I could detect. Then I started turning my head and sure enough, the acoustic ‘picture’ stayed where I thought it was, related to the speaker position. Nodding and head-rising also felt (sounded) completely accurate. Good, but not revolutionary.

But I was most impressed when I started to walk around; not only did the acoustic picture remain in place; I also could sense the subtle changes in sound due to the imaginary room geometry. In a traditional speaker setup, if you walk around the room, the tonal balance and timbre change when you move for instance past a wall or into a wider open area. I didn’t know the geometry of the particular room I was listening to, but the changes were there, and I had the distinct feeling of walking around in a real room.

I was also very impressed when I walked from in front of the speakers (which were not playing), between them, to behind the speakers. Totally natural and if I had had more time, I probably could have determined the polar response of those not-playing-speakers by just walking around them. So, all in all a totally convincing experience.
 
BrandenburgLabs_photo2-Web.jpg
Professor Brandenburg enjoying the Brandenburg Labs creation (Photo courtesy Eleonora Hamburg)
Afterthoughts
I eventually sold my Realiser to an interested audio designer, who sold it to yet another interested party a year later. Why didn’t we keep it and instead sell our big ugly space-hogging speakers? I believe that wearing headphones is not something we are comfortable with for hours at a time, even with perfect spatial reproduction. Or, possibly, even when the sound field is totally convincing and realistically external, some part of us still knows that it is somehow faked.

Also, as previously mentioned [1], if the visual setup is divergent from the externalized setup, the localization can be confusing and head tracking becomes less effective. Maybe my problem is that I am still hung up on traditional two-way stereo for serious listening, so that is my frame of reference. For on-the-move and casual listening it is more about creating a credible sound field, and there it is very effective. Developments are happening very fast, and this could become a standard feature in all (active) headsets, including in-ear monitors.

References
[1] Influence of Head Tracking on the Externalization of Auditory Events at Divergence between Synthesized and Listening Room Using a Binaural Headphone System; Stephan Werner, George Götz, Florian Klein; Audio Engineering Convention 142nd Convention, Berlin, May 20-23, 2017.
[2] The Perception of System Latency in Dynamic Binaural Synthesis; A. Lindau; Audio Communications Group, Technical University, Berlin, 2009. 
[3] audioXpress coverage of Binaural Rendering
[4] Stuart Yaniger, Jan Didden; Smyth Realiser A8 review, Linear Audio Vol 7, April 2014, available at https://linearaudio.net/sites/linearaudio.net/files/2023-06/v7%20sy%26jd.pdf
[5] Steven Slate Audio VSX Headphones - https://stevenslateaudio.com/vsx
[6] https://support.apple.com/en-us/HT213318
[7] Karlheinz Brandenburg et al, Creation of Auditory Augmented Reality Using a Position-Dynamic Binaural Synthesis System—Technical Components, Psychoacoustic Needs, and Perceptual Evaluation; Applied Sciences, 2021, 11, 1150.
[8] C. Pörschmann, P. Stade, and J. M. Arend, “Binauralization of omnidirectional room impulse responses - algorithm and technical evaluation,” in DAFx, 2017.
[9] Vive - https://www.vive.com/us/accessory/tracker3/

This article was originally published in The Audio Voice newsletter, (#429), July 13, 2023.
 
The Problem Is All Inside Your Head - She Said to Me (1)

I do want to share an experience I’ve had with headphone virtualization. It may be just my personal experience, or only valid for some listeners, or universally for all humans, I don’t know but it may interest you.

Start with listening to headphones on a virtualization system as described here, with the virtualization switched off. You will have the familiar "sound inside the head" experience. Now, without moving your head, switch on the virtualization. And nothing happens in your head. Now slightly move your head and - bang! - the sound field explodes, and you listen to a virtual speaker setup! Move your head, walk around, get used to the virtual sound field.

Next, stop moving and keep your head still, and switch off the virtualization. Nothing happens in your head; you continue to hear the virtualized speaker sound field. Now slightly move your head and - bang! - the sound field collapses inside your head.

So, what is this? Apparently, virtualization with manipulating the ITD and ILD and whatnot does not, in itself, convince your brain that you are listening to an external sound source. It must be combined with head (ear) movement to be processed and brought up to your conscious perception. Maybe it is processed but only brought to the perception forefront if you move the ears. Similarly, when the virtualization is switched off and all ITD and ILD and whatnot processing ceases, in itself that is not a reason for your brain to tell you the sources are no longer somewhere out there, until the time when you move your head (ears), and you realize you’ve been duped and there’s no external soundfield.

I find this highly intriguing, and I would love to hear from someone who knows more about such things.

1. Paul Simon, “50 Ways To Leave Your Lover,” December 1975
Page description
About Jan Didden
Jan Didden has written for audioXpress since the 1970s and he is the magazine’s Technical Editor. He is retired following a career with the Netherlands Air Force and NATO. He worked in logistics, air defense, and information technology. Retirement has provided... Read more

related items