Interaural Timing Difference (ITD) and Interaural Level Difference (ILD) are known phenomena that help explain how we know something is to the left or right, but they cannot convey front/back, up/down, or range. You can easily imagine that a sound that emanates from a plane perpendicular to the axis between the ears cannot provide any information as to front or back, or above or below. For this, you need to rotate your head in azimuth or elevation to get the sound source out of that plane and re-introduce ITDs and ILDs.
But even without moving one’s head, localization cues come from the Head Related Transfer Function (HRTF) as well as subtle reflections from the upper body or differences between direct and reflected sound, or with the help of a hand, held next to an ear - something we all have done to pinpoint the source of a sound. The HRTF can supply the angular localization information (azimuth and elevation). Range (distance to source) is determined by the brain from changes in the HRTF with head movement, and to some extent using the ratio of direct to reverberant energy, as determined by the venue acoustics.
We should also should not forget that auditory perception is a cooperation of several senses to build a “sound field” in the brain. As an example, if you are listening to a speaker setup, the fact that you can see the speakers helps a lot in localizing what you hear. Think about your typical reaction when you hear a sound: you turn around to face the source, so you know where it comes from (and what it is, of course). This is a very important factor, which can be demonstrated if you offer an externalized sound field to a listener, where this does not match what the listener sees [1]. Even with head tracking, it is hard to present a convincing externalized sound field based on the characteristics of a specific listening room if that is at odds with what the listener sees.
Another issue that can come up with head tracking is when you virtualize the sound of a video or film presentation. Just as with loudspeaker listening, latency between video and audio must be kept low to maintain the illusion of simultaneous listening/ viewing. The added factor with head tracking is that the latency between head movements and the reaction of the externalization should also be small to be convincing [2]. For those interested in this topic, audioXpress has covered Binaural reproduction
About 12 years ago, I attended an Electrostatic Speaker listening event in Germany’s beautiful Eifel area. Wandering through the demo rooms I spotted an unusual sight: Someone wearing a headset and listening intently to music while stepping sideways and rotating his head. Upon further inspection, I noticed his headphone was connected with a long thin wire to a very strange piece of equipment. There was also something that looked like a sensor glued to the front wall, connected to the same equipment. This turned out to be a Smyth Realiser A8. I will not delve into every detail here, as audioXpress regular contributor Stuart Yaniger and yours truly wrote an extensive review on it for Linear Audio [4].
The sensor I noticed on the wall in front of the listener was an IR receiver. There was an IR transmitter clipped to the top of the headphone and this setup allowed the Realiser to detect head movements, and adjust the sound in terms of EQ, ITD and ILD. The result was uncanny – you really were convinced you were listening to speakers! I remember taking the whole thing to the residence of the late Siegfried Linkwitz in California for a demo. After some preliminaries, I activated the setup, played music, and turned off the speakers in Siegfried’s system. After listening and turning his head a bit for a few seconds, Siegfried looked at me and said, OK, you can turn off the speakers now. I said nothing, and finally he took off the headset and looked at me with total surprise. In my book, if you can fool Siegfried Linkwitz with a headphone making him think he is listening to his speakers, that’s an achievement!
___
Note that before using the Realiser, there is an extensive setup procedure to calibrate the unit for your speakers and listening room. It includes fitting a pair of miniature microphones into your ear canal; the referenced article has details. One interesting side effect was that I could go home, activate the calibration done in someone else’s home with someone else’s speakers and get a reproduction as if I were in that room with those speakers!
After all the COVID misery, I once again decided to attend an European Audio Engineering Society (AES) convention, this time the 154th AES International Convention, held May 13-15 at Aalto University, Helsinki, in Finland. And, in one of the demo setups I spotted someone wearing a headset with something clipped to the top of the headband, moving his head this way, that way, and making some tentative steps left and right. Déjà vu! This turned out to be a demo set up for headphone immersive audio reproduction developed by Dr. Karl-Heinz Brandenburg and his colleagues at Brandenburg Labs.
The Nitty Gritty Details
ITD and ILD are central concepts in auditory (spatial) perception and the Smyth Realiser manipulated those parameters during the listening process with great success. But research has shown that there's a lot more to it. Some of those effects are higher level and long-term cognitive effects. Room acoustics and latency play a big part in this, and the Brandenburg system includes processing of these quantities as well to make the immersive experience even more realistic.
A critical component in the whole setup is the head position and movement sensor. There have been solutions in the past, with emitters that were linked to static sensors to provide absolute position reference, but also USB connected and wireless position and movement sensors. Early sensors had limited accuracy and limited degrees of freedom, were sensitive to stationary drift and had to be recalibrated or reset from time to time. However, the advent of cheap and powerful digital processing and wireless connectivity has changed this. You can now get a digital plugin for your DAW that emulates the sound field of a “standard” listening room, a car, a studio, what have you [5].
One of the strong features of the Realiser was that you could record the room response with two small mikes inside your ears. That captured the room as well as (part of) your ears, and your HRTF. So that it is totally personalized. It also allows you to record music in a friend’s home that has great acoustics, or indeed in a concert hall, and play all your music as if it was performed in that venue.
The Brandenburg system has a similar setup, but it does mean that you need to perform some important measurements at the setup stage; not plug-and-play. Sony, Apple, and many other headphone companies have adopted an interesting solution for that - there is a short setup procedure where you take a close-up picture of your ear canal with (of course) your phone as well as a front view of your head, and presumably the system can calculate the most approximate HRTF response and use that to equalize the replay over the in-ear monitors [6].
"Positional tracking is still an issue, which has not been fully solved yet. We benefit from advancements on tracking, originating from the AR and VR markets (we are using the HTC VIVE System [9] with additional trackers). The VIVE system allows 6DoF tracking, so we get the head rotation of the user as well as the translation approximately in real time, and the users’ movements in the room," explains Professor Brandenburg. "This is achieved with the help of static infrared emitters in the corners of the room. The tracker devices receive emitted infrared light, which makes it possible for the system to estimate the position and orientation. This is the current solution for our research demo, we are already looking into other options more suitable for future consumer products, to reduce the hardware requirements. The ideal tracking solution combines low latency, high precision, no drift and works everywhere with little to no setup. This solution does not exist yet, but we can make certain trade-offs for certain use cases."
For its current demo system, Brandenburg Labs takes an acoustic measurement with an omnidirectional microphone. They play a sine sweep over the loudspeakers to get the room impulse response containing the reflections of the room, and this room impulse response is then used by the Brandenburg Labs' algorithm. The process only takes a few minutes and is currently done for each room. The Brandenburg Labs team is already working on smarter ways to do this, so it can all be set up by the user.
"We are currently talking to several interested pilot customers. Systems like our current demo system will be shipped very soon to interested music recording studios and research institutions. Ultimately, we are convinced that every person using digital audio over headphones will benefit from our technology. Immersive audio has the potential to enrich every application, since it matches how we listen to the real world," Professor Brandenburg states.
"In the upcoming early adoption stages, our immersive audio technology will be first introduced in B2B markets, such as music production, communication, virtual prototyping, and many more, where high specialization and more complex hardware setups are less problematic. Once the headphone technology goes into mass production, it will initially be adopted by the entertainment industry, before spreading wide enough for general usage."
The Proof Is in the Pudding
Brandenburg Labs had a small demo set up at the Helsinki 2023 AES convention as mentioned. Music could be played over the demo immersive headphone setup as well as over a pair of small active monitor speakers. First, I listened to some music over the speakers, before donning the headphones. OK, no significant change I could detect. Then I started turning my head and sure enough, the acoustic ‘picture’ stayed where I thought it was, related to the speaker position. Nodding and head-rising also felt (sounded) completely accurate. Good, but not revolutionary.
But I was most impressed when I started to walk around; not only did the acoustic picture remain in place; I also could sense the subtle changes in sound due to the imaginary room geometry. In a traditional speaker setup, if you walk around the room, the tonal balance and timbre change when you move for instance past a wall or into a wider open area. I didn’t know the geometry of the particular room I was listening to, but the changes were there, and I had the distinct feeling of walking around in a real room.
I was also very impressed when I walked from in front of the speakers (which were not playing), between them, to behind the speakers. Totally natural and if I had had more time, I probably could have determined the polar response of those not-playing-speakers by just walking around them. So, all in all a totally convincing experience.
I eventually sold my Realiser to an interested audio designer, who sold it to yet another interested party a year later. Why didn’t we keep it and instead sell our big ugly space-hogging speakers? I believe that wearing headphones is not something we are comfortable with for hours at a time, even with perfect spatial reproduction. Or, possibly, even when the sound field is totally convincing and realistically external, some part of us still knows that it is somehow faked.
Also, as previously mentioned [1], if the visual setup is divergent from the externalized setup, the localization can be confusing and head tracking becomes less effective. Maybe my problem is that I am still hung up on traditional two-way stereo for serious listening, so that is my frame of reference. For on-the-move and casual listening it is more about creating a credible sound field, and there it is very effective. Developments are happening very fast, and this could become a standard feature in all (active) headsets, including in-ear monitors.
References
[1] Influence of Head Tracking on the Externalization of Auditory Events at Divergence between Synthesized and Listening Room Using a Binaural Headphone System; Stephan Werner, George Götz, Florian Klein; Audio Engineering Convention 142nd Convention, Berlin, May 20-23, 2017.
[2] The Perception of System Latency in Dynamic Binaural Synthesis; A. Lindau; Audio Communications Group, Technical University, Berlin, 2009.
[3] audioXpress coverage of Binaural Rendering
[4] Stuart Yaniger, Jan Didden; Smyth Realiser A8 review, Linear Audio Vol 7, April 2014, available at https://linearaudio.net/sites/linearaudio.net/files/2023-06/v7%20sy%26jd.pdf
[5] Steven Slate Audio VSX Headphones - https://stevenslateaudio.com/vsx
[6] https://support.apple.com/en-us/HT213318
[7] Karlheinz Brandenburg et al, Creation of Auditory Augmented Reality Using a Position-Dynamic Binaural Synthesis System—Technical Components, Psychoacoustic Needs, and Perceptual Evaluation; Applied Sciences, 2021, 11, 1150.
[8] C. Pörschmann, P. Stade, and J. M. Arend, “Binauralization of omnidirectional room impulse responses - algorithm and technical evaluation,” in DAFx, 2017.
[9] Vive - https://www.vive.com/us/accessory/tracker3/
This article was originally published in The Audio Voice newsletter, (#429), July 13, 2023.
|