Perception of Audio Location

Despite the availability of Dolby® Digital Surround Sound and everyone else’s 33.17 channel sound schemes, we have two and only two ears as standard issue. Aside from the unfortunate ones who may lose hearing, the perception of direction can only come from time and or frequency variation of bulk level and level vs. frequency as it is modified by the physical relationship between the two, and that only in two ears. Remember that space has three orthogonal directions. We only have two ears. How do you solve for three variables when given only two parameters … well you don’t … so clearly there are other variables that we bring to bear.

Go into the woods and just hold still. Listen to the position of a bird, stream, jet or any other sound that is holding relatively still. You will ascertain a position. Next swivel your head around and judge where the sound is coming from. It is pretty often that the static perception of direction is radically different from the dynamic version. In fact, think about it, we trust the dynamic version more. If we’re uncertain when we first hear a sound, don’t we invariably cock our heads or swivel around to localize the sound? It turns out that there are two simple and one complex algorithms being applied in our brains to localize the sound.

The first simple one works either for nearby objects or when the object or our head is moving. Bulk loudness is perceived by each ear’s interpreter in the brain, the levels are compared and the louder one keys our audio judgement to deduce that the louder ear is pointing towards the sound. There’s some subtlety in that the increase and decrease vs time as we move or the sound moves can influence the decision when there is motion involved despite the absolute loudness.

The second simple algorithm is based on arrival time. Given that the two ears are some distance apart, and given the speed of sound, there is a roughly 1 ms delay between ears for a sound straight from the side. For my fat head, though, that stretches to a few ms. In fact there is significant research proving that time difference will trump level difference for some range of frequencies. This is exploited in Haas panning. Warning … Haas panning is a fool’s paradise because of horrible comb filtering that takes place if Haas-delayed channels ever get mixed together. Haas is a live-sound tool, and only to the mains buses.

The complex algorithm gives us the real miracle of position perception given only two ears. The convoluted shape of an ear forces differences in spectral filtering that depend on the azimuthal and altitudinal position of the source with respect to a single ear. Even more miraculous is that we make catalogs of what the spectral signature of so many sounds should be. For the static case, neither the head nor the object is moving, we compare what the catalog says the sound’s spectrum should be to what we find. Then the decider, knowing the filtering function of the ear shape, de-convolves the two to deduce direction. It’s not as reliable because the environment … that hillside, this boulder, some unseen house ’round the bend … all can shape the spectrum and fool the decider. Not to mention that one crow may actually have the same spectral dip in his voice-box that the ear has for the 3-o’clock filter function.

There is some small ability to detect sound direction as it hits skin, especially for either loud, sharp sounds or low-frequency sounds.

So what’s the most important reason for this. I believe in evolution, so I’d say it’s the ability to very very quickly figure out whether the lion is in front of you or behind you. Above isn’t where either our enemy or our food is, so the ear’s shape is most radically asymmetrical in the fore-aft dimension. Notice how scared rabbits put their ears back so they might hear the hawk from above. Even in the static case, especially when the lion is close enough to be dangerous, we are never fooled front-to-back. We spin ’round to use the much more detailed position decider system, the eyes. We avoid getting eaten and we reproduce. Tough toenails for that one-eared Iguana, his branch of the tree of life died and fell to the ground. The creationists have a much simpler explanation (which, by the way, I can’t disprove) which is that God is just clever and knew to shape our ears so that we wouldn’t get eaten by the lion … poor lion … see-ya wouldn’t want ta be ya.

So it would seem obvious that we could use these principals to place either synthetically generated or recorded sounds to any arbitrary location using just two sources for rendering two streams. This is very difficult to mimic, and it is theoretically impossible to perfectly reproduce except in headphones. The first big difficulty is that the speakers can’t physically be at the location being emulated, they must use a mathematical compromise to fool our ears into believing it is in another place. Second, and not completely independent, is that as our heads move about, the source not being at the emulated location gives different frequency response changes because it can only become filtered because of the speaker’s location, not the emulated location. Third, is that the filtering function vs. direction is different for every ear on the planet. This list goes on and on.

So my goal, being that this is an audio site, is to do a survey and tutorial on the simplest position emulation schemes in existence. These are separation and pan.

Catraeus

Perception of Audio Location

Dedicated to Audio and Thought