SoundScape

Project Details

Showcase Video

Demo Video

Abstract

SoundScape: Real-Time 3D Sound Localization and Classification with Sensory Substitution for the Deaf and Hard of Hearing

Current devices geared towards the deaf and hard of hearing, such as hearing aids, struggle to localize and transmit sounds to those with severe hearing impairments. Advanced devices, like cochlear implants, are invasive and cost $30,000 to $50,000. Devices that classify sounds, such as home alert systems, are not suited for mobile use and recognize a limited number of noises. Our goal was to convey the directionality, pitch, amplitude, sound classification, and speech recognition of multiple sound sources to those with hearing impairments through a low-cost device. We localized sounds with the SRP-PHAT-HSDA algorithm calculated on incoming audio captured by a 6-microphone array. We separated sounds through Geometric Source Separation (GSS) and beamforming, allowing us to isolate 4 audio sources. We classified each source in real-time using a stacked generalization ensemble trained on an augmented audio dataset. Background noise is filtered through dynamic noise reduction for each source. The direction along with filtered amplitude and frequency are transmitted to the user through vibration motors in a wearable device worn around the shoulders while classification and speech recognition are displayed on a watchOS app. SoundScape can reliably localize four sounds under 15.49 degrees of error, classify sounds with 93.4% accuracy, and operate under 0.2 seconds of latency. The entire device costs $60 to manufacture with inexpensive TPU filament. SoundScape is the first device assistive listening device to separate, localize, and classify multiple sound sources with the potential to protect 466 million people with hearing loss worldwide.

Slides

Research Paper

Features

Sound Localization

We perform 3D sound localization with ODAS, which can localize up to four sound sources simultaneously.

Sound Separation

Using Geometric Source Separation, we separate each localized sound source from each other with limited interference.

Sound Classification

By using our stacked generalization ensemble, we classify each separated sound source produced from GSS.

Speech Recognition

We use Google's Speech Recognition API, the best speech recognition on the market, to identify human speech.

Background Noise

Using our dynamic noise reduction algorithm, we remove background noise so that the user doesn't feel it in sensory substitution.

Sensory Substitution

We convey music and the feeling of sound through haptic motor vibration with sensory substitution.

Frequently Asked Questions

How much will it cost?

SoundScape will only cost $60. It is printed with inexpensive but flexible TPU filament which can be scaled to an injection mold for mass-production. Other parts, like the Raspberry Pi and Teensy 4.0, total to under $45.

What sounds can it classify and how many?

SoundScape can classify 50 sounds as well as recognize speech. These include emergency sounds, such as a baby crying or a car's engine, to common sounds like rain or footsteps. The user can choose specific sounds to receive a notification for.

Do you plan to use a server going forward?

By using bluetooth streaming to transmit audio from the microphone array to the phone, we can run computational tasks on a phone. We can compile our computer vision models on CoreML to utilize an iPhone's neural engine.

How does background noise filtering work?

Noise reduction is done by first recording a 0.5 second audio clip from an initial interval of 10 seconds. Then we use fourier transforms to extract the frequencies from the background noise to produce a mask. Then we use the mask to filter those frequencies from the signal audio.

How have you accounted for different environments?

We do this in two ways: background noise reduction and data augmentation. Dynamically reducing background noise allows us to adapt to a variety of environments. Data augmentation allows our classifier to account for outdoors environments.

Does your system work in real-time?

Yes - the entire system has a latency under 0.2 seconds. We do this with GPU acceleration using CuPy and classifying audio cumulatively, similar to how voice assistants like Siri work.

Why do you do sound classification?

People who are deaf and hard of hearing rely on their eyes to identify what made a sound. However, sound sources are often out of sight, such as a baby crying or a car incoming behind you. In such cases, our users need to differentiate between "emergency" sounds - cars, babies - and less urgent ones, like wind or rain.

Where and how have you tested this?

We've tested indoors, outdoors, and with varying degrees of background noise. We found that indoors, with constant background noise at 20 dB, we could localize sounds up to 35 ft away.

In your motor pairs, how did you decide when to turn on the red or green LED?

In each pair the red LED is for low frequencies and the green for high. We determined the threshold frequency by calculating the average median frequency for each audio sample in the ESC-50 dataset.

What made you choose the four base models?

We chose the models based on three criteria: accuracy, run-time, and diversity in terms of classification methods. Our goal was to incorporate each model into the meta-classifier, so having variation in terms of how they classify audio data increased the accuracy of the meta-classifier. The 4 models we chose, Densenet, Resnet18, Cnn10, and Talnet, had some of the highest accuracies on other sound classification datasets such as ESCS10 and the Urban8k dataset.

How do you localize multiple sound sources?

Each microphone has a set time difference of arrival which offsets to the generated audio wave. So for every audio signal, the microphone array offsets the sound wave such that the sound waves generated by every mic are in sync for the same sound source. This applies when multiple audio signals coming from different directions are added. So if the same audio signal from one mic is detected on another, the offset sound waves should almost entirely be in sync. This allows the SRP model to filter out repeating sound sources.

How do you track sound sources?

Using localization data from the SRP-PHAT-HSDA model, we know the direction of arrival for each sound source. We then determine if the sound source is one of four states: static, constant velocity, accelerating, or a new source. Then we use a 3D Kalman filter, called M3K, to accurately track the sound source over time.

Meet The Team

We are high school juniors from TJHSST hoping to improve lives.

Eugene Choi

Co-Founder

Irfan Nafi

Co-Founder

Raffu Khondaker

Co-Founder