Sound Source Identification for Vocal Sounds of Human Origin

Advantages

Only human-derived speech can be discriminated from environmental sounds.
When used in smart speakers, it eliminates the need for the wake words.

Technology Overview & Background

Invention of a speaking voice identification method for smart speakers or interactive robots that can distinguish between sound derived from human speech contained in environmental sounds (e.g., human speech from a TV) and raw voice derived from direct human speech.
The inventors are conducting research to develop an interactive robot that can be used in the homes of the elderly and patients with mild cognitive impairment (MCI). Many homes of the elderly, for example, leave the TV on all day, and this causes the interactive robot to respond to sounds originating from the TV and other sources. One way to distinguish between environmental sounds is to use the wake words to distinguish between instruction words and environmental sounds, as typified by smart speakers, etc. However, this is not suitable for interactive robots that are intended for natural interaction. Therefore, the inventors devised this method to enable the robot to easily distinguish between live human speech and human speech sounds derived from television, radio, and other sources.
The specific process is as follows: (1) the robot records speech, (2) the robot uses a Voice Activity Detector (VAD) to identify whether the speech is a human voice (including those derived from environmental sounds), and finally (3) a trained convolutional neural network (CNN) is applied to the voice identified as a human voice. After going through these three processes, the robot responds only to direct human speech. Since speech data is time series data, it is not suitable to be handled as it is due to the amount of information processing. Therefore, the researchers propose to convert the voice data into spectrogram image data and apply it to the above process. This process enables the use of CNNs, which are usually used to learn images, thereby reducing the overall amount of processing and allowing for discrimination.
The technology and processing methods are described in more detail in the paper(s) listed below.

Papers

Figueroa D., Nishio S., Yamazaki R., Ishiguro H., Int Rob Auto J. 2023;9(1):8‒13.
“Improving voice detection in real life scenarios: differentiating television and human speech at older adults’ houses.”
https://medcraveonline.com/IRATJ/IRATJ-09-00255.pdf

Patents

Filed in Japan, not yet published.

Principal Investigator & Academic Institution

Shuichi NISHIO (Specially Appointed Professor, Osaka University, Japan)

Expectations

Tech Manage is currently looking for companies willing to work with the researchers to develop this technology.
It is possible to license the above patent from Osaka University. In addition, joint research using the invention, provision of know-how through a nondisclosure agreement, and evaluation and licensing options for a certain period could also be considered.

Project No.JT-04562