FB AI distinguishes multiple speakers simultaneously

栏目: IT技术 · 发布时间: 5年前

内容简介:We’re introducing a new method to separate up to five voices speaking simultaneously on a single microphone. Our method surpasses previous state-of-the-art performance on several speech source separation benchmarks, including ones with challenging noise an

What the research is

We’re introducing a new method to separate up to five voices speaking simultaneously on a single microphone. Our method surpasses previous state-of-the-art performance on several speech source separation benchmarks, including ones with challenging noise and reverberations. Using the WSJ0-2mix and WSJ0-3mix data sets, along with newly created variations with four and five simultaneous speakers, our model achieved a scale-invariant SI-SNR (signal-to-noise ratio, a common measure of separation quality) improvement of more than 1.5 dB (decibels) over the current state-of-the-art models.

To build our model, we use a novel recurrent neural network architecture that works directly on the raw audio waveform. Previously best-available models use a mask and a decoder to sort each speaker’s voice. The performance of these kinds of models rapidly degrades when the number of speakers is high or unknown.

As with standard speech separation systems, our model requires knowledge of the total number of speakers in advance. But in order to handle challenges when the number of speakers is unknown, we built a novel system that automatically detects the number of speakers and selects the most relevant model.

How it works

The main goal of speech separation models is to estimate the input sources, given an input mixture of speech signals, and generate an output of isolated channels for each speaker.

Our model uses an encoder network that maps the input signal to a latent representation. We applied a voice separation network composed of several blocks, where the input is the latent representation and the output is an estimated signal for each speaker. Previous methods typically use a mask when performing separation, which is problematic when the mask is not defined and some signal information may be lost in the process.

We trained the model and directly optimized the SI-SNR using several loss functions via the permutation invariant training. We inserted a loss function after every separation block to further improve the optimization process. Finally, to ensure each speaker is consistently mapped to a particular output channel, we added a perceptual loss function using a pretrained speaker recognition model.

We also built a new system to handle separation of unknown numbers of multiple speakers. We did this by training different models for separating two, three, four, and five speakers. We fed the input mixture to the model designed to accommodate up to five simultaneous speakers so that it would detect the number of active (nonsilent) channels present. Then, we repeated the same process with a model trained for the number of active speakers and checked to see whether all output channels were active. We repeated this process until either all channels were activated or we found the model with the lowest number of target speakers.

Why it matters

The ability to separate a single voice from conversations across many people can improve and enhance communication across a wide range of applications that we use in our daily lives, like voice messaging, assistants, and video tools, as well as AR/VR innovations. It can also improve audio quality for people with hearing aids, so it’s easier to hear others clearly in crowded and noisy environments such as parties, restaurants, or large video calls.

Beyond its separating different voices, our novel system can also be applied to separate other types of speech signals from a mixture of sounds such as background noise. Our work can also be applied to music recordings, improving our previous work on separating different musical instruments from a single audio file. As a next step, we’ll work on improving the generative properties of the model until it achieves high performance in real-world conditions.

Read the full paper:

Voice separation with an unknown number of multiple speakers

Check out the audio samples here.


以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

计算广告

计算广告

刘鹏、王超 / 人民邮电出版社 / 2015-9-1 / 69.00元

计算广告是一项新兴的研究课题,它涉及大规模搜索和文本分析、信息获取、统计模型、机器学习、分类、优化以及微观经济学等诸多领域的知识。本书从实践出发,系统地介绍计算广告的产品、问题、系统和算法,并且从工业界的视角对这一领域具体技术的深入剖析。 本书立足于广告市场的根本问题,从计算广告各个阶段所遇到的市场挑战出发,以广告系统业务形态的需求和变化为主线,依次介绍合约广告系统、竞价广告系统、程序化交易......一起来看看 《计算广告》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

XML 在线格式化
XML 在线格式化

在线 XML 格式化压缩工具