Accurate Online Speaker Diarization with Supervised Learning

内容简介：In “

Speaker diarization , the process of partitioning an audio stream with multiple people into homogeneous segments associated with each individual, is an important part of speech recognition systems. By solving the problem of “who spoke when”, speaker diarization has applications in many important scenarios, such as understanding medical conversations , video captioning and more. However, training these systems with supervised learning methods is challenging — unlike standard supervised classification tasks, a robust diarization model requires the ability to associate new individuals with distinct speech segments that weren’t involved in training. Importantly, this limits the quality of both online and offline diarization systems. Online systems usually suffer more, since they require diarization results in real time.

Accurate Online Speaker Diarization with Supervised Learning

Online speaker diarization on streaming audio input. Different colors in the bottom axis indicate different speakers.

In “ Fully Supervised Speaker Diarization ”, we describe a new model that seeks to make use of supervised speaker labels in a more effective manner. Here “fully” implies that all components in the speaker diarization system, including the estimation of the number of speakers, are trained in supervised ways, so that they can benefit from increasing the amount of labeled data available. On the NIST SRE 2000 CALLHOME benchmark, our diarization error rate (DER) is as low as 7.6%, compared to 8.8% DER from our previous clustering-based method , and 9.9% from deep neural network embedding methods . Moreover, our method achieves this lower error rate based on online decoding, making it specifically suitable for real-time applications. As such we are open sourcing the core algorithms in our paper to accelerate more research along this direction.

Clustering versus Interleaved-state RNN

Modern speaker diarization systems are usually based on clustering algorithms such as k-means or spectral clustering . Since these clustering methods are unsupervised, they could not make good use of the supervised speaker labels available in data. Moreover, online clustering algorithms usually have worse quality in real-time diarization applications with streaming audio inputs. The key difference between our model and common clustering algorithms is that in our method, all speakers’ embeddings are modeled by a parameter-sharing recurrent neural network (RNN), and we distinguish different speakers using different RNN states, interleaved in the time domain.

To understand how this works, consider the example below in which there are four possible speakers: blue , yellow , pink and green (this is arbitrary, and in fact there may be more — our model uses Chinese restaurant process to accommodate the unknown number of speakers). Each speaker starts with its own RNN instance (with a common initial state shared among all speakers) and keeps updating the RNN state given the new embeddings from this speaker. In the example below, the blue speaker keeps updating its RNN state until a different speaker, yellow , comes in. If blue speaks again later, it resumes updating its RNN state. (This is just one of the possibilities for speech segment y ₇ in the figure below. If new speaker green enters, it will start with a new RNN instance.)

The generative process of our model. Colors indicate labels for speaker segments.

Representing speakers as RNN states enables us to learn the high-level knowledge shared across different speakers and utterances using RNN parameters, and this promises the usefulness of more labeled data. In contrast, common clustering algorithms almost always work with each single utterance independently, making it difficult to benefit from a large amount of labeled data.

The upshot of all this is that given time-stamped speaker labels (i.e. we know who spoke when), we can train the model with standard stochastic gradient descent algorithms. A trained model can be used for speaker diarization on new utterances from unheard speakers. Furthermore, the use of online decoding makes it more suitable for latency-sensitive applications.

Future Work

Although we’ve already achieved impressive diarization performance with this system, there are still many exciting directions we are currently exploring. First, we are refining our model so it can easily integrate contextual information to perform offline decoding. This will likely further reduce the DER, which is more useful for latency-insensitive applications. Second, we would like to model acoustic features directly instead of using d-vectors. In this way, the entire speaker diarization system can be trained in an end-to-end way.

To learn more about this work, please see our paper . To download the core algorithm of this system, please visit the Github page .

Acknowledgments

This work was done as a close collaboration between Google AI and Speech & Assistant teams. Contributors include Aonan Zhang (intern), Quan Wang, Zhengyao Zhu and Chong Wang.

除非特别声明，此文章内容采用知识共享署名 3.0 许可，代码示例采用 Apache 2.0 许可。更多细节请查看我们的服务条款。

以上所述就是小编给大家介绍的《Accurate Online Speaker Diarization with Supervised Learning》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Accurate Online Speaker Diarization with Supervised Learning

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

响应式Web设计实践

[美] Tim Kadlec / 侯鸿儒 / 人民邮电出版社 / 2013-3-1 / 55.00元

随着各种各样的移动设备不断地涌现到使用者面前，Web设计的适应性已经成为设计师们所面临的最为艰巨的挑战。你设计出的网站不仅要在桌面计算机的大尺寸屏幕上可以为用户提供友好的UI和用户体验，同时在小尺寸屏幕上也应该可以提供一致的用户体验，并可以让用户能够在桌面大屏幕上和移动小屏幕上平滑切换，同时没有任何的不适应感觉。本书作者是一位出色的开发者，在本书中，他将诸多技术和设计理念杂糅在一起，再辅以......一起来看看《响应式Web设计实践》这本书的介绍吧!

码农工具

Accurate Online Speaker Diarization with Supervised Learning

响应式Web设计实践

XML 在线格式化

RGB CMYK 转换工具

HEX CMYK 转换工具