Mozilla Common Voice Dataset – 7200 hrs, 54 languages

栏目: IT技术 · 发布时间: 4年前

内容简介:We are halfway through 2020, and already it’s been an exciting year for Common Voice! Thanks to the enthusiasm and incredible engagement from our Common Voice communities, we areNot only is Common Voice growing, it’s continuing to diversify. This release i

More data, more languages, and introducing our first target segment!

We are halfway through 2020, and already it’s been an exciting year for Common Voice! Thanks to the enthusiasm and incredible engagement from our Common Voice communities, we are releasing an updated dataset with 7,226 total hours of contributed voice data. 5,591 of these hours have been confirmed valid by our diligent contributors. Dataset fun fact: this release comprises over 5.5million clips*!

Not only is Common Voice growing, it’s continuing to diversify. This release includes voice recordings in 54 languages , 14 of these languages** are new to the platform and dataset. The platform is seeing more languages with over 5,000 unique speakers*** and an increase in languages with over 500 recorded hours****. With contributions from all over the globe, you are helping us follow through on our goal to create a voice dataset that is publicly available to anyone and represents the world we live in.

We are also proud to announce the release of our first ever dataset target segment! In May, Common Voice started collecting voice data for a specific purpose or use case. Now, we’re releasing the single word target segment which includes the digits zero through nine , as well as the words yes , no , hey and Firefox . The released target segment is 120 total recorded hours , with 64 valid hours , across 18 languages . It was created in one month by over 11,000 unique contributor voices! This segment data will help Mozilla benchmark the accuracy of our open source voice recognition engine, Deep Speech , in multiple languages for a similar task and will enable more detailed feedback on how to continue improving the dataset.

From the whole Voice team at Mozilla: Thank you for your ongoing contributions, your support and your enthusiasm! Going into the second half of 2020, we look forward to continuing our mission to build a better, more open, internet.

Cheers,

Megan + the Common Voice team

*Average clip duration is 4.7 seconds.

**14 new languages included with this release: Upper Sorbian, Romanian, Frisian, Czech, Greek, Romansh Vallader, Polish, Assamese, Ukranian, Maltese, Georgian, Punjabi, Odia, and Vietnamese.

***Languages with over 5,000 unique speakers: English, German, French, Italian, Spanish

****Languages with over 500 recorded hours: English, German, French, Kabyle, Catalan, Spanish, Kinyarwandan


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

我的第一本算法书

我的第一本算法书

[日]石田保辉、[日]宮崎修一 / 张贝 / 人民邮电出版社 / 2018-10 / 69.00元

本书采用大量图片,通过详细的分步讲解,以直观、易懂的方式展现了7个数据结构和26个基础算法的基本原理。第1章介绍了链表、数组、栈等7个数据结构;从第2章到第7章,分别介绍了和排序、查找、图论、安全、聚类等相关的26个基础算法,内容涉及冒泡排序、二分查找、广度优先搜索、哈希函数、迪菲 - 赫尔曼密钥交换、k-means 算法等。 本书没有枯燥的理论和复杂的公式,而是通过大量的步骤图帮助读者加深......一起来看看 《我的第一本算法书》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

MD5 加密
MD5 加密

MD5 加密工具

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具