Detecting Emotions from Voice Clips

栏目: IT技术 · 发布时间: 4年前

内容简介：Training a neural net to recognize emotions from voice clips using transfer learning.Humans tend to convey messages not just using the spoken word but by also using tones,body language and expressions. The same message spoken in two different manners can h

Identifying Emotions from Voice using Transfer Learning

Training a neural net to recognize emotions from voice clips using transfer learning.

A spectrogram of the voice clip of a happy person

In the episode titled “The Emotion Detection Automation” from the iconic sitcom “The Big Bang Theory” Howard manages to procure a device which would aid Sheldon ( who has trouble reading emotional cues in others) in understanding the feelings of other people around him by pointing the device at them …

Humans tend to convey messages not just using the spoken word but by also using tones,body language and expressions. The same message spoken in two different manners can have very different meanings. So keeping this in mind I thought about embarking on a project to recognize emotions from voice clips using the tone, loudness and various other factors to determine what the person speaking is feeling.

This article is basically a brief but complete tutorial which explains how to train a neural network to predict the emotions a person is feeling . The process will be divided into 3 steps :

Understanding the data
Pre-processing the data
Training the neural network

We will be requiring the following libraries :

1. fastai

2. numpy

3. matplotlib

4. librosa

5. pytorch

You will also require a jupyter notebook environment.

Understanding The Data

We will be using two datasets together to train the neural network :

RAVDESS Dataset

The RAVDESS Dataset is a collection of audio and video clips of 24 actors speaking the same two lines with 8 different emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprised). We will be using only the audio clips for this tutorial. You can obtain the dataset from here.

TESS Dataset

The TESS Dataset is a collection of audio clips of 2 women expressing 7 different emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). You can obtain the dataset from here .

We will be converting the sound clips into a graphical format and then merge the two datasets into one which we will then divide into 8 different folders one for each emotion mentioned in the RAVDESS dataset section ( We will merge the surprised and pleasant surprise into one component).

Pre-processing the Data

We will be converting the sound clips into graphical data so that it can be used for the training of the neural network. Check out the code below to do so :

The notebook above shows how to convert the sound files into graphical data so that it can be interpreted by neural networks. We are using the library called librosa to do so. We are converting the sound data into a spectogram with the MEL scale as this scale is designed to make it more interpretable. This notebook contains code to be applied to one sound file. The notebook containing the whole code for converting the initial datasets into final input data can be found here .

So after running the code given in the notebook above for each and every sound file and then dividing the files as necessary you should have 8 separate folders labelled with the corresponding emotions. Each folder should contain the graphical outputs for all the sound clips expressing the emotion with which the folder is labelled.

Training the Neural Network

We will now commence training the neural net to identify emotions by looking at the spectograms generated from the sound clips. We will be training the neural network using the fastai library. We will be using a pretrained CNN ( resnet34) and then train it on our data.

What we will be doing is as follows :

1. Making a dataloader with appropriate data augmentation to feed to the neural network. The size of each image is 432 by 288.

2. We will be using a neural net ( resnet34) pretrained on the imagenet dataset. We will then reduce our images to size of 144 by 144 by cropping appropriately and then train our neural net on that dataset.

3. We will then train the neural net again on images of size 288 by 288.

4. We will then analyse the performance of the neural net on the validation set.

5. Voila! The training process will be complete and you will have a neural net which can identify emotions from sound clips.

Lets’s start training !

In the above section we have created a dataloader using our data . We have applied the appropriate transformations on the image to reduce overfitting and also to reduce it to size of 144 by 144. We have also split it into validation and training sets and labelled the data from the folder name. As you can see the data has 8 classes so now this is a simple classification problem for an image dataset.

In the above section we used a pretrained neural net and then trained it on images of size 144 by 144 to identify emotions. At the end of training we managed to get an accuracy of 80.1 %.

So now we have a neural net which is pretty good at identifying emotions by looking at images of size 144 by 144. So now we will use the same neural net and train it to identify emotions by looking at images of size 288 by 288 (which it should be pretty good at already) .

In the above section we trained the neural net ( which we had trained on 144 by 144 sized images ) on the 288 by 288 sized images.

And voila! It can now identify emotions from sound clips irrespective of the content of the speech with an accuracy of 83.1 % ( on a validation set).

In the next section we will analyse the results of the neural net using a confusion matrix.

The above section contains the confusion matrix for our dataset.

The complete notebook for training can be found here and all the notebooks from preprocessing to training can be found here.

Thank you for reading this article and hope you enjoyed it !

Citations

RAVDESS :

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391 .

Toronto emotional speech set (TESS) :

Pichora-Fuller, M. Kathleen; Dupuis, Kate, 2020, “Toronto emotional speech set (TESS)”, https://doi.org/10.5683/SP2/E8H2MF , Scholars Portal Dataverse, V1

以上所述就是小编给大家介绍的《Detecting Emotions from Voice Clips》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Detecting Emotions from Voice Clips

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

HTML 5实战

陶国荣 / 机械工业出版社 / 2011-11 / 59.00元

陶国荣编著的《HTML5实战》是一本系统而全面的HTML 5教程，根据HTML 5标准的最新草案，系统地对HTML 5的所有重要知识点进行了全面的讲解。在写作方式上，本书以一种开创性的方式使理论与实践达到极好的平衡，不仅对理论知识进行了清晰而透彻的阐述，而且根据读者理解这些知识的需要，精心设计了106个完整（每个案例分为功能描述、实现代码、效果展示和代码分析4个部分）的实战案例，旨在帮助读者通过实......一起来看看《HTML 5实战》这本书的介绍吧!

码农工具