内容简介:Training a neural net to recognize emotions from voice clips using transfer learning.Humans tend to convey messages not just using the spoken word but by also using tones,body language and expressions. The same message spoken in two different manners can h
Identifying Emotions from Voice using Transfer Learning
Training a neural net to recognize emotions from voice clips using transfer learning.
In the episode titled “The Emotion Detection Automation” from the iconic sitcom “The Big Bang Theory” Howard manages to procure a device which would aid Sheldon ( who has trouble reading emotional cues in others) in understanding the feelings of other people around him by pointing the device at them …
Humans tend to convey messages not just using the spoken word but by also using tones,body language and expressions. The same message spoken in two different manners can have very different meanings. So keeping this in mind I thought about embarking on a project to recognize emotions from voice clips using the tone, loudness and various other factors to determine what the person speaking is feeling.
This article is basically a brief but complete tutorial which explains how to train a neural network to predict the emotions a person is feeling . The process will be divided into 3 steps :
- Understanding the data
- Pre-processing the data
- Training the neural network
We will be requiring the following libraries :
1. fastai
2. numpy
3. matplotlib
4. librosa
5. pytorch
You will also require a jupyter notebook environment.
Understanding The Data
We will be using two datasets together to train the neural network :
RAVDESS Dataset
The RAVDESS Dataset is a collection of audio and video clips of 24 actors speaking the same two lines with 8 different emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprised). We will be using only the audio clips for this tutorial. You can obtain the dataset from here.
TESS Dataset
The TESS Dataset is a collection of audio clips of 2 women expressing 7 different emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). You can obtain the dataset from here .
We will be converting the sound clips into a graphical format and then merge the two datasets into one which we will then divide into 8 different folders one for each emotion mentioned in the RAVDESS dataset section ( We will merge the surprised and pleasant surprise into one component).
Pre-processing the Data
We will be converting the sound clips into graphical data so that it can be used for the training of the neural network. Check out the code below to do so :
The notebook above shows how to convert the sound files into graphical data so that it can be interpreted by neural networks. We are using the library called librosa to do so. We are converting the sound data into a spectogram with the MEL scale as this scale is designed to make it more interpretable. This notebook contains code to be applied to one sound file. The notebook containing the whole code for converting the initial datasets into final input data can be found here .
So after running the code given in the notebook above for each and every sound file and then dividing the files as necessary you should have 8 separate folders labelled with the corresponding emotions. Each folder should contain the graphical outputs for all the sound clips expressing the emotion with which the folder is labelled.
Training the Neural Network
We will now commence training the neural net to identify emotions by looking at the spectograms generated from the sound clips. We will be training the neural network using the fastai library. We will be using a pretrained CNN ( resnet34) and then train it on our data.
What we will be doing is as follows :
1. Making a dataloader with appropriate data augmentation to feed to the neural network. The size of each image is 432 by 288.
2. We will be using a neural net ( resnet34) pretrained on the imagenet dataset. We will then reduce our images to size of 144 by 144 by cropping appropriately and then train our neural net on that dataset.
3. We will then train the neural net again on images of size 288 by 288.
4. We will then analyse the performance of the neural net on the validation set.
5. Voila! The training process will be complete and you will have a neural net which can identify emotions from sound clips.
Lets’s start training !
In the above section we have created a dataloader using our data . We have applied the appropriate transformations on the image to reduce overfitting and also to reduce it to size of 144 by 144. We have also split it into validation and training sets and labelled the data from the folder name. As you can see the data has 8 classes so now this is a simple classification problem for an image dataset.
In the above section we used a pretrained neural net and then trained it on images of size 144 by 144 to identify emotions. At the end of training we managed to get an accuracy of 80.1 %.
So now we have a neural net which is pretty good at identifying emotions by looking at images of size 144 by 144. So now we will use the same neural net and train it to identify emotions by looking at images of size 288 by 288 (which it should be pretty good at already) .
In the above section we trained the neural net ( which we had trained on 144 by 144 sized images ) on the 288 by 288 sized images.
And voila! It can now identify emotions from sound clips irrespective of the content of the speech with an accuracy of 83.1 % ( on a validation set).
In the next section we will analyse the results of the neural net using a confusion matrix.
The above section contains the confusion matrix for our dataset.
The complete notebook for training can be found here and all the notebooks from preprocessing to training can be found here.
Thank you for reading this article and hope you enjoyed it !
Citations
RAVDESS :
Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391 .Toronto emotional speech set (TESS) :
Pichora-Fuller, M. Kathleen; Dupuis, Kate, 2020, “Toronto emotional speech set (TESS)”, https://doi.org/10.5683/SP2/E8H2MF , Scholars Portal Dataverse, V1以上所述就是小编给大家介绍的《Detecting Emotions from Voice Clips》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。
算法:C语言实现
塞奇威克 / 机械工业出版社 / 2006-9 / 69.00元
本书是Sedgewick彻底修订和重写的C算法系列的第一本。全书分为四部分,共16章,第一部分“基础知识”(第1-2章)介绍基本算法分析原理。第二部分“数据结构”(第3-5章)讲解算法分析中必须掌握的数据结构知识,主要包括基本数据结构,抽象数据结构,递归和树。一起来看看 《算法:C语言实现》 这本书的介绍吧!