Detecting Emotions from Voice Clips

栏目: IT技术 · 发布时间: 3年前

内容简介:Training a neural net to recognize emotions from voice clips using transfer learning.Humans tend to convey messages not just using the spoken word but by also using tones,body language and expressions. The same message spoken in two different manners can h

Identifying Emotions from Voice using Transfer Learning

Training a neural net to recognize emotions from voice clips using transfer learning.

A spectrogram of the voice clip of a happy person
In the episode titled “The Emotion Detection Automation” from the iconic sitcom “The Big Bang Theory” Howard manages to procure a device which would aid Sheldon ( who has trouble reading emotional cues in others) in understanding the feelings of other people around him by pointing the device at them …

Humans tend to convey messages not just using the spoken word but by also using tones,body language and expressions. The same message spoken in two different manners can have very different meanings. So keeping this in mind I thought about embarking on a project to recognize emotions from voice clips using the tone, loudness and various other factors to determine what the person speaking is feeling.

This article is basically a brief but complete tutorial which explains how to train a neural network to predict the emotions a person is feeling . The process will be divided into 3 steps :

  1. Understanding the data
  2. Pre-processing the data
  3. Training the neural network

We will be requiring the following libraries :

1. fastai

2. numpy

3. matplotlib

4. librosa

5. pytorch

You will also require a jupyter notebook environment.

Understanding The Data

We will be using two datasets together to train the neural network :

RAVDESS Dataset

The RAVDESS Dataset is a collection of audio and video clips of 24 actors speaking the same two lines with 8 different emotions (neutral, calm, happy, sad, angry, fearful, disgust, surprised). We will be using only the audio clips for this tutorial. You can obtain the dataset from here.

TESS Dataset

The TESS Dataset is a collection of audio clips of 2 women expressing 7 different emotions (anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral). You can obtain the dataset from here .

We will be converting the sound clips into a graphical format and then merge the two datasets into one which we will then divide into 8 different folders one for each emotion mentioned in the RAVDESS dataset section ( We will merge the surprised and pleasant surprise into one component).

Pre-processing the Data

We will be converting the sound clips into graphical data so that it can be used for the training of the neural network. Check out the code below to do so :

The notebook above shows how to convert the sound files into graphical data so that it can be interpreted by neural networks. We are using the library called librosa to do so. We are converting the sound data into a spectogram with the MEL scale as this scale is designed to make it more interpretable. This notebook contains code to be applied to one sound file. The notebook containing the whole code for converting the initial datasets into final input data can be found here .

So after running the code given in the notebook above for each and every sound file and then dividing the files as necessary you should have 8 separate folders labelled with the corresponding emotions. Each folder should contain the graphical outputs for all the sound clips expressing the emotion with which the folder is labelled.

Training the Neural Network

We will now commence training the neural net to identify emotions by looking at the spectograms generated from the sound clips. We will be training the neural network using the fastai library. We will be using a pretrained CNN ( resnet34) and then train it on our data.

What we will be doing is as follows :

1. Making a dataloader with appropriate data augmentation to feed to the neural network. The size of each image is 432 by 288.

2. We will be using a neural net ( resnet34) pretrained on the imagenet dataset. We will then reduce our images to size of 144 by 144 by cropping appropriately and then train our neural net on that dataset.

3. We will then train the neural net again on images of size 288 by 288.

4. We will then analyse the performance of the neural net on the validation set.

5. Voila! The training process will be complete and you will have a neural net which can identify emotions from sound clips.

Lets’s start training !

In the above section we have created a dataloader using our data . We have applied the appropriate transformations on the image to reduce overfitting and also to reduce it to size of 144 by 144. We have also split it into validation and training sets and labelled the data from the folder name. As you can see the data has 8 classes so now this is a simple classification problem for an image dataset.

In the above section we used a pretrained neural net and then trained it on images of size 144 by 144 to identify emotions. At the end of training we managed to get an accuracy of 80.1 %.

So now we have a neural net which is pretty good at identifying emotions by looking at images of size 144 by 144. So now we will use the same neural net and train it to identify emotions by looking at images of size 288 by 288 (which it should be pretty good at already) .

In the above section we trained the neural net ( which we had trained on 144 by 144 sized images ) on the 288 by 288 sized images.

And voila! It can now identify emotions from sound clips irrespective of the content of the speech with an accuracy of 83.1 % ( on a validation set).

In the next section we will analyse the results of the neural net using a confusion matrix.

The above section contains the confusion matrix for our dataset.

The complete notebook for training can be found here and all the notebooks from preprocessing to training can be found here.

Thank you for reading this article and hope you enjoyed it !

Citations

RAVDESS :

Livingstone SR, Russo FA (2018) The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English. PLoS ONE 13(5): e0196391. https://doi.org/10.1371/journal.pone.0196391 .

Toronto emotional speech set (TESS) :

Pichora-Fuller, M. Kathleen; Dupuis, Kate, 2020, “Toronto emotional speech set (TESS)”, https://doi.org/10.5683/SP2/E8H2MF , Scholars Portal Dataverse, V1

以上所述就是小编给大家介绍的《Detecting Emotions from Voice Clips》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

游戏编程中的人工智能技术

游戏编程中的人工智能技术

布克兰德 / 吴祖增 / 清华大学出版社 / 2006-5 / 39.0

《游戏编程中的人工智能技术》是人工智能游戏编程的一本指南性读物,介绍在游戏开发中怎样应用遗传算法和人工神经网络来创建电脑游戏中所需要的人工智能。书中包含了许多实用例子,所有例子的完整源码和可执行程序都能在随书附带的光盘上找到。光盘中还有不少其他方面的游戏开发资料和一个赛车游戏演示软件。 《游戏编程中的人工智能技术》适合遗传算法和人工神经网络等人工智能技术的各行业人员,特别是要实际动手做应用开......一起来看看 《游戏编程中的人工智能技术》 这本书的介绍吧!

html转js在线工具
html转js在线工具

html转js在线工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具