The Multi-Channel Neural Network

栏目: IT技术 · 发布时间: 4年前

内容简介:Neural Networks are widely used across multiple domains, such as Computer Vision, Audio Classification, Natural Language Processing, etc. In most cases, they are considered in each of these domains individually. However, in real-life settings, it is rarely

Neural Networks are widely used across multiple domains, such as Computer Vision, Audio Classification, Natural Language Processing, etc. In most cases, they are considered in each of these domains individually. However, in real-life settings, it is rarely the case that this is the optimal configuration. It is much more common to have multiple channels, meaning several different types of inputs. Similarly to how humans extract insights using a wide range of sensory inputs (audio, visual, etc.), Neural Networks can (and should) be trained on multiple inputs.

Let’s take, for example, the task of emotion recognition .

Humans do not use a single input to effectively classify the interlocutor’s emotion. They do not use, for instance, the facial expressions (visual), the voice (audio), or the meaning of words (text) of the interlocutor only, but a mixture of them. Similarly, Neural Networks can be trained on multiple inputs, such as images, audio and text, processed accordingly (through CNN, NLP, etc.), to come up with an effective prediction of the target emotion. By doing so, Neural Networks can be better able to capture the subtleties that are difficult to get from each channel individually (e.g. sarcasm), similarly to humans.

System Architecture

Considering the task of emotion recognition, that for simplicity we restrict to three classes (Positive, Negative and Neutral), we can think of the system as the following:

The system picks up the audio through a microphone, computes the MEL Spectrogram of the sound as image and transcribes it into a string of text. These two signals are then used as the inputs to the model, each fed to a branch of it. The Neural Network is, indeed, formed by two sections:

  • The left branch, performing Image Classification through a Convolutional Neural Network
  • The right branch, performing NLP on the text, using Embeddings.

Finally, the output of each side is fed into a common set of Dense layers, where the last one has three neurons to respectively classify the three classes (Positive, Neutral and Negative).

Setup

In this example, we will use the MELD dataset, consisting of short conversations with associated emotion labels, indicating whether the sentiment of the statement is Positive, Neutral or Negative. The dataset can be found here .

You can also find the complete code of this article here .

Image Classification Branch

Right after the sound is collected, the system computes the Spectrogram of the audio signal. The state-of-the-art features for audio classification are MEL Spectrograms and MEL Frequency Cepstral Coefficients. For this example, we will use the MEL Spectrogram.

First of all, we will obviously load the data from a folder containing the audios, split across Training, Validation and Test:

import gensim.models as gm
import glob as gb
import keras.applications as ka
import keras.layers as kl
import keras.models as km
import keras.optimizers as ko
import keras_preprocessing.image as ki
import keras_preprocessing.sequence as ks
import keras_preprocessing.text as kt
import numpy as np
import pandas as pd
import pickle as pk
import tensorflow as tf
import utils as ut


# Data
Data_dir = np.array(gb.glob('../Data/MELD.Raw/train_splits/*'))
Validation_dir = np.array(gb.glob('../Data/MELD.Raw/dev_splits_complete/*'))
Test_dir = np.array(gb.glob('../Data/MELD.Raw/output_repeated_splits_test/*'))

# Parameters
BATCH = 16
EMBEDDING_LENGTH = 32

We can then loop through each audio file in these three folders, compute the MEL Spectrogram and save it as an image into a new folder:

# Convert Audio to Spectrograms
for file in Data_dir:
    filename, name = file, file.split('/')[-1].split('.')[0]
    ut.create_spectrogram(filename, name)

for file in Validation_dir:
    filename, name = file, file.split('/')[-1].split('.')[0]
    ut.create_spectrogram_validation(filename, name)

for file in Test_dir:
    filename, name = file, file.split('/')[-1].split('.')[0]
    ut.create_spectrogram_test(filename, name)

To do so, we have created a function in the utilities script for Training, Validation and Test. This function uses the librosa package to load the audio file, process it and then save it as an image:

import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np
import path as ph
import pydub as pb
import speech_recognition as sr
import warningsdef create_spectrogram(filename, name):
    plt.interactive(False)
    clip, sample_rate = librosa.load(filename, sr=None)    fig = plt.figure(figsize=[0.72, 0.72])
    ax = fig.add_subplot(111)
    ax.axes.get_xaxis().set_visible(False)
    ax.axes.get_yaxis().set_visible(False)
    ax.set_frame_on(False)    S = librosa.feature.melspectrogram(y=clip, sr=sample_rate)
    librosa.display.specshow(librosa.power_to_db(S, ref=np.max))
    filename = '../Images/Train/' + name + '.jpg'    plt.savefig(filename, dpi=400, bbox_inches='tight', pad_inches=0)
    plt.close()
    fig.clf()
    plt.close(fig)
    plt.close('all')
    del filename, name, clip, sample_rate, fig, ax, S

Once we have converted each audio signal into an image representing the corresponding Spectrogram, we can load the dataset containing information on the labels of each audio. To properly link each audio file to its Sentiment, we create an ID column containing the name of the image file for that sample:

# Data Loading
train = pd.read_csv('../Data/MELD.Raw/train_sent_emo.csv', dtype=str)validation = pd.read_csv('../Data/MELD.Raw/dev_sent_emo.csv', dtype=str)test = pd.read_csv('../Data/MELD.Raw/test_sent_emo.csv', dtype=str)
# Create mapping to identify audio files
train["ID"] = 'dia'+train["Dialogue_ID"]+'_utt'+train["Utterance_ID"]+'.jpg'validation["ID"] = 'dia'+validation["Dialogue_ID"]+'_utt'+validation["Utterance_ID"]+'.jpg'test["ID"] = 'dia'+test["Dialogue_ID"]+'_utt'+test["Utterance_ID"]+'.jpg'

Natural Language Processing Branch

At the same time, we also need to take the text associated with an audio signal and process it using NLP techniques to transform it into a numeric vector so that the Neural Network can process it. Since we already have information on the text from the MELD dataset itself, we can go ahead with it. Otherwise, if the information was not available, to do this, we could use text conversion libraries like Google Cloud’s speech_recognition :

# Text Features
tokenizer = kt.Tokenizer(num_words=5000)
tokenizer.fit_on_texts(train['Utterance'])

vocab_size = len(tokenizer.word_index) + 1

train_tokens = tokenizer.texts_to_sequences(train['Utterance'])
text_features = pd.DataFrame(ks.pad_sequences(train_tokens, maxlen=200))

validation_tokens = tokenizer.texts_to_sequences(validation['Utterance'])
validation_features = pd.DataFrame(ks.pad_sequences(validation_tokens, maxlen=200))

We first tokenise the sentences spoken in each audio and then convert them into numeric vectors of length two hundred.

Data Pipeline

One of the trickiest aspects of putting together multi-media inputs, is the creation of a custom Data Generator. This structure is basically a function able to iteratively return to the model the next batch of input every time that it is called. Using Keras’ pre-made generator is relatively easy, but there is no implementation allowing you to merge together multiple inputs and make sure that both inputs are fed into the model side by side without errors.

The following code is general enough that it can be used in different settings, not just related to this example. In particular, it takes a folder location, where the images are located, and a “normal” dataset, having samples in the rows and features in the columns. It then iteratively provides the next samples of both images and text features in this case, both in batches of the same size:

# Data Pipeline
def train_generator(features, batch):
    # Image Generator
    train_generator = ki.ImageDataGenerator(
        rescale=1. / 255.)    train_generator = train_generator.flow_from_dataframe(
        dataframe=train,
        directory="../Images/Train/",
        x_col="ID",
        y_col="Sentiment",
        batch_size=batch,
        seed=0,
        shuffle=False,
        class_mode="categorical",
        target_size=(64, 64))    train_iterator = features.iterrows()
    j = 0
    i = 0    while True:
        genX2 = pd.DataFrame(columns=features.columns)        while i < batch:
            k,r = train_iterator.__next__()
            r = pd.DataFrame([r], columns=genX2.columns)
            genX2 = genX2.append(r)
            j += 1
            i += 1            if j == train.shape[0]:
                X1i = train_generator.next()                train_generator = ki.ImageDataGenerator(
                    rescale=1. / 255.)                train_generator = train_generator.flow_from_dataframe(
                    dataframe=train,
                    directory="../Images/Train/",
                    x_col="ID",
                    y_col="Sentiment",
                    batch_size=batch,
                    seed=0,
                    shuffle=False,
                    class_mode="categorical",
                    target_size=(64, 64))                # Text Generator
                train_iterator = features.iterrows()
                i = 0
                j=0
                X2i = genX2
                genX2 = pd.DataFrame(columns=features.columns)
                yield [X1i[0], tf.convert_to_tensor(X2i.values, dtype=tf.float32)], X1i[1]        X1i = train_generator.next()
        X2i = genX2
        i = 0
        yield [X1i[0], tf.convert_to_tensor(X2i.values, dtype=tf.float32)], X1i[1]

Neural Network Architecture

Finally, we can create the model having as input images (Spectrogram) and text (Transcriptions), and that processes them.

Inputs

The inputs consist of images made of 64×64 pixels and 3 channels (RGB), and “normal” numeric features, of length 200, representing the encoding of the text:

# Model
# Inputs
images = km.Input(shape=(64, 64, 3))
features = km.Input(shape=(200, ))

Image Classification (CNN) Branch

The Image Classification branch is made of an initial VGG19 network and then a set of custom layers. By using VGG19, we can take advantage of the benefits of using Transfer Learning. In particular, since the model has been pre-trained on the ImageNet dataset, it already starts with coefficients initialised to values meaningful for image classification tasks. The output is then fed into a series of layers that can learn the specific characteristics of this type of images and then outputted to the next common layers:

# Transfer Learning Bases
vgg19 = ka.VGG19(weights='imagenet', include_top=False)
vgg19.trainable = False

# Image Classification Branch
x = vgg19(images)
x = kl.GlobalAveragePooling2D()(x)
x = kl.Dense(32, activation='relu')(x)
x = kl.Dropout(rate=0.25)(x)
x = km.Model(inputs=images, outputs=x)

Text Classification (NLP) Branch

The NLP Branch uses a Long Short-Term Memory (LSTM) layer, together with an Embedding layer to process the data. Dropout layers are also added to avoid the model overfishing, similarly to what done in the CNN Branch:

# Text Classification Branch
y = kl.Embedding(vocab_size, EMBEDDING_LENGTH, input_length=200)(features)
y = kl.SpatialDropout1D(0.25)(y)
y = kl.LSTM(25, dropout=0.25, recurrent_dropout=0.25)(y)
y = kl.Dropout(0.25)(y)
y = km.Model(inputs=features, outputs=y)

Common Layers

We then can combine these two outputs and feed them into a series of Dense layers. A set of dense layers is able to capture information that is available only when both audio and text signals are combined, and is not identifiable from each input individually.

We also use an Adam optimizer with learning rate 0.0001, which is generally a great combination:

combined = kl.concatenate([x.output, y.output])

z = kl.Dense(32, activation="relu")(combined)
z = kl.Dropout(rate=0.25)(z)
z = kl.Dense(32, activation="relu")(z)
z = kl.Dropout(rate=0.25)(z)
z = kl.Dense(3, activation="softmax")(z)

model = km.Model(inputs=[x.input, y.input], outputs=z)

model.compile(optimizer=ko.Adam(lr=0.0001), loss='categorical_crossentropy', metrics='accuracy')

model.summary()

Model Training

The model can then be trained using the Training and Validation Generators that we previously created by doing:

# Hyperparameters
EPOCHS = 13
TRAIN_STEPS = np.floor(train.shape[0]/BATCH)
VALIDATION_STEPS = np.floor(validation.shape[0]/BATCH)

# Model Training
model.fit_generator(generator=train_generator(text_features, BATCH),
                    steps_per_epoch=TRAIN_STEPS,
                    validation_data=validation_generator(validation_features, BATCH),
                    validation_steps=VALIDATION_STEPS,
                    epochs=EPOCHS)

Model Evaluation

Lastly, the model is evaluated on a set of data, for instance the Validation set alone, and then saved as a file to load when used in “Live” scenarios:

# Performance Evaluation
# Validation
model.evaluate_generator(generator=validation_generator(validation_features, BATCH))

# Save the Model and Labels
model.save('Model.h5')

Summary

Overall, we built a system able to take multiple types of inputs (images, text, etc.), preprocess them and then feed them to a Neural Network consisting of a branch per input. Each branch individually processes its input for then converging into a common set of layers predicting the final output.

The specific steps are:

  • Data Loading
  • Preprocess Input Separately (Spectrogram, Tokenization)
  • Creating a Custom Data Generator
  • Building the Model Architecture
  • Model Training
  • Performance Evaluation

以上所述就是小编给大家介绍的《The Multi-Channel Neural Network》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

JavaScript凌厉开发

JavaScript凌厉开发

张鑫 黄灯桥、杨彦强 / 清华大学出版社 / 2010 年4月 / 49.00元

本书详细介绍Ext JS框架体系结构,以及利用HTML/CSS/JavaScript进行前端设计的方法和技巧。作者为Ext中文站站长领衔的三个国内Ext JS先锋,在开发思维和开发经验上有着无可争议的功力。 本书包含的内容有Ext.Element.*、事件Observable、Ext组件+MVC原理、Grid/Form/Tree/ComboBox、Ajax缓存Store等,并照顾JavaSc......一起来看看 《JavaScript凌厉开发》 这本书的介绍吧!

HTML 编码/解码
HTML 编码/解码

HTML 编码/解码

html转js在线工具
html转js在线工具

html转js在线工具

RGB CMYK 转换工具
RGB CMYK 转换工具

RGB CMYK 互转工具