Classifying music genres with CNNs

栏目: IT技术 · 发布时间: 4年前

Classifying music genres with CNNs

Photo Creds: https://unsplash.com/

A CNN is one of the top technologies for streaming services and leading music platforms. The neural network can identify similar compositions by genre, mood, etc. All this makes the music world much better and more enjoyable. But how? Let’s take a look at essential things like how CNN works, how it relates to music, and how to implement music classifier in TensorFlow.

It is going to be an intense guide, but relax, it will be exciting and fun ;)

Convolutional Neural Networks basics: Quick overview

So, what is the Convolutional neural network (CNN) in machine learning? In a nutshell, it’s an advanced type of neural network that is mainly used for processing images. CNN has an input layer, an output layer, and various hidden layers. Some of these layers are convolutional, using a mathematical model to pass on results to successive layers. This simulates some of the actions in the human visual cortex.

If you give the neural network hundreds and thousands of images of a particular subject, then CNN will process these photos in several layers. The first layers of CNN will distinguish between the gray points and the edges of the image, the next layers will differentiate between CNN shapes, objects and the more of these layers — the greater the identification of the object that dominates those images.

What does it mean? For example, we have a few photos of elephant images, we feed CNN with many elephant photos, then we can show any image, and CNN will tell if there is an elephant or not.

CNN can also process video through pixels that appear on the screen and change, CNN can examine this changing pattern and recognize the object. Of course, this procedure is much more complicated than identifying a simple image. For example, you could use CNN as a model to distinguish objects and learn to recognize them better.

What’s good about CNN?

I can differentiate several things at once:

  • CNNs have the same timeless parameters as dense layers . *Dense layer feeds all outputs from the previous layer to all its neurons, each neuron providing one output to the next layer. It’s the most basic layer in neural networks);
  • CNN has all the image data structures — all pixels, edges, shapes — everything has its position because CNN extracts all that data features and sorts them.

CNN + music

For music, things are a little different, and therefore, more complicated. Any audio recording is a specific amplitude that can also be processed as an image:

Classifying music genres with CNNs

Stereophonic sound

It all sounds like magic, but it’s really just CNN and the simple (or not quite simple) principle behind it. What is the principle? Let’s figure it out.

What is convolution?

How does CNN differ from a simple neural network? The answer is easier than you think, it is — C, or put it differently, a convolution.

A first question to answer with CNNs is why they are called Convolutional in the first place.

Convolution is a mathematical concept used heavily in Digital Signal Processing when dealing with signals that take the form of a time series. Convolution is a mechanism to combine or “blend” two functions of time in a coherent manner. It can be mathematically described as follows:

For a discrete domain of one variable:

For a discrete domain of two variables:

The central point of CNN is convolution, and to understand what it is we need to learn Kernel.

A kernel is a filter that makes a grid of weights like this:

Classifying music genres with CNNs

Say we have an image and we need to process it with CNN. Our first step is to apply Kernel to it. How? Let’s see how CNN works in a simple example.

Classifying music genres with CNNs

Here is the picture of a cat, and just like any other picture it has a certain amount of pixels with different shades and colors. This picture is black and white so we deal with various shades of these two colors (or more precisely the possible range of values a single pixel can represent is [0, 255]). If assigning each of the shades a certain value like from 0 to 10, we can translate this picture in a grid with all of these numbers representing certain values.

What will happen if we take a colored photo? Nothing special, RGB is our best friend:

Classifying music genres with CNNs

Separate color channels (3 in the case of RGB images) introduces an additional ‘depth’ field to the data, making the input 3-dimensional. Hence, for a given RGB image of size, say 255×255 (Width x Height) pixels, we’ll have 3 matrices associated with each image, one for each of the color channels. Thus the image in its entirety constitutes a 3-dimensional structure called the Input Volume (255x255x3).

The next step is what is called a convolution — we overlay the Kernel with a specific formula to change the input data and get an output.

Kernelis an operator applied to the entirety of the image such that it transforms the information encoded in the pixels. In practice, it is a small matrix, which is slid across the image and multiplied with the input such that the output is enhanced in a certain desirable manner.

Classifying music genres with CNNs

https://vision.unipv.it/CV/20200113%20-%20Computer%20Vision%20Applications.pdf

So, generally, we use Kernel to extract features. At the beginning, we extract low-level features, but then we go deeper, increase CNN layers and detect the whole object and the smallest details.

Kernel do a larger part of CNN work, but we also have other important parts of this procedure like rectified linear unit (ReLu), the pooling layer, the fully connected layer. All of them create a cutting-edge CNN. Now, I will not stop on each of these parts cause it will take lots of time. If you want to have the whole picture of this procedure in your mind, I recommend watching this video:

To learn more about this you can also follow these links:

CNN for Music Genre Classification

Classifying music genres with CNNs

Photo Creds: https://unsplash.com/

As mentioned before, to process audio and extract useful insight from it, we can process it as an image and conduct a standard CNN procedure. A popular method in the audio domain is to use a spectrogram (derived from the Fast Fourier Transform and/or other transformations) as an input to a CNN and to apply convolving filter kernels that extract patterns in 2D.

Everything seems quite logical, but here is the most interesting question: How to classify different music genres? What principle can be behind this task?

Classic algorithm needed for the task of classifying music genres with CNN:

1. create train, validation and test sets

2. build the CNN net

3. compile the network

4. train the CNN

5. evaluate the CNN on the test set

6. make a prediction on a sample

Music genres are a set of descriptive keywords that convey high-level information about a music clip (jazz, classical, rock…). Genre classification is a task that aims to predict music genre using the audio signal.

Building this system requires extracting acoustic features that are good estimators of the type of genres we are interested in, followed by a single or multi-label classification or in some cases, the regression stage. Conventionally, feature extraction relies on a signal processing front-end in order to compute relevant features from time or frequency domain audio

representation. The features are then used as input to the machine learning stage.

CNNs assume features that are in different levels of hierarchy and can be

extracted by convolutional kernels. The hierarchical features are learned to achieve a given task during supervised training. For example, learned features from a CNN that is trained for genre classification exhibit low-level features (e.g., onset) to high-level features (e.g., percussive instrument patterns).

Classifying music genres with CNNs in TensorFlow

Classifying music genres with CNNs

Photo Creds: https://unsplash.com/

Here is the practical implementation of the steps for music classifying described above:

import jsonimport numpy as npfrom sklearn.model_selection import train_test_splitimport tensorflow.keras as kerasimport matplotlib.pyplot as pltDATA_PATH = “../13/data_10.json”def load_data(data_path):“””Loads training dataset from json file.:param data_path (str): Path to json file containing data:return X (ndarray): Inputs:return y (ndarray): Targets“””with open(data_path, “r”) as fp:data = json.load(fp)X = np.array(data[“mfcc”])y = np.array(data[“labels”])return X, ydef plot_history(history):“””Plots accuracy/loss for training/validation set as a function of the epochs:param history: Training history of model:return:“””fig, axs = plt.subplots(2)# create accuracy sublpotaxs[0].plot(history.history[“accuracy”], label=”train accuracy”)axs[0].plot(history.history[“val_accuracy”], label=”test accuracy”)axs[0].set_ylabel(“Accuracy”)axs[0].legend(loc=”lower right”)axs[0].set_title(“Accuracy eval”)# create error sublpotaxs[1].plot(history.history[“loss”], label=”train error”)axs[1].plot(history.history[“val_loss”], label=”test error”)axs[1].set_ylabel(“Error”)axs[1].set_xlabel(“Epoch”)axs[1].legend(loc=”upper right”)axs[1].set_title(“Error eval”)plt.show()def prepare_datasets(test_size, validation_size):“””Loads data and splits it into train, validation and test sets.:param test_size (float): Value in [0, 1] indicating percentage of data set to allocate to test split:param validation_size (float): Value in [0, 1] indicating percentage of train set to allocate to validation split:return X_train (ndarray): Input training set:return X_validation (ndarray): Input validation set:return X_test (ndarray): Input test set:return y_train (ndarray): Target training set:return y_validation (ndarray): Target validation set:return y_test (ndarray): Target test set“””# load dataX, y = load_data(DATA_PATH)# create train, validation and test splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=validation_size)# add an axis to input setsX_train = X_train[…, np.newaxis]X_validation = X_validation[…, np.newaxis]X_test = X_test[…, np.newaxis]return X_train, X_validation, X_test, y_train, y_validation, y_testdef build_model(input_shape):“””Generates CNN model:param input_shape (tuple): Shape of input set:return model: CNN model“””# build network topologymodel = keras.Sequential()# 1st conv layermodel.add(keras.layers.Conv2D(32, (3, 3), activation=’relu’, input_shape=input_shape))model.add(keras.layers.MaxPooling2D((3, 3), strides=(2, 2), padding=’same’))model.add(keras.layers.BatchNormalization())# 2nd conv layermodel.add(keras.layers.Conv2D(32, (3, 3), activation=’relu’))model.add(keras.layers.MaxPooling2D((3, 3), strides=(2, 2), padding=’same’))model.add(keras.layers.BatchNormalization())# 3rd conv layermodel.add(keras.layers.Conv2D(32, (2, 2), activation=’relu’))model.add(keras.layers.MaxPooling2D((2, 2), strides=(2, 2), padding=’same’))model.add(keras.layers.BatchNormalization())# flatten output and feed it into dense layermodel.add(keras.layers.Flatten())model.add(keras.layers.Dense(64, activation=’relu’))model.add(keras.layers.Dropout(0.3))# output layermodel.add(keras.layers.Dense(10, activation=’softmax’))return modeldef predict(model, X, y):“””Predict a single sample using the trained model:param model: Trained classifier:param X: Input data:param y (int): Target“””# add a dimension to input data for sample — model.predict() expects a 4d array in this caseX = X[np.newaxis, …] # array shape (1, 130, 13, 1)# perform predictionprediction = model.predict(X)# get index with max valuepredicted_index = np.argmax(prediction, axis=1)print(“Target: {}, Predicted label: {}”.format(y, predicted_index))if __name__ == “__main__”:# get train, validation, test splitsX_train, X_validation, X_test, y_train, y_validation, y_test = prepare_datasets(0.25, 0.2)# create networkinput_shape = (X_train.shape[1], X_train.shape[2], 1)model = build_model(input_shape)# compile modeloptimiser = keras.optimizers.Adam(learning_rate=0.0001)model.compile(optimizer=optimiser,loss=’sparse_categorical_crossentropy’,metrics=[‘accuracy’])model.summary()# train modelhistory = model.fit(X_train, y_train, validation_data=(X_validation, y_validation), batch_size=32, epochs=30)# plot accuracy/error for training and validationplot_history(history)# evaluate model on test settest_loss, test_acc = model.evaluate(X_test, y_test, verbose=2)print(‘\nTest accuracy:’, test_acc)# pick a sample to predict from the test setX_to_predict = X_test[100]y_to_predict = y_test[100]# predict samplepredict(model, X_to_predict, y_to_predict)

References

Thanks for stopping by. If this post was interesting for you, welcome to follow me on Instagram , Medium , and Linkedin .

Cheers!


以上所述就是小编给大家介绍的《Classifying music genres with CNNs》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!

查看所有标签

猜你喜欢:

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

趣学算法

趣学算法

陈小玉 / 人民邮电出版社 / 2017-7-1 / 89.00元

本书内容按照算法策略分为7章。 第1章从算法之美、简单小问题、趣味故事引入算法概念、时间复杂度、空间复杂度的概念和计算方法,以及算法设计的爆炸性增量问题,使读者体验算法的奥妙。 第2~7章介绍经典算法的设计策略、实战演练、算法分析及优化拓展,分别讲解贪心算法、分治算法、动态规划、回溯法、分支限界法、线性规划和网络流。每一种算法都有4~10个实例,共50个大型实例,包括经典的构造实例和实......一起来看看 《趣学算法》 这本书的介绍吧!

Base64 编码/解码
Base64 编码/解码

Base64 编码/解码

RGB HSV 转换
RGB HSV 转换

RGB HSV 互转工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具