A CNN is one of the top technologies for streaming services and leading music platforms. The neural network can identify similar compositions by genre, mood, etc. All this makes the music world much better and more enjoyable. But how? Let’s take a look at essential things like how CNN works, how it relates to music, and how to implement music classifier in TensorFlow.
It is going to be an intense guide, but relax, it will be exciting and fun ;)
Convolutional Neural Networks basics: Quick overview
So, what is the Convolutional neural network (CNN) in machine learning? In a nutshell, it’s an advanced type of neural network that is mainly used for processing images. CNN has an input layer, an output layer, and various hidden layers. Some of these layers are convolutional, using a mathematical model to pass on results to successive layers. This simulates some of the actions in the human visual cortex.
If you give the neural network hundreds and thousands of images of a particular subject, then CNN will process these photos in several layers. The first layers of CNN will distinguish between the gray points and the edges of the image, the next layers will differentiate between CNN shapes, objects and the more of these layers — the greater the identification of the object that dominates those images.
What does it mean? For example, we have a few photos of elephant images, we feed CNN with many elephant photos, then we can show any image, and CNN will tell if there is an elephant or not.
CNN can also process video through pixels that appear on the screen and change, CNN can examine this changing pattern and recognize the object. Of course, this procedure is much more complicated than identifying a simple image. For example, you could use CNN as a model to distinguish objects and learn to recognize them better.
What’s good about CNN?
I can differentiate several things at once:
- CNNs have the same timeless parameters as dense layers . *Dense layer feeds all outputs from the previous layer to all its neurons, each neuron providing one output to the next layer. It’s the most basic layer in neural networks);
- CNN has all the image data structures — all pixels, edges, shapes — everything has its position because CNN extracts all that data features and sorts them.
CNN + music
For music, things are a little different, and therefore, more complicated. Any audio recording is a specific amplitude that can also be processed as an image:
It all sounds like magic, but it’s really just CNN and the simple (or not quite simple) principle behind it. What is the principle? Let’s figure it out.
What is convolution?
How does CNN differ from a simple neural network? The answer is easier than you think, it is — C, or put it differently, a convolution.
A first question to answer with CNNs is why they are called Convolutional in the first place.
Convolution is a mathematical concept used heavily in Digital Signal Processing when dealing with signals that take the form of a time series. Convolution is a mechanism to combine or “blend” two functions of time in a coherent manner. It can be mathematically described as follows:
For a discrete domain of one variable:
For a discrete domain of two variables:
The central point of CNN is convolution, and to understand what it is we need to learn Kernel.
A kernel is a filter that makes a grid of weights like this:
Say we have an image and we need to process it with CNN. Our first step is to apply Kernel to it. How? Let’s see how CNN works in a simple example.
Here is the picture of a cat, and just like any other picture it has a certain amount of pixels with different shades and colors. This picture is black and white so we deal with various shades of these two colors (or more precisely the possible range of values a single pixel can represent is [0, 255]). If assigning each of the shades a certain value like from 0 to 10, we can translate this picture in a grid with all of these numbers representing certain values.
What will happen if we take a colored photo? Nothing special, RGB is our best friend:
Separate color channels (3 in the case of RGB images) introduces an additional ‘depth’ field to the data, making the input 3-dimensional. Hence, for a given RGB image of size, say 255×255 (Width x Height) pixels, we’ll have 3 matrices associated with each image, one for each of the color channels. Thus the image in its entirety constitutes a 3-dimensional structure called the Input Volume (255x255x3).
The next step is what is called a convolution — we overlay the Kernel with a specific formula to change the input data and get an output.
Kernelis an operator applied to the entirety of the image such that it transforms the information encoded in the pixels. In practice, it is a small matrix, which is slid across the image and multiplied with the input such that the output is enhanced in a certain desirable manner.
So, generally, we use Kernel to extract features. At the beginning, we extract low-level features, but then we go deeper, increase CNN layers and detect the whole object and the smallest details.
Kernel do a larger part of CNN work, but we also have other important parts of this procedure like rectified linear unit (ReLu), the pooling layer, the fully connected layer. All of them create a cutting-edge CNN. Now, I will not stop on each of these parts cause it will take lots of time. If you want to have the whole picture of this procedure in your mind, I recommend watching this video:
To learn more about this you can also follow these links:
- https://www.youtube.com/watch?v=t3qWfUYJEYU&list=PL-wATfeyAMNrtbkCNsLcpoAyBBRJZVlnf&index=15
- https://blog.xrds.acm.org/2016/06/convolutional-neural-networks-cnns-illustrated-explanation/
CNN for Music Genre Classification
As mentioned before, to process audio and extract useful insight from it, we can process it as an image and conduct a standard CNN procedure. A popular method in the audio domain is to use a spectrogram (derived from the Fast Fourier Transform and/or other transformations) as an input to a CNN and to apply convolving filter kernels that extract patterns in 2D.
Everything seems quite logical, but here is the most interesting question: How to classify different music genres? What principle can be behind this task?
Classic algorithm needed for the task of classifying music genres with CNN:
1. create train, validation and test sets
2. build the CNN net
3. compile the network
4. train the CNN
5. evaluate the CNN on the test set
6. make a prediction on a sample
Music genres are a set of descriptive keywords that convey high-level information about a music clip (jazz, classical, rock…). Genre classification is a task that aims to predict music genre using the audio signal.
Building this system requires extracting acoustic features that are good estimators of the type of genres we are interested in, followed by a single or multi-label classification or in some cases, the regression stage. Conventionally, feature extraction relies on a signal processing front-end in order to compute relevant features from time or frequency domain audio
representation. The features are then used as input to the machine learning stage.
CNNs assume features that are in different levels of hierarchy and can be
extracted by convolutional kernels. The hierarchical features are learned to achieve a given task during supervised training. For example, learned features from a CNN that is trained for genre classification exhibit low-level features (e.g., onset) to high-level features (e.g., percussive instrument patterns).
Classifying music genres with CNNs in TensorFlow
Here is the practical implementation of the steps for music classifying described above:
import jsonimport numpy as npfrom sklearn.model_selection import train_test_splitimport tensorflow.keras as kerasimport matplotlib.pyplot as pltDATA_PATH = “../13/data_10.json”def load_data(data_path):“””Loads training dataset from json file.:param data_path (str): Path to json file containing data:return X (ndarray): Inputs:return y (ndarray): Targets“””with open(data_path, “r”) as fp:data = json.load(fp)X = np.array(data[“mfcc”])y = np.array(data[“labels”])return X, ydef plot_history(history):“””Plots accuracy/loss for training/validation set as a function of the epochs:param history: Training history of model:return:“””fig, axs = plt.subplots(2)# create accuracy sublpotaxs[0].plot(history.history[“accuracy”], label=”train accuracy”)axs[0].plot(history.history[“val_accuracy”], label=”test accuracy”)axs[0].set_ylabel(“Accuracy”)axs[0].legend(loc=”lower right”)axs[0].set_title(“Accuracy eval”)# create error sublpotaxs[1].plot(history.history[“loss”], label=”train error”)axs[1].plot(history.history[“val_loss”], label=”test error”)axs[1].set_ylabel(“Error”)axs[1].set_xlabel(“Epoch”)axs[1].legend(loc=”upper right”)axs[1].set_title(“Error eval”)plt.show()def prepare_datasets(test_size, validation_size):“””Loads data and splits it into train, validation and test sets.:param test_size (float): Value in [0, 1] indicating percentage of data set to allocate to test split:param validation_size (float): Value in [0, 1] indicating percentage of train set to allocate to validation split:return X_train (ndarray): Input training set:return X_validation (ndarray): Input validation set:return X_test (ndarray): Input test set:return y_train (ndarray): Target training set:return y_validation (ndarray): Target validation set:return y_test (ndarray): Target test set“””# load dataX, y = load_data(DATA_PATH)# create train, validation and test splitX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=test_size)X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, test_size=validation_size)# add an axis to input setsX_train = X_train[…, np.newaxis]X_validation = X_validation[…, np.newaxis]X_test = X_test[…, np.newaxis]return X_train, X_validation, X_test, y_train, y_validation, y_testdef build_model(input_shape):“””Generates CNN model:param input_shape (tuple): Shape of input set:return model: CNN model“””# build network topologymodel = keras.Sequential()# 1st conv layermodel.add(keras.layers.Conv2D(32, (3, 3), activation=’relu’, input_shape=input_shape))model.add(keras.layers.MaxPooling2D((3, 3), strides=(2, 2), padding=’same’))model.add(keras.layers.BatchNormalization())# 2nd conv layermodel.add(keras.layers.Conv2D(32, (3, 3), activation=’relu’))model.add(keras.layers.MaxPooling2D((3, 3), strides=(2, 2), padding=’same’))model.add(keras.layers.BatchNormalization())# 3rd conv layermodel.add(keras.layers.Conv2D(32, (2, 2), activation=’relu’))model.add(keras.layers.MaxPooling2D((2, 2), strides=(2, 2), padding=’same’))model.add(keras.layers.BatchNormalization())# flatten output and feed it into dense layermodel.add(keras.layers.Flatten())model.add(keras.layers.Dense(64, activation=’relu’))model.add(keras.layers.Dropout(0.3))# output layermodel.add(keras.layers.Dense(10, activation=’softmax’))return modeldef predict(model, X, y):“””Predict a single sample using the trained model:param model: Trained classifier:param X: Input data:param y (int): Target“””# add a dimension to input data for sample — model.predict() expects a 4d array in this caseX = X[np.newaxis, …] # array shape (1, 130, 13, 1)# perform predictionprediction = model.predict(X)# get index with max valuepredicted_index = np.argmax(prediction, axis=1)print(“Target: {}, Predicted label: {}”.format(y, predicted_index))if __name__ == “__main__”:# get train, validation, test splitsX_train, X_validation, X_test, y_train, y_validation, y_test = prepare_datasets(0.25, 0.2)# create networkinput_shape = (X_train.shape[1], X_train.shape[2], 1)model = build_model(input_shape)# compile modeloptimiser = keras.optimizers.Adam(learning_rate=0.0001)model.compile(optimizer=optimiser,loss=’sparse_categorical_crossentropy’,metrics=[‘accuracy’])model.summary()# train modelhistory = model.fit(X_train, y_train, validation_data=(X_validation, y_validation), batch_size=32, epochs=30)# plot accuracy/error for training and validationplot_history(history)# evaluate model on test settest_loss, test_acc = model.evaluate(X_test, y_test, verbose=2)print(‘\nTest accuracy:’, test_acc)# pick a sample to predict from the test setX_to_predict = X_test[100]y_to_predict = y_test[100]# predict samplepredict(model, X_to_predict, y_to_predict)
References
- Table of classification accuracies attained over MNIST. https://en.wikipedia.org/wiki/MNIST_database#Performance
- Tim Dettmers, “Understanding Convolution In Deep Learning” http://timdettmers.com/2015/03/26/convolution-deep-learning/
- TensorFlow Documentation: Convolution https://www.tensorflow.org/versions/r0.7/api_docs/python/nn.html#convolution
- Parallel Convolutional Neural Networks for Music Genre and Mood Classification https://publik.tuwien.ac.at/files/publik_256012.pdf
- T. Lidy, “Spectral convolutional neural network for music classification,” in Music Information Retrieval Evaluation eXchange (MIREX), Malaga, Spain, October 2015
- Music Genre Classification with Deep Learning by Albert Jiménez https://github.com/jsalbert/Music-Genre-Classification-with-Deep-Learning
- How to Implement a CNN for Music Genre Classification by Valerio Velardo https://www.youtube.com/watch?v=dOG-HxpbMSw&list=PL-wATfeyAMNrtbkCNsLcpoAyBBRJZVlnf&index=16
Thanks for stopping by. If this post was interesting for you, welcome to follow me on Instagram , Medium , and Linkedin .
Cheers!
以上所述就是小编给大家介绍的《Classifying music genres with CNNs》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。