Implementing Custom Data Generators in Keras

栏目: IT技术 · 发布时间: 4年前

内容简介：Data Generators is one of the most useful features of the Keras API. Consider a scenario where you have lots of data, so much that you cannot have all of it at once in the RAM. Wyd? Purchasing more RAM is obviously isn’t an option.Well, the solution to thi

How to Implement Custom Data Generators for Enabling Dynamic Data Flow in a Keras Model

Implementing Custom Data Generators in Keras

Rohan Jagtap

Jun 8 ·4min read

Data Generators is one of the most useful features of the Keras API. Consider a scenario where you have lots of data, so much that you cannot have all of it at once in the RAM. Wyd? Purchasing more RAM is obviously isn’t an option.

Well, the solution to this can be loading the mini-batches fed to the model dynamically. This is exactly what data generators do. They can generate the model input dynamically thus forming a pipeline from the storage to the RAM to load the data as and when it is required. Another advantage of this pipeline is, one can easily apply preprocessing routines on these mini-batches of data as they are prepared to feed the model.

In this article, we will see how to subclass the tf.keras.utils.Sequence class to implement custom data generators.

ImageDataGenerator

First things first, we will now see how to use the ImageDataGenerator API for dynamic image pipelining and hence, address the need for implementing custom ones.

datagen = ImageDataGenerator(
        rescale=1./255,
        shear_range=0.2,
        zoom_range=0.2,
        horizontal_flip=True
)

The ImageDataGenerator API provides features for pipelining of image data from directories as well as from paths mentioned in a dataframe. One may include preprocessing steps like scaling, augmentation on images that would be directly applied to the images in real-time.

So, Why Custom Ones?

Model training is not limited to a single type of input and target. There are times when a model is fed with multiple types of inputs at once. For example, say you are working on a multi-modal classification problem where you need to process text and image data simultaneously. Here, you obviously cannot use ImageDataGenerator. And, loading all the data at once isn’t affordable. Hence, we tackle this issue by implementing a custom data generator.

Implementing a Custom Data Generator

We finally start with the implementation.

This will be a very generic implementation and hence can be directly copied. You just have to fill the blanks/replace certain variables with your own logic.

As mentioned earlier, we will subclass the tf.keras.utils.Sequence API.

def __init__(
     self, 
     df, 
     x_col, 
     y_col=None, 
     batch_size=32, 
     num_classes=None,
     shuffle=True
):
     self.batch_size = batch_size
     self.df = dataframe
     self.indices = self.df.index.tolist()
     self.num_classes = num_classes
     self.shuffle = shuffle
     self.x_col = x_col
     self.y_col = y_col
     self.on_epoch_end()

First, we define the constructor to initialize the configuration of the generator. Note that here, we assume the path to the data is in a dataframe column. Hence, we define the x_col and y_col parameters. This could also be a directory name from where you can load the data.

The on_epoch_end method is a method that is called after every epoch. We can add routines like shuffling here.

def on_epoch_end(self):
     self.index = np.arange(len(self.indices))
     if self.shuffle == True:
          np.random.shuffle(self.index)

Basically, we shuffled the order of the dataframe rows in this snippet.

Another utility method we have is __len__ . It essentially returns the number of steps in an epoch, using the samples and the batch size.

def __len__(self):
     # Denotes the number of batches per epoch
     return len(self.indices) // self.batch_size

Next is the __getitem__ method which is called with the batch number as an argument to obtain a given batch of data.

def __getitem__(self, index):
     # Generate one batch of data
     # Generate indices of the batch
     index = self.index[index * self.batch_size:(index + 1) * self.batch_size]     # Find list of IDs
     batch = [self.indices[k] for k in index]     # Generate data
     X, y = self.__get_data(batch)
     return X, y

Basically, we just obtained the shuffled indices and called the dataset from a different method and returned it to the caller. The logic for the dataset generation can be implemented here itself. However, it is a good practice to abstract it to somewhere else.

Finally, we write the logic for our data generation in the __ get_data method. Since this method is going to be called by us, we can name it anything. Moreover, there is no reason for this method to be public, hence we define it private.

def __get_data(self, batch):
     # X.shape : (batch_size, *dim)
     # We can have multiple Xs and can return them as a list     X = # logic to load the data from storage
     y = # logic for the target variables     # Generate data
     for i, id in enumerate(batch):
     # Store sample
          X[i,] = # logic     # Store class
     y[i] = # labels
     return X, y

Additionally, we can add preprocessing / augmentation routines to enable them in real-time. In the above piece of code, the X and y are loaded from data sources according to the batch indices argument passed in the method. This can be anything from loading images to loading texts or both simultaneously or any other kind of data.

After incorporating all the methods, the complete generator looks like this:

Complete Custom Data Generator Class

Conclusion

In this article, we saw the usefulness of data generators while training models with a huge amount of data. We peeked at the ImageDataGenerator API to see what it is and to address the need for custom ones. Then, we finally learned how to implement a custom data generator by subclassing the tf.keras.utils.Sequence API.

Feel free to copy this code and add your own generator logic to it.

References

以上所述就是小编给大家介绍的《Implementing Custom Data Generators in Keras》，希望对大家有所帮助，如果大家有任何疑问请给我留言，小编会及时回复大家的。在此也非常感谢大家对码农网的支持！

查看所有标签

猜你喜欢:

Implementing Custom Data Generators in Keras

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

图论导引

[美] Douglas B.West / 机械工业出版社 / 2004-10 / 59.00元

图论在计算科学、社会科学和自然科学等各个领域都有广泛应用。本书是本科生或研究生一学期或两学期的图论课程教材。全书力求保持按证明的难度和算法的复杂性循序渐进的风格，使学生能够深入理解书中的内容。书中包括对证明技巧的讨论、1200多道习题、400多幅插图以及许多例题，而且对所有定理都给出了详细完整的证明。虽然本书包括许多算法和应用，但是重点在于理解图论结构和分析图论问题的技巧。一起来看看《图论导引》这本书的介绍吧!

码农工具