内容简介:Data Generators is one of the most useful features of the Keras API. Consider a scenario where you have lots of data, so much that you cannot have all of it at once in the RAM. Wyd? Purchasing more RAM is obviously isn’t an option.Well, the solution to thi
How to Implement Custom Data Generators for Enabling Dynamic Data Flow in a Keras Model
Jun 8 ·4min read
Data Generators is one of the most useful features of the Keras API. Consider a scenario where you have lots of data, so much that you cannot have all of it at once in the RAM. Wyd? Purchasing more RAM is obviously isn’t an option.
Well, the solution to this can be loading the mini-batches fed to the model dynamically. This is exactly what data generators do. They can generate the model input dynamically thus forming a pipeline from the storage to the RAM to load the data as and when it is required. Another advantage of this pipeline is, one can easily apply preprocessing routines on these mini-batches of data as they are prepared to feed the model.
In this article, we will see how to subclass the tf.keras.utils.Sequence class to implement custom data generators.
ImageDataGenerator
First things first, we will now see how to use the ImageDataGenerator API for dynamic image pipelining and hence, address the need for implementing custom ones.
datagen = ImageDataGenerator( rescale=1./255, shear_range=0.2, zoom_range=0.2, horizontal_flip=True )
The ImageDataGenerator API provides features for pipelining of image data from directories as well as from paths mentioned in a dataframe. One may include preprocessing steps like scaling, augmentation on images that would be directly applied to the images in real-time.
So, Why Custom Ones?
Model training is not limited to a single type of input and target. There are times when a model is fed with multiple types of inputs at once. For example, say you are working on a multi-modal classification problem where you need to process text and image data simultaneously. Here, you obviously cannot use ImageDataGenerator. And, loading all the data at once isn’t affordable. Hence, we tackle this issue by implementing a custom data generator.
Implementing a Custom Data Generator
We finally start with the implementation.
This will be a very generic implementation and hence can be directly copied. You just have to fill the blanks/replace certain variables with your own logic.
As mentioned earlier, we will subclass the tf.keras.utils.Sequence API.
def __init__( self, df, x_col, y_col=None, batch_size=32, num_classes=None, shuffle=True ): self.batch_size = batch_size self.df = dataframe self.indices = self.df.index.tolist() self.num_classes = num_classes self.shuffle = shuffle self.x_col = x_col self.y_col = y_col self.on_epoch_end()
First, we define the constructor to initialize the configuration of the generator. Note that here, we assume the path to the data is in a dataframe column. Hence, we define the x_col and y_col parameters. This could also be a directory name from where you can load the data.
The on_epoch_end method is a method that is called after every epoch. We can add routines like shuffling here.
def on_epoch_end(self): self.index = np.arange(len(self.indices)) if self.shuffle == True: np.random.shuffle(self.index)
Basically, we shuffled the order of the dataframe rows in this snippet.
Another utility method we have is __len__ . It essentially returns the number of steps in an epoch, using the samples and the batch size.
def __len__(self):
# Denotes the number of batches per epoch
return len(self.indices) // self.batch_size
Next is the __getitem__ method which is called with the batch number as an argument to obtain a given batch of data.
def __getitem__(self, index):
# Generate one batch of data
# Generate indices of the batch
index = self.index[index * self.batch_size:(index + 1) * self.batch_size] # Find list of IDs
batch = [self.indices[k] for k in index] # Generate data
X, y = self.__get_data(batch)
return X, y
Basically, we just obtained the shuffled indices and called the dataset from a different method and returned it to the caller. The logic for the dataset generation can be implemented here itself. However, it is a good practice to abstract it to somewhere else.
Finally, we write the logic for our data generation in the __ get_data method. Since this method is going to be called by us, we can name it anything. Moreover, there is no reason for this method to be public, hence we define it private.
def __get_data(self, batch):
# X.shape : (batch_size, *dim)
# We can have multiple Xs and can return them as a list X = # logic to load the data from storage
y = # logic for the target variables # Generate data
for i, id in enumerate(batch):
# Store sample
X[i,] = # logic # Store class
y[i] = # labels
return X, y
Additionally, we can add preprocessing / augmentation routines to enable them in real-time. In the above piece of code, the X and y are loaded from data sources according to the batch indices argument passed in the method. This can be anything from loading images to loading texts or both simultaneously or any other kind of data.
After incorporating all the methods, the complete generator looks like this:
Conclusion
In this article, we saw the usefulness of data generators while training models with a huge amount of data. We peeked at the ImageDataGenerator API to see what it is and to address the need for custom ones. Then, we finally learned how to implement a custom data generator by subclassing the tf.keras.utils.Sequence API.
Feel free to copy this code and add your own generator logic to it.
References
以上所述就是小编给大家介绍的《Implementing Custom Data Generators in Keras》,希望对大家有所帮助,如果大家有任何疑问请给我留言,小编会及时回复大家的。在此也非常感谢大家对 码农网 的支持!
猜你喜欢:本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们。