NLP for learners – Interrupt and resume training(ModelCheckpoint)

The larger the amount of data used for training, the longer the time required for training and the more difficult it becomes to train all at once.

Therefore, we introduce a method to save the model at the end of each epoch and resume training later.

ModelCheckpoint

from keras.callbacks import ModelCheckpoint
cp_callback = ModelCheckpoint(
    filepath="model.h5",
    verbose=1,
    save_weights_only=False,
    save_freq="epoch")

ModelCheckpoint() saves the model weights and the model itself at the end of each epoch.

filepath is the name of the file to be saved. In this case, we save the model itself, so we specify the h5 format.

If verbose=1, a message is displayed when you save the file; if verbose=0, no message is displayed.

save_weights_only=True saves only the weights of the model, whereas save_weights_only=False saves the model itself.

When the model is saved, the model contains the weights. So when you resume training later, you don’t need to load both the weights and the model; you just load the model.

save_freq="epoch" overwrites the files at the end of each epoch. It is recommended that you do not change this value.

model.fit(
    ...... ,
    callbacks=cp_callback)

Finally, ModelCheckpoint() is called by callbacks of fit().

Resume training

model = load_model('model.h5')

load_model() reads the saved model.

If the model itself is loaded, there is no need to build it again. Sequential(), mode.add(), and model.compile() are not necessary.

Remove or comment out these codes and run model.fit().

model.fit(
    ...... ,
    initial_epoch=10,
    callbacks=cp_callback)

initial_epoch indicates the number of epochs from which the training should start. For example, if you have finished the last training at 10 epoch, if you specify initial_epoch=10, the training starts at 11 epoch.

If you don’t use initial_epoch, you will count from 1 epoch.

You don’t necessarily specify a value for initial_epoch, but it is better to do so because you will not know the relationship between the amount of training and the result.