NLP for learners – Simple Introduction to pad_sequences() and TimeseriesGenerator()

In the NLP for learners-Training and prediction with keras and LSTM, we learned how to use Keras and model.fit(). Here we will learn about pad_sequences() and TimeseriesGenerator().

pad_sequences()

In natural language processing, input sentences are of various lengths. However, to handle a large amount of data, it is more efficient to keep the length of the data constant.

For this reason, the data are sized by inserting zeros in the list.

[[1 2 3]
 [1 2 3 4 5 6]
 [1 2 3 4]]

--->

[[1 2 3 0 0 0 0 0 0 0]
 [1 2 3 4 5 6 0 0 0 0]
 [1 2 3 4 0 0 0 0 0 0]]

pad_sequences() is used for padding.

For example, if you have a vectorized list texts of sentences of varying length , to be combined into a list of 30 words of length per sentence, you can write the following

texts = sequence.pad_sequences(texts, maxlen=30, padding="post", truncating="post")

Padding the vectorized list texts and storing it in texts again.

texts =
[[1, 407, 813, 319, 432, 1697, 666, 4, 985, 5, 563, 3, 613, 5,
  735, 3, 2599, 1, 525, 4, 526, 3622, 1, 814, 986, 2],
 [1, 117, 338, 18, 2600, 3, 892, 2072, 815, 339, 2],
 [1, 339, 20, 893, 10, 5, 167, 36, 1, 319, 6, 3623, 36, 1, 67, 320, 2]

 ......

]

pad_sequences -->

texts =
[[   1  407  813  319  432 1697  666    4  985    5  563    3  613    5
   735    3 2599    1  525    4  526 3622    1  814  986    2    0    0
     0    0]
 [   1  117  338   18 2600    3  892 2072  815  339    2    0    0    0
     0    0    0    0    0    0    0    0    0    0    0    0    0    0
     0    0]
 [   1  339   20  893   10    5  167   36    1  319    6 3623   36    1
    67  320    2    0    0    0    0    0    0    0    0    0    0    0
     0    0]

 ......

]

maxlen=30 means the number of data is 30. If the number of words in the original sentence is 30 or more, the latter words are cut off.

padding="post" inserts 0s after the data; "pre" inserts 0s before the data.

[1 2 3 4]

padding="pre"  --> [0 0 0 1 2 3 4]

padding="post" --> [1 2 3 4 0 0 0]

TimeseriesGenerator()

TimeseriesGenerator() makes it easy to prepare the time series data you need for LSTM.

seq_length = 5
for line in range(len(texts)):
    dataset = TimeseriesGenerator(texts[line], texts[line], length=seq_length, batch_size=1)

len(text) is the number of lines of text, and a for statement creates time-series data for each line.

There are two texts[lines], one is the input data and the other is the answer data. Typically, you would extract both the input and the answer from the same sentence. seq_length=5 is the length of the time step. In this case, the length of the time step is 5 because we are trying to predict the next word from 5 sequential words.

The time series data and the answer are stored in a dataset.

texts = [1 2 3 4 5 6 7 8 9 10 ......]

X = [[1 2 3 4 5]      Y = [[6]
     [2 3 4 5 6]           [7]
     [3 4 5 6 7]           [8]
          ......]          ......]

batch_size is as follows

texts = [1 2 3 4 5 6 7 8 9 10 ......]

batch_size=1  -->

[[1 2 3 4 5]
 [2 3 4 5 6]
 [3 4 5 6 7] ......]

batch_size=2  -->

[[1 2 3 4 5]
 [3 4 5 6 7]
 [5 6 7 8 9] ......]

batch_size=3  -->

[[1 2 3 4 5]
 [3 4 5 6 7]
 [6 7 8 9 10] ......]
    for batch in dataset:
        X, Y = batch
        x.extend(X[0])
        y.extend(Y)

Using the for statement, we extract the input X and the answer Y from the dataset and add them one by one to the lists x and y.

Since X is a two-dimensional list, we add a one-dimensional list as X[0].

X = [[1 2 3 4 5]]
X[0]= [1 2 3 4 5]

The result is that x is a one-dimensional list of timesteps.

x = [1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 ......]
x = np.reshape(x,(int(len(x)/5),5,1))

The input value of LSTM takes the form of (batch size, time step, input dimension).

Use np.reshape to convert x to a third-order tensor.

x = [[[1]
      [2]
      [3]
      [4]
      [5]]
     [[2]
      [3]
      [4]
      [5]
      [6]]
     [[3]
      [4]
      [5]
      [6]
      [7]] ......]