# NLP for learners – Simple Introduction to pad_sequences() and TimeseriesGenerator()

In the NLP for learners-Training and prediction with keras and LSTM, we learned how to use Keras and model.fit(). Here we will learn about pad_sequences() and TimeseriesGenerator().

In natural language processing, input sentences are of various lengths. However, to handle a large amount of data, it is more efficient to keep the length of the data constant.

For this reason, the data are sized by inserting zeros in the list.

[[1 2 3]
[1 2 3 4 5 6]
[1 2 3 4]]

--->

[[1 2 3 0 0 0 0 0 0 0]
[1 2 3 4 5 6 0 0 0 0]
[1 2 3 4 0 0 0 0 0 0]]

For example, if you have a vectorized list textsof sentences of varying length , to be combined into a list of 30 words of length per sentence, you can write the following

texts = sequence.pad_sequences(texts, maxlen=30, padding="post", truncating="post")

Padding the vectorized list texts and storing it in texts again.

texts =
[[1, 407, 813, 319, 432, 1697, 666, 4, 985, 5, 563, 3, 613, 5,
735, 3, 2599, 1, 525, 4, 526, 3622, 1, 814, 986, 2],
[1, 117, 338, 18, 2600, 3, 892, 2072, 815, 339, 2],
[1, 339, 20, 893, 10, 5, 167, 36, 1, 319, 6, 3623, 36, 1, 67, 320, 2]

......

]

texts =
[[   1  407  813  319  432 1697  666    4  985    5  563    3  613    5
735    3 2599    1  525    4  526 3622    1  814  986    2    0    0
0    0]
[   1  117  338   18 2600    3  892 2072  815  339    2    0    0    0
0    0    0    0    0    0    0    0    0    0    0    0    0    0
0    0]
[   1  339   20  893   10    5  167   36    1  319    6 3623   36    1
67  320    2    0    0    0    0    0    0    0    0    0    0    0
0    0]

......

]

maxlen=30 means the number of data is 30. If the number of words in the original sentence is 30 or more, the latter words are cut off.

padding="post" inserts 0s after the data; "pre" inserts 0s before the data.

[1 2 3 4]

padding="pre"  --> [0 0 0 1 2 3 4]

padding="post" --> [1 2 3 4 0 0 0]

## TimeseriesGenerator()

TimeseriesGenerator() makes it easy to prepare the time series data you need for LSTM.

seq_length = 5
for line in range(len(texts)):
dataset = TimeseriesGenerator(texts[line], texts[line], length=seq_length, batch_size=1)

len(text) is the number of lines of text, and a for statement creates time-series data for each line.

There are two texts[lines], one is the input data and the other is the answer data. Typically, you would extract both the input and the answer from the same sentence. seq_length=5 is the length of the time step. In this case, the length of the time step is 5 because we are trying to predict the next word from 5 sequential words.

The time series data and the answer are stored in a dataset.

texts = [1 2 3 4 5 6 7 8 9 10 ......]

X = [[1 2 3 4 5]      Y = [[6]
[2 3 4 5 6]           [7]
[3 4 5 6 7]           [8]
......]          ......]

batch_size is as follows

texts = [1 2 3 4 5 6 7 8 9 10 ......]

batch_size=1  -->

[[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7] ......]

batch_size=2  -->

[[1 2 3 4 5]
[3 4 5 6 7]
[5 6 7 8 9] ......]

batch_size=3  -->

[[1 2 3 4 5]
[3 4 5 6 7]
[6 7 8 9 10] ......]
    for batch in dataset:
X, Y = batch
x.extend(X[0])
y.extend(Y)

Using the for statement, we extract the input X and the answer Y from the dataset and add them one by one to the lists x and y.

Since X is a two-dimensional list, we add a one-dimensional list as X[0].

X = [[1 2 3 4 5]]
X[0]= [1 2 3 4 5]

The result is that x is a one-dimensional list of timesteps.

x = [1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 ......]
x = np.reshape(x,(int(len(x)/5),5,1))

The input value of LSTM takes the form of (batch size, time step, input dimension).

Use np.reshape to convert x to a third-order tensor.

x = [[[1]
[2]
[3]
[4]
[5]]
[[2]
[3]
[4]
[5]
[6]]
[[3]
[4]
[5]
[6]
[7]] ......]

Mof eTextbooks
タイトルとURLをコピーしました