NLP for learners – Simple Introduction to pad_sequences() and TimeseriesGenerator()
In the NLP for learners-Training and prediction with keras and LSTM, we learned how to use Keras and model.fit(). Here we will learn about pad_sequences() and TimeseriesGenerator().
pad_sequences()
In natural language processing, input sentences are of various lengths. However, to handle a large amount of data, it is more efficient to keep the length of the data constant.
For this reason, the data are sized by inserting zeros in the list.
[[1 2 3]
[1 2 3 4 5 6]
[1 2 3 4]]
--->
[[1 2 3 0 0 0 0 0 0 0]
[1 2 3 4 5 6 0 0 0 0]
[1 2 3 4 0 0 0 0 0 0]]
pad_sequences() is used for padding.
For example, if you have a vectorized list texts of sentences of varying length , to be combined into a list of 30 words of length per sentence, you can write the following
texts = sequence.pad_sequences(texts, maxlen=30, padding="post", truncating="post")
Padding the vectorized list texts and storing it in texts again.
texts =
[[1, 407, 813, 319, 432, 1697, 666, 4, 985, 5, 563, 3, 613, 5,
735, 3, 2599, 1, 525, 4, 526, 3622, 1, 814, 986, 2],
[1, 117, 338, 18, 2600, 3, 892, 2072, 815, 339, 2],
[1, 339, 20, 893, 10, 5, 167, 36, 1, 319, 6, 3623, 36, 1, 67, 320, 2]
......
]
pad_sequences -->
texts =
[[ 1 407 813 319 432 1697 666 4 985 5 563 3 613 5
735 3 2599 1 525 4 526 3622 1 814 986 2 0 0
0 0]
[ 1 117 338 18 2600 3 892 2072 815 339 2 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0]
[ 1 339 20 893 10 5 167 36 1 319 6 3623 36 1
67 320 2 0 0 0 0 0 0 0 0 0 0 0
0 0]
......
]
maxlen=30 means the number of data is 30. If the number of words in the original sentence is 30 or more, the latter words are cut off.
padding="post" inserts 0s after the data; "pre" inserts 0s before the data.
[1 2 3 4]
padding="pre" --> [0 0 0 1 2 3 4]
padding="post" --> [1 2 3 4 0 0 0]
TimeseriesGenerator()
TimeseriesGenerator() makes it easy to prepare the time series data you need for LSTM.
seq_length = 5
for line in range(len(texts)):
dataset = TimeseriesGenerator(texts[line], texts[line], length=seq_length, batch_size=1)
len(text) is the number of lines of text, and a for statement creates time-series data for each line.
There are two texts[lines], one is the input data and the other is the answer data. Typically, you would extract both the input and the answer from the same sentence. seq_length=5 is the length of the time step. In this case, the length of the time step is 5 because we are trying to predict the next word from 5 sequential words.
The time series data and the answer are stored in a dataset.
texts = [1 2 3 4 5 6 7 8 9 10 ......]
X = [[1 2 3 4 5] Y = [[6]
[2 3 4 5 6] [7]
[3 4 5 6 7] [8]
......] ......]
batch_size is as follows
texts = [1 2 3 4 5 6 7 8 9 10 ......]
batch_size=1 -->
[[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7] ......]
batch_size=2 -->
[[1 2 3 4 5]
[3 4 5 6 7]
[5 6 7 8 9] ......]
batch_size=3 -->
[[1 2 3 4 5]
[3 4 5 6 7]
[6 7 8 9 10] ......]
for batch in dataset:
X, Y = batch
x.extend(X[0])
y.extend(Y)
Using the for statement, we extract the input X and the answer Y from the dataset and add them one by one to the lists x and y.
Since X is a two-dimensional list, we add a one-dimensional list as X[0].
X = [[1 2 3 4 5]]
X[0]= [1 2 3 4 5]
The result is that x is a one-dimensional list of timesteps.
x = [1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 ......]
x = np.reshape(x,(int(len(x)/5),5,1))
The input value of LSTM takes the form of (batch size, time step, input dimension).
Use np.reshape to convert x to a third-order tensor.
x = [[[1]
[2]
[3]
[4]
[5]]
[[2]
[3]
[4]
[5]
[6]]
[[3]
[4]
[5]
[6]
[7]] ......]
SNSでシェア