NLP for learners – Simple Introduction to pad_sequences() and TimeseriesGenerator()
In the NLP for learners-Training and prediction with keras and LSTM, we learned how to use Keras and model.fit(). Here we will learn about pad_sequences() and TimeseriesGenerator().
pad_sequences()
In natural language processing, input sentences are of various lengths. However, to handle a large amount of data, it is more efficient to keep the length of the data constant.
For this reason, the data are sized by inserting zeros in the list.
[[1 2 3]
[1 2 3 4 5 6]
[1 2 3 4]]
--->
[[1 2 3 0 0 0 0 0 0 0]
[1 2 3 4 5 6 0 0 0 0]
[1 2 3 4 0 0 0 0 0 0]]
pad_sequences() is used for padding.
For example, if you have a vectorized list texts
of sentences of varying length , to be combined into a list of 30 words of length per sentence, you can write the following
texts = sequence.pad_sequences(texts, maxlen=30, padding="post", truncating="post")
Padding the vectorized list texts
and storing it in texts
again.
texts =
[[1, 407, 813, 319, 432, 1697, 666, 4, 985, 5, 563, 3, 613, 5,
735, 3, 2599, 1, 525, 4, 526, 3622, 1, 814, 986, 2],
[1, 117, 338, 18, 2600, 3, 892, 2072, 815, 339, 2],
[1, 339, 20, 893, 10, 5, 167, 36, 1, 319, 6, 3623, 36, 1, 67, 320, 2]
......
]
pad_sequences -->
texts =
[[ 1 407 813 319 432 1697 666 4 985 5 563 3 613 5
735 3 2599 1 525 4 526 3622 1 814 986 2 0 0
0 0]
[ 1 117 338 18 2600 3 892 2072 815 339 2 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0]
[ 1 339 20 893 10 5 167 36 1 319 6 3623 36 1
67 320 2 0 0 0 0 0 0 0 0 0 0 0
0 0]
......
]
maxlen=30
means the number of data is 30. If the number of words in the original sentence is 30 or more, the latter words are cut off.
padding="post"
inserts 0s after the data; "pre"
inserts 0s before the data.
[1 2 3 4]
padding="pre" --> [0 0 0 1 2 3 4]
padding="post" --> [1 2 3 4 0 0 0]
TimeseriesGenerator()
TimeseriesGenerator()
makes it easy to prepare the time series data you need for LSTM.
seq_length = 5
for line in range(len(texts)):
dataset = TimeseriesGenerator(texts[line], texts[line], length=seq_length, batch_size=1)
len(text)
is the number of lines of text, and a for
statement creates time-series data for each line.
There are two texts[lines]
, one is the input data and the other is the answer data. Typically, you would extract both the input and the answer from the same sentence. seq_length=5
is the length of the time step. In this case, the length of the time step is 5 because we are trying to predict the next word from 5 sequential words.
The time series data and the answer are stored in a dataset
.
texts = [1 2 3 4 5 6 7 8 9 10 ......]
X = [[1 2 3 4 5] Y = [[6]
[2 3 4 5 6] [7]
[3 4 5 6 7] [8]
......] ......]
batch_size
is as follows
texts = [1 2 3 4 5 6 7 8 9 10 ......]
batch_size=1 -->
[[1 2 3 4 5]
[2 3 4 5 6]
[3 4 5 6 7] ......]
batch_size=2 -->
[[1 2 3 4 5]
[3 4 5 6 7]
[5 6 7 8 9] ......]
batch_size=3 -->
[[1 2 3 4 5]
[3 4 5 6 7]
[6 7 8 9 10] ......]
for batch in dataset:
X, Y = batch
x.extend(X[0])
y.extend(Y)
Using the for
statement, we extract the input X
and the answer Y
from the dataset
and add them one by one to the lists x
and y
.
Since X
is a two-dimensional list, we add a one-dimensional list as X[0]
.
X = [[1 2 3 4 5]]
X[0]= [1 2 3 4 5]
The result is that x
is a one-dimensional list of timesteps.
x = [1 2 3 4 5 2 3 4 5 6 3 4 5 6 7 ......]
x = np.reshape(x,(int(len(x)/5),5,1))
The input value of LSTM takes the form of (batch size, time step, input dimension).
Use np.reshape
to convert x
to a third-order tensor.
x = [[[1]
[2]
[3]
[4]
[5]]
[[2]
[3]
[4]
[5]
[6]]
[[3]
[4]
[5]
[6]
[7]] ......]
SNSでシェア