[1., 0., 0., 0., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0., 0.]])
==Proposed Training Model==
Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.
(1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.
(2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.
(3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]
(4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data
(5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]
A sample LSTM cell in tensorflow is as follows:
def lstm_cell(keep_prob):
'''
Define one single lstm cell
args:
keep_prob: tensor scalar
'''
if tf.test.is_gpu_available():
lstm = Cudnn_LSTM_with_bias(num_units)
else:
lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0)
lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob)
return lstm
Then, we applied a tf.while loop through the cell to build our network. The sample code is
def lstm_network(x, weight, bias,keep_prob):
'''
define stacked cells and prediction
x: data with shape [batch_size,max_len,len_unique_char]
'''
lstm=lstm_cell(keep_prob)
outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32)
prediction = tf.add(tf.matmul(states.h, weight), bias,name='prediction')
return prediction