Difference between revisions of "DSL Encoding"
(14 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
{{Project | {{Project | ||
+ | |Has project output=Tool | ||
+ | |Has sponsor=Kauffman Incubator Project | ||
|Has title=DSL Encoding | |Has title=DSL Encoding | ||
|Has owner=Hiep Nguyen | |Has owner=Hiep Nguyen | ||
Line 57: | Line 59: | ||
which results in | which results in | ||
[ | [ | ||
− | '', | + | ' ', |
'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive', | 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive', | ||
'header ', | 'header ', | ||
Line 74: | Line 76: | ||
This results in | This results in | ||
char_indices | char_indices | ||
− | {'': 0, | + | {' ': 0, |
'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1, | 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1, | ||
'header ': 2, | 'header ': 2, | ||
Line 118: | Line 120: | ||
[1., 0., 0., 0., 0., 0., 0., 0., 0.], | [1., 0., 0., 0., 0., 0., 0., 0., 0.], | ||
[1., 0., 0., 0., 0., 0., 0., 0., 0.]]) | [1., 0., 0., 0., 0., 0., 0., 0., 0.]]) | ||
+ | |||
+ | ==Proposed Training Model== | ||
+ | Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows. | ||
+ | |||
+ | (1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later. | ||
+ | |||
+ | (2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows. | ||
+ | |||
+ | (3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens] | ||
+ | |||
+ | (4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data | ||
+ | |||
+ | (5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens] | ||
+ | |||
+ | A sample LSTM cell in tensorflow is as follows: | ||
+ | import tensorflow as tf | ||
+ | def lstm_cell(keep_prob): | ||
+ | ''' | ||
+ | Define one single lstm cell | ||
+ | args: | ||
+ | keep_prob: tensor scalar | ||
+ | ''' | ||
+ | if tf.test.is_gpu_available(): | ||
+ | lstm = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(num_units) #num_units is the number of hidden units in the LSTM cell. | ||
+ | else: | ||
+ | lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0) | ||
+ | lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob) | ||
+ | return lstm | ||
+ | |||
+ | Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is | ||
+ | |||
+ | def lstm_network(x, W, b,keep_prob): | ||
+ | ''' | ||
+ | define stacked cells and prediction | ||
+ | x: data with shape [batch_size,max_len,len_unique_tokens] | ||
+ | ''' | ||
+ | lstm=lstm_cell(keep_prob) | ||
+ | outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32) | ||
+ | prediction = tf.add(tf.matmul(states.h, W), b,name='prediction') | ||
+ | return prediction | ||
+ | |||
+ | If we want to stack multiple LSTM layers together, we can replace '''lstm=lstm_cell(keep_prob)''' with '''lstm= tf.contrib.rnn.MultiRNNCell([lstm_cell(keep_prob) for _ in range(num_layers)])''' where '''num_layers''' is an integer representing the number of LSTM layers we want | ||
+ | |||
+ | A sample training code lives in | ||
+ | E:\projects\embedding\Web_extractor_model\train_sample.py | ||
+ | |||
+ | In the '''utils.py''' file, there are a few hyperparameters to remember. | ||
+ | |||
+ | max_len: the length of each training point | ||
+ | |||
+ | step: the number of steps we want to move to generate the next training point | ||
+ | |||
+ | num_units: LSTM units, a safe choice is 128 | ||
+ | |||
+ | len_unique_chars: total number of unique tokens in all training data |
Latest revision as of 12:47, 21 September 2020
DSL Encoding | |
---|---|
Project Information | |
Has title | DSL Encoding |
Has owner | Hiep Nguyen |
Has start date | 2019/04/26 |
Has deadline date | |
Has project status | Active |
Has sponsor | Kauffman Incubator Project |
Has project output | Tool |
Copyright © 2019 edegan.com. All Rights Reserved. |
Approach
Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the pix2code project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This article gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found here
File and scripts
The current scripts that I wrote by following pix2code source code are living on
E:/projects/embedding
So far, I have been experimenting with only one simple DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write
python convert_gui.py
Explanation and Implementation
One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a DSL file from pix2code as an example. The process is as follows:
gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui') tokens=[] for line in gui: line=line.strip('\n').strip('}').strip('{') tokens.append(line) print(line)
What we just did is opening a DSL file, going through every single line, stripping some symbols and storing all the tokens in a list. The tokens variable now looks something like this
tokens [ 'header ', 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive', ' ', 'row ', 'quadruple ', 'small-title, text, btn-orange', ' ', 'quadruple ', 'small-title, text, btn-red', ' ', 'quadruple ', 'small-title, text, btn-green', ' ', 'quadruple ', 'small-title, text, btn-orange', ' ', ' ', 'row ', 'single ', 'small-title, text, btn-green', ' ', ' ' ]
Now, based on this list, to see the total number of tokens we can do
chars = sorted(list(set(tokens)))
which results in
[ ' ', 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive', 'header ', 'quadruple ', 'row ', 'single ', 'small-title, text, btn-green', 'small-title, text, btn-orange', 'small-title, text, btn-red' ]
As we can see, we have 9 elements in this example, which means the length of each vector would be 9. Now, we need to assign a number for each of the symbol, and the number will indicate the index of that element in the vector.
char_indices = dict((c, i) for i, c in enumerate(chars)) indices_char = dict((i, c) for i, c in enumerate(chars))
This results in
char_indices {' ': 0, 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive': 1, 'header ': 2, 'quadruple ': 3, 'row ': 4, 'single ': 5, 'small-title, text, btn-green': 6, 'small-title, text, btn-orange': 7, 'small-title, text, btn-red': 8}
Hence, if we have a line with token 'header', the one-hot representation of it is [0,0,1,0,0,0,0,0,0]. There is a '1' at index 3, which indicates that 3 is there.
Now, let's apply this embedding rule to our GUI file
sentences=[] for i in range(0, len(tokens)): sentences.append(tokens[i]) one_hot_vector = np.zeros((len(sentences),len(chars))) for i, sentence in enumerate(sentences): for t, char in enumerate(sentences): one_hot_vector[t, char_indices[char]] = 1
The vector that represents our GUI will be something like this.
array([[0., 0., 1., 0., 0., 0., 0., 0., 0.], [0., 1., 0., 0., 0., 0., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 1., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 0., 1.], [1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 1., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 1., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 0., 1., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0.], [0., 0., 0., 0., 1., 0., 0., 0., 0.], [0., 0., 0., 0., 0., 1., 0., 0., 0.], [0., 0., 0., 0., 0., 0., 1., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0.], [1., 0., 0., 0., 0., 0., 0., 0., 0.]])
Proposed Training Model
Given the preprocessing described above, one way we can train our model to detect all the marking in a HTML page is as follows.
(1) First, we will store all the tokens in our training data into two Python dictionaries, where the first one has format {'token_index':'token'} and the second one has format {'token':'token_index'} to use later.
(2) Decide a proper length for each training point. For instance, we can determine the max length of a training point to be 10 lines and pages that contain less than 20 lines will be padded with zeroes. For example, if we decide our max length to be 20, then files with 17 lines will be padded with [0, 0, 0...,0] for the remaining 3 rows.
(3) Applied one-hot encoding as described in the previous section to all training dataset. When we do so, each data point ( one DSL file) will have shape [max length, number of unique tokens]
(4) Define y. What we are trying to do is to predict the next token given the previous tokens, so our label y will be the one-hot representation of the token after the sequence from the training data
(5) Used a LSTM, or possibly bi-directional LSTM to traing the data using mini batches with Adam optimizer. Each batch that goes into the model will have shape [batch size, max length, number of unique tokens]
A sample LSTM cell in tensorflow is as follows:
import tensorflow as tf def lstm_cell(keep_prob): Define one single lstm cell args: keep_prob: tensor scalar if tf.test.is_gpu_available(): lstm = tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(num_units) #num_units is the number of hidden units in the LSTM cell. else: lstm = tf.nn.rnn_cell.LSTMCell(num_units,forget_bias=1.0) lstm=tf.contrib.rnn.DropoutWrapper(lstm, output_keep_prob=keep_prob) return lstm
Then, we applied a tf.while loop through the cell to build our network. tf.nn.dynamic_rnn will do the work. The sample code is
def lstm_network(x, W, b,keep_prob): define stacked cells and prediction x: data with shape [batch_size,max_len,len_unique_tokens] lstm=lstm_cell(keep_prob) outputs, states = tf.nn.dynamic_rnn(lstm, x, dtype=tf.float32) prediction = tf.add(tf.matmul(states.h, W), b,name='prediction') return prediction
If we want to stack multiple LSTM layers together, we can replace lstm=lstm_cell(keep_prob) with lstm= tf.contrib.rnn.MultiRNNCell([lstm_cell(keep_prob) for _ in range(num_layers)]) where num_layers is an integer representing the number of LSTM layers we want
A sample training code lives in
E:\projects\embedding\Web_extractor_model\train_sample.py
In the utils.py file, there are a few hyperparameters to remember.
max_len: the length of each training point
step: the number of steps we want to move to generate the next training point
num_units: LSTM units, a safe choice is 128
len_unique_chars: total number of unique tokens in all training data