Difference between revisions of "DSL Encoding"
Line 24: | Line 24: | ||
print(line) | print(line) | ||
− | What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The | + | What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The ''tokens'' variable now looks something like this |
['header ', | ['header ', | ||
'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive', | 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive', |
Revision as of 17:48, 3 May 2019
DSL Encoding | |
---|---|
Project Information | |
Has title | DSL Encoding |
Has owner | Hiep Nguyen |
Has start date | 2019/04/26 |
Has deadline date | |
Has project status | Active |
Copyright © 2019 edegan.com. All Rights Reserved. |
Approach
Currently, I am thinking about using one-hot vector to encode the structure of a DSL page. The author of the pix2code project also had the same approach. However, the preprocessing part was not discussed carefully in the paper and the source code was not commented properly. This article gives a more detailed instruction for the embedding method. For our project, we can ignore the image-preprocessing part and focus solely on the text processing. The associated github page can be found here
File and scripts
The current scripts that I wrote by following pix2code source code are living on
E:/projects/embedding
So far, I have been experimenting with only one DSL file, which is '00CDC9A8-3D73-4291-90EF-49178E408797.gui'. To see the current output (not yet one-hot), write
python convert_gui.py
Implementation
One-hot-encoding can be understood as representing a word or token as a vector with a lot of zeroes, where the number of zeroes is equal to the number of unique tokens in the DSL file. Let's look at a concrete DSL file from pix2code as example. The process is as follows
gui = open('00CDC9A8-3D73-4291-90EF-49178E408797.gui') tokens=[] for line in gui: line=line.strip('\n').strip('}').strip('{') tokens.append(line) print(line)
What we just did is opening a DSL file, going through every single line, stripping some symbols and store all the tokens in a list. The tokens variable now looks something like this
['header ', 'btn-inactive, btn-active, btn-inactive, btn-inactive, btn-inactive', , 'row ', 'quadruple ', 'small-title, text, btn-orange', , 'quadruple ', 'small-title, text, btn-red', , 'quadruple ', 'small-title, text, btn-green', , 'quadruple ', 'small-title, text, btn-orange', , , 'row ', 'single ', 'small-title, text, btn-green', , ]