Structurally analyzing the HTML tree structure of a web page and expressing it in a DSL is the most innovative method of the three. It would require more than simply adapting off-the-shelf models. First, the DSL itself would need to be designed to optimize abstraction into the target domain, a web page. (See Domain Specific Language Research.) Then, the DSL would need to be integrated into the machine learning pipeline by encoding the DSL into an appropriately formatted input, such as a vector or matrix, for a neural network. Three proposed methods for this encoding are using an adjacency matrix, an edges to vertices approach, or utilizing DFS (depth-first search) algorithms.
==== DFS Encoding =====
Currently, we are leaning towards utilizing DFS algorithms. A depth-first search algorithm could traverse any given tree and record 1 when a new node is found, and 0 when that node is fully explored. This creates a numerical representation of that tree that can then be entered into a vector or matrix.
==== Supervised Learning Approach ====