For any given tree, we have n-1 (I'm assuming n = number of nodes) edges. For every edge, we can record the two ending vertices. This will result in a matrix of dimensions (n-1) x 2. This matrix approach has an algorithmic efficiency of O(n).
==== node2vec ====
==== Supervised Learning Approach (HTML to DSL) ====
*[https://arxiv.org/abs/1210.6113 Using the DOM Tree for Content Extraction (Lopez, Silva, Insa)]
: This paper presents a method of content extraction that analyzes the relationships between DOM elements based on a "chars-node ratio" that displays the relationship between text content and tags content in each node of the DOM tree. The authors of this paper implemented this technique in an open-source Firefox plugin.
*[https://dl.acm.org/citation.cfm?id=775182 DOM-based Content Extraction of HTML Documents (Gupta et al.)]
: The approach delineated in this article parses HTML documents into DOM trees using openXML. Then, a recursive content extractor process each DOM tree and removes any non-content nodes.
*[https://link.springer.com/chapter/10.1007/3-540-36901-5_42 Extracting Content Structure for Web Pages Based On Visual Representation (Cai et al.)]
: This paper proposes a Vision-based Page Segmentation (VIPS) algorithm that creates an extracted content structure, with each node of the structure corresponding to a block of coherent content on the page. Although this method focuses on semantically dividing the page, the algorithm uses page layout features to detect the page's structure.