* [https://openreview.net/pdf?id=BkSqjHqxg Skip-Graph: Learning Graph Embeddings with an Encoder-Decoder Model (Lee, Kong)]
: This paper explains how to apply the skip-gram model to learning node representations of graph-structured data. This new encoder-decoder model can be trained to generate representations for any arbitrary random walk sequence.
* [https://www.kdd.org/kdd2018/files/deep-learning-day/DLDay18_paper_27.pdfLearning Graph Representations with Recurrent NeuralNetwork Autoencoders (Taheri, Gimpel, Berger-Wolf)]: In a similar process to other neural network encoders, this proposed architecture first generates sequential data from graphs, using BFS shortest path, and random walk algorithms. It then trains LSTM autoencoders to embed these graph sequences into a vector space. * [https://arxiv.org/abs/1805.07683 Learning Graph Representations with Recurrent Neural Networks (Jin, JaJa)]:This article describes another approach to learning graph-level representations, except through a combination of supervised and unsupervised learning components.
* [http://cs229.stanford.edu/proj2013/YaoZuo-AMachineLearningApproachToWebpageContentExtraction.pdf A Machine Learning Approach to Webpage Content Extraction (Yao, Zuo)]
: This article describes methods to simplify the content of a noisy HTML page, specifically through using machine learning to predict whether a block is content or non-content. This allows the classifier to remove boilerplate information.
* [https://ieeexplore.ieee.org/abstract/document/1683775]
* [https://dl.acm.org/citation.cfm?id=565137 A Brief Survey of Web Data Extraction Tools (Laender et al.)]
: This article classifies various web data extraction techniques into 5 different types of tools, and one category of web extraction specific-languages. Section 3.2 (HTML-aware Tools) describes several existing tools for parsing HTML tree structures in building wrappers. Section 3.4 (NLP-based Tools) includes several methods of text analysis that may be relevant to this project.