Patent Validity Ideas for ML

From edegan.com
Revision as of 21:43, 13 July 2017 by OliverC (talk | contribs) (→‎Primer)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Research Question: Can patent grants be predicted given

Input: patent application text

Output: patent was granted? (boolean)


Primer

See

SOA Natural Language Feature Extraction

In a machine learning setting, this problem is a supervised recommender system.

Recent machine learning + NLP research focuses on the idea of RNNs, Recurrent Neural Networks. For a primer, see Karpathy's blog post on The Unreasonable Effectiveness of RNNs. These techniques are generally sequence-to-sequence (seq2seq) which are the state-of-the-art in the machine translation of human languages. Crucially, RNNs have popularized LSTM (Long Short-Term Memory) sub-networks which handle relatively "far away" grammatical dependencies in sequences like English. Note that RNNs are extremely computationally expensive to train.


Data Preprocessing

Since the domain of patent data text is the English language, the input to the machine learning model must undergo dimensionality reduction, either explicitly as preprocessing or implicitly in layers of the model. Given a constrained input source like an XML document, it would be preferable to have no XML markup to boost contrast between different inputs.

Negative sample generation is how many image models achieve high accuracy. We can imagine some algebra over patent applications such that two rejected applications randomly combined will probably be rejected. And so on for the truth tables of AND and OR. This approach relies on the algebra operating on enough random bad and good combinations that a clear signal emerges that is predictive. I know this approach has been used in machine learning models.


Next Steps

So to begin, we could use a autoencoder to do dimensionality reduction and frame the problem as a classical binary classification problem to get a baseline? Then move on to a classical RNN model and then some linear combination of the two models like in Google's Wide & Deep Paper?