Difference between revisions of "Industry classifier yang"
Line 1: | Line 1: | ||
− | + | Industry Classifier - by Yang Zhang | |
− | Goal: | + | Goal: |
+ | For each company we want to classify its industry based on its description | ||
Approach: | Approach: | ||
− | + | Step 1: encode the text description into numerical values. | |
− | + | Step 2: build a deep neural network to learn to classify. | |
− | For step 1, a very naive way is to use the bag of words representation. The obvious | + | For step 1, a very naive way is to use the "bag of words" representation. The obvious drawbacks are that you just ignore the correlations between the words and also their relative orders. So, instead, we use "word2vec" (https://en.wikipedia.org/wiki/Word2vec) this method, where, in short, each word is mapped to a vector which represents how likely the other words appear around this center word. |
− | For step 2: we have tried 1D convolutional and LSTM. | + | For step 2: we have tried 1D/2D convolutional NN (Neural Network) and LSTM RNN (Recurrent Neural Network). All the models can achieve 90+% training accuracy and around 60% testing accuracy. Notice that this task is even hard for humans and the baseline of randomly guessing is around 10%, 60% is acceptable. Turning the parameters doesn't help much meaning we might have reached the model's max capability. |
Next steps: | Next steps: | ||
Try with longer descriptions and see if more information can provide us better accuracy | Try with longer descriptions and see if more information can provide us better accuracy |
Revision as of 15:12, 26 September 2017
Industry Classifier - by Yang Zhang
Goal:
For each company we want to classify its industry based on its description
Approach: Step 1: encode the text description into numerical values. Step 2: build a deep neural network to learn to classify.
For step 1, a very naive way is to use the "bag of words" representation. The obvious drawbacks are that you just ignore the correlations between the words and also their relative orders. So, instead, we use "word2vec" (https://en.wikipedia.org/wiki/Word2vec) this method, where, in short, each word is mapped to a vector which represents how likely the other words appear around this center word.
For step 2: we have tried 1D/2D convolutional NN (Neural Network) and LSTM RNN (Recurrent Neural Network). All the models can achieve 90+% training accuracy and around 60% testing accuracy. Notice that this task is even hard for humans and the baseline of randomly guessing is around 10%, 60% is acceptable. Turning the parameters doesn't help much meaning we might have reached the model's max capability.
Next steps: Try with longer descriptions and see if more information can provide us better accuracy