Difference between revisions of "Deep Text Classifier"

From edegan.com
Jump to navigation Jump to search
Line 68: Line 68:
 
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset.  
 
For data preprocessing, we adopt the same standard as in the [http://ai.stanford.edu/~amaas/data/sentiment/ IMDB] dataset.  
  
# '''To general users:''' your input (usually a single ".txt" file contains many examples) will be split into a training set (80% by default) and a testing set (20% by default). The target labels you want to predict will be the sub-folder names. The description of each example will go into a separate ".txt" file and the name of the file can be determined by the user. To process your own dataset, you basically need to specify the file name, expected columns, content index and label index.
+
# '''To general users:''' your input (usually a single ".txt" file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate ".txt" files. To run the script, you basically need to specify the following:
 +
  "File Name" : without the ".txt" extension,
 +
  "Expected Columns" : total number of columns in the input file
 +
  "Content Index" : the column index of the content  
 +
  "Label Index" : the column index of the label
  
 
==Model Training/Prediction==
 
==Model Training/Prediction==
  
 
==General Guidelines for Tuning the Hyper-Parameters==
 
==General Guidelines for Tuning the Hyper-Parameters==

Revision as of 14:12, 12 October 2017

Deep Text Classifier

Problem Description

We want to build a classifier for the text input. For example, we may want to classify a company's industry area based on its description. Or we may want to classify a company's IPO status based on its description.

General Approach

We will build a deep neural network to uniformly solve this problem. The traditional way of doing this is to hire a task specific expert to manually design some useful features, say to check if the text contains words "Internet" and "High-tech" at the same time, and to classify based on the observed features. Our way, by using the deep neural network, can automatically extract the features and most importantly achieve very high testing accuracy. However, the features that are used by the deep neural network are not human interpretable.

About the Deep Models

There are basically two big categories of deep neural networks - the convolutional neural networks CNN and the recurrent neural networks RNN. The first one, CNN, is more suitable for dealing with the image based classification tasks. The second one, RNN, is in general for sequential information (i.e. language, video ...) based classification tasks. I have tired both kinds of models and , as expected, the RNN is more robust in facing different text classification tasks

Major Package Dependences

How to Run the Code

The code contains two parts: Data Preprocessing and Model Training/Prediction.

Data Preprocessing (preprocessing.py) : this is where you transfer a text based "XXX.txt" input file into a numerical value based pickle file that the later part of the code can understand and use for training and prediction.

  • Step 1 : specify the target file name in "main()"
  # don't add ".txt" extension
  file_name = 'ThicketDefCodingTestProcessed'
  • Step 2 : specify the expected columns of your target file in "main()"
  # expected number of columns, in case we have "None" in the table
  expected_columns = 5
  • Step 3 : specify the indices of the text and the label in "prepare_imdb_structure(file_name, expected_columns)"
  # the index of the label in the tokens
  label_index = 1
  # the index of the text in the tokens
  content_index = 4
  • Step 4 : run the code
  python preprocessing.py 
  • Step 5 : give your pickle file a more reasonable name
Attention: by default, the name of the pickle file is same as the original ".txt" file. But it's highly likely that you will use the same text inputs to predict different things. So it's important to give your pickle file a more reasonable name each time you run the above script. For example, from "longdescriptions.pkl" to "longdescriptions_indu.pkl" to indicate that we are predicting the industry areas and to "longdescriptions_ipo.pkl" to indicate that we are predicting the IPO status. If you don't do this, the later generated pickle files will overwrite the previously generated ones. 

Model Training/Prediction (classification_MMM_LLL.py) : this is where the deep neural network is. The "MMM" represents the model. For example, currently I have "1DConvolution", "2DConvolution" and "LSTM". "LLL" represents the name of the label. Notice that for the same text inputs we can predict for different things using the same model literally. For example, "classification_LSTM_indu.py" is a LSTM model to predict the industray based on the descriptions. And "classification_LSTM_ipo.py" is a LSTM model to predict the IPO status based on the same descriptions. Again you need to name your files properly! Different tasks will have different hyper-parameter configurations though the model and the inputs can be totally the same. This Python file, no matter what the model is, will always load in a pickle file you generated in the previous step and train the neural network. At the end, the well trained neural network will predict on your test examples (the examples you don't see during the training) and print the accuracy.

  • Step 1 : specify the name of the pickle file
with open('longdescription_ipo.pkl', 'rb') as file:
  • Step 2 : specify the total number of possible labels
model.add(Dense(2, activation='softmax'))
  • Step 3 : run the code
python classification_LSTM_ipo.py

Data Preprocessing

For data preprocessing, we adopt the same standard as in the IMDB dataset.

  1. To general users: your input (usually a single ".txt" file contains many examples each as a row) will be split into a training set (80% by default) and a testing set (20% by default). The labels you want to predict will be the folder names. The content (usually a block of text) of the examples will go into separate ".txt" files. To run the script, you basically need to specify the following:
 "File Name" : without the ".txt" extension,
 "Expected Columns" : total number of columns in the input file
 "Content Index" : the column index of the content 
 "Label Index" : the column index of the label

Model Training/Prediction

General Guidelines for Tuning the Hyper-Parameters