'''To advanced users:'''
1. One important step in data preprocessing is to encode words (strings) into integers. The solution is to build a dictionary mapping words to their corresponding indices. (say "hello" is the 17th words in the dictionary, so "hello" -> 17) Our dictionary is ordered by the words' frequency. Higher the frequency smaller the index, i.e. you should expect to see "the, a, ..." these words in the smallest 10 indices : 2, 3, 4, .... Please also notice that 0 and 1 these two indices are not assigned to any words intentionally. The advantage of doing this is that you can specify easily ignore those common and meaningless words by simply say I want to consider words with the index > 20 for example. And for any word that is not in our dictionary, code it with index 1, so again you can easily ignore it.
2. Saving a pickle file is an very efficient way to retrieve the data so that you don't need to do preprocessing every time you want to run your classifier.