Sentiment analysis on Twitter posts part 1.

Przemysław PrzybytBy Przemysław Przybyt

Table of Content

    Hi! I’m Przemysław from AI Development Company Profil Software. This will be my 2nd article connected with AI. If you want to answer the question of how can object classification be easily fooled you should refer to the first article of mine.

    This time I will try to present some hands-on examples of how to deal with simple NLP tasks like sentiment analysis. The solutions used in this article could be easily reused in other classification tasks. We will traverse the rough trails of AI starting with data cleaning and preprocessing, then move on to model definition and visualising the results. The solution was prepared using Google Colab and will be shared at the end. So are you ready? ;)


    While using Google Colab no external packages are required to run the solution. Libraries like scikit-learn are already included. While moving to another environment you can easily grab dependencies using the !pip freeze command in a notebook cell.


    In this article I will be using Twitter dataset from the kaggle competition . The dataset consists of 1.6M tweets written in English and extracted using the Twitter api. They are grouped into 2 classes (named targets):

    • positive (target 0 in csv)
    • negative (target 4 in csv)

    The dataset also contains other columns like corresponding date or the user that posted the tweet. For the purpose of this article, we will be using text and target info only. What can be useful for further processing might be the data distribution over the 2 classes. Part of the code responsible for loading the dataset and counting target distribution is presented below:


    Sample rows of the raw dataset are displayed below:


    At first glance we can see that data is a bit dirty and can be cleaned to remove bogus parts like links and mentions. To briefly apply simple preprocessing, some pandas utils were used. You can see that part below:

    • remove all urls and user mentions, hashtags,
    • accept only letters and digits
    • remove extra spaces
    • parse everything to lowercase
    • rename target class 4 -> 1

    Sample output of preprocessed data (can be compared with previous image):


    Data split

    To be able to train our AI model we need to first split the data into train and test sets. The code below shows how to do it:

    I used therandom_state option to enable reproductivity between experiments, while the stratify option is responsible for enabling similar distribution of classes in both sets.


    The first model that will be checked is a so-called Bag of words model. The bag-of-words model is a simplifying representation used in NLP. In this model, a text (such as a sentence or a document) is represented as the bag (multiset) of its words, disregarding grammar and even word order but keeping multiplicity. Its visualisation is presented below:


    I will implement it using CountVectorizer from sklearn which converts a collection of text documents to a matrix of token counts. This approach will then enable us to use preprocessed vectors as an input for the LogisticRegression model.

    As you can see, with just a few lines of code we can prepare a fully working model. In the above code, NGRAM_RANGE describes how many coexisting words will be analysed as a feature. MAX_ITER enables us stop the algorithm after a certain amount of iteration; for such big datasets it is sometimes safe to limit that.


    And voilà, that was it, we did it. Now let’s see what results we receive.

    For sklearn models, we can use visualisation functions that help with ad hoc prototyping.

    • classification_report: grabs prediction and true labels and prepares printable report
    • plot_confusion_matrix: plots a heat-map of classification; for binary classification it will have a structure of 4 squares

    Outputs of both functions are presented below:


    As we can see, the model reaches above 80% accuracy for the unknown samples which is a great result compared to the amount of code that was used to achieve that.

    Feature importance

    The logistic regression model enables us to use its features (in our case 1-word or 2-words pairs) and coefficients calculated during data fitting to obtain features that are influential for choosing a specific label. Below is a short snippet:

    Positive features: ['not sad', 'no problem', 'doesnt hurt', 'not bad', 'no problems', 'no prob', 'not problem', 'never too', 'no probs', 'cant miss'] Negative features: ['clean me', 'not happy', 'sad', 'passed away', 'rip', 'not looking', 'funeral', 'headache', 'disappointing', 'upsetting']


    Here is the complete solution (

    Next steps

    In the next article I will try to implement a model doing the same task but constructed with neural networks. Can’t wait for it and I hope that you cannot wait for it either!


    Thanks to Katarzyna Latarska. 

    Got an idea and want to launch your product to the market?

    Get in touch with our experts.

    Max length: 1000
    Przemysław PrzybytPrzemysław Przybyt