What happen if you mix Twitter, the president of México and natural language processing?

Published in

LCC-Unison

10 min readDec 8, 2020

Nowdays, there is a lot of information in twitter and a bunch of people using this social media dialy (which creates more information) so… what happen if you grab a bunch of tweets that have words related to the politics of México and mainly, of course, Andrés Manuel López Obrador (actual president of México) and do an analysis? in this post we will show you the analysis that we did in the courses of natural language processing and pattern recognition in the university of Sonora.

Natural Language Processing Teacher:

Olivia Carolina Gutú Ocampo

Pattern Recognition Teacher:

Julio Waissman Vilanova

This analysis was done by the students:

Jesús Armando Báez Camacho
Diana Laura Ballesteros Valenzuela
Martín José Vega Noriega

Github repository:

https://github.com/MJVNOR/MexicanPoliticsTweetClassification

Spoiler alert!

Our main goal was to train a model to tell us if a text was for or against the goverment of AMLO.

The evaluation criterio of our project is that our model yields a percentage of how much for or against a text is about the actual goverment given by a user.

Test our final model

Here’s how we did it!

The firts thing we did was the information gathering, using the twitter API we collect as many tweets as we could (collecting 7985 tweets in total).

df = pd.read_csv(‘../data/tweets_base_madre.csv’, encoding =’utf-8')

Then, we classify them manually, creating a new column in our data frame named “clasification”. In that column we putted a number from 0 to 3, depending the content of the tweet.

0 — If the content of the tweet was in favor of the goverment
1 — If the content of the tweet was against the goverment
2 — If the content of the tweet was neutral
3 — If the content of the tweet wasn’t related to the subject.

Before going deeper with tokenization, classification, etc. we did an analysis of which hashtags and arrobas people use when they mention AMLO and also the number of likes that the tweets have depending on their classification using the column “full_text”.

In general.

Hashtags.

Likes

Now, let’s go deeper….

After finishing the analysis of the hashtags, arrobas and likes, the next thing we did was create a new column with the name of “cleanTweets”. In that column we applied some cleaning (with the help of regular expressions); the cleaning we did consisted of:

Lemmatize.
Transform all the text to lowercase.

Remove:

Accents.
Stop words.
At symbols ( @ ).
Hashtag symbols (#).
Hypertexts.
Suspension points.
Isolated characters.
Isolated numbers.
Dates.
Possible laughs (ja, jaja, ha, haha, je, jeje, etc).
Certain symbols (symbols like {,}, [,], “ , ” , …)

def cleanTxt(text):     
    text = text.lower()     text = re.sub(r'@', '', text) 
    text = re.sub(r'#', '', text) 
    text = re.sub('(?:(?:https?|ftp):\/\/)?[\w/\-?=%.]+\.[\w/\-?    =%.]+','',text)     
    text = re.sub('[ (){}\[\]\|,;"\“\”\‘\’'\«\»!¡?¿]', '', text) 
    text = re.sub('\.[\.]+', '', text)     
    text = re.sub('\s.\s', '', text)     
    text = re.sub('\s[0-9]+\s', '', text)    
    text = re.sub('\d\d/\d\d/\d\d|\d\d/\d\d/\d\d\d\d', '', text)          
    text = re.sub(r'\s([jJ][AaEeIi])+\s', r' <risa> ', text)    
    text = re.sub(r'(ja|je|ji|JA|JE|JI|\s a|\s e|\s i)([jJ][AaEeIi])(\w?)', r' <risa> ', text)     
    text = re.sub(r'\s((ha|Ha|he|HE)[hH][AaEeIi])+\w', r' <risa> ', text)     
    return text
 
df[‘CleanTweets’] = df[‘full_text’].apply(cleanTxt)
df['CleanTweets'].dropna(inplace=True)

Note: Before taking the next step, we made sure that there were no repeated tweets (in case there was more than one of the same, we would keep the first one and delete the others)

The next step we did was to create another column named “TokenizeTweetsTidy_text”, and that column contains the information of the column “cleanTweets” mentioned before BUT we applied tokenization to that column.

#Tokenizar
df['TokenizeTweetsTidy_text'] = df.apply(lambda row: nltk.word_tokenize(row['CleanTweets']), axis=1)
#comillas a las palabras
df['TokenizeTweetsTidy_text'] = df.TokenizeTweetsTidy_text.astype(str)

Next, we took (from the column “clasification”) those tweets that have clasification=0 and clasification=1 and we created a new dataframe with those two together.

liberales = df[df["clasification"]==0] 
conservadores  = df[df["clasification"]==1]libCon = pd.concat([liberales,conservadores]).reset_index()

In the dataFrame libCon are 1639 Tweets of the class 0 (Liberales) and 3164 Tweets of the class 1 (Conservadores), giving us a total of 4803 Tweets.

Then, we vectorize the tweets that are in the column “TokenizeTweetsTidy_text” with Tf-idf.

Unigrams, bigrams and trigrams are used, the minimum frequency of a word has to be repeated 5 times in the documents to be taken into account and if a term is repeated more than 90% in the documents, it is eliminated.

To see the tf-idf data set we use TSNEVisualizer, this one creates an inner transformer pipeline that applies such a decomposition first (SVD with 50 components by default), then performs the t-SNE embedding. The visualizer then plots the scatter plot, coloring by cluster or by class, or neither if a structural analysis is required.

def GraficaData(df,nombreCol):     
  with warnings.catch_warnings():         
#ignore all caught warnings  
  warnings.filterwarnings("ignore")         
# execute code that will generate warnings         
  Train_X, Test_X, Train_Y, Test_Y = model_selection.train_test_split(df[nombreCol],df['clasification'],test_size=1)         
  y = Train_Y         
  Encoder = LabelEncoder()         
  Train_Y = Encoder.fit_transform(Train_Y)         
  Test_Y = Encoder.fit_transform(Test_Y)          
  Tfidf_vect = TfidfVectorizer(max_features=50000)  
  Tfidf_vect.fit(df[nombreCol])          
  Train_X_Tfidf = Tfidf_vect.transform(Train_X)         
  Test_X_Tfidf = Tfidf_vect.transform(Test_X)  
#print(Tfidf_vect.vocabulary_)         
  len(Tfidf_vect.vocabulary_)         
#print(Train_X_Tfidf)         
  warnings.filterwarnings("ignore")        
  tsne = TSNEVisualizer(cmap='PuOr')   
  warnings.filterwarnings("ignore")         
  tsne.fit(Train_X_Tfidf, y)   
  warnings.filterwarnings("ignore")         
  tsne.show()GraficaData(libCon,'TokenizeTweetsTidy_text')

¡Here starts the clasification!

To classify we used logistic regression and SVM to see which one classifies the best.

Logistic regression.

Why did we choose the logistic regression model? because it is the easiest model to train and also gives a good (and fast) classification.

SVM (Support-Vector Machine)

Why did we choose the SVM model? because is a fast and dependable classification algorithm that performs very well with a limited amount of data to analyze.

Before going into the clasification… Using the GridSearchCV technique we obtained the best values for the classification in the two models.

In the case of logistic regression: The results were using: RegLog = ‘C’: 5, ‘max_iter’: 100, ‘solver’: ‘saga’.
For the support vector machine the results were: ‘C’: 100, ‘gamma’: 0.01, ‘kernel’: ‘rbf'

Next, what happen if we want to know (passing specific percentage of dataset) the degree of success achieved of our chosen model?

We can know that by doing the learning curve!

# Creating CV training and test scores for various training set sizes
X, _, y, _, Tfidf_vect = separaData(libCon,'TokenizeTweetsTidy_text',1)
cv = StratifiedKFold(n_splits=10)
estimator = LogisticRegression(solver='saga',max_iter=100 , C=5)
train_sizes, train_scores, test_scores = learning_curve(estimator,X, y, cv=cv, scoring='f1_weighted', n_jobs=-1,train_sizes=np.linspace(0.3, 1.0, 10))

# Creating means and standard deviations of training set scores
train_mean = np.mean(train_scores, axis=1)
train_std = np.std(train_scores, axis=1)

# Creating means and standard deviations of test set scores
test_mean = np.mean(test_scores, axis=1)
test_std = np.std(test_scores, axis=1)

fig = go.Figure()

layout = go.Layout(title='Learning Curve Logistic Regression',
                xaxis=dict(title='Training Set Size'),
                yaxis=dict(title='Accuracy Score'))

fig = go.Figure(layout=layout)


fig.add_trace(go.Scatter(
    x=train_sizes, y=train_mean,
    line_color='rgb(230,171,2)',
    line = dict(width=4, dash='dash'),
    name='Training score',
))

fig.add_trace(go.Scatter(
    x=train_sizes, y=test_mean,
    line_color='rgb(117,112,179)',
    line = dict(width=4),
    mode='lines+markers',
    name='Cross-validation score',
))

fig.update_traces(mode='lines')
fig.write_html("LearningCurveRegLogistic.html")
fig.show()fig.update_traces(mode='lines') 
  fig.write_html("LearningCurveSVM.html") 
  fig.show()

 # Ploting Learning Curve
    # Creating CV training and test scores for various training set sizes
    X, _, y, _, Tfidf_vect = separaData(libCon,'TokenizeTweetsTidy_text',1)
    cv = StratifiedKFold(n_splits=10)
    estimator = SVC(gamma=0.01, C=100,probability=True, kernel= 'rbf')
    train_sizes, train_scores, test_scores = learning_curve(estimator,X, y, cv=cv, scoring='f1_weighted', n_jobs=-1,train_sizes=np.linspace(0.3, 1.0, 10))

    # Creating means and standard deviations of training set scores
    train_mean = np.mean(train_scores, axis=1)
    train_std = np.std(train_scores, axis=1)

    # Creating means and standard deviations of test set scores
    test_mean = np.mean(test_scores, axis=1)
    test_std = np.std(test_scores, axis=1)

    fig = go.Figure()

    layout = go.Layout(title='Learning Curve SVM',
                    xaxis=dict(title='Training Set Size'),
                    yaxis=dict(title='Accuracy Score'))

    fig = go.Figure(layout=layout)


    fig.add_trace(go.Scatter(
        x=train_sizes, y=train_mean,
        line_color='rgb(230,171,2)',
        line = dict(width=4, dash='dash'),
        name='Training score',
    ))

    fig.add_trace(go.Scatter(
        x=train_sizes, y=test_mean,
        line_color='rgb(117,112,179)',
        line = dict(width=4),
        mode='lines+markers',
        name='Cross-validation score',

    ))
.
    fig.update_traces(mode='lines')
    fig.write_html("LearningCurveSVM.html")
    fig.show()

According to the learning curves of the two models, the two graphs tell us the same thing: more training data is needed to be able to have a smaller error.

This is the function that we are going to use to train

def ModeloClasificadorKFoldCV(modeloToTrain, df,nombre):
    X, _, y, _, Tfidf_vect = separaData(df,'TokenizeTweetsTidy_text',1)

    if modeloToTrain == LogisticRegression:
        myModelo = modeloToTrain(solver='saga',max_iter=100 , C=5)
    elif modeloToTrain == SVC:
        myModelo = modeloToTrain(gamma=0.01, C=100,probability=True, kernel= 'rbf') 

    cv = KFold(n_splits=10, random_state=None, shuffle=False)
    scores = []

        for train_index, test_index in cv.split(X):
            Train_X_Tfidf, Test_X_Tfidf, Train_Y, Test_Y = X[train_index], X[test_index], y[train_index], y[test_index]
            myModelo.fit(Train_X_Tfidf, Train_Y)
            scores.append(myModelo.score(Test_X_Tfidf, Test_Y))

The first thing we did was define our k-fold and after that we trained! Then we got the predictions, the confusion matrix and the Roc curve of the model.

Now, we have two options, to test our data without stemming and stemming.

First the results without stemming:

Logistic regression:

SVM.

Now, we will see if changing to the best threshold improves the f1 score:

Logistic regression:

Best Threshold=0.681871, G-Mean=0.708

SVM:

Best Threshold=0.691281, G-Mean=0.659

What can be seen by observing the f1 score of the logistic regression is that changing the threshold is negatively affected and the same can also be observed with the support vector machine. So we will discard the option to modify the threshold.

Now the results using stemming:

Logistic regression:

SVM.

Now looking at the results of when we lemmatize to when we do not, we can realize that if we do so we have a small loss in the f1 score both in the logistic regression and in the svm, that is why we will make the decision to train without lemmatizing.

As can be seen in the graphs, we can conclude that the logistic regression classifies conservative tweets very well, but liberal tweets very poorly, this is observed from the confusion matrix and the f1 score. From the roc curve, we can say that our classes, especially, the classification of the “liberal” class is very bad. For this we will try to solve our problems using another model which would be the SVM, since we want to improve the f1 score of the “liberal” class.

The results of the SVM do not differ much from the logistic regression model, the good thing here is that it increased what we wanted; the f1 score of the “liberal” class.

So after looking at the SVM and logistic regression results, we will use all the available data to classify a final model using SVM:

def ModeloClasificadorFinal(modeloToTrain, df,nombre):

    if modeloToTrain == LogisticRegression:
        myModelo = modeloToTrain(solver='saga',max_iter=100 , C=5)
    elif modeloToTrain == SVC:
        myModelo = modeloToTrain(gamma=0.01, C=100,probability=True, kernel= 'rbf') 

    myModelo.fit(Train_X_Tfidf,Train_Y)

    return myModelo

modelSVCFinal = ModeloClasificadorFinal(SVC, libCon, "Maquina de vector de soportes")

Conclusion

After seeing the results of the two models, we can conclude that to improve our model we must have more “liberal” tweets, since it can be seen in the two models that due to this problem our f1 score of the “liberal” class was extremely decreased and this makes our classifiers not 100% reliable.

What happen if you mix Twitter, the president of México and natural language processing?

Natural Language Processing Teacher:

Pattern Recognition Teacher:

This analysis was done by the students:

Github repository:

Spoiler alert!

Here’s how we did it!

In general.

Hashtags.

Likes

Now, let’s go deeper….

Then, we vectorize the tweets that are in the column “TokenizeTweetsTidy_text” with Tf-idf.

¡Here starts the clasification!

Logistic regression.

SVM (Support-Vector Machine)

Before going into the clasification… Using the GridSearchCV technique we obtained the best values ​​for the classification in the two models.

This is the function that we are going to use to train

Now, we have two options, to test our data without stemming and stemming.

First the results without stemming:

Now, we will see if changing to the best threshold improves the f1 score:

Logistic regression:

SVM:

Now the results using stemming:

Logistic regression:

Conclusion

Written by Martin vega

Before going into the clasification… Using the GridSearchCV technique we obtained the best values for the classification in the two models.