In this post I will try to briefly explain a study conducted to give an idea about twitter based data mining procedures for sentiment analysis within the context of Turkey President Elections.
The objective of this study is to evaluate
whether a subset of social media postings on a specific topic can be used to
illustrate the preferences of general population. To be able to realistically
evaluate the performance of model or how much social media aligns with the real
preferences, we chose president elections in Turkey as the subject of this
study. By this way, we become able to
measure our outcomes comparing to real outcomes from the real and clearly
identified population. As social media environment, we agreed on Twitter by
considering its microblogging concept which makes it a great source to mine
what people think and feel related to a specific topic. In order to capture
“the feeling” in tweets and to identify the direction of idea, we used sentiment
analysis. Throughout the study we have followed CRISP-DM methodology steps but rather then the process, I will focus on the approach and tools used (all free).
Collecting data
Tool: Linqtotwitter - as they have described themselves "LINQ to Twitter is an open source 3rd party LINQ Provider for the Twitter micro-blogging service."
You can download LINQ to twitter pack from their website.
We have used a test twitter account throughout the study. After
logging in to Twitter Apps (with the test account), access tokens should be copied to “App.config” document in
the LINQ to Twitter solution.
As it is valid for all types of web-mining, "web has everything" so you need to filter the data that you need. Since it is more structured compared to other web-content, it is much easier to filter data in twitter mining.
For our purpose, we have defined keywords (ex. president, election, the date of elections ...etc, all in Turkish!!!) related to president elections and also used hot topic hashtags of the topic.
Note 1: In our first try, we have used candidate names as well but since one of them is already the prime minister of country, there were to many tweets about him but related to other topics. Thus, this has caused highly skewed data between candidates and also unrelated data. Therefore, we eliminated candidate names from our keyword list.
"It is a best practice to get in scope data as much as possible starting from the first step of data mining process. It will make data cleaning step easier and shorter."
You can see related code piece for LINQ to twitter below.
The other point that need to be considered is which data to collect from twitter. You can find a reference field guide that provides list and explanations of what can be get from twitter in the following link (https://dev.twitter.com/docs/platform-objects/tweets). You can get tweet itself with user, time, location, friends and follower counts, retweet info etc.
After some evaluations and focusing on our scope, we have decided to get "tweet text, tweet creation date, user name,
user id, favorite count of user, followers’ count of user and friend count of
user" info.
Although we have experienced some connection-cuts during data collection phase we have gathered 17 days tweets on topic till the morning of elections and ended up around 126000 tweets to work on. At the end of this step, we collected all gathered tweets in a .txt file.
Holding data
Tool: Ms SQL Server 2014
To be able to easily analyze, search, aggregate the data, we have build a database on Ms SQL Server 2014 for our project. We initialy start with more structuring our data in excel and then imported our structured data to our dedicated database.
Sentiment Analysis
Tool: Sentistregth - as they have described themselves "SentiStrength is a sentiment analysis (opinion mining) program"
We have used Sentistregth as a free, simple, modifiable sentiment analysis tool. Our tweets are in Turkish, so we have started with the localization of the tool. In order to enhance the tool to read Turkish language better, we have replaced "EnglishWordList" with "TurkishWordList" (which can be found in the Internet or if someone needs, I can share upon request ;)).
The next step is to teach our tool senti words in local language. In order to do that, we have updated "EmotionLookupTable" with Turkish senti words and their senti strengths (btw -5 to +5). In addition to language localisations, we have also performed context customisations to our tool dictionary. There are some neutral words in general context but have gained a negative or positive meaning in the context of Turkish political issues (e.g cat ("kedi" in Turkish), transformer ("trafo" in Turkish) after a politician explained the reason of power cut during vote counting time of previous local elections as "a cat entered to transformer". Can you imagine how humorous political environment we have in Turkey? :)). Anyway, we gathered such context specific senti words as well and included them in our dictionary. After this step we are almost done with our tool but not yet.
Note 2: Although it is Latin, Turkish alphabet has some special characters (e.g. ı, ü, ç, ş, ğ ). However while tweeting, because of using English keyboards, some people do not use this letters but closest characters in standard Latin alphabet (e.g. i, u, c, s, g). To be able to capture these type of wording we have duplicated both our "TurkishWordList" and "EmotionLookup" dictionaries and replaced these custom characters with standard ones in half of the each dictionary.
"By doing all these pre-work we enhanced our tool to better analyze Turkish language within the given context."
Data Preparation 1: Before running sentiment analysis on our tweets, to be able to track them throughout the study and to be able to match sentistrength results with the other tweet data, we have generated incremental surrogate key for our tweets in database so all tweets had a unique id.
Then we run our sentiment analysis on our full set of tweets, keeping the unique tweet id. Sentistrength mainly gives two values as a result: one net positive senti strength and one net negative sentistrength. It also gives breakdown of each strength to specific words that are given based on the values in "EmotionalLookupTable". After we get our results, we imported them to our database and join them with the existing other tweet data.
Data Classification
After having all related data in database, it is time to understand which tweet is talking about which candidate. In this phase, we again used an iterative keyword approach. After defining best and most accurate keywords, we have tried to classify our data in three class. Having totally exclusive sets of data might give more accurate or realistic results but as you can guess, there are too many tweets talking about more than one candidate. We made some human eye analysis on these type tweets and we have noticed that almost all of the tweets that are found as related to all three candidates are about survey results so we eliminated these tweets from our further analysis. About the tweets that are talking about two candidates, instead of eliminating them all, we improved our analysis queries in a way that will handle them together with the other tweets of the same specific user.
Data preparation 2: After having our classification criteria, we included 3 binary fields (one for each candidate) in our main tweet table and updated as 1 for each candidate field based on the classification criteria defined for this specific candidate.
Note 3: Common challenges in sentiment analysis are also valid for our study such that our tool cannot correctly weight sarcastic comments, it is difficult to identify direction in a tweet (who is the subject, what is the object), about which of them our senti word is related etc... At this point we had to make an assumption that all these cases will be normalised in the full set of data.
Data Aggregation & Evaluation
The main aim of this analysis is to estimate vote allocation percentages to candidates based on tweet posts of twitter users and evaluate how much they represent the viewpoints and preferences of whole population. Each citizen who is competent to vote has a single vote. Therefore, by considering each twitter user as a citizen, we need to aggregate tweet data at user level.
In order to make this analysis, we aggregated all tweets of specific users at user level by calculating weighted average senti strength based on tweet counts and net sentistrengths for each candidate. If one of the candidates distinctively has most positive senti strength average, we assumed that, this user will vote for this candidate. With this approach, we predicted votes of 10164 twitter users (compared to real valid vote count as 40.566.232) and found allocation as below: (RTE, EI, SD are initials of candidates)
The results, based on twitter data, shows that although 62% of users "vote for" RTE, if there is chance 58% of them will "vote against" for RTE. You can evaluate results for other candidates as well. The results are acceptable and verifiable considering the political environment in the country.
Note 2: We have considered all twitter users as real citizens who are competent to vote. However, there are too many accounts which are not belong to a person but an organisation, an ideology, an affinity group etc. These also have manipulating impact on our results. Our other analysis will clarify the impact better.
As another analysis, we tried to find best and worst influencers for each candidate. To illustrate the total influence, we calculated weighted average based on follower count and average net sentistregth at user level.
You can see some indicative results in the table below.
To make it clearer (especially for non Turkish readers)
NTV: one of the mainstream news channel
TRT: Turkey Radio and Television Cooperation, public broodcasting
Sabah: one of the main newspapers
DHA: one of the news agencies in the country
YeniŞafak: a newspaper
Stargazete: a newspaper
AkitGazetem: a newspaper
AkşamGazetesi: a newspaper
CHP_online: acount of main opposition party
calıkosman: executive editor/chief of Samanyolu radio (also has a tv channel know as being close to Gülen community)
CHPvekilHaber: a news account of main opposition party
It is better to note that the sample above is not directly top 11 but selected separately for each candidate to be able to get results for each. There is no distinctive influencers for last candidate compared to two other. For the first two, it is again better to note that when considered the top influencers' impact, influence for RTE is more than 20 times more of EI top influencer (even though EI influencer is the official site of opposition party who support EI). When the number of influencers and their total impact are considered, we can say much more thing related to political environment in Turkey and also freedom of media. I will share my subjective comments on this topic here hopefully in a short time. However, when we come back to our technical content, newspapers, any other media institutions or special supportive accounts etc which are actually not voters, shift our results to up for RTE the existing prime minister at the time of elections.
Further improvements
This study is completed in a very short time but still
"it shows that at high level, twitter data illustrate the preferences of general population."
Further human interpretation makes the results more meaningful and accurate. The model and approaches used can be improved for more detailed analysis.
- more classification algorithms can be run on tweets in order to identify exact subject candidate of tweets that could not be identified or assigned to more than one candidate, thus sample set could be extended
- more human intervention can improve classification criteria
- more data cleaning activities can be applied to better analyze sentiment in tweets
- in this study only word list and emotional lookup tables are updated to understand local language, the other dictionaries (BoosterWordlist, NegatingWordList, QuestionWords etc) can also be updated
- account owners who are not citizens but organizations can be eliminated while estimating election results
- further analysis can be done, what are the main characteristics of each candidate, what are their popularity, what are the sensitive issues for Turkish voters etc., it depends on your questioning..
- much more ...
I have shared this study as an example to ones who will conduct a similar analysis. Hope it gives some ideas.. If you have further questions or comments, feel free to comment/message.
And a small tweet selection... (at least for Turkish speakers)
ProjectTeam: Fatma Cengiz, Begüm Göloğlu, Gizem Gündoğdu, Nihan Erkan, Tuğçe Sarıbay