Data Management: 2014

Sunday, August 17, 2014

Tweet mining for sentiment analysis, Case: Turkey President Elections

In this post I will try to briefly explain a study conducted to give an idea about twitter based data mining procedures for sentiment analysis within the context of Turkey President Elections.

The objective of this study is to evaluate whether a subset of social media postings on a specific topic can be used to illustrate the preferences of general population. To be able to realistically evaluate the performance of model or how much social media aligns with the real preferences, we chose president elections in Turkey as the subject of this study. By this way, we become able to measure our outcomes comparing to real outcomes from the real and clearly identified population. As social media environment, we agreed on Twitter by considering its microblogging concept which makes it a great source to mine what people think and feel related to a specific topic. In order to capture “the feeling” in tweets and to identify the direction of idea, we used sentiment analysis. Throughout the study we have followed CRISP-DM methodology steps but rather then the process, I will focus on the approach and tools used (all free).

Collecting data

Tool: Linqtotwitter - as they have described themselves "LINQ to Twitter is an open source 3rd party LINQ Provider for the Twitter micro-blogging service."

You can download LINQ to twitter pack from their website.

We have used a test twitter account throughout the study. After logging in to Twitter Apps (with the test account), access tokens should be copied to “App.config” document in the LINQ to Twitter solution.

As it is valid for all types of web-mining, "web has everything" so you need to filter the data that you need. Since it is more structured compared to other web-content, it is much easier to filter data in twitter mining.

For our purpose, we have defined keywords (ex. president, election, the date of elections ...etc, all in Turkish!!!) related to president elections and also used hot topic hashtags of the topic.

Note 1: In our first try, we have used candidate names as well but since one of them is already the prime minister of country, there were to many tweets about him but related to other topics. Thus, this has caused highly skewed data between candidates and also unrelated data. Therefore, we eliminated candidate names from our keyword list.

"It is a best practice to get in scope data as much as possible starting from the first step of data mining process. It will make data cleaning step easier and shorter."

You can see related code piece for LINQ to twitter below.

The other point that need to be considered is which data to collect from twitter. You can find a reference field guide that provides list and explanations of what can be get from twitter in the following link (https://dev.twitter.com/docs/platform-objects/tweets). You can get tweet itself with user, time, location, friends and follower counts, retweet info etc.

After some evaluations and focusing on our scope, we have decided to get "tweet text, tweet creation date, user name, user id, favorite count of user, followers’ count of user and friend count of user" info.

Although we have experienced some connection-cuts during data collection phase we have gathered 17 days tweets on topic till the morning of elections and ended up around 126000 tweets to work on. At the end of this step, we collected all gathered tweets in a .txt file.

Holding data

Tool: Ms SQL Server 2014

To be able to easily analyze, search, aggregate the data, we have build a database on Ms SQL Server 2014 for our project. We initialy start with more structuring our data in excel and then imported our structured data to our dedicated database.

Sentiment Analysis

Tool: Sentistregth - as they have described themselves "SentiStrength is a sentiment analysis (opinion mining) program"

We have used Sentistregth as a free, simple, modifiable sentiment analysis tool. Our tweets are in Turkish, so we have started with the localization of the tool. In order to enhance the tool to read Turkish language better, we have replaced "EnglishWordList" with "TurkishWordList" (which can be found in the Internet or if someone needs, I can share upon request ;)).

The next step is to teach our tool senti words in local language. In order to do that, we have updated "EmotionLookupTable" with Turkish senti words and their senti strengths (btw -5 to +5). In addition to language localisations, we have also performed context customisations to our tool dictionary. There are some neutral words in general context but have gained a negative or positive meaning in the context of Turkish political issues (e.g cat ("kedi" in Turkish), transformer ("trafo" in Turkish) after a politician explained the reason of power cut during vote counting time of previous local elections as "a cat entered to transformer". Can you imagine how humorous political environment we have in Turkey? :)). Anyway, we gathered such context specific senti words as well and included them in our dictionary. After this step we are almost done with our tool but not yet.

Note 2: Although it is Latin, Turkish alphabet has some special characters (e.g. ı, ü, ç, ş, ğ ). However while tweeting, because of using English keyboards, some people do not use this letters but closest characters in standard Latin alphabet (e.g. i, u, c, s, g). To be able to capture these type of wording we have duplicated both our "TurkishWordList" and "EmotionLookup" dictionaries and replaced these custom characters with standard ones in half of the each dictionary.

"By doing all these pre-work we enhanced our tool to better analyze Turkish language within the given context."

Data Preparation 1: Before running sentiment analysis on our tweets, to be able to track them throughout the study and to be able to match sentistrength results with the other tweet data, we have generated incremental surrogate key for our tweets in database so all tweets had a unique id.

Then we run our sentiment analysis on our full set of tweets, keeping the unique tweet id. Sentistrength mainly gives two values as a result: one net positive senti strength and one net negative sentistrength. It also gives breakdown of each strength to specific words that are given based on the values in "EmotionalLookupTable". After we get our results, we imported them to our database and join them with the existing other tweet data.

Data Classification

After having all related data in database, it is time to understand which tweet is talking about which candidate. In this phase, we again used an iterative keyword approach. After defining best and most accurate keywords, we have tried to classify our data in three class. Having totally exclusive sets of data might give more accurate or realistic results but as you can guess, there are too many tweets talking about more than one candidate. We made some human eye analysis on these type tweets and we have noticed that almost all of the tweets that are found as related to all three candidates are about survey results so we eliminated these tweets from our further analysis. About the tweets that are talking about two candidates, instead of eliminating them all, we improved our analysis queries in a way that will handle them together with the other tweets of the same specific user.

Data preparation 2: After having our classification criteria, we included 3 binary fields (one for each candidate) in our main tweet table and updated as 1 for each candidate field based on the classification criteria defined for this specific candidate.

Note 3: Common challenges in sentiment analysis are also valid for our study such that our tool cannot correctly weight sarcastic comments, it is difficult to identify direction in a tweet (who is the subject, what is the object), about which of them our senti word is related etc... At this point we had to make an assumption that all these cases will be normalised in the full set of data.

Data Aggregation & Evaluation

The main aim of this analysis is to estimate vote allocation percentages to candidates based on tweet posts of twitter users and evaluate how much they represent the viewpoints and preferences of whole population. Each citizen who is competent to vote has a single vote. Therefore, by considering each twitter user as a citizen, we need to aggregate tweet data at user level.

In order to make this analysis, we aggregated all tweets of specific users at user level by calculating weighted average senti strength based on tweet counts and net sentistrengths for each candidate. If one of the candidates distinctively has most positive senti strength average, we assumed that, this user will vote for this candidate. With this approach, we predicted votes of 10164 twitter users (compared to real valid vote count as 40.566.232) and found allocation as below: (RTE, EI, SD are initials of candidates)

Although the estimated results are similar to real election results, there are accuracy problems especially for top 2 candidates. After getting these initial results, by also considering political environment in country, we calculated percentages about "voting against" for a candidate. This is not a measurable metric in real case scenario but helped us to criticize our results represent the situation in Turkey. Similar to "vote for" evaluation, this time we used distinctively most negative senti strength average to identify "voting against" candidate for each twitter user.

The results, based on twitter data, shows that although 62% of users "vote for" RTE, if there is chance 58% of them will "vote against" for RTE. You can evaluate results for other candidates as well. The results are acceptable and verifiable considering the political environment in the country.

Note 2: We have considered all twitter users as real citizens who are competent to vote. However, there are too many accounts which are not belong to a person but an organisation, an ideology, an affinity group etc. These also have manipulating impact on our results. Our other analysis will clarify the impact better.

As another analysis, we tried to find best and worst influencers for each candidate. To illustrate the total influence, we calculated weighted average based on follower count and average net sentistregth at user level.

You can see some indicative results in the table below.

To make it clearer (especially for non Turkish readers)

NTV: one of the mainstream news channel

TRT: Turkey Radio and Television Cooperation, public broodcasting

Sabah: one of the main newspapers

DHA: one of the news agencies in the country

YeniŞafak: a newspaper

Stargazete: a newspaper

AkitGazetem: a newspaper

AkşamGazetesi: a newspaper

CHP_online: acount of main opposition party

calıkosman: executive editor/chief of Samanyolu radio (also has a tv channel know as being close to Gülen community)

CHPvekilHaber: a news account of main opposition party

It is better to note that the sample above is not directly top 11 but selected separately for each candidate to be able to get results for each. There is no distinctive influencers for last candidate compared to two other. For the first two, it is again better to note that when considered the top influencers' impact, influence for RTE is more than 20 times more of EI top influencer (even though EI influencer is the official site of opposition party who support EI). When the number of influencers and their total impact are considered, we can say much more thing related to political environment in Turkey and also freedom of media. I will share my subjective comments on this topic here hopefully in a short time. However, when we come back to our technical content, newspapers, any other media institutions or special supportive accounts etc which are actually not voters, shift our results to up for RTE the existing prime minister at the time of elections.

Further improvements

This study is completed in a very short time but still

"it shows that at high level, twitter data illustrate the preferences of general population."

Further human interpretation makes the results more meaningful and accurate. The model and approaches used can be improved for more detailed analysis.

- more classification algorithms can be run on tweets in order to identify exact subject candidate of tweets that could not be identified or assigned to more than one candidate, thus sample set could be extended

- more human intervention can improve classification criteria

- more data cleaning activities can be applied to better analyze sentiment in tweets

- in this study only word list and emotional lookup tables are updated to understand local language, the other dictionaries (BoosterWordlist, NegatingWordList, QuestionWords etc) can also be updated

- account owners who are not citizens but organizations can be eliminated while estimating election results

- further analysis can be done, what are the main characteristics of each candidate, what are their popularity, what are the sensitive issues for Turkish voters etc., it depends on your questioning..

- much more ...

I have shared this study as an example to ones who will conduct a similar analysis. Hope it gives some ideas.. If you have further questions or comments, feel free to comment/message.

And a small tweet selection... (at least for Turkish speakers)

ProjectTeam: Fatma Cengiz, Begüm Göloğlu, Gizem Gündoğdu, Nihan Erkan, Tuğçe Sarıbay

Tuesday, July 29, 2014

02: Overview of Datastage Designer Client

After creating your project, it is time to know about designer client to start developing ETL.

There are different sections in Designer screen. At the top, you see menubar (1) and toolbar (2) (that I will explain each tool when we go deep into developing process).

(3) Repository: DS suggests a standard repository structure to organize repository objects. You can add new folders and organize them together with the existing ones.

(4) Palette: Palette keeps the building elements (called stages in parallel and server jobs, called activities in sequence jobs) to develop ETL jobs. Stages are orginized in different sections based on their funcltionalities.

(5) Log: to make your life easier during development, you can show log view in designer client. If it does not appear on your screen, you can activate from view menu.

(6) Canvas: place holder to design our jobs. You can drag and drop any stage in palette to canvas and link them to have end-to-end job.

The first DS Job :)

Basically an ETL job requires three stages.
1- DB connector/enterprise/file stage to connect to data source and extract data
2- Transformer stage to make all necessary transformations and map to output columns
3- DB connector/enterprise/file stage to connect target and load data

Depending on your needs you can use further processing stages like lookup, copy, filter etc.

Friday, July 25, 2014

01: How to start with Datastage?

Let's start with the clients of Datastage and describe main activities that you will be doing with each.

Administrator Client:

With Administrator Client, mainly you can add, delete, move Datastage projects, control user permissons and environment variables and also manage some other administrative tasks. You can prefer to create a new project from scratch or copy an existing one and continue as a phased approach.

Designer Client:

The client of development! You will be designing and building parallel, server or sequence jobs to fulfill your data integration initiative within the designer. Designer client is a graphical user interface that include many building blocks (called "stage" in parallel and server jobs and called "activity" in sequence jobs) to help you incorporate functional capabilities to your job design.

Director Client:

You developed your jobs and ready to run! Within director client you can validate, run, reset, schedule and monitor your jobs.

We will go into detail about each client in following related posts but let's start with creating our project.

1-) How to create a new Datastage Project?

You will see the list of projects on the left pane.
With the options on the right, you can add or delete projects and control project level properties.

When you click on "Add" button, you can give a name to your project, specify the path and you can copy roles from an existing project. When you check the related box, drop down list will be activated so you can chose the project from which you want to copy roles.

2-) Do I need to create a new project for every reason?

Deciding to create a new project or building your jobs within an existing project is a design option. However, to be able to have boundaries between different environments, control user priveleges and ensure easy maintanence, it is better to have different projects for different subjects especially for prod environments. For example you can have a DS Project for your datawarehouse and another one for your accounting system. On the other hand, within the same repository, you can benefit from generating generic batch jobs that can be used commonly for different purposes with parameters. So before deciding to create a new project or continue with the existing repository, you need to evaluate similarities and required boundaries between projects.

3-) How to copy/move a project in Datastage?

You might want to copy/move your projects because of many reasons like moving your repository to a new path, going from development to prod, having different incremental repositories for phased based development projects ...etc.

To move/copy a project, you need to create a new project in desired path by copying the roles from existing project that you want to move/copy. Then, you need to export required DS components (jobs, table definitions, parameter sets, routines etc.) from the existing project and import all to your new project.

To export DS components from your existing project, open designer client and click on Export >DatastageComponents from menu bar. In the opened screed, specify the path that you want to save export file. Then click 'Add' to select any component that you want to export from repository. When you select all components to export, click on 'Export' button.

Then open your new project in Designer Client to import all components. Select Import >DatastageComponents and select the export file path in opened screen and then click 'Import'. Now you have all components (jobs, table definitions, parameters etc) that you exported in your new project.

You also need to consider that 'Project Properties' are not copied automatically. So if you want to have same Environment Variables for your new project, you need to export them from Administrator Client and then import to your new project.

4-) How to delete a Datastage Project?

In administrator client, projects tab, go over the project that you want to delete and click on delete button on the left. You are done :)

Monday, February 24, 2014

ETL Technology in market

It is time to start tool based posts with Datastage and sharing knowledge.

Before going deeper on Datastage, it is better to define what is ETL and what is current understanding and position of this technology in the market, for the ones new to the topic.

ETL ("extract, transform, load") is used to define process or supporting tools that are used to pull data out of one source (database, file ...etc), make necessary transformations and load to another database as a target. As I stated in my previous post "Why data integration geting more important?" , there are main requirements that makes us to think on data integration. Mainly,

(-) Data is not a stand alone asset anymore for enterprises or organizations.

(+) Data is a commodity moving around the enterprise and going in to and coming out from other processes, other systems, other enterprises...etc.

To handle this commodity throughout your systems, enterprise and even your whole environment; instead of just worrying about full-compatibility between systems, you need to consider about your data integration capability. Compatibility and bundling of complementary tools is used as a marketing&sales strategy in the develeopment era of information technologies or mainly in computer science. However with the explosion of knowledge and technological develoments, it is almost impossible for a vendor to respond and meet all expectations in the market. Probably you already watch out that vendors more strictly stick to standards and more focused expertise together with partnership approach is being more popular.

With the discusions on Bigdata and NoSQL systems, there are two main ideas on whether ETL will be still in use or not. You can find viewpoints of Phil Shelley, former CTO Sears Holdings, CEO Metascale who has also established his Bigdata consulting firm NPP-Newton Park Partners last year(2013) and James Markarian, CTO Informatica from different sides in InformationWeek article.

It is impossible not to listen Shelley's idea that "since Hadoop came to the enterprise, we are beginning to see the end of ETL as we know it". But there are points that I do not completely agree with Shelley. While he is saying ETL, he is just focusing the technology as it is now but we know that technology is tend to evolve acording to requirements in the market and it is not necessarily to be called as a new technology. Additionally, I do not see each stage of ETL as non-value-added activities. Within the context of relational databases and structured data, with a good design and good performance, you can add value to your data and turn your stand-alone asset to a commodity that can be used for different purposes throughout your organisation with ETL. Shelley might be completely right for Hadoop but I really suspect whether "Hadoop came to enterprise"! Although it comes to enterprise, is it possible to have Hadoop as the only system? He also states "Some subsets of data do have to be moved out of Hadoop into other systems, for specific purposes. However, with a strong and coherent enterprise data architecture, this can be managed to be the exception."

I do not want to make you lost in different articles before startig to learn a tool but I also strongly believe that it is better to understand the requirement and motivation behind any effort. ıt might also be good to have a look at "The State of ETL: Extract, Transform and Load Techology" article written by Alan R. Earls in DataInformed.

Not just for ETL but for all technologies, it is better to take it as a conceptual knowledge and make use of it to understand new technologies. It is more likely that ETL will not exist for too many years as its traditional form but the logic behind it to extract (filter and read, not necesarily to load to a different system) data, transform if necessary and to load (to a new system, to a modeling tool or just to a user interface) to make data available to serve specific purposes will retain.

Hope you will find Datastage posts helpful for your ongoing tasks and to have a vision to get ready for new technologies.

http://www-01.ibm.com/software/data/infosphere/