The following abstract was submitted to the Tom Tom Founders Festival that will be held in Charlottesville, VA April 9th to April 15th 2018. This is an excerpt of the research and will include future posts.
Association analysis is a machine learning technique that analyzes data patterns over a period of time. Also known as market basket analysis or affinity analysis, this technique uses information from transactions to develop rules about how records with details about specific items appear together. For example, in market basket analysis a data scientist can review point-of-sale transactions by customers to see how often baby wipes are purchased with baby pacifiers and determine if there is a strong enough association with those products to put them into the same aisle. The qualitative decision of this being that having these items in close proximity can increase their sales based on the past purchase patterns by multiple customers. In this project, association analysis is applied to twitter follower retweets and quotes in order to understand patterns in hundreds or thousands of tweets based on which politically polarizing termed tweets have the strongest association and whether tweets with the highest associations have negative or positive connotations.
OBJECTIVES AND METHODS
Python, a powerful data analysis language, was used to connect to Twitter to stream tweet data. The retweeted or quoted text, user information, number of retweets, number of followers and the number of quotes were downloaded into a NoSQL document database as json. Json is an easy to read data document format used by Twitter. Python and SAS tools were then used to build association rules based on a taxonomy of user ids and what was retweeted. Sequencing discovery analysis was applied to take into account the order of the association among retweets and quotes to strengthen the association. About thirty politically polarizing terms were used to mine tweets in the NoSQL database. New json documents were built with these terms and then analyzed for sentiment using a Python sentiment analysis program. The objective of this was to build a story of what terms are strongest in retweet and quotes, and how positive or negative the context of those tweets are. The secondary objective is use visualization libraries in Python to show quantitatively how users respond to politically divisive words in tweets. Finally, a twitter “bot” account was built to retweet the highest associated tweets and to compare the “bot” account with the retweet from other accounts. In the future, this project will analyze how organic the retweets of the “bot” account have been. In other words, can a “bot” account that only retweets strong associations model a “human” account with the same termed tweets?
RESULTS AND CONCLUSIONS
Market basket analysis is popular in data mining. It typically consists of two data variables: a transaction and an item. In this research, the item is the retweet or quote, the transaction is the users who performed a retweet or have quoted a tweet. Association strength is also based on frequency. In this case, the number of retweets and quotes are part of transaction. A few discoveries made during this research were that users with the highest number of followers had the greatest retweets, with negative tweets superseding positive tweets by about 7-15% based on strength in association. However, in quotes, positive tweets superseded negative tweets by about 23%. The sample data size for the “bot” account had about a 25% standard error to the original training data. This result may have somewhat been skewed by the lack of data that has been applied thus far to the bot.