How to get started with Text Analysis in Python (and analyzing Donald Trump's tweets).
4:55pm - 5:20pm on Friday, October 5 in PennTop SouthDave Klee
- Audience Level:
- All
- Slides:
- https://github.com/daveklee/textblob-pygotham2018
Overview
Python makes it pretty easy to deal with text, and there are a lot of great packages out there to help. This talk will show you how to get up and running from a true beginner’s perspective using NLTK and TextBlob to make a classifier that might tell you who’s writing Donald Trump’s tweets.
Description
There are some incredibly powerful tools available to analyze text with Python – so much so that it can be the preferred choice for more advanced programmers and data scientists.
But if you’re not a data scientist or just starting out with Python, it can be hard to know how to get going – especially when you start looking at the interesting world of Machine Learning to help. You can start with an introduction to text analysis but before you know it be buried in so many acronyms, jargon, and equations that it’s hard to know how to accomplish anything useful.
This talk will serve as a true beginner’s guide on how to get started analyzing text with Python, focusing on the TextBlob package that does a nice job of making text operations simple, while tapping into some of the power of NLTK – the pervasive Natural Language Toolkit.
We’ll go through a quick intro of TextBlob and show how you can start to break text down into pieces, classify sentences into categories, see if a computer thinks text is happy or sad, and do a variety of other basic Natural Language Processing (NLP) with some pretty simple code – even building a basic text classifier with machine learning. And to keep it interesting, we’ll do it all using examples from one of the most fascinating text sets available today – Donald Trump’s tweets.
Donald Trump’s tweets are an interesting application of text classification because it’s been widely reported that multiple people tweet as @realdonaldtrump; in addition to the president, a few other key staffers are in the mix. In fact, for years a somewhat open secret (an Android phone) made it pretty easy to tell the difference between a tweet from staff and a tweet from Donald Trump, himself. But that all changed in March 2017 when the president was forced to trade in his dated Samsung Galaxy S3 for an updated iPhone.
This provides a great opportunity to start to dig into applications for text classification with machine learning tools like TextBlob and Naive Bayes to see how it might be possible to separate out staff tweets from messages written directly by the president, and train computers to help classify large amounts of human language into useful categories.