Tuesday, September 9, 2014

Twitter bots update

I now have some generic python code up on github to host an app at google apps engine that reads from a text file.  I have used it to add another twitter bot to the Jane Austen Twitter Bot Universe,
@SS_Bot4. I'll probably continue to improve on the code, and try to get the other three texts up along the way.

Monday, August 25, 2014

Creating a twitter bot: A Link Round-up

My goal was to write a twitter bot or two and see if I could run them in the cloud for free.  I am declaring myself (provisionally) successful and wanted to make a list of the resources I used.

My two twitter bots are @P_P_Bot and @Emma_Bot_1.  They are tweeting from Jane Austen's Pride and Prejudice and Emma, respectively.   They start at the beginning and tweet the next bit every 10 minutes.   My inspiration was @UlyssesReader, which tweets from Joyce's Ulysses.  Austen is a bit less exciting, since her works have been long out of copyright, whereas Ulysses went out of copyright really recently.  Still, I love her books as much as I love Ulysses-a lot.


  • First, I downloaded the texts of Pride and Prejudice and Emma from Project Gutenberg.  I saved the .txt versions.  
  • Next, I broke up the texts into sentences.  I used Python for this.  First I used the Natural Language Toolkit module (nltk) to do an initial pass.  The nltk has been trained on English language data, so it knows to split on periods-but not when they occur in words like "Mr." or "Mrs."  I used the punkt tokenizer.  If you do this, make sure you have the latest version of nltk installed.  I then did some cleaning up of the results in Python, such as eliminating excess newlines and splitting off chapter headings.
  • I broke up the resulting sentences into chunks of 140 characters for twitter.  The sentences that were already under 140 character I kept as is.  The rest I tried to split up intuitively on colons, semicolons, dashes, and commas, in that order.  I wrote my own Python code to do this.  The function I wrote to split on commas is the most complicated, but it works on the Jane Austen texts. My goal was to split on commas while keeping each chunk as close to 140 characters as possible.
  • Next I learned how to write a simple twitter bot that tweets from your desktop.  The tutorial I used is here.  Note: in order to choose the "read and write" option for your twitter app, you have to enable your mobile phone on your new twitter account.  After you do this, you can't use your mobile on any other account.  So, to make more than one twitter bot, you need to enable your mobile phone, choose the "read and write" option for the app, and then disable your mobile phone so you can make the next bot.  
  • I got the twitter bots up and running from my desktop.  My final goal was to put them in the cloud.  I used this tutorial to make my move.  I downloaded Bill the Lizard's code from github to examine.  Warning: even with a good tutorial, google apps engine has a learning curve.    First, when you are working on your desktop, make sure to check the log file!  If there is something wrong with your code, you will not be able to view your app on localhost.  I spent a futile hour or two trying to figure out what was wrong with localhost and my ports when the problem was in my code (doh).  Second, I encourage you to look carefully at the different files that make up the app.  Before you deploy your code to the cloud, make sure the correct identifier is in the app.yaml file.  Next, in order to use the tweepy module, you have to add it, and dependencies, to your project.  The dependencies I ended up adding are the requests module, the requests_oauthlib module, and oauthlib.  I also had to modify the .yaml file as indicated in the comments of Bill's original tutorial. 
  • In order to keep track of my place in each book, I needed to save an index to the cloud.  This is when I learned that you cannot write to a file in google apps engine.  Instead, you have to save to one of their data objects.  There is a lot of documentation explaining how to do this.  Also, you can inspect and even manually change your data objects from your application dashboard, which is nice.
  • If you are creating a cron job, which I did, check first to make sure you can send tweets from your online application before you add the cron job.  Then you can monitor whether the cron job is working from the "Cron Jobs" link in your application's dashboard.  If it is not working, it is probably because you have a fiddly syntax error like I did.  Check your cron.yaml file, your app.yaml file, and your python functions to make sure they are matched in the same way as Bill the Lizard's example.
At the moment, my twitter bots are tweeting once every 10 minutes.  The final question is whether they are going to exhaust google app engine's free quota of "frontend instance hours."  Based on the current daily data, I think they are ok, but if not I will reset them to tweet once every 15 minutes.

I'm pretty happy about this experiment.  One thing it underscored to me was some advantages of Python over R.  When it comes to statistical computing and ease of package/module installation, R is superior, in my opinion.  But the nltk module is probably better than any equivalent in R; there is, as far as I know, no equivalent of tweepy for Python; and google apps engine doesn't support R!  

Friday, August 22, 2014

This is my new blog for recording my attempt to go from Statistics M.S. graduate to data scientist.  My plan is to record my journal as I:

  • Listen to the lectures from the Coursera Machine Learning course
  • Try the exercises in the book Agile Data Science
  • Decide how to deal with some older projects I have hanging around.
I also am looking for a job in data science right now.  I will talk about my experiences job-hunting in a general and discreet fashion :)