Twitter data has become the go-to dataset for all things Big Data, averaging around 500 million tweets per day. The Twitter Streaming API opens up a world of possibility by: a.) allowing developers access to the garden hose of tweets that flow through the Twitterverse, in real-time. b.) providing sales and marketing professionals a tremendous amount of useful data and insights into what their customers are saying about their products, services, etc.
Over the past few months, Cervello’s Analytics and Information Management (AIM) practice has been hard at work on a Big Data Analytics competition. Divided into three teams, Boston, New York, and Dallas, the AIM practice was charged with defining a Big Data dataset and implementing an integration and analytics solution using Big Data technologies. For the New York team’s entry, we sought to combine a real-time flow of tweets with some cool, custom visualizations, including Google Maps.
Using a medium Amazon EC2 instance running Ubuntu Linux, we started by setting up a connection to the Twitter Streaming API with PHP and cUrl. We specified a list of 35 keywords, including “Bitcoin,” “cryptocurrency,” and “exchange rates,” that we wanted to filter. Pretty quickly we were pulling around 500 tweets per minute into a plain text file. Because we wanted to be able to store and query the tweets quickly and easily we chose Elasticsearch as our target database. Elasticsearch is a scalable, document-based database that is effective at handling JSON objects, the format in which tweets are delivered by the Streaming API. Then, it was just a matter of directing the PHP output flow into Elasticsearch; Apache Flume is the tool for that. Apache Flume is a service for collecting and moving large amounts of streaming log data in real-time.
Configuration of Apache Flume is handled conveniently inside one configuration file. This is where you set up all the Sources, Interceptors, Channels and Sinks that your flow requires:
- Source: Output text file from PHP
- Interceptor: Java program that cleans and manipulates tweets mid-stream
- Channel: Holding area for tweets on their way to the Sink
- Sink: Elasticsearch database
After a few iterations of our Apache Flume configuration file our flow into Elasticsearch was running smoothly. We could see a tweet flowing into our database only a second or two after the tweet was created!
Our next task was to put some real-time visualizations on top of the data in Elasticsearch. Kibana is a tool that integrates automatically with Elasticsearch and is great for basic visualizations, like line graphs. However, we wanted to build a few visualizations of our own. Using JQuery on an HTML page, we built a framework for real-time querying of Elasticsearch. Every 10 seconds the webpage retrieves the previous 10 seconds of tweets from Elasticsearch and queues them up to be visualized in the order they were created. We then used some awesome libraries, like D3.js and Canvas.js, to build some fun visualizations of our keywords as depicted below.
In closing, this project illustrates three key points:
a) The Twitter Streaming API coupled with a variety of data integration and analytics technologies makes analyzing the Twitter stream possible with minimal effort.
b) Big Data, and more specifically, Twitter Data, in this example, provides significant business opportunity to sales, marketing and even finance professionals who can now rely on empirical data for critical decision making around things such as prospecting, forecasting and marketing spend.
c) Cervello consultants are an innovative team with a strong technical aptitude and a penchant for problem solving.
If you’ve transformed Big Data into a big opportunity for your business or need help doing so, we’d like to hear from you.
Interested in learning more? Join our mailing list!