In a previous post, titled, Real-Time Big Data Analysis of Twitter Stream: Part 1, we discussed a Big Data Analytics competition among our Analytics and Information Management (AIM) practice whereby the New York team analyzed Twitter data in real time using PHP, the Twitter Streaming API, and Apache Flume. While this met our needs for real-time flow visualization, we also wanted to store and display analytics about a large number of tweets using Big Data technologies.
Our goal was to collect and store millions of tweets in Amazon Redshift which would make those tweets available for querying by reporting tools, such as Birst or Oracle Business Intelligence. We chose Amazon Redshift because of its ability to count and aggregate large amounts of data quickly. We used Amazon Elastic MapReduce (EMR) with Hive and a JSON SerDe to turn the tweet JSON objects into columnar data and to process out duplicate tweets that snuck their way into the flow. We chose EMR because of its ability to process large amounts of data, particularly JSON objects, within Amazon S3. Finally, we set up a new Sink in Apache Flume, a File Roll, which is a set of files containing all of the tweets as they come in, rolled over every hour. With that, our flow from the Twitter Streaming API all the way through to Amazon Redshift was complete.
However, we needed a way to automate this whole process, and that’s how SONIC, Simply Our NY Integration Console, was born. SONIC monitors and controls every step of the flow. We started with the monitoring. We used Cron to execute a PHP script once every minute. The script checks that all systems are up, including the connection with the Streaming API, Apache Flume and Elasticsearch. It also checks that the File Roll output from Flume is growing with more tweets. If any of these systems is down or not working, an email is sent to the administrator.
The second function of SONIC is to control the flow. Once per day (or when an administrator initiates it), SONIC performs an incremental load of tweets into Amazon Redshift. The steps include:
1) Check that SONIC is not already running
2) Sync files from the File Roll on Amazon EC2 into Amazon S3 using the AWS Command Line Interface (CLI)
3) Kick off the EMR job using the AWS CLI
4) Wait until the EMR job completes
5) Kick off the Redshift load script using psql command
6) Wait until the data is loaded into Redshift
7) Check that the number of tweets in Redshift has increased
8) Archive the files that have been loaded
The final task was to create a front-end for SONIC where a user can see the status of all systems and processes and where an administrator can initiate an ad-hoc load. We used PHP and an HTML5 canvas to generate this page.
In summary, this project, along with the project described in Part 1 of this blog series, illustrates three key points:
a.) The Twitter Streaming API coupled with a variety of data integration and analytics technologies makes analyzing the Twitter stream possible with minimal effort.
b.) Big Data, and more specifically, Twitter, in this example, provides significant business opportunity to sales, marketing and even finance professionals who can now rely on empirical data for critical decision making around things such as prospecting, forecasting and marketing spend.
c.) Cervello consultants are an innovative team with a strong technical aptitude and a penchant for problem solving.
If you’ve transformed Big Data into a big opportunity for your business or need help doing so, we’d like to hear from you.
Interested in learning more? Join our mailing list!