Twitter Open Sources Data Analysis Code

Twitter announced that it has decided to open sources some of the core code it relies on for its data analysis.

Twitter is releasing code that it uses with Hadoop and Pig around LZO and Protocol Buffers. “Use it, fork it, improve upon it,” says Kevin Weil of Twitter’s Analytics team.

Weil goes on to explain Hadoop, Pig, and HBase. and how exactly Twitter uses Hadoop. “Twitter has large data storage and processing requirements, and thus we have worked to implement a set of optimized data storage and workflow solutions within Hadoop,” he says. “In particular, we store all of our data LZO compressed, because the LZO compression turns out to strike a very good balance between compression ratio and speed for use in Hadoop. Hadoop jobs are generally IO-bound, and typical compression algorithms like gzip or bzip2 are so computationally intensive that jobs quickly become CPU-bound. LZO in contrast was built for speed, so you get 4-5x compression ratio while leaving the CPU available to do real work.”

“Having code automatically generated from a simple data structure definition has helped us move very quickly and make fewer mistakes in our analysis infrastructure at Twitter,” he says. “You can even hook in and add your own code generators from within the framework. Please do, and submit back!”

For more of his explanation, read this post. Documentation is available here.

By Chris Crum

Chris Crum is a staff writer and content coordinator for iEntry.

Leave a comment