Thursday, September 8, 2011

Big Data : Do we need more use cases?

Today, we can comfortably say that Big Data is an accepted term or a concept - most of the people who work in information technology have at least some definition about it. Well, everyone understands that data is growing at exponential speed; it will continue to do so for decades and there is no reason for it to stop. New analytical needs, not well suited to existing data warehouses, and growing volumes of source data are one of the biggest drivers to this relatively new space. Exponential drop in the cost of bandwidth, storage and computing makes big data applications possible and economical. The wider acceptance of cloud computing has made the business case for big data even stronger. And as companies move from proccess-driven business to analytics-driven business, the interest in big data is only going to multiply in the coming years. So overall its great news for all enterpreneurs, technologists and vendors who want to offer products and services in the big data space.

Almost every second day there is some interesting news in the big data world - last Tuesday Fujitsu announced that they will offer a cloud platform to leverage big data and MapR, an apache Hadoop distributor, has just secured a twenty million dollar funding led by Redpoint ventures. There are numerous others like, 10gen, Cloudera, Neo Technology, Loggly, Hypertable etc. who got funded earlier. Today, market seems to be most receptive to Apache's Hadoop-based distributions which includes offerings from companies like Cloudera, IBM and EMC.

But big data adoption has still not reached tipping point in enterprises even though you will find many pilot projects. The primary reason being that there is still confusion around which problems are big data problems. What are the right questions to ask? Basically, how do you start and structure a project? What is the scope of a project? Cloudera has done a great job in raising the awareness but still the focus is very technical. The story is very different if you consider pioneers like Google, Yahoo and early adopters like Linkedin, Facebook etc.. Big data is critical part of their business and their needs are very different. Startups in the web space like Bitly, Foursquare and many others have similar needs - maybe not in terms of scale but at least in terms of flavor. Also, not every company is going to have a petabyte problem - infact, most of the companies have terabytes problem. The ecosystem of tools in this space is maturing very quickly, though still immature as compared to other fields. The majority of the existing use cases are around click stream analysis, log analysis, marketing analytics, text processing etc.. For e.g. NTT Communications built a log analysis system for marketing using hadoop, which explore the internet users’ interests or feedback about specified products or themes from access log, query. CBS interactive used Hadoop as the web analytics platform, processing one billion weblogs daily from hundreds of web site properties at CBS Interactive. Right now, targeted advertising is the biggest market for big data applications and there is lots of interest in using social media analytics from big data. While the problem of  finding unexpected patterns in unstructured data makes sense but there is lots of work to be done for structured data. Because not all problems in the enterprise is going to be around unstructured data specially if you consider industries like financial services. The schemaless and flexible aspect of nosql technologies used to solve big data problems is begining to be viewed as very attractive specifically for ETL kind of operations.

In the end, the promise of big data is about building models, cause and effect relationships and predicting outcomes. The expectations are high as it is even considered critical path to building personalized medicines in future. But we need to take various baby steps before we get there. Not having the right set of skillset to focus on big data is another problem and having a steep learning curve to understand technologies like Hadoop/Mapreduce doesn't make it easy also. Abstractions over Mapreduce like cascading, pig and hive are good approaches but are still relatively new. Market for use case driven applications is least understood and not tapped. Using public cloud for big data seems no brainer from cost perspective but there are practical problems like how do you move large amounts of big data to/from the cloud. Scalability problem is sorted out using technology like Hadoop but building analytical applications is where the challenge is. Big data is here to stay and the space has become very interesting but many unknowns are still there before it starts becoming mainstream.

No comments:

Post a Comment