Thursday, September 29, 2011

Data Without Borders: Data can be a burden if it is not set free!

Wikipedia describes philanthropy etymologically means "the love of  humanity"— love in the sense of caring for, nourishing, developing, or enhancing. Historically, philanthropy has always been associated with giving generous donations of money. It will continue to be associated with giving generous donations but some professionals can make bigger impact by donating their skill set than money. Yes, I am talking about the skill set of data science! Data science is relatively a newly coined term and probably originated from data geeks working on hard data problems in the companies like Linkedin, Facebook and other technology companies who needed these experts to make sense and insights from the vast amount of data being produced everyday. Data Without Borders, a newly founded organization, seeks to match non-profits in need of data analysis with freelance and pro bono data scientists who can work to help them with data collection, analysis, visualization, or decision support. The concept is brilliant and makes sense!

There are various initiatives out there where technology is being leveraged creatively to help the non-profit organizations. The Bill and Melinda Gates Foundation has recently funded a new digital-media hub call ViewChange.org. The hub uses semantic technology to create a platform that combines the video sharing power of YouTube with the open information of Wikipedia and the mission of your favorite advocacy organization. I had written about it in more detail in one of my posts titled - Philanthropy goes Semantic.  Ushahidi, initially started as a simple web site to map reports of violence in Kenya, is another non-profit tech company that specializes in developing free and open source software for information collection, visualizing and interactive mapping. To my knowledge, Hans Rosling, a medical doctor and a statistician with decades of work studying outbreaks in Africa, is probably the first data science philanthropist. He co-founded Gapminder foundation which developed the Trendalyzer software, acquired by Google, that converts international statistics into moving, interactive graphics. His TED presentation about his best stats you have ever seen is worth watching.

The genesis of the idea of "Data without Borders" is to match the NGOs, who are sitting on lots of data with nobody to look at because of  resource and budget constraints, with data scientists who have the energy, time and passion to make sense of this data. Timing of this initiative couldn't be better because data scientists can now have a common and noble cause to rally behind! It is the beginning of a powerful vision but it will surely have its own challenges.  Having some experience with an NGO myself, I can say that sustaining the enthusiasm and commitment of data scientist for a long-term can be challenging. We are all aware that data scientists are going to be one of the most sought after, busiest and highly paid professionals in the next decade! So I will go for a good data scientist with more commitment over a rock star data scientist in this context. Also, a weekend of data hackathon in this context will probably won't be enough because data Science is an iterative process and will require an ongoing engagement. It is still not clear to me that why there are not initiatives like open government data in case of NGOs to build powerful data mashups. I am aware of new standards like IATI but its more about aid spending by governments. In this context, I believe that too much data can be a burden if it is not set free and used effectively. Ideally, in case of NGOs, open data shouldn't have political or privacy barriers. In the end, the co-founders of "Data without Borders" will need all possible support, structure and maybe funding, to be successful in their mission. Winston Churchill, rightly said, "We make a living by what we get, but we make a life by what we give."

Thursday, September 8, 2011

Big Data : Do we need more use cases?

Today, we can comfortably say that Big Data is an accepted term or a concept - most of the people who work in information technology have at least some definition about it. Well, everyone understands that data is growing at exponential speed; it will continue to do so for decades and there is no reason for it to stop. New analytical needs, not well suited to existing data warehouses, and growing volumes of source data are one of the biggest drivers to this relatively new space. Exponential drop in the cost of bandwidth, storage and computing makes big data applications possible and economical. The wider acceptance of cloud computing has made the business case for big data even stronger. And as companies move from proccess-driven business to analytics-driven business, the interest in big data is only going to multiply in the coming years. So overall its great news for all enterpreneurs, technologists and vendors who want to offer products and services in the big data space.

Almost every second day there is some interesting news in the big data world - last Tuesday Fujitsu announced that they will offer a cloud platform to leverage big data and MapR, an apache Hadoop distributor, has just secured a twenty million dollar funding led by Redpoint ventures. There are numerous others like couch.io, 10gen, Cloudera, Neo Technology, Loggly, Hypertable etc. who got funded earlier. Today, market seems to be most receptive to Apache's Hadoop-based distributions which includes offerings from companies like Cloudera, IBM and EMC.

But big data adoption has still not reached tipping point in enterprises even though you will find many pilot projects. The primary reason being that there is still confusion around which problems are big data problems. What are the right questions to ask? Basically, how do you start and structure a project? What is the scope of a project? Cloudera has done a great job in raising the awareness but still the focus is very technical. The story is very different if you consider pioneers like Google, Yahoo and early adopters like Linkedin, Facebook etc.. Big data is critical part of their business and their needs are very different. Startups in the web space like Bitly, Foursquare and many others have similar needs - maybe not in terms of scale but at least in terms of flavor. Also, not every company is going to have a petabyte problem - infact, most of the companies have terabytes problem. The ecosystem of tools in this space is maturing very quickly, though still immature as compared to other fields. The majority of the existing use cases are around click stream analysis, log analysis, marketing analytics, text processing etc.. For e.g. NTT Communications built a log analysis system for marketing using hadoop, which explore the internet users’ interests or feedback about specified products or themes from access log, query. CBS interactive used Hadoop as the web analytics platform, processing one billion weblogs daily from hundreds of web site properties at CBS Interactive. Right now, targeted advertising is the biggest market for big data applications and there is lots of interest in using social media analytics from big data. While the problem of  finding unexpected patterns in unstructured data makes sense but there is lots of work to be done for structured data. Because not all problems in the enterprise is going to be around unstructured data specially if you consider industries like financial services. The schemaless and flexible aspect of nosql technologies used to solve big data problems is begining to be viewed as very attractive specifically for ETL kind of operations.

In the end, the promise of big data is about building models, cause and effect relationships and predicting outcomes. The expectations are high as it is even considered critical path to building personalized medicines in future. But we need to take various baby steps before we get there. Not having the right set of skillset to focus on big data is another problem and having a steep learning curve to understand technologies like Hadoop/Mapreduce doesn't make it easy also. Abstractions over Mapreduce like cascading, pig and hive are good approaches but are still relatively new. Market for use case driven applications is least understood and not tapped. Using public cloud for big data seems no brainer from cost perspective but there are practical problems like how do you move large amounts of big data to/from the cloud. Scalability problem is sorted out using technology like Hadoop but building analytical applications is where the challenge is. Big data is here to stay and the space has become very interesting but many unknowns are still there before it starts becoming mainstream.