Monday, January 25, 2010

Semantic Search: Finding Stuff and Creating more Businesses in this Flat World!

If you want more good jobs then spawn more Steve Jobs" says Thomas Friedman, the author of  "The World is Flat "  in this new article in NY times. Not everyone is a genius like Steve Jobs, can put 10,000 hours or is a part of 1955 club which were the key characterstics of these successful enterpreneurs as pointed by Malcom Galdwell in his interesting book "Outliers." Not every company becomes Apple also. Well, what US really needs is more of small businesses than ever which are one of the driving force behind its economy. There are some real facts about the small businesses in US:
  • Employ just over half of the country’s private sector workforce
  • Hire 40 percent of high tech workers, such as scientists, engineers and computer workers
  • Include 52 percent home-based businesses and two percent franchises
  • Represent 97.3 percent of all the exporters of goods
  • Represent 99.7 percent of all employer firms
  • Generate a majority of the innovations that come from United States companies





We need more enterpeneurs than ever who can identify opportunities worldwide and develop it into profitable ventures. Where do you start? How do you get the information? How do I know about the gaps in the markets for particular products and services? What should be the focus area? Has it been done before? Who can partner with me? And there are so many questions you can think of. I undertand that you don't start every business just by searching on the web as there are other important things like personal contacts, capital, your network, your own experience, trade associations etc. etc.. But the search on the web is increasingly becoming the major starting point for many of these activities. It is more relevant than ever to do the research because everything you want to do can be outsourced; can be imported; maybe already exists somewhere; or demand is going to go away and you are blissfully unaware. Not that it is easy to figure this out but atleast we should have more resources than just the big three search engines - Google, Yahoo and Bing.

Try to find or research that information on these engines and you will know it is so hard. All three of them have done good jobs in the last many years but they can't continue to be all things to all people in all the contexts. Too much of emphasis has been on ranking also. I personally like google but it definetely falls short as far as exploratory and interactivity aspect is concerned. Bing, which calls itself decision engine, has shown some very good improvements in last one year. Also, somehow the big three have ended up promoting a marketing view of search on the web and they thrive on the tension created between SEO consultants/advertisers and them. It seems that media and analyst community are also too concerned with glorifying every percentage gain by Bing over Yahoo as you can see in this news and may others. Maybe, "its just not about search, its about business" - probably, Michael Corleane (from movie Godfather) would have said if he worked for Google in this era.

In general, I have seen a very narrow view of the search on the web from a school of thought which believes that whatever could be done in search will be just confined to these three as far as web is concerned - they think that new entrants will make some noise initially and then go away quietly. I completey disagree. In my opinion, it is no different from the view in eighteenth century when there was a school of thought which believed that whatever human beings can think of or can invent has already been done - there is no scope of anything new. Sounds ridiculous if you evaluate the progress mankind has made since then!

A lot has been written about the benefits about the Semantic Search and how it is better than the key-word based search. I have also written in one of my previous article about "semantics" in semantic search. Recently, Seth Grimes also compiled a very good article about types of semantic search. So I am not going to talk about what semantic search is but more about the opportunities for the new breed of semantic search engines.

Occassionally, you do see articles like "semantic search engines which will change the world which lists new breed of semantic search engines - some call them google killers. I always wonder why these semantic search have not been able to make measurable impact yet. Some of these engines have very good technology also though the list doesn't include many others which are out there. The internal details of most of these engines are still proprietary and they combine a natural-language processing with various flavours of semantics. One more semantic engine which I like is TipTop which mines Twitter database and does sentiment analysis also. But there are more than six Twitter search-based products in the market as you can see in this review - product from Tiptop is not even mentioned here while the other products may not be doing semantic search. Now, even Bing has a product which searches Twitter. So what can be the next step for these new semantic search engines?

In my opinion, the big issue is the lack of focus for some of these "semantic search" startups and their obsession to boil the ocean. Many of them waste lot of time comparing themselves with google. Some of them are also trying to do very similar things. Market will continue to get crowded with semantic search engines in next few years but there is a risk that many companies with excellent technologies will get lost in the crowd. Very soon, every search engine will start calling itself a semantic search engine, the same way every SAAS offering is a Cloud offering nowadays. The other issue is lack of understanding from business and users about what "semanticity" in search engines really means. Ideally, Semantic search engines should have some aspects of natural language, contextual (focus on disambiguating queries), ontologies and reasoning. The hard part is always developing the understanding how much of work is required to customize the technology to incorporate all these aspects so that it is relevant for a particular domain.

There is a great opportunity for these new breed of semantic search engines to rethink about their strategy. They should also not try to be all things to all people. They really need to carve out a space for themselves in specific segments. If they go after enterprises, they will face stiff competion from the big three in the enterprise - Sharepoint/Fast, Autonomy and Endeca who have customers in hundreds and have evolved over the years. Among them, Autonomy has done a great job in e-discovery space by following a vertical strategy. Even companies like Marklogic with its powerful XML server can solve many search related problems for unstructured content. In my opinion, semantic search startups can always continue to tweek their algorithms and enhance semantics but simultaneously the focus should be on verticalization, branding, strategy, positioning etc.. Application-centric or vertical strategy will be better for them as opposed to platform-centric strategy.  They can also think about merging what is there on the web with the enterprise data/content to give extended BI inxights which is still a new area to develop powerful applications. Though I can count few small companies in this area also and even big ones like Business Objects after acquistion of Inxight. Still , there is ample oportunity to think creatively and develop useful analytics-based applications for enterprise.

Recently, Financial times launched Newssift (still in Beta) which is a business news semantic search engine which indexes thousands of news sources worldwide. They have used Endeca technology for faceted search and sentiment analysis is provided by Lexalytics. It can be a useful  tool but more can be done in this area. If any of these new breed of semantic search engines can correlate data from historical sources and the one which is acquired from multiple sources to: identify patterns and indicate important events then it can be a killer application on Wall Street. Unstructured data is already being leveraged in electronic trading strategies but the adaption is not so fast. Generating alpha from the stream of unstructured data is not an easy task but a great opportunity.

Another set of innovative companies I want to mention in this context are Bintro ,Trialx and Echo Nest. Bintro matches you to what you are looking for like employment, partnerships, investment and joint ventures. TrialX is a free service that matches participants to relevant clinical trials based on their personal health information. TrialX uses a comprehensive database of 25,000+ clinical trials approved by the Food and Drug Administration (FDA) in the United States. Echo Nest helps you find your audience with targetted music production. It claims that it can understand every music writer on the web (bloggers, review sites etc.) and helps you find the writer most likely to review your music. Again, very useful way to leverage semantic technology to solve problems in a particular domain!

In the end, I believe that there is a scope for hundreds of similar applications for semantic search engines and they can happily coexist with Google, Yahoo and Bing. We will also see very interesting changes once the data web evolves and semantic markup starts becoming more prevalent.
 


Tuesday, January 12, 2010

Christmas Bomber: Connecting The Dots!

It is now official that it was the fault of software which almost got 289 people killed in the bungled Christmas day bombing. Couple of facts have emerged:
  • The suspect, Umar Farouk Abdulmutallab, was added to a catch-all terrorism-related database when his father reported concerns about his son's radicalizations and associations. Though his name was not on flight watchlists
  • A misspelling of Mr. Abdulmutallab's name initially resulted in the State Department believing he did not have a valid U.S. visa. It seems that his visa could not be revoked earlier because of it
  • According to remarks by the President Obama:
    • The intelligence community did not agrresively follow up on and priortize stream of intelligence related to possible attack.
    • a failure to connect the dots of intelligence that existed across our intelligence community and which, together, could have revealed that Abdulmutallab was planning an attack.
    • In sum, the U.S. government had the information -- scattered throughout the system -- to potentially uncover this plot and disrupt the attack. Rather than a failure to collect or share intelligence, this was a failure to connect and understand the intelligence that we already had.


Hand-luggage inspection machine at an airport.Image via Wikipedia
There will be many corrective steps which will be taken after this but the key point is that the suspect went through the same screening as other passengers and a a metal detector can't detect the kind of explosives that were sewn into his clothes. All of us know that billions have been spent on homeland security and the aviation security before this happened. And this has been one of the top priority since President Bush's days. It seems that it wasn't enough and we need to take a step back and reassess our strategy.

 It is useless to play the blaming game at this stage as written in this Economist article but what it means that we have to rely more on the intelligence of the software before a terrorist, with all valid documents, tries to board the plane. Yes, there are ways to detect the device at the airport as explained in the Scientific American article - adding unpredictable or layered security screening in future but that will always be a very costly solution if we have to implement in all the airports in this world. I am still not clear about this report released by National Research Council in 2008 which says that data mining is not the most effective way to smoke out terrorists. Yes, there can be issues of false positives which really means that a non-match can be declared as a match but it is always not the case as evident in the Christmas bomber's case.

To me, what really stands out is how/why we fail to connect the dots." as the suspect's name was in an international database indicating "a significant terrorist connection". It is clear that there is a strong need for superior knowledge discovery, database integration, cross-database search and the ability to correalte biographic information with terrorism-related information. I can't imagine doing any of these things without taking semantic technology into consideration. Infact, it should be one of the biggest drivers for any new initiatives in this context!

You might have heard story of  David Headley (whose earlier name was Daood Gilani)  - he is named as the key architect behind the Mumbai/India attacks in Nomberber 2008 in which 173 people died and 308 were injured. The residents of Mumbai were not as lucky as the passengers on the flight with Umar Farouk Abdulmutallab.  David Headley is an American with a Pakistani father and also served as an agent for the Drug Enforcement Agency after being caught twice doing drug dealings. He was an operative of the Pakistan-based terror group Lashkar-e-Taiba. After his arrest by U.S. authorities, Indian officials discovered that he was given a long-term business visa for India. It is also alleged that he was already on a watch list which Indian authorties were not aware of. Indian authorities also say Headley traveled seamlessly between borders and stayed in various hotels in the same city while scouting for targets. What is more shocking is that he came back to India after the Mumbai attacks! This is another case of big failure to connect the dots! There can be many of these in the future also!

Semantic technologies can really help in connecting these dots because OWL/RDF  can help build views in a more natural data graph format that is highly expressive and strongly deterministic. It is also more applicable in scenarios like this which places more premium on adaptiveness, agility, flexibility and grounded unambiguous level of truth.  It is very useful when you really care to see end-to-end picture of how things are logicaly connected. The consistency can still be maintained while changing and asserting new facts! Inferencing is also a powerful capability which can unearth many new facts.

It is understandable that there are very complex protocols and policies  involved in sharing of databases between various agencies around the world but not having the right technology shouldn't be an excuse. Because Semantic technology can be a very good solution to this problem.

I would really like to know your opinion about this. If you have new ideas/thoughts or you are aware of existing work being done in this area then please comment or write directly to me.







Reblog this post [with Zemanta]

Thursday, January 7, 2010

"Pull" by David Siegel: Book Review

I just finished the final pages of "Pull" while watching the People's choice awards on TV. Who could have thought few years back that there will be a category for most popular "Web Celebrity" award which will be won by Ashton Kutcher for his more than one million followers on Twitter. Such is the power of web technology! Are you curious about the power of Semantic Web technology? Read "Pull" by David Siegel.


First of all, I would like to acknowledge the courage of David Siegel to write a business book on a difficult topic like this. The book is more about the power of Semantic Web to transform your business and is meant for business managers and enterpreneurs. It is not easy to write a "business book" on a topic like Semantic Web which has more sceptics than believers. In general, I have found that most of the good business books are more about analyzing the past and there are very few which are visionary or predict the future. In this complex world, it is so hard to see the future even beyond five years from now! So don't expect perfection! David makes a very good attempt in this direction.

Other than explaining the benefits of pull vs the push approach as practiced in most of the businesses, you will find some very useful information and statistics. The book is more conceptual in nature and talks about the recent developments in the world of semantic web and also takes you to the decade between 2020 to 2030. Some of you might question his timing also, if you are in a mood to question everything, but that is not that important from my perspective. Eventually the market forces dictate everything so why should we worry about it? You can always argue that he has become over enthusiastic about certain topics and is almost Utopian in its approach at some places but keep in mind that many of the concepts in this book are about a distant future. You may wish for more details or hope that he should have covered more domains but then he had to draw a line somewhere to make it readable for everyone. Overall, he has done a good job in doing gap analysis between the present state and the future state across various verticals but don't expect that you are going to get the perfect technology and process roadmap to achieve future state - the book is not about making incremental improvements but to find a completely new learning curve.

In the end, this is a thought provoking book which you should read with an open mind. It will definetely make you think! Without giving away too much about this book, I just want you to know that I really enjoyed reading it. It is a passionate work by someone who has put two years of life writing about a subject he believes in. The level of effort he has put in research also shows. Apart from that, for $18.45 on Amazon.com, this book is value for money. A real bargain! I will recommend all of you to read it.