Thursday, June 30, 2005

The Natural Evolution of Internet Search

Increasingly, I feel that the internet search industry will fragment into specialized niche players. We are already starting to see this with services like Grokker (by Groxis) that takes search information from Yahoo and groups it in a visual presentation allowing users to quickly sift through the vast amounts of search results. Additionally, it is important to note that there are only two major suppliers of query information—Google and Yahoo. Search Engine Watch has this chart that illustrates how Google and Yahoo power (
) almost all the other major search players.

Why do these second-tier searches exist? The answer is simply that they find new ways to present the query information that is supplied by Yahoo and Google. In fact, within internet search we have two different but dependent technologies. It all starts with spidering and indexing webpages. While the spider reads a page’s content, it breaks it down into useable parts and neatly stores them away for future use. This acts similarly to a raw material that is then processed into the final product. The processing analogue in internet search is taking the indexed information and processing it. Each page is given a score for a particular search query; it all amounts to a number. The spidering and indexing is a rather standard process, whereas the algorithm that assigns a number to each page for a given query is unique to each search service. Yahoo has a similar but different ranking algorithm than Google. MSN takes Yahoo’s index and processes it with its own algorithm.

On top of this, a new layer is starting to form that is not yet altogether separate from the ranking algorithm layer although we are starting to see greater division with services like Grokker, Teoma, and Yahoo’s Mindset. What distinguishes these services from ones like Google, Yahoo, and MSN is their focus on presenting and sorting given query results. It is a well-known fact that about 80% of internet search users never make it past the first ten results and the number drops off exponentially as we go deeper into the search results. This essentially means that only the top twenty results matter with a first-generation search service like Google and Yahoo. However, others like Teoma and Grokker have found that although people are not willing to sort through large numbers of search results they are likely to sort through them if they are presented in a few digestible pieces that allow them to drill down to the right answer. The popularity of these searches shows that many people are unsatisfied with standard search results and are willing to take the time to find the answer for which they are really looking; however, they are unwilling to spend a lot of time or energy to do so.

Another indication that internet search is starting to fragment into specialized niches is that there is an explosion of vertical-specific search service. The most noticeable manifestation of this are blog-related searches like Technorati, Bloglines, and Feedster. It was very apparent that this was going to happen given that Google and Yahoo where nowhere to be found when it came to searching for blogs.

Another huge trend in second generation search engine technology is clustering or grouping pioneered by Ask Jeeves and adopted by services like Teoma and Clusty. When we search ‘beatles’ into a first generation search engine like Google, we get all sorts of results related to ‘beatles’ but they are just raw, unrefined results. Google is too dumb to know what ‘beatles’ you are searching for. That is where clustering comes in. It allows you to drill down to the results that you really want by clicking on the clusters that are of interest to you. It is essentially a way for back and forth interaction between the search service and the user. By displaying clusters related to ‘beatles’ the search engine is asking you, ‘which one do you mean?’

Grokker takes clustering or grouping further by making the interaction visually appealing and more powerful at the same time by showing you what sub-groups fit within which larger groups. This gives you a stronger understanding of their true nature and meaning allowing you to drill down with more certainty.

Yahoo’s Mindset addresses the most important distinction among search results by grouping them as commercial and informative. That is often the first thing that users want to resolve by separating the two. For example, if you search for ‘panther’ in the first-generation searches you get many commercial results within the top 20; dumb search engines don’t know if you want the commercial results for Apple’s Panther OS or want to learn about the animal. Yahoo Mindset allows you to resolve this.

Finally, the newest trend is making personalized search engines for everyone. In essence, the user still uses the same old first generation search services, but they are super charged with your personal information. I am not at all convinced that these will actually help you find what you need because my interests change on a moment by moment basis. For example, I studied physics and economics as an undergraduate. How will this help dumb search determine if I am interested in the animal panther or the OS? It seems to me that this is rather a ploy to help Yahoo and Google sell us products. Yahoo has already revealed that itss advertising service will match its ads to user profiles rather than simple keywords.

How does this all tie into the main point of this article? There are countless ways to analyze and present information found, processed and stored on servers. That is why Grokker uses Yahoo’s information, Teoma uses Google’s information. Companies will continue to come up with pioneer ways to search, sort and present data about webpages on the internet and larger companies like Yahoo and Google won’t always be the winning end. The only thing that lies between smaller services like Grokker and having a complete search product is the underlying crawling, processing and storage infrastructure that costs a great deal of money to own and operate. That is why it makes perfect sense to have companies that specialize in the infrastructure sphere and others in the production sphere. If companies arise that specialize in the infrastructure and open their services to any client, this will mean low barriers to entry for many companies that want to focus on the more architecture-driven side of the equation; that is where the most interesting engineering remains unexplored.

Thursday, June 23, 2005

Sample Referring Link Query on Top 3

I started doing an analysis on referring links in the top three internet search services-Google, Yahoo, and MSN. For my alma mater (, the scores were:

Google - 5,990
Yahoo - 372,000
MSN - 57,790

I was spinning my wheels a little bit because I got the search syntax wrong for all three initially. After some quick investigation I found the right syntax. For everyone's reference they are:

Google - ""
Yahoo - "link:"
MSN - ""

I am well on my way to analysing the top 100 as ranked by I will post the results are they come up.

Grokker Extends Offerings with Grokker Research

Groxis (father company of Grokker) announced on Monday, June 20 that it has started a pilot program called Grokker Research that will be available to corporations and academic institutions. Grokker Research is a web-based platform that will allow for a visual presentation of "deep Web" search results. It will allow corporations like Sun Microsystems (and I bet univeristies like Stanford) to search a vast collection of sources like library databases, internet search engines (Yahoo) and subscription services. This service will be free for a trial period to qualifying institutions, and will be a premium service thereafter.

Now this is great news because what we really need is an ability to sort out commercial webpages from informational webpages and to further make sense of them. Grokker in its present form and as Grokker Research will be a quantum leap forward in this pursuit. I am very optimistic about these evolving services along with ones like Teoma, Ask Jeeves and Yahoo’s Mindset.

I am beginning to see how things are shaping up in the next generation of search. Google, Yahoo, MSN Search and the like will basically become suppliers of raw materials to companies that will fabricate different things from them. I think that this is a natural evolution of things. Each search service does not need it’s own website repository. They can buy that from Yahoo or Google. With these resources they can find various ways to use them whether it is a vertical search, a visual search, a search that allows you to choose between commercial and research interest, or any other interesting way to utilize these resources. This makes perfect economic sense and follows every other industry throughout history.

Perhaps I’m sticking my neck out with this day dream, but it really falls in line with how homo sapiens utilize resources through specialization. Specialization is an underlying principle that has shaped every industry and it makes sense that the internet search industry would follow suit. A car maker does not mine the steel, etc. IBM and Apple no longer make all their components. In sum, I’m looking forward to play around with Grokker Research. I really enjoyed the original Grokker and now turn to it for more complicated searches.

Google Indexes Less

John Battelle again writes about Tristan's discovery that Google does a vastly poorer job indexing blogs than Yahoo and Technorati.

This is no news but rather a general fact. Google almost always indexes fewer referring pages for any given URL vis a vis Yahoo or MSN. In fact, it seems that MSN does the best job of indexing, closely followed by Yahoo. I was astounded to discover that this is the case, but it really does hold true. Tristan just uncovered the blogging part of a greater truth--Google does an inferior job indexing pages as compared to Yahoo and MSN.

Why does Google do it this way? Google is not dumb. They realize that 80% of people only look at the first page of results, thus they weight it to the top to reduce the computational and storage loads for their server clusters. The real question is why Google publicly claims otherwise. Either Yahoo and MSN are understating their reach or Google is overstating its. I am not sure where the truth lies, but I'd be interested to find out.

Bookmark and check back next week for an analysis of this topic. I'll work on it this weekend.

Wednesday, June 22, 2005

Expanding Scope

As it turns out, there is not much buzz around search. Of course, this is a relative statement. The important thing is that there is not enough about search to warrant daily posts. On the other hand, there are a lot of topics that are closely related to search engines and how we find information on the net whether it is for commercial interest of informational interest. To that end, I am slightly expanding the scope of this blog to include some of these issues. The main emphasis will, of course, continue to be search engine technology.

Liar, Liar and Pipe Dreamer

Jon Battelle wrote the following regarding Google's plans to launch a pay service, an ad listing service, and a media player:

I recently sent a note to the folks at Google PR. It went something like this:

So, in the last week, it's been

1. Google is starting a Paypal killer.
2. Google is starting a
craigslist killer.
3. Google is starting an
iTunes killer.

So, any thoughts about all of this?

To which Eric Schmidt replied:

"We believe that ecommerce can be improved and we are working on ways to improve the user experience. We are working on things in ecommerce."

It's the typical political rhetoric from Google. I am sitting at the edge of my seat waiting to see just how Google will improve internet commerce taking into account the stellar job they did with Froogle. I mean, Froogle was a real quantum leap in internet commerce.

I have heard rumors on the net that Google was trying to acquire Craigslist. They either failed and will try to go at it alone, or are still trying to buy it. My two cents is that if they go at it alone, they will fail because they won't be able to get the locally-focused following that Craigslist got. Why would anyone switch? They will probably bury it deep behind the main page in a beta receiving whatever drip-through traffic they usually get to their 'other' products.

Regarding iTunes, Money please! There are a ton of music players out there. This is probably meant to lead to GMusic Store or something like that. There are already many players in that game too. Of course, they will make marginal money from these ventures because they have a captive audience from their internet search, but I don't expect that they will make a killing like they did with AdSense.

On the other hand, the internet is not a new market free from traditional economics. In fact, it is probably one of the purest forms of economics in real life due to a low barrier to entry, seemingly endless resources, etc. As competition increases (it will quickly) their margins will be squeezed. As we saw with American car makers, to be 3rd best in an industry now means bleading red. And that is the car industry!

The internet is still the wild west-unexplored and sparcely populated. Look at any other industry a decade or two after it began. Look at television, or radio; things will really become interesting in the upcoming months and years.

Take into account also, that Google does not hold to it's word. If you look at how many linked entries they have to any site, it is vastly less than either Yahoo or MSN. Their claim that they search X number of pages is simply absurd. (Try it- 'link:' on Google vs. 'link:' on Yahoo) This pattern holds true for nearly all entries that I have tried. If every site, has fewer links that logically means that Google's index is smaller. Simply put, it's a sham.

Thursday, June 02, 2005

Google's Secret Search Lab

There has been a lot of buzz in search related forums about a secret effort at Google to control the quality of search results. The kicker is that they are not doing this via significant changes in the ranking algorithm or spidering technology, but rather production-line-style human quality control! Are you kidding me? This reminds me of a spaghetti horror flick where the monster is running amuck. Has Google lost its marbles? Why is the world's premier technology company resorting to turn-of-last-century production line efforts?

It seems that we are moving backwards in time to more primitive technologies. Google is now doing what Yahoo! was doing while a small start-up at the dawn of the internet age, when it hired people to actually add by hand new sites to its directory. Luckily, Yahoo soon realized that this was a an insurmountable task. I really can't believe that Eric Schmidt is spending money on this rather than investing in new, paradigm-shifting search technologies.

Not only that, but is Google becoming a self-proclaimed internet censor? Well, why not? They are already doing this with AdSense by denying sites with certain content access to the program. ( Check out the AdSense Program Policies .) When will they realize that it's not the web, it's the sucky search algorithms that are so easily fooled by spammers. Stop buying every digital company under the sun and invest in your bread and butter business; you are not yet Microsoft.

Now onto Henk van Ess's blog about this "secret Google lab." He doesn't seem like the most trustworthy guy, especially since at the same time as he's breaking a big search-related story about Google's secret search quality lab, he is showing Google ads on his webpage. However, even though I approached his site with a great deal of skepticism, I could not deny that the flash animation of the internal Google tool for this supposed search quality team seems quite authentic. The screen shot that he has of the "Rater Hub" also seems incredibly Google'esque. There is also that cache on Google's own search of . Check out also this query for more remnants of .

(Now I know that Google has a tendency to cover it's tracks, so if it removes this entry from the index, email me; I saved a screen shot. )

Then there is ample evidence that Google is and has been looking for contractors to do QA work:
QUALITY RATER - (SPANISH, DUTCH, ITALIAN, FRENCH) This is a temporary role offered through Kelly Services. Google Inc. is recruiting part-time, temporary, home-based workers to help with work on a search quality evaluation on a project basis. You would work at your own pace, and the time and length of any particular work session would be up to you. Candidates will evaluate search results and rate their relevance. Thus, all candidates must be web-savvy and analytical, have excellent web research skills and a broad range of interests. Specific areas of expertise are highly desirable. We are looking for smart people who read voraciously and have a wide variety of interests. Raters should have all the following qualifications: Native-level fluency in Dutch, Italian, Spanish, or French In-depth, up-to-date familiarity with the web culture of at least one predominantly Dutch, Italian, Spanish, or French-speaking country. Excellent web research skills and analytical abilities. A high-speed internet connection. Legal eligibility to work in the Netherlands, Italy, Spain or France. Moderate ability to read and write in English. Perfect English is not necessary; however, you must be able to read and write English well enough to use software with an English interface, understand fairly complicated instructions written in English, and make yourself understood in informal written communication. The job involves frequent written communication with fellow Quality Raters. For immediate consideration, please send an ENGLISH text (ASCII) or HTML version of your resume to Important: The subject field of your email must include Quality Rater - TEMPORARY.

I'm not going to recount all of the contents of Henk's blog, but it does seem like this is fact despite my most heart-felt hope that this is a farce. I am praying for innovation and instead we're getting an old-fashioned assembly line. The search services still suck, are getting more tainted with ads, and more spammed. I hope that we start thinking about taking matters into our own hands like we did with Firefox. Grokker, Teoma, Mindset are steps in the right direction, but we really ought to be at least jogging to make up for lost time.

Wow. I was checking my stats on and noticed that I've been getting a lot traffic from Technocrati for the search term Grokker. I am not sure how I got on there so quickly, but it gives me even more impetus to hold a regular writing schedule. So, please enjoy Dumb Search. To celebrate, I have a great story.  Posted by Hello

Wednesday, June 01, 2005

Yahoo Mindset

Check out Yahoo Mindset ( ). I think that this is a great start. The number one vector to divide is the commercial vs. Informational query results since 80% of queries are strictly informational in nature. Mindset allows the user to give more weight to either commercial results or informational results using a slider. I liked it a lot, and although they have to perfect their algorithm, it is going a long way to ushering in a new generation of internet search services.

My feedback to Yahoo re: Mindset submitted on their site:

First, congratulations on beating Google to the punch one more time. I have never been satisfied with Yahoo query results, but after using Mindset I can say that I will probably switch to the Mindset beta in lieu of Google (at least 50/50) because for the first time a search puts the power in the users' hands. I rarely search for products using SE's and have often been frustrated by the numerous irrelevant results for commercial sites that my queries produced. I have tried services like Teoma and Grokker to name just a few. I haven't been fully satisfied with any of them. Grokker, a Yahoo partner, goes a long way to providing results in a digestible manner since only 20% of users go beyond the first page of a classical SERP. However, Grokker is not robust enough. At the most granular level it has one result instead of 20. Teoma's groupings are pretty good, but I think that algorithm behind it isn't very strong and it has a way to go. All that said, the most important vector to split is the commercial vs. Informational results vector. Since 80% of queries are for information-only purposes, it makes sense that we separate the commercial results for the majority of queries. As far as I know, you guys are the only ones to do that effectively to date. I love it. It goes a long way to refining my results.

However, I think that the more power you put in the users' hands, the more you will kick Google's behind and the more you will profit. Since Google is wasting it's time with fruitless conquests do more with the search, Yahoos! For one, put all the advanced functionality on the main UI instead of behind a link. The reason why there is a small click through to the advanced features is not only that people wouldn't use them but also because it is more tedious to click that damn link. Also, make a better graphical GUI for the functionality; most users aren't programmers and shy away from Boolean logic. I think that another great improvement that any search engine can make is adding the ability to add weights to your search terms. For example, if I am looking for treatments for poison oak, I don't want to give "treatments" as much weight at "poison oak" because I will be getting results for non-relevant treatments for things like a hang over.

I think that slider is a great tool. Give users a few fields for words and phrases and give those fields sliders to allow users to determine the weight on each of those words. Think about similar implementations in photo editing SW where you can adjust (give weight) to contrast, darkness, color, shades, etc.

Additionally, it would be great to see a Grokker-like implementation of Yahoo search. I think that it is brilliant how Grokker groups the results. It is simple to user, intuitive, and powerful. The only drawbacks are speed and robustness as mentioned above. However, with increasing connection speeds the quickness becomes less of an issue. The robustness can be very easily solved; just list the top 20 results for the most granular level. Also, it would be nice if clicking on the larger categories gave a compiled list. Buy Grokker and expand it; they are really onto something.

From a user's perspective, I don't care that a search takes a millisecond if I then have to spend a couple of minutes sifting through the jumbled results. Give users improved methods to filter and digest the search queries (like Mindset) and you will win a huge market. Everyone is praying for Search 2.0 . Please take us out of the dark ages!

As a former Google employee, it takes a lot to break my loyalties and say "Good job Yahoo!" but you guys deserve it. Competition and innovation are great for customers and good for business.