Search Director aka finesearch

Search Director

aka finesearch

Download (version 1.20 updated on 3/14/8

Why Is Anyone Still Using Google search?

Opinion: How many times must Google be shown to be totally irrelevant, inaccurate, and ads driven when there is the Live search?

Citation: Google is a lot like McDonalds. Both are global forces and probably aren't going anywhere soon. But you feel kind of guilty after buying or using anything from either of them, and you usually have a bad taste in your mouth for days.

Search engine? What can be simpler? Do you think so? You can be right though. Indexing Internet content although requires time and processor recourses is quiet straight forward task. So where a problem is?

Keeping index up-to-date
Keep index comprehensive
Build categorization upon content
Categorize a search query
Sort and filter result
Give an access to entire index
Incremental and activity indexing
Avoid ads or censorship driven result massaging or index building
Provide request location independent result

What's behind every problem and how it reflects in search result? Ok, let's look:

Keeping index up-to-date

If index isn't updating periodically then a search result can contain broken links. It can be solved looking in search cache, however actual cause of broken link can be that previous version of link had incomplete or wrong information. Cache also doesn't allow to do drill down to get more information if it looks like relevant. Other case of obsolete link, that a link can be still active but content can not match search query.

Keep index comprehensive

Although web crawling is well known task, many search engines use very simplistic crawlers which can't handle well links calculated in JavaScript or links reachable from navigation trees or other navigation controls with multiple states. Reason of that that manipulation with navigation control can just chnage its state without bringing a new result. So many crawlers gave up after certain number of attempts not bringing anything different. Some sites, like blogs can not provide an index of content, like live journal. In this case search engines are not smart enough to perform searching tasks requiring elements of AI. Needless to say about impossibility of reaching protected content.

Build categorization upon content

Many search engine doesn't assign any categorization information to indexed web content. So all pages look equal. It makes search result mostly irrelevant to a search query.

Categorize a search query

Categorization a search query makes sense only if index has categorization too. Matching query categories to categories in an index can provide much more relevant results.

Sort and filter result

Search result should be sorted by certain criteria, like last time update, or most relevant upon used categorization. Filtering can be helpful to eliminate duplicated or similar result coming from the same source.

Give an access to entire index

Although actual number of entries matching a search query can be big, most of search engine will restrict a temporary built result to 1000 or less entries. It makes very likely that final result won't include search goal, especially if above problems exist in a search engine.

Incremental and activity indexing

Indexing can be incremental like providing changes for updated pages. Activity index is also helpful to find out recently changed information relevant to a search query.

Avoid ads or censorship driven massaging

Result data shouldn't be driven by advertisement pushing certain result on first page. Censorship shouldn't be also applied to hide certain results. Funny thing that two giants as Microsoft and Google have controversial censorship patterns. For example I wrote a small blog about how to use Google mail with Java mail API. Do a search for words: javamail google using Microsoft's live search . A link to my blog will be second on first page. However if you try Google or Yahoo search, then you won't be able to find this link at all. You may guess, that Google doesn't index the blog? It isn't possible, because you can find link to my article from some page found by Google. So certainly Google twit is applied for my blog entry. Read more about Google censorship here.

Provide request location independent result

Result shouldn't be build or requester location. Some engine like Google provides analyze of requester IP or language settings and trying to use result sorting based on location coefficient. A requester has to have choice to disable location involvement.

How good Google or any other search engine are in resolving above problems?

To get an answer on this question I wrote a small tool which helps in deep analyzing results provided by a search engine. The following score system for search engines was introduced:

Relevant: One or more search targets are presented in result, 0 to 5. 0 no search target, 1 in last 20% entries, 2 in last 50% entries, 3 in first 20% entries, 4 in 2 first pages, 5 on first page more than one entry.
Relevant after tuning: One or more search targets are presented after one search query tuning. The same scores as for one, considering as 5 if tuning not required.
Actuality, 0 to 3: 0 more than 20% of broken links, 1 - 10% of broken links, 2 less than 5% of broken links, 3 - no broken links.
Relevant actuality, 0 to 2, 0 when 50% or more of search targets broken, 1 20% to 50% of search target broken, 2 less than 20% of of search target broken.
Actuality, 0 to 2, 0 more than 50% of result entries do not include all query words, 1 between 20%-50%, 2 less than 20%.
Biased rate, 0 to 2, 0 when sites with censored words completely removed from index and can't be found using other neutral words, or when competitor's product can't be found on first 10 page of result, 1 when censored page included in index but can't be found using censored words, 2 when no censorship or ads biasing.
Locale involvement, 0 to 1, 0 when result depends on a requester IP or locale setting, 1 not depend on.
Comprehensive rate, 0-2, 0 total number of entries in query < 200, 1 < 600, 2 < 1000

All above tests perform for 10 different queries in the following categories:

Single word product name
Single word product name and 3 words product description
3 words product description
5 words problem description
a person name, well known
a person name not well known
20 words on page accessible using tree navigation
20 words on page accessible JavaScript constructed link
10 words page updated 10 days ago
10 words page updated 30 days ago
specifically selected words or product which suspected to be censored or ads sensitive.
providing above tests using different IP and language settings

Testing result

	Relevant	Actual	Comprehensive	Total
Yahoo	1	2	2	5
MSN	3	3	3	9
Google	2	2	2	6

Download finesearch and test yourself to see how bad your favorite search engine Google is. MSN search engine seems improving last time, results became more comprehensive and accurate. Yahoo is slippig. The company is busy on reorganization their software. A friend of mine Yahoo's chief architect told me about this reason. They also hired a lot of new people who need some time to get familiar with the products of the company. I have to update recognition patterns for Yahoo every 2 months, they can't still establish output format. Google is so so, and keeping second place. Google is still very good in software development related searches, however has very poor blog coverage. Very relevant blog entries listed by MSN on first page do not appear in Google or Yahoo search results at all. Yahoo's crawler seems the best when working on sites with cumbersome navigation. For education purpose, the tool has search support over Craig's list. Please use this feature with precaution and do not abuse Craig's list.

Finesearch is easy deployable on such app servers as TJWS, Tomcat, or Sun Java(TM) System Application Server. However binary of Finesearch is prepackaged with TJWS, so you do not need to bother downloading and installing anything else. Just type java -jar finesearch.jar and select URL http://localhost:8080/finesearch in your browser. If you have port 8080 conflict, then edit rundescriptor inside of jar specifying a different port number.

Note that due instability of SF.net CVS, I decided to move CVS repository to my home machine. So SF.net CVS tree doesn't reflect the latest project code status. I'll try to provide weekly builds to allow get changes faster.

How to promote your company, product or service without spending a penny when people still use Google ?

It's quite easy and will take just few minutes with Search Director, so even if your time is very valuable, you still can save hundred dollars on payment to Google. I'll provide a step by step instruction on a simple example.

Initial task

I have a Java build tool and want to get it reachable for people who're trying to choose a building tool. I've created a home page for this build tool at http://7bee.j2ee.us/bee/index-bee.html . The tool name is 7Bee.

Verify that your target can be found using direct name search

Generally this step is optional, because even if your page not listed on direct name search, then following to next steps of the remedy plan you have a good chance to get it listed and generate traffic to it.

Verify that your target can't be found at a simple relevant search

Run Search Director an use relevant query like java build tool. Add a filtering rule with a name of your product, like 7bee in our case. make sure that result produced by a search engine didn't bring any entry on your product. There is no surprise.

Find a content in search result you can change

Blogs, forums, surveys, visitors lists, encyclopedias, and any other type of pages allowing content management are your target. (You do not need to hack in WebDav or something like that.) Use filtering words like blog, forum survey, and so on to find such pages. You can also filter out direct competitor's pages by specifying their product names with low count value, like 1-2. Doing that I found two pages looked promising as:

A poll page from http://www.manageability.org/polls/what-is-your-favorite-java-build-tool This page asked to add your favorite Java build tool unless it's already listed. So I have added my tool 7Bee on this page. Manageability page was very good catch since it was entry number 3 of Google's result.
Another page was found as blog of Sayed Hashimi's at http://weblogs.java.net/blog/sayedh/archive/2005/10/your_build_tool.html . This page allowed to leave a comment, so I left it about 7Bee build tool. Ranking of this page not quite well, it has number 356 in Google's result, so very unlikely somebody ever will open it. However Google generates 3 entries for this page, so it made me think that this page is popular in other searches and people can get know 7Bee without a targeted search.

Enjoy seeing your product by search engines

After remedy actions defined in previous steps I could see my tool listed instantly in search results. Although search engines as Google still do not provide direct links to a product, it doesn't seem like bad, because a mention of the product can be found in first page of any relevant search.

I used the same technique to promote my servlet container which currently listed as 2nd entry of Google's result. Without usage of this tool it wasn't listed at all. Needless to mention that nothing required to make my products listed by MSN search within 2 first pages of a result. Certainly MSN produces much less biased and more relevant results. MSN search is just lacking in search of problem solving pages. So Google wins here. It looks like internally Google can have a good ranking mechanism, however it's completely destroyed by ads.

Time passed, things changed

You need to repeat the described procedure to keep you listed on first result pages of popular search engines. Google seems lost interest to blog of Sayed and Manageably site withdrew this page either. However a 3 minutes procedure returned status quo. I got listed again. Google seems hiring more engineers from India, so my product listed on Indian sites turned on visibility on Google's pages as well. Microsoft seems got an interest to Sayed, so my tool again survived. Sayed is also loved by Yahoo probably he moved from Google there, who knows?

Design principles of a right search engine

You're already aware those multiple problems of widely used search engines. So, is there any solution to avoid problems and create a right one? Certainly a solution exists and not one. I and my group are working on a new generation of search engines. We name it as an intelligent search engine, although it's probably too much. I'd like to share with you some design principles and hope to get some feedback from you, because you will be all users of the new search engine quite soon. I have also opennings in my group, so if you feel smart enough, then welcome to join. So, how does it work?

All words of English language (we count around 12,000) get categorization index (sure other languages will coming soon). A word can have more than one categorization index (it's obvious).
Every crawled web site obtains a categorization index based on categorization indices of words found on it. Obviously web site can get more than one.
Words are not falling in any categorization indices considered as names and for names we build also weak categorization indices.
Every categorization index has weight, it allows to order categorization indices in a request. A word with highest weight of categorization index names as a leading request word.
A search query processed first upon categorization indices of words starting from a leading word and going down. Finally a result after applying categorization out line is used for finding not categorized words as names and others.
Working on improving categorization of search index we improve your search results.

Contact dmitriy@google.com or sign on for jAddressBook account and check a shared folder

Note that some criticism to direction of some companies has a goal to improve products developed by them and doesn't have any personalized or other colors.