Copyright
2006 Michael Lowe (michael.j.lowe AT kalio DOT info)
The evaluation and comparison of a number of desktop search engine products is described in this document. The criteria for comparing the products are listed, the process used to perform the comparison is described, and the results of the comparison are summarised.
In order to select the search engines used in this comparison, a survey of the desktop search engine products currently available was performed in order to gain an understanding of the marketplace. The products chosen for evaluation were the current market leaders as ranked by Google when performing the search query “desktop search” in early 2006. The products are: Google Desktop, Copernic Desktop Search, Yahoo! Desktop Search, and Windows Desktop Search. These products cater for the consumer end of the marketplace and are available for free, but they each have sibling products (except Copernic Desktop Search) which are targeted at enterprise use by providing additional deployment and administration features. An additional product, ISYS:Desktop, was chosen since it provides a contrasting set of capabilities that are targeted at government, law enforcement and legal environments. In addition, the default file search utility provided in the Windows XP operating system, Windows Search Companion, was included in the comparison as a baseline case, even though it does not make use of indexing by default.
| Search Engine | Version | Abbreviation used in this document |
|---|---|---|
|
1.63.910 |
CDS |
|
|
4.2006.306.1208-en |
GDS |
|
|
7.0.2 |
ID |
|
|
2.06.0000.2083 |
WDS |
|
|
Windows XP Search Companion |
5.1.2600 |
WSC |
|
1.2.1852je |
YDS |
Table 1: Desktop search applications evaluated
There are many aspects of desktop search application that can be evaluated. According to Gartner, the following aspects should be considered when making a purchasing decision with search applications:
The focus of this evaluation was on assessing the crawling, indexing and search query functionality and performance of desktop search engines using measurable and repeatable metrics. The following list of metrics was initially defined:
Crawling
Indexing
Search
The list of metrics was refined in a number of ways to make them more practical to evaluate, namely:
The CACM corpus, which contains 3,204 documents and 64 queries, was used for the evaluation. A program was written to parse the CACM corpus and generate a Microsoft Word document for each corpus document. The Title, Author and Keyword fields from the corpus were stored as metadata properties of the document.
Each desktop search engine was installed within its own VMWare virtual machine, containing Windows XP with 512Mb RAM and 20GB hard disk. The physical machine used for the evaluation was a Dell Latitude D820 with a 1.8GHz Intel Centrino Duo processor, 2GB RAM and a 5400 RPM hard disk.
Before indexing the CACM corpus, a baseline index of the clean system was performed, and then a number of physical attributes were noted – the number of items contained in the index, the number of files and folders on disk used for the index and the index size. Indexing was paused, and the CACM documents copied into the “My Documents” folder of the virtual machine. Indexing was resumed and when complete, the time taken to index the documents and the new index physical attributes were noted. During indexing, the program FileMon was used to log the file system activity of the indexing process. A program was written to read the FileMon log file and tabulate summary results – a total of each operation type, and the total bytes read and written.
Each of the 64 test queries for the CACM corpus was executed and the response time recorded using a stopwatch. The search results were ordered by relevance for those search engines which provide this feature, or left with their default ordering, and the results recorded. The IDS application provides a number of alternate query style options, and for this evaluation, two different query styles were tested; a “web style” syntax and the “natural language” query. After completing all queries, the data was analysed using TREC Eval to generate average recall-precision results for all queries as well as for each individual query.
To generate indexing scalability metrics, the VMWare virtual machine for each desktop search application was restored to its initial pre-index “fresh installation” state. Indexing was paused and the CACM corpus documents were replicated so there were 10 copies of each document. Indexing was resumed and the total time taken to index the documents was noted with a stopwatch. A subset of 9 out of the 64 CACM queries was executed, in order to avoid the substantial effort of re-executing every query, and the response time recorded.
The desktop search applications were further
evaluated
across a range of functional and non-functional attributes, covering
indexing,
search query functionality and search result presentation. The
attributes evaluated are listed in Table 3, Table 4, Table 5 and Table
6. The
evaluation of each attribute involved a
combination of research of the published information available for each
desktop
search application—specifications, help files, and other material
available on
the web—and testing the capabilities of the applications. All published
information was validated by testing the attributes of the applications
where
feasible. For example, to evaluate the list of file formats indexed by
each
application and the list of meta-data attributes indexed for each file
type, experimental
files containing dummy metadata and content were created and indexed
for each
of the file types. Queries were then executed to determine the file
types and
metadata properties successfully indexed.
A comparison of indexing capabilities of the evaluated desktop search tools is shown in Table 3. Notable results of the comparison of the indexing capabilities of the evaluated desktop search tools include:
It was discovered that the range of metadata properties indexed for the various file types varied greatly between the desktop search applications. A more detailed comparison of three common file types (a JPEG picture, a Microsoft Word document, and an MP3 audio file) was performed, and the results are shown in Table 6. It is notable that WDS has the most comprehensive indexing of metadata, while GDS and YDS do not index important properties of Word documents including Title, Author, Keywords, Subject, Comments, Manager and Company. YDS desktop search does not index many of the key JPEG metadata properties and GDS is also missing a number of these properties.
The results for the comparison of indexing time between desktop search applications are shown in Figure 1. The fastest indexing of the CACM corpus, by an order of magnitude compared to the next fastest application, was performed by IDS. The slowest indexing was performed by GDS. The indexing by IDS was around 40 times faster than indexing by GDS.

Figure 1: Comparison of indexing time for desktop search applications
The results for the comparison of index size used for the CACM corpus are shown Figure 2. The chart is shown with a logarithmic scale since there is over two orders of magnitude difference between the smallest and largest index sizes. The smallest index is YDS while the largest is GDS. The relative ranking of applications for the indexing time metric and the index size metric is the same, except for IDS and YDS which are swapped.

Figure 2:
Comparison of
index space used for CACM
corpus by desktop
search applications (logarithmic size scale)
The other size-related metrics captured include the number of files and folders used for the index before and after the indexing of the CACM corpus, and also the number of items added to the index during indexing. The CACM documents were contained within a subdirectory and so this number is either 3205 for WDS and YDS which are able to index the directory details or 3204 otherwise.
The results show how
indexing time scales with a
larger
corpus size (10 CACM corpuses) are shown in Figure 3.
No result could be obtained for the larger corpus
size from GDS due to difficultly of manually triggering a reindex
operation –
GDS does not provide a manual reindex operation, and the workaround
described
previously did not work. The results show that WDS has the largest
increase in
indexing time, while YDS has the smallest increase. The relative
ranking of
indexing time for the applications remained the same. All applications
were
proportionality more efficient relative to the number of documents
indexed for
the larger corpus size, except IDS.

Figure 3: Indexing time versus corpus
size
A comparison of query capabilities of the evaluated desktop search tools is shown in Table 4.
A comparison of the mean query time for the 64 CACM queries is shown in Figure 4. The chart is shown with a logarithmic scale since there is over three orders of magnitude difference between the smallest and largest mean query time. The fastest mean query time is with YDS, while the slowest is WSC due to its lack of indexing.
The IDS Natural and IDS Web results are the slowest of the indexed desktop search applications, though further analysis of the results reveals the time taken for each subsequent query has a non-linear growth factor. The may be either due to a bug with the software, or an intentional feature to cripple the trial version of the software which was evaluated.

Figure 4: Comparison of mean query time for desktop search applications (logarithmic time scale)
The average recall-precision results are shown in Figure 5. The highest recall and precision curve belongs to ISC Natural. This likely due to two factors:
The lowest recall-precision curve belongs to WSC, which had a consistent precision of zero. The next highest curves belong to GDS and YDS, for the reasons mentioned above. The curves for CDS, ISC Web, and WDS were similar.

Figure 5: Average recall-precision results for all CACM corpus queries
It is notable that the precision numbers for all desktop search engines are low due to the specific nature of the queries. The results are particularly affected by the number of failed queries and successful queries which returned no results, as shown in Table 7. The reasons for the failed queries are some queries being too long to be handled by the applications, query 29 consistently causing an exception in IDS Web, and specific reserved words causing IDS Web to return errors.
Due to the natural language specification of most CACM queries, it is interesting to view the precision-recall curve in Figure 6 which is for query number 19, a more keyword based query – “Parallel algorithms”. The precision curves are higher for all desktop search applications, except WSC. The IDS Natural query still performs very well, as do IDS Web and WDS. The lowest recall-precision curves are still GDS and YDS however.

Figure 6: Recall-precision results for CACM corpus query 19
The results showing how
search query time scales
with a
larger corpus size (10 CACM corpuses) are shown in Figure 7.
No result could be obtained for the larger corpus
size from GDS due to the reindexing problems mentioned earlier. Only
one query
was executed for the WSC application due to the extremely long response
time
involved – almost 2 hours per query. The results for the IDS Natural
and IDS
Web queries were unexpectedly faster than for the single CACM corpus
queries.
This is most likely due to the non-linear growth factor mentioned above
– a
smaller number of queries had been executed for the larger corpus than
for the
smaller one, and hence the growth factor was smaller. The results
otherwise
show that YDS has the smallest increase in query time, while WSC has
the
largest increase. The largest increase among the desktop search
applications was
WDS. The relative ranking of query time for the applications remained
the same,
excluding the results for GDS, IDS Web and IDS Natural. All
applications are
proportionality more efficient relative to the number of documents
indexed for
the larger corpus size, except WSC.

Figure 7: Mean query time versus corpus
size
A functional evaluation of the user interface of the various desktop search engines was performed, and the results are presented in Table 5. All applications provide a native Windows based user interface, except GDS which provides a web based interface which is integrated with its web search portal.
Only GDS, WDS and IDS provide relevance ranking for the search results. All search applications provide some form of document preview in the search results, although for GDS this is limited to images and web pages. The search result sorting options vary across applications and are most limiting for GDS.
This document has described the evaluation and
comparison of
a number of the leading desktop search engine products. The functional
and
non-functional criteria for comparing the products were listed,
covering the
indexing, search query functionality and search result presentation
aspects of
these products. The process used to perform the comparison was
described, and
the results of the comparison were summarised. The results provide an
insight
into the strengths and weaknesses of the existing desktop search
products on
the market.
|
Product |
Company |
URL |
|
AOL Desktop Search |
AOL |
|
|
Ask Desktop Search |
IAC Search & Media (Ask.com) |
|
|
Beagle |
- |
|
|
Beetext Find Desktop |
Beetext |
|
|
Blinkx Pico |
Blinkx |
|
|
Copernic Desktop Search |
Copernic Technologies |
|
|
DTSearch Desktop |
dtSearch Corp |
|
|
EasyReach Find / EasyReach Workspace |
EasyReach |
|
|
Fast Search & Transfer |
FAST PSP |
|
|
Filehand Search |
Filehand |
|
|
Google Desktop |
|
|
|
HotBot Desktop Search |
Lycos |
|
|
IDOL |
Autonomy |
|
|
ISYS:Desktop |
ISYS Search Software |
|
|
KAT Desktop Search |
- |
|
|
Omea Pro |
JetBrains |
|
|
one:desktop |
exalead |
http://corporate.exalead.com/enterprise/l=en?p=produits_exalead-desktop_index |
|
Spotlight |
Apple |
|
|
Svizzer |
G10 Software |
|
|
The Sleuthhound! |
iSleuthHound Technologies |
|
|
Windows Desktop Search |
Microsoft |
|
|
X1 |
X1 Technologies |
|
|
Yahoo! Desktop Search |
Yahoo! |
Table 2: Currently available desktop search products
|
Index Attribute |
CDS |
GDS |
WDS |
YDS |
ID |
WSC |
|
Maximum Index Size |
“No limit” |
4GB |
? |
? |
24GB; 64 million documents |
- |
|
Maximum size of content indexed per document |
50MB (configurable) |
10,000 words |
1MB |
30MB (configurable) |
2 billion words; 65,535 paragraphs |
- |
|
Documented plugin architecture for indexing additional types |
|
|
|
- |
- |
- |
|
Indexing triggered by file system changes |
|
|
|
- |
|
- |
|
Indexing can be scheduled |
|
- |
- |
|
|
- |
|
Indexing can be manually paused |
|
|
|
|
|
- |
|
Indexing can be initiated manually |
|
- |
|
|
|
- |
|
Indexing of network drives |
|
|
|
- |
|
- |
|
General Files |
||||||
|
Microsoft Word documents |
MC |
MC |
MC |
MC |
MC |
- |
|
Microsoft PowerPoint presentations |
MC |
MC |
MC |
MC |
MC |
- |
|
Microsoft Excel spreadsheets |
MC |
MC |
MC |
MC |
MC |
- |
|
WordPerfect documents |
MC |
- |
- |
MC† |
MC |
- |
|
OpenOffice.org/StarOffice documents |
MC |
- |
- |
MC† |
MC |
- |
|
Adobe Acrobat documents |
MC |
MC |
- |
MC |
MC |
- |
|
HTML pages |
MC |
MC |
MC |
MC |
MC |
- |
|
Text files |
MC |
MC |
MC |
MC |
MC |
- |
|
XML files |
MC |
MC |
MC |
M |
MC |
- |
|
RTF files |
MC |
MC |
MC |
MC |
MC |
- |
|
Zip files |
M |
MC |
M |
M |
MC |
- |
|
Folders |
C |
C |
MC |
MC |
C |
- |
|
Windows Shortcut files |
- |
- |
M |
M |
- |
- |
|
Communications |
||||||
|
Microsoft Outlook email |
MC |
MC |
MC |
MC |
MC |
- |
|
Microsoft Outlook attachments |
MC |
- |
MC |
M |
MC |
- |
|
Microsoft Outlook contacts |
MC |
MC |
MC |
MC |
- |
- |
|
Microsoft Outlook calendar |
- |
MC |
MC |
- |
- |
- |
|
Microsoft Outlook tasks |
- |
MC |
MC |
- |
- |
- |
|
Microsoft Outlook notes |
- |
MC |
MC |
- |
- |
- |
|
Microsoft Outlook journal |
- |
MC |
MC |
- |
- |
- |
|
Microsoft Outlook Express email |
MC |
MC |
MC |
MC |
MC |
- |
|
Microsoft Outlook Express contacts |
MC |
- |
- |
- |
- |
- |
|
GMail email |
- |
MC |
- |
- |
- |
- |
|
Netscape Mail email |
- |
MC |
- |
- |
MC |
- |
|
Mozilla Thunderbird email |
MC |
MC |
- |
MC |
MC |
- |
|
Mozilla Thunderbird contacts |
MC |
- |
- |
- |
- |
- |
|
Mozilla Mail email |
- |
MC |
- |
- |
MC |
- |
|
Eudora email |
MC |
- |
- |
- |
MC |
- |
|
Internet Explorer browser history |
M |
MC |
- |
- |
- |
- |
|
Mozilla Firefox browser history |
M |
MC |
- |
- |
- |
- |
|
Netscape browser history |
M |
MC |
- |
- |
- |
- |
|
Mozilla browser history |
M |
MC |
- |
- |
- |
- |
|
Internet Explorer Favourite files |
M |
- |
M |
- |
- |
- |
|
Mozilla Firefox bookmarks |
M |
- |
- |
- |
- |
- |
|
Mozilla bookmarks |
M |
- |
- |
- |
- |
- |
|
Netscape bookmarks |
M |
- |
- |
- |
- |
- |
|
MSN Messenger Chats |
- |
MC |
MC |
- |
- |
- |
|
AOL Instant Messenger Chats |
- |
MC |
- |
- |
- |
- |
|
Yahoo Instant Messenger Chats |
- |
- |
- |
MC |
- |
- |
|
Google Talk Chats |
- |
MC |
- |
- |
- |
- |
|
Picture Files |
||||||
|
BMP |
M |
M |
M |
M |
M |
- |
|
GIF |
M |
M |
M |
M |
M |
- |
|
JPEG-EXIF |
M |
M |
M |
M |
M |
- |
|
PNG |
M |
M |
M |
M |
M |
- |
|
Audio Files |
||||||
|
AAC |
M |
M |
M |
M |
M |
- |
|
iTunes (M4A & M4P) |
M |
- |
- |
M |
- |
- |
|
MID |
M |
M |
M |
M |
M |
- |
|
MP3 |
M |
M |
M |
M |
M |
- |
|
WAV |
M |
M |
M |
M |
M |
- |
|
Windows Media (WMA) |
M |
M |
M |
M |
M |
- |
|
Video Files |
||||||
|
AVI |
M |
M |
M |
M |
M |
- |
|
MPEG |
M |
M |
M |
M |
M |
- |
|
QuickTime (MOV) |
M |
M |
M |
M |
M |
- |
|
Windows Media (WMV) |
M |
M |
M |
M |
M |
- |
Table 3: Comparison of indexing between search
engines.
Note:
M=metadata,
C=content, †=Available with optional expansion pack
|
Query Functionality |
CDS |
GDS |
WDS |
YDS |
ID |
WSC |
|
AND |
|
|
|
|
|
|
|
OR |
|
- |
|
|
|
|
|
NOT |
|
|
|
|
|
- |
|
XOR |
- |
- |
- |
- |
|
- |
|
* wildcard |
- |
- |
|
- |
|
|
|
? wildcard |
- |
- |
- |
- |
|
|
|
Sub-queries |
|
- |
- |
|
|
- |
|
Phrase |
|
|
|
|
|
- |
|
Word proximity |
- |
- |
- |
|
|
- |
|
Paragraph proximity |
- |
- |
- |
- |
|
- |
|
Number range |
- |
- |
- |
- |
|
- |
|
Date range |
- |
- |
- |
- |
|
- |
|
Synonym lookup |
- |
- |
- |
- |
|
- |
|
Natural language query |
- |
- |
- |
- |
|
- |
|
Case sensitive |
no |
no |
no |
no |
no |
User defined |
|
Meta-data Query |
||||||
|
Author |
- |
- |
|
- |
- |
- |
|
Subject |
||||||