佣金中国's Archiver

BEFREE 发表于 2005-5-31 22:55

Hilltop: A Search Engine based on Expert Documents

Abstract:
In response to a query a search engine returns a ranked list of documents. If the query is broad (i.e., it matches many documents) then the returned list is usually too long to view fully. Studies show that users usually look at only the top 10 to 20 results. In this paper, we propose a novel ranking scheme for broad queries that places the most authoritative pages on the query topic at the top of the ranking. Our algorithm operates on a special index of "expert documents." These are a subset of the pages on the WWW identified as directories of links to non-affiliated sources on specific topics. Results are ranked based on the match between the query and relevant descriptive text for hyperlinks on expert pages pointing to a given result page. We present a prototype search engine that implements our ranking scheme and discuss its performance. With a relatively small (2.5 million page) expert index, our algorithm was able to perform comparably on broad queries with the best of the mainstream search engines.
1 Introduction
When searching the WWW broad queries tend to produce a large result set. This set is hard to rank based on content alone, since the quality and "authoritativeness" of a page (namely, a measure of how authoritative the page is on the subject) cannot be assessed solely by analyzing its content. In traditional information retrieval we make the assumption that the articles in the corpus originate from a reputable source and all words found in an article were intended for the reader. These assumptions do not hold on the WWW since content is authored by sources of varying quality and words are often added indiscriminately to boost the page's ranking. For example, some pages are created to purposefully mislead search engines, and are known popularly as "spam" pages. The most virulent of spam techniques involves deliberately returning someone else's popular page to search engine robots instead of the actual page, to steal their traffic. Even when there is no intention to mislead search engines, the WWW tends to be crowded with information on topics popular with users. Consequently, for broad queries keyword matching seems inadequate.
Prior approaches that have used content analysis to rank broad queries on the WWW cannot distinguish between authoritative and non-authoritative pages (e.g., they fail to detect spam pages). Hence the ranking tends to be poor and search services have turned to other sources of information besides content to rank results. We next describe some of these ranking strategies, followed by our new approach to authoritative ranking - which we call Hilltop.

MORE:[url]http://www.cs.toronto.edu/~georgem/hilltop/[/url]

tufubob 发表于 2005-6-2 09:35

这是什么啊

页: [1]

Powered by Discuz! Archiver 7.0.0  © 2001-2007 Comsenz Inc.