Site icon Search Engine People Blog

Block-level link analysis

Microsoft has released a publication that outlines what may be the new algorithm they will use when they release their new search engine. You can find the paper at http://research.microsoft.com/research/pubs/view.aspx?tr_id=754. The basic premise behind this algo is that it breaks down a web page into smaller sections (or blocks), and then evaluates the importance of the information and links within that block. If, for example, a page has a block of sponsored links at the bottom of the page, that block would not be given as much weight as an informational block at the top of the page would be given. This could greatly affect the power of buying and selling text links. Here are some excerpts from the paper.

"...based on our previous discussions, different blocks in a page have different importance. Therefore, those links in blocks with high importance value should be more important than those in blocks with low importance value. In other words, a user might prefer to follow those links in important blocks."

"Block Level PageRank (BLPR) is similar to the original PageRank algorithm in spirit. The key difference between them is that, traditional PageRank algorithm models web structure in the page level while BLPR models web structure in the block level."

"Different from PageRank which assigns only one value to each page, HITS assigns two values to each page (authority value and hub value). Hubs and authorities exhibit a mutually reinforcing relationship. As we discussed before, there are always multiple semantic regions in one page. Some hyperlinks such as banners, navigation panels, and advertisements in a page do not convey human endorsement. Thus equally mutually reinforcing all the links in a page might not be suitable. Based on our block level graph of the web presented in previous section, we proposed a Block Level HITS (BLHITS) algorithm. In BLHITS, the authority hub reinforcing idea is the same as the original HITS. The main difference is that in BLHITS, a page will have only authority score and a block will have only hub score, rather than top ranked pages in HITS. When a query is submitted to our system, we first retrieve the top ranked pages. The top ranked blocks are then extracted from these top ranked pages. In this step, those noisy blocks (such as advertisement block) are excluded. In our system, all the pages are pre-indexed at block level, so we can directly get the top ranked blocks without any extra computation. When expanding the root set, we only consider the out-links contained in top ranked blocks. HITS expands all the links in the pages, which inevitably introduce noisy pages into the base set. Similarly, we only add those blocks which contain
links link to the pages in the root set rather than the whole pages to the root set."

"PR and BLPR T are calculated off-line, and stored for combination with relevance rank."

"HITS algorithm is query dependant, so we can not calculate a unique rank off-line."

Whether or not this will actually be implemented is unknown, but it definitely gives search engine optimizers plenty to think about.