Site icon Search Engine People Blog

SES Toronto 2010: Introduction to Information Retrieval on the Web

Otherwise known as "How Search Engines Work", this session by Mike Grehan looks like it's going to be an interesting hour of great information on the "tubes and dials" of how search engines do the magic that they do.

Moderating: Jonathan Allen, Director, SearchEngineWatch
Presenting: Mike Grehan, Global VP Content, SES/Search Engine Watch/ClickZ - solo presentation

Off we go!

Mike tells us that there is a paper under one person's chair, and that person should stand up and tell us why they should be #1.

Then he says that he's kidding - funny, Mike!

Moving on...

Sometimes what you ask Google for is not what you are looking for.

In 1945, Vannevar Bush said that as humans stop turning to war, they would start focussing on making information more accessible.

Similar to Google's mission statement.

Tim Berners-Lee invented the WWW. It wasn't his lifes work - he actually invented it at lunchtime. The InNternet is not the same as the WWW.

Search is failing. The www is based on Graph Theory.

All search engines use one general similar system of Information Retrieval. They keep the words on the page attached ot the document, like the index in the back of the book - the word points to the page it's mentioned on.

Search engines can weight the term on the page.

In 2004, Mike sat down with engineers from major engines to find out "what does the perfect page look like". And found out that Google is trying to read the page like a human being does.

But they noticed, a page written about Beethoven's Fifth symphony might have the same density as a page by Andre Previn. That's not ideal.

People noticed that AltaVista didn't appear in it's own results for "search engine".

So search engines began using citation analysis to use how the pages link together as a signal for ranking. It matters more what people say about you than what you say about you. Some links are more equal than others.

You can't reverse engineer the algorithm.

If one page links to another, that's a recommendation. 2 Pages being recommended by similar other pages might be related to each other.

Hub sites are used like human editors. But now we know the hubs and authorities, and also the subject matter & communities of each.

Don't think about buying links - instead think about business development - get links from your community.

2 Japanese researchers did a study and found 100k vommunities in 40 million websites. So identify your communities and get links there.

"10 Blue Links" is a pretty dull result now. Your eyes immediately go to the pictures in the results. No compelling title tag in the world will compell someone to click on a Blue text link instead of an attractive image link.

Types of Queries

Informational:
If someone types into "low haemoglobin". They are looking for specific information about a medical condition.

Navigational:
I know where I'm going.

Transactional:
Doesn't necessarily mean that they want to buy something. It means they are looking to transact some kind of engagement with you, like downloading a whitepaper.

Query Chains
When you can't find what you're looking for, you tend to use a different query. When this happens thousands and millions of times, you can see patterns. If Google sees that query chain thousands of times, it starts showing results for the final query that's usually seen in the chain.

User Trails

What's the strongest signals?

Strongest signal for now is the Google Toolbar. It tells Google what happens next after you click on the link - the information they never had before. There's a lot of data going back to Google through this.

Talking recently to some security people about Google signals. Question: what about Chrome? That's even more powerful.

If you type something that has a clear commercial intent into Google, it tends to show commercial results. SHowing a screenshot of a commercial page where if you click the [+] sign next to the ads, you get a full page of paid listings - no organic at all.

User generated content beats mediated content 5:1. It's very difficult for the engine to keep up with the amount of data. Use user generated content.

Showing the Naver search engine. The best information on the web is verifyable information. That's why social search is becoming more and more important.

Connected Marketing
Local, mobile, social, multimedia convergence - the different ways that we look for and consume information. There are different ways we can satisfy our information needs. If the Search Engine is smart enough, why do you need to ask?

Google does not have the entire WWW in their database. They have 1 Trillion URLs in their database. But they can't crawl it all before it becomes irrelevant.

Many years ago, your grandparents sat in front of a brown wooden box listening to Roosevelt. Your family sits around an HD-TV. It's the same.

Questions

Question: Do you know anything about optimizing images
Mike: Get people to look at it, because the more people who click on it and look at it, the more relevant it must be.

Question: Clarify: Google can't handle all of the information - will it diverge into applications?
Mike: the WWW is great for searching and banking, but it's not adequate for current informational needs. It can't scale, because if you crawl faster, you can slow down the web.

Remember that Chrome is now an Operating System - think about that in terms of convergence. Can mine information in real-time.

Question: Is DMOZ still worth it?
Mike: Sure, if you don't die of old age before you get the link. Other links might be more easier and effective.

Question: (me): Does Google Analytics also count as a ranking signal?
Mike: The data is seperate. You'd have to be really important for Google to want to look at your data. Look at the privacy issue. It's about the user, not about the site - Google wants to know what you like so they can personalize, and that's what the toolbar is for.

Question: The all-ads page you showed in Google - does that go against Google's "don't be evil" motto?
Mike: Google is probably going to regret the "don't be evil" motto for a long time. From a personal point of view as a user I actually found the result useful and got what I wanted quickly. That goes back to Google's idea of the more they can personalize, the more useful the results are.

Question: If people only look at the top of the screen, what will happen when everyone is optimizing & paying for the same keywords?
Mike: That's a long one. The keyword prices will go up. You're going to see a lot less of a commercial element in the organic results. Because when someone wants to buy something, the commercial results will be more visibile.
Jonathan: remember that 50% of queries on Google are still unique, even today, so it might not get that crowded.

Question: WIth rapid growth of social media, is there such things as SMO, or is that something that's coming?
Jonathan: It depends on if social media comes up with its own protocol.
Mike: Temporal analysis will probably become more important. You can be #1 in Google even if they haven't crawled your page. So if Google has your URL but haven't yet crawled you, then if there are enough links, they will display the URL anyway.
Jonathan: Can rank without even having a website. So made a video that went viral, and gave the embed code to a blogger who had great ranking. So many people picked it up that the facebook group where it was became #1 for that phrase. So you can be #1 even without a domain name. Google ignores the Facebook meta tags. But Bing might not ignore them.

Question: (Jonathan): Where are we going to get new sources of data? Wolfram-Alpha?
Mike: I use a lot of apps on my iphone, like OpenTable. Google can't get into that, but I'd like to see that in Google. So we might see more partnering with Google to get that information in there. OpenTable also knows a lot about you.

Question: How soon do you think it will end?
Mike: If the query has a commercial intent, the results should be commercial.

Question: (Jonathan): How many of you know that Google just put its' whole index in a new platform called Caffeine?
Mike: Google is a public company, adn you can see that they are not spending most of their money on indexing. They are going to change their model for sure.