Ruud Questions: Marie-Claire Jenkins

Ruud Hein

15 years ago

I think I met MCJ via Fantomaster online. Or maybe I saw her involved in a discussion with Dave theGypsy -- I'm not sure anymore. Either way, quickly after becoming aware of her I started to follow her because what she has to share with us is of a different quality than the usual SEO c.... stuff we hear. Started to read her blog posts (tremendously informative).

She's smart. She knows the things you would want to -- and even though she's likely the most knowledgeable SEO in the field at the moment, she's as cool and hip about them as the next person.

Since a couple of weeks her popular TGIF post appears on SEO Scoop

Information retrieval, algorithms, patterns. The mind's eye sees a schoolboard filled to the very edges with formula's, the mad scientist continuing to scribble on the wall.....

Is any of this material accessible if you're not a mathlete?

I started off as a Linguist. My first degree was in translating-interpreting French and German. My true passion was and still is finding patterns in language as well as philosophy. I decided to take a Masters in Computer Science to research Machine Translation. After that I was offered a PhD by my University in IR and NLP related things. So you see, I am not at the base a mathematician. I am a word person. In fact me turning into a computer scientist surprised everyone including myself. I thought I was going to be some cool literary type.

How did I make the transition? Well I put my fears aside and started at the beginning. I learnt about a lot of things that appeared supremely complex, that I had never come across, that I barely understood, and many things that I actually had no idea about at all. Some things took me a couple of years to digest and fully comprehend. Writing equations was a steep learning curve, and coding proper languages (not web programming) was quite a challenge too. I discovered however that I was blessed with a knack for finding creative solutions and that my linguistics background gave me an edge as did philosophy.

Edison said Success is 10 percent inspiration and 90 percent perspiration. - unfortunately there's no way around that. If you want to be a computer scientist, you have to have passion and not be afraid of hard graft. You can grasp the basics though, and they help for things like SEO and that's usually enough, but if you're talking a full in depth understanding, it takes time.

Dividing search into information retrieval and ranking algorithms it seems one is very basic (collect information, spit it back out) and the other is hidden (who knows *what* they're using to rank!).

What can I learn from which part?

IR is all about finding the right information in the context of a query, simply put. It is very far from basic, there are so many complexities to deal with. Before you can decide whether a document is relevant, you have to read it, or at least scan read it and understand it. Then your brain links it up to the topic around that query and then you make a decision. The same is true for a machine. The thing is that they don't cope with that sort of thing very well, so we have to devise all sorts of weird and wonderful things to make that happen. Even humans aren't 100% correct in retrieving information because it becomes very subjective. We are trying to get a machine to do what we do, but even better. This is a tall order.

IR understanding tells you all about how to work with copy, and how computers process it and make sense of it. It helps you work out how they go about picking particular documents above others. This information is not secret. In fact the science community is very open so you can easily find all of the papers and methods that you need.

Ranking algorithms are very complex and the topic around those is called "Learning to rank" because we use machine learning algorithms for that. In order to rank anything at all you have to create some kind of scale and then place everything in the right position. Humans can't really do this. The data has so many dimensions that it needs to be processed and analysed and taken apart with complex maths to establish any kind of ranking order. Do you always agree with the Google ranking? I don't and their system is very efficient in comparison to a lot of other ones. It still doesn't work to the level required though.

Here you learn about how machines go about deciding how to sort content. This is also the area of classification and clustering. Again all of the information on the shiny new methods are available.

Remember IR is a hammer and every problem is a nail. Things like machine translation are scalpels and microscopes.

New ranking algos nobody has talked about yet

Also issues with IR evaluation

When we hear "natural language processing" our mind conjures up the image of a captain aboard the deck of a space ship, talking to a machine and receiving intelligent responses. Or we imagine typing a question into a search engine and getting an answer back that is not "keywords on page" based.

What do you see when you think about NLP?

Natural language processing is often misunderstood as a discipline. It feeds into all systems that work with language. NLP is used to manipulate text, physically. This means that it does things like stemming, parsing, stopword removal, pattern matching...that sort of thing. It's a bit like the admin side of computational linguistics if you like. The space ship thing is natural language understanding and generation which is quite different. It uses NLP but also heavily relies on artificial intelligence. Typing a natural language query into a search engine is the same thing. In fact I work in that area at the moment. It requires a lot of different methods of which NLP, but in my own work for example, it isn't the most exciting thing under the hood.

This might be useful

What should we be thinking about when Google states its working on AI projects?

Google has always been working on AI projects. It was first described by John McCarthy in 1956 as "the science and engineering of making intelligent machines.". In fact you were mentioning dialogue earlier and Alan Turing is the man who started this conversation off in 1950 with "Computing machinery and intelligence". AI goes back a bit and so I would be very surprised to see any IR project not make any use of AI techniques. They can be as simple as the genetic algorithm for example but they can also be horribly complex! Google probably mean that they're working on systems that use intelligent agents, which is what AI is about. It means that systems that are capable of organising themselves and taking actions based on their own decisions are likely being worked upon. Again I wouldn't see this as anything strange, coming from Google it is a given in my mind.

What is AI?

This might be useful

AI for marketing

Why the emphasis on personalization? What are they trying to personalize -- and what does "personalization" mean anyway?

We have reached a point where there is only so much more that can come out of processing a query in isolation. IR in search engines is hard because a query is always in isolation and there are few works to work on. In natural language you have a grammatical structure, more information, and you can derive something a lot more accurate (as long as you can disambiguate effectively). Personalization means that additional information that isn't usually accessible to search engines can be acquired. It is all about making the system more efficient for users and putting their queries into context. Personalisation is the art of making something tailor made to a user. I can guarantee that my search history is different to a lot of other people's ones. If the search engine is "aware" (for want of a better word - this is a choice of word that leads to big arguments!) of the fact that I read the coastal news for my area, that I often look for surfing related things, sport events, computing stuff, particular books...when I enter "Yoga class" it's pretty safe that I'm not looking for one in Memphis. I'm in Sydney.

A lot of people worry about giving information out to the engines, and privacy and such things. Online you are lucky to have any privacy for a start. Google yourself and there's your proof. Data from a single source is not interesting because it doesn't tell you anything about your performance as a whole, where the issues with your system are, what queries are common to which demographics...that is the sort of thing you want to look at. Who cares if Trevor Smith has looked at seashells from Papua New Guinea? We might care however if 2,000 people interested in the same things as Trevor looked at the same thing. I welcome the time when I can have more personalised results because they will save me time for one thing.

How does it work?

For SEO

There is a common sense element to expecting search engines to somehow make sense and use of the tremendous amount of information available in social networking data.

Is overlaying social network signals simply another layer of ranking calculations?

Social networking data is something I've been looking at along with a bunch of other computer scientists. Having it incorporated into a search engine can lead to quite noisy data. The quality of the stuff that comes through twitter is usually not great anyway in my experience, and so I don't think it belongs in the "normal" rankings. Certainly there should be a way of getting through it all and finding things threads that you find interesting. The information on Twitter for example is great because it's short, but it's not easy to work out who is an authority source, find a full conversation instead of bits, and processing full natural language isn't the easiest, especially not in real time. An interesting area of research is sentiment extraction, something which Chris Rines is working on - watch this space. I'm not sure social media stuff belongs in a regular search engine. If it is to be included there are some issues that need to be addressed first.

Twitter in an IR system

Google proudly boasts taking into account over 200 factors when ranking results. Meanwhile pragmatical SEO's think "yup, and a good and a bunch of links accounts for 198 of those in most cases"

If everything from domain age to historical spam signals is taken into account, why is it still so relatively easy to rank?

The 200 factors may range from things for organising data to extracting small variables from it for example. I don't know what those 200 factors are but as an SEO professional, they're not very interesting. As a computer scientist they are. While I am on a crusade to educate and share scientific and technical information on how search engines work (along with David Harry to name but one), I do believe that SEO's do not need to know how to build and run a neural network for example. Knowing what one is is important but that's about it. The reason for this is that it gives some kind of understanding for how search engines function. This knowledge enables people to understand what new fangled algorithms are about and how likely the story is. Changing your whole SEO strategy based on what someone said in a blog post is dangerous and unnecessary. Read around it and do some tests.

My favourite quote is "If you want to make an apple pie from scratch, you must first create the universe" by Carl Sagan. If you want to understand something, you have to learn everything around it, and put all of your beliefs into question. For example if you don't know what a neural network is, how will you understand a paper that describes a method where one is used?

It's not easy for everyone to rank. Some sites are up against some interesting issues, such as for example an author who sells their books on their own site. Amazon and numerous bookshops also sell the book online. The author wants to show #1 for their own work and name obviously and sometimes this can be a challenge. Other sites are much more straightforward and yes, ranking can be relatively easy. I would say that it does depend in what topic area you're going for as well. "Sea shells from Papua new Guinea" might indeed be quite easy. This is probably because there isn't much data in that topic area. There will be in the "hotels" category though. This is a simplistic view of it but you get the picture.

Actually, if it was really easy to rank, it would actually get harder and harder. This is because not everyone can be at #1.

Another type of ranking

The single most effective thing to do for your web site is .... ?

To love it. But don't be afraid to leave it alone and get outside once in a while. Even websites need space sometimes.