S&T | The Pros and Cons of Using a Machine Learned vs. Score-based Model for Search Ranking

IndraStra Global Sunday, March 13, 2016 Edit this post

As search engines got better, the scores they used started getting better. A lot of big search engines use a heuristic / score-based model for search ranking. Most famously, as recently as 2011, Google used a heuristic model for ranking in search in spite of having a really strong in-house expertise in Machine Learning.

By Nikhil Dandekar

S&T | The Pros and Cons of Using a Machine Learned vs. Score-based Model for Search Ranking

Traditional Information Retrieval started with score-based ranking for search results using scores such as TF-IDF ^[1] or BM25 ^[2]. As search engines got better, the scores they used started getting better. A lot of big search engines use a heuristic / score-based model for search ranking. Most famously, as recently as 2011, Google used a heuristic model for ranking in search ^[3] in spite of having a really strong in-house expertise in Machine Learning (ML).

Ranking search results using machine learned models has been explored for at least a couple of decades now. It has gained even more prominence with the popularity of Learning to Rank ^[4] techniques in the last decade or so. For example, Bing has been using Learning to Rank techniques to rank its search results at least from 2009 ^[5].

This is a choice a lot of new and existing search engines have to make: Should they go for hand-tuned, score-based models or should they use machine learning for ranking search results.

Here are some of factors that matter and should go into your decision-making. Note that most of these points are generic enough to apply to any prediction/ranking problem and are not restricted strictly to search.

1. Explainability

For most ML algorithms, especially for the ones currently in fashion such as ensembles or neural nets, ranking is essentially a black box in terms of explainability. You can control the inputs, but it’s really hard to explain what exact effect specific inputs have on the output. The final model, thus, is not very explainable.

A score-based model, especially one where the score is thoughtfully constructed, is usually easier to reason about and explain.

2. Implementation time

It usually takes a non-trivial amount of time to build a new version of a ML model. You need to run through multiple iterations of the “gather/clean data -> train -> validate -> test” loop before your model is ready for A/B testing.

Updating a score-based model can be as simple as tweaking the scores and thus can be A/B test ready in a very short time.

3. Optimization metric

For most search engines, it’s hard to come up with an objective metric to optimize for. This is the metric that tells you that your results for a particular search query are good and the search was successful. This metric changes based on what product you are building and what constitutes as “success” for your search. You might be tempted to start off optimizing for user clicks, but if you use clicks blindly, you may train a model that favors bad, “click-baity” results more. Big search engines spend a lot of money on building human relevance systems ^[6], where trained human raters use well-defined guidelines to generate an objective “success rating” for each search result. The training data generated by these systems can then be used to train the ML models to rank search results. Smaller search engines might not have similar amount of resources as the big players and might not be able to afford building such systems.

This optimization metric is important for both ML and score-based systems. However ML models suffer more if you don’t have a good optimization metric, since you can end up learning a well-trained model that optimizes for a completely wrong metric. Score-based systems suffer a little less in comparison given that the score is constructed using reason and intuition in combination with the metric you are trying to optimize.

4. Result relevance

If you can get your optimization metric right, this is where the ML model can give you huge dividends. Learning directly from data usually trumps any intuition you can encode in your score-based model. If relevance matters more to you than any of the other factors, using an ML model is usually the way to go.

5. Flexibility

It's hard to make spot-fixes in an ML model. The best way to fix issues is via things like using better/more training data or better feature engineering or hyperparameter tuning, all of which are time consuming.

It's much easier to fix issues quickly in a score-based model. Given a bug you can tweak the model easily and have it out for users in no time.

6. Engineering ramp-up time

If you are using well-known ML models, it's relatively easy for a good Machine Learning engineer to ramp up on your system. The model learns from the data, and while you need some time to understand the overall system, you don't need to understand all the details of what happens inside the model before you start making changes to it.

A score-based model is hand-tuned, and you need to understand all the intuitions and trade offs baked into the model before you can work with it effectively. For a fairly complex hand-tuned model, even a good engineer might take months to have enough context to understand all of intuitions baked-in over years of working on the model. This problem usually gets worse the older a model gets.

Hybrid ML/hand-tuned systems

As you see, ML models can be better for relevance, but have some other shortcomings. To overcome these shortcomings, most search engines that use a ML model, use a hybrid ML/hand-tuned system. In this case even though your main ranking model is a ML trained one, you still have hand-tuned levers such as blacklists, constraints, or forced rankings to quickly fix egregious mistakes. Note that if you go this way, it's important that the hand-tuned components remain very simple and easy to use, or else you might end up having to maintain both a fairly complex ML model and a fairly complex hand-tuned system.

Advice for new search engines

For new search engines, given the various factors above, a good rule-of-thumb would be:

Start with a hand-tuned model. They are simpler to build up-front and let you hit the ground running.

Get your initial users. Get them to use your search for a while, so that you generate good training data for your future ML model.

When you reach a scale where incremental gains in relevance are more important than the rest of the factors, consider moving to an ML model. But do make sure you have a good answer to the “Optimization Metric” problem before you start working on a ML model.

Endnotes:

[1] tf–idf

[2] Okapi BM25

[3] Why is machine learning used heavily for Google's ad ranking and less for their search ranking?

[4] Nikhil Dandekar's answer to What is the intuitive explanation of Learning to Rank and algorithms like RankNet, LambdaRank and LambdaMART?

[5] User Needs, Features and the Science behind Bing

[6] Nikhil Dandekar's answer to How does Google measure the quality of their search results?

Source: Quora