[SOUND] So average precision is computer for just one. one query. But we generally experiment with many different queries and this is to avoid the variance across queries. Depending on the queries you use you might make different conclusions. Right, so it's better then using more queries. If you use more queries then, you will also have to take the average of the average precision over all these queries. So how can we do that? Well, you can naturally. Think of just doing arithmetic mean as we always tend to, to think in, in this way. So, this would give us what's called a "Mean Average Position", or MAP. In this case, we take arithmetic mean of all the average precisions over several queries or topics. But as I just mentioned in another lecture, is this good? We call that. We talked about the different ways of combining precision and recall. And we conclude that the arithmetic mean is not as good as the MAP measure. But here it's the same. We can also think about the alternative ways of aggregating the numbers. Don't just automatically assume that, though. Let's just also take the arithmetic mean of the average position over these queries. Let's think about what's the best way of aggregating them. If you think about the different ways, naturally you will, probably be able to think about another way, which is geometric mean. And we call this kind of average a gMAP. This is another way. So now, once you think about the two different ways. Of doing the same thing. The natural question to ask is, which one is better? So. So, do you use MAP or gMAP? Again, that's important question. Imagine you are again testing a new algorithm in, by comparing the ways your old algorithms made the search engine. Now you tested multiple topics. Now you've got the average precision for these topics. Now you are thinking of looking at the overall performance. You have to take the average. But which, which strategy would you use? Now first, you should also think about the question, well did it make a difference? Can you think of scenarios where using one of them would make a difference? That is they would give different rankings of those methods. And that also means depending on the way you average or detect the. Average of these average positions. You will get different conclusions. This makes the question becoming even more important. Right? So, which one would you use? Well again, if you look at the difference between these. Different ways of aggregating the average position. You'll realize in arithmetic mean, the sum is dominating by large values. So what does large value here mean? It means the query is relatively easy. You can have a high pres, average position. Whereas gMAP tends to be affected more by low values. And those are the queries that don't have good performance. The average precision is low. So if you think about the, improving the search engine for those difficult queries, then gMAP would be preferred, right? On the other hand, if you just want to. Have improved a lot. Over all the kinds of queries or particular popular queries that might be easy and you want to make the perfect and maybe MAP would be then preferred. So again, the answer depends on your users, your users tasks and their pref, their preferences. So the point that here is to think about the multiple ways to solve the same problem, and then compare them, and think carefully about the differences. And which one makes more sense. Often, when one of them might make sense in one situation and another might make more sense in a different situation. So it's important to pick out under what situations one is preferred. As a special case of the mean average position, we can also think about the case where there was precisely one rank in the document. And this happens often, for example, in what's called a known item search. Where you know a target page, let's say you have to find Amazon, homepage. You have one relevant document there, and you hope to find it. That's call a "known item search". In that case, there's precisely one relevant document. Or in another application, like a question and answering, maybe there's only one answer. Are there. So if you rank the answers, then your goal is to rank that one particular answer on top, right? So in this case, you can easily verify the average position, will basically boil down to reciprocal rank. That is, 1 over r where r is the rank position of that single relevant document. So if that document is ranked on the very top or is 1, and then it's 1 for reciprocal rank. If it's ranked at the, the second, then it's 1 over 2. Et cetera. And then we can also take a, a average of all these average precision or reciprocal rank over a set of topics, and that would give us something called a mean reciprocal rank. It's a very popular measure. For no item search or, you know, an problem where you have just one relevant item. Now again here, you can see this r actually is meaningful here. And this r is basically indicating how much effort a user would have to make in order to find that relevant document. If it's ranked on the top it's low effort that you have to make, or little effort. But if it's ranked at 100 then you actually have to, read presumably 100 documents in order to find it. So, in this sense r is also a meaningful measure and the reciprocal rank will take the reciprocal of r, instead of using r directly. So my natural question here is why not simply using r? I imagine if you were to design a ratio to, measure the performance of a random system, when there is only one relevant item. You might have thought about using r directly as the measure. After all, that measures the user's effort, right? But, think about if you take a average of this over a large number of topics. Again it would make a difference. Right, for one single topic, using r or using 1 over r wouldn't make any difference. It's the same. Larger r with corresponds to a small 1 over r, right? But the difference would only show when, show up when you have many topics. So again, think about the average of Mean Reciprocal Rank versus average of just r. What's the difference? Do you see any difference? And would, would this difference change the oath of systems. In our conclusion. And this, it turns out that, there is actually a big difference, and if you think about it, if you want to think about it and then, yourself, then pause the video. Basically, the difference is, if you take some of our directory, then. Again it will be dominated by large values of r. So what are those values? Those are basically large values that indicate that lower ranked results. That means the relevant items rank very low down on the list. And the sum that's also the average that would then be dominated by. Where those relevant documents are ranked in, in ,in, in the lower portion of the ranked. But from a users perspective we care more about the highly ranked documents. So by taking this transformation by using reciprocal rank. Here we emphasize more on the difference on the top. You know, think about the difference between 1 and the 2, it would make a big difference, in 1 over r, but think about the 100, and 1, and where and when won't make much difference if you use this. But if you use this there will be a big difference in 100 and let's say 1,000, right. So this is not the desirable. On the other hand, a 1 and 2 won't make much difference. So this is yet another case where there may be multiple choices of doing the same thing and then you need to figure out which one makes more sense. So to summarize, we showed that the precision-recall curve. Can characterize the overall accuracy of a ranked list. And we emphasized that the actual utility of a ranked list depends on how many top ranked results a user would actually examine. Some users will examine more. Than others. An average person uses a standard measure for comparing two ranking methods. It combines precision and recall and it's sensitive to the rank of every random document. [MUSIC]