A Human-centric Approach for Evaluating Visual Search Models

栏目: IT技术 · 发布时间: 6年前

内容简介：Since our main objective is to create compelling customer experiences, this article will describe a method that is tackling this problem directly from the eyes of the users.

Image search is one of our central products. The user experience has shifted from textual search, to visual search, where a user can either capture a live image or select an image from their device.

Through the use of convolutional neural networks, we are able to process images and then find similar live eBay listings to suggest to a user. These suggested listings are ranked based on their similarity and then displayed to the user. As we train these models, we face the challenge of evaluating their performances. How can we compare several visual search models and say which of them works better?

Since our main objective is to create compelling customer experiences, this article will describe a method that is tackling this problem directly from the eyes of the users.

Preparing the data set for the evaluation

We are using a fixed set of n randomly selected user loaded images that will serve as our query images for both models during this evaluation.

These images were not part of the training set that consists of eBay’s active listings, but are reflective of the true images our buyers use to search for eBay products. For each query (i.e. anchor image) we call a model, obtain the top 10 results per anchor image, and then collect 10X n images per model output for our evaluation dataset.

Adding the human to the loop

Once we have the evaluation dataset, we upload these images to FigureEight (i.e. crowdflower), a crowd tagging platform that we use to collect responses on how well the output of a model compares to the anchor image given (see Figure 1).

A Human-centric Approach for Evaluating Visual Search Models

Figure 1. FigureEight demo.

Since images are extremely subjective to evaluate, we decided to incorporate dynamic judgments in order to establish a confidence score for every pair of questioned images. We start by asking three people the same question and reviewing their responses. If they all answer the same, we keep this answer. If they answer differently, we will ask two more people (totaling up to five) to ensure a high confidence of this response.

Our evaluators are also being tested while answering these questions. There are test questions, handpicked by our team, that every evaluator must go through in order to qualify as a valid labeler. Their accuracy on these test questions will be linked to their trust score . They must score at least a 70% on the test in order to be accepted to complete this task. In addition to the pre-test, there are test questions distributed throughout the task that could result in their trust score falling below our designated threshold of 0.7, which would result in these labelers being removed from the task.

The overall confidence score per each answer is calculated by the level of agreement between labelers and their assigned level of trust.

For example, if there were two types of answers selected for the same question, we will take the answer that has a higher confidence score overall. Only questions that have a confidence greater than or equal to 70% are being evaluated (see Figure 2).

Figure 2. Confidence score

Calculating the total score per model

This process is done in order to obtain a score per each of the models we are evaluating so we can do a fair evaluation between them and decide which one users might prefer. We are using DCG (Discounted Cumulative Gain), which is a standard metric for ranked results (see Figure 3).

Figure 3. Discounted cumulative gain.

The weights we are using are described in the following table.

Response	Description	Weight ( w n )
Good Match	Exact match or a very good substitute	1
Fair Match	Same product type, but with slight variation (i.e. color)	0.8
Bad Match	Same product type, but significant differences	0.3
Very Bad Match	Different product type	0

Once we have all the answers from the crowd, we can assign the relevant numbers to the formula and accumulate the total score per each model. A model with a higher score means a model that produced more relevant search results per this 10X n evaluation set and thus will be the chosen one.

以上就是本文的全部内容，希望对大家的学习有所帮助，也希望大家多多支持码农网

查看所有标签

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

第三次浪潮

[美]阿尔文·托夫勒 / 黄明坚 / 中信出版集团 / 2018-7 / 79.00元

《第三次浪潮》是美国著名未来学家阿尔文•托夫勒的代表作之一。1980年出版之际，随即引起全球热评，堪称中国改革开放的指南。本书阐述了由科学技术发展引起的社会各方面的变化与趋势。托夫勒认为，人类迄今为止已经经历了两次浪潮文明的洗礼：第一次是农业革命，人类就此从原始渔猎时代进入以农业为基础的文明社会，并历经千年，直到工业革命的到来。随后，人类社会历时300年摧毁了落后的第一次浪潮文明，并在“二战”后1......一起来看看《第三次浪潮》这本书的介绍吧!

码农工具