A Human-centric Approach for Evaluating Visual Search Models

栏目: IT技术 · 发布时间: 6年前

内容简介:Since our main objective is to create compelling customer experiences, this article will describe a method that is tackling this problem directly from the eyes of the users.

Image search is one of our central products. The user experience has shifted from textual search, to visual search, where a user can either capture a live image or select an image from their device.

Through the use of convolutional neural networks, we are able to process images and then find similar live eBay listings to suggest to a user. These suggested listings are ranked based on their similarity and then displayed to the user. As we train these models, we face the challenge of evaluating their performances. How can we compare several visual search models and say which of them works better? 

Since our main objective is to create compelling customer experiences, this article will describe a method that is tackling this problem directly from the eyes of the users.

Preparing the data set for the evaluation

We are using a fixed set of n randomly selected user loaded images that will serve as our query images for both models during this evaluation.

These images were not part of the training set that consists of eBay’s active listings, but are reflective of the true images our buyers use to search for eBay products. For each query (i.e. anchor image) we call a model, obtain the top 10 results per anchor image, and then collect 10X n images per model output for our evaluation dataset. 

Adding the human to the loop

Once we have the evaluation dataset, we upload these images to FigureEight (i.e. crowdflower), a crowd tagging platform that we use to collect responses on how well the output of a model compares to the anchor image given (see Figure 1).

A Human-centric Approach for Evaluating Visual Search Models

Figure 1. FigureEight demo.

Since images are extremely subjective to evaluate, we decided to incorporate dynamic judgments in order to establish a confidence score for every pair of questioned images. We start by asking three people the same question and reviewing their responses. If they all answer the same, we keep this answer. If they answer differently, we will ask two more people (totaling up to five) to ensure a high confidence of this response.

Our evaluators are also being tested while answering these questions. There are test questions, handpicked by our team, that every evaluator must go through in order to qualify as a valid labeler. Their accuracy on these test questions will be linked to their trust score . They must score at least a 70% on the test in order to be accepted to complete this task. In addition to the pre-test, there are test questions distributed throughout the task that could result in their trust score falling below our designated threshold of 0.7, which would result in these labelers being removed from the task.

The overall confidence score per each answer is calculated by the level of agreement between labelers and their assigned level of trust.

For example, if there were two types of answers selected for the same question, we will take the answer that has a higher confidence score overall.  Only questions that have a confidence greater than or equal to 70% are being evaluated (see Figure 2).

A Human-centric Approach for Evaluating Visual Search Models

Figure 2. Confidence score

Calculating the total score per model

This process is done in order to obtain a score per each of the models we are evaluating so we can do a fair evaluation between them and decide which one users might prefer. We are using DCG (Discounted Cumulative Gain), which is a standard metric for ranked results (see Figure 3).

A Human-centric Approach for Evaluating Visual Search Models

Figure 3. Discounted cumulative gain.

The weights we are using are described in the following table.

Response Description Weight ( w n )
Good Match Exact match or a very good substitute 1
Fair Match Same product type, but with slight variation (i.e. color) 0.8
Bad Match Same product type, but significant differences 0.3
Very Bad Match Different product type 0

Once we have all the answers from the crowd, we can assign the relevant numbers to the formula and accumulate the total score per each model. A model with a higher score means a model that produced more relevant search results per this 10X n evaluation set and thus will be the chosen one.


以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持 码农网

查看所有标签

本站部分资源来源于网络,本站转载出于传递更多信息之目的,版权归原作者或者来源机构所有,如转载稿涉及版权问题,请联系我们

罗辑思维:迷茫时代的明白人

罗辑思维:迷茫时代的明白人

罗振宇 / 北京联合出版公司 / 2015-9 / 42

编辑推荐 1、 罗振宇,自媒体视频脱口秀《罗辑思维》主讲人,互联网知识型社群试水者,资深媒体人和传播专家。曾任CCTV《经济与法》《对话》制片人等。2012年底打造知识型视频脱口秀《罗辑思维》。半年内,由一款互联网自媒体视频产品,逐渐延伸成长为全新的互联网社群品牌。 他对商业和互联网的独到见解,影响了互联网一代的知识结构和对互联网的认识:人类正在从工业化时代进入互联网时代。新的时代将彻......一起来看看 《罗辑思维:迷茫时代的明白人》 这本书的介绍吧!

RGB转16进制工具
RGB转16进制工具

RGB HEX 互转工具

MD5 加密
MD5 加密

MD5 加密工具

HSV CMYK 转换工具
HSV CMYK 转换工具

HSV CMYK互换工具