Getting started with Computer Vision Datasets: a 5-step primer

栏目: IT技术 · 发布时间: 5年前

内容简介：Just like we need material such as textbooks/blogs/videos to learn new skills and test our knowledge, machine learning algorithms need datasets to do the same thing.The choice of a dataset is crucial. It’s precisely what stands between an outstanding machi

The why/when/what/where/which of CV datasets in the age of AI

Dylan Seychell

May 21 ·11min read

Just like we need material such as textbooks/blogs/videos to learn new skills and test our knowledge, machine learning algorithms need datasets to do the same thing.

Getting started with Computer Vision Datasets: a 5-step primer — Aren’t postcards our version of real-life datasets to learn how to recognise a place? (Image by Hector Rivas on Unsplash )

The choice of a dataset is crucial. It’s precisely what stands between an outstanding machine learning model or just another experiment.

There are plenty of excellent articles about text-based datasets. Over the past years lecturing topics in computer vision, I noticed students struggling to get their head around understanding the what/when/where/how of computer vision datasets.

So here’s the primer I usually give to those getting started:

Why do we need a dataset?
When do we need a dataset?
What do we measure?
Which datasets are available?
Where do we find datasets?

Let’s start.

1- Why do we need a dataset?

By definition, a dataset is a collection of related examples that are used to train and test a model. This can be a selection of examples belonging to a particular topic or domain, and a dataset generally aims to cater to one or more application. A dataset may be labelled, and therefore, ideal for training and testing supervised models. However, there are also unlabeled datasets that are used to train unsupervised models.

Train and Test

From a machine learning perspective, we need Datasets to train models and subsequently test them. This process requires us to choose a part of the dataset (e.g. 70% of it) and ‘show’ it to the machine learning algorithm for learning purpose. We then select the remaining unseen examples in the dataset (e.g. the remaining 30%) and use them to test how well the model learnt. It is crucial that we don’t test with examples that were already used for training since the model will be predicting something it already knows, which is known as ‘overfitting’ a model. This is something that we wouldn’t want because it only guarantees the failure of the model once it is used on a different dataset. There are various methods for organising the train-test set, and you can take a look at these examples.

Benchmarking

Datasets also serve as a measurement tool when it comes to the performance of machine learning techniques. A selection of models performing the same task needs to be compared fairly. This is carried out by running the different methods on a range of datasets. The performance measurement of each method would, therefore, be comparable and allows for the neat comparison of results.

Ali Borji carried out and published an outstanding set of benchmarking exercises on Saliency techniques. These are some of his papers that I recommend to my students:

Salient object detection: A survey (2019)
Revisiting Video Saliency: A Large-scale Benchmark and a New Model (2018)
Salient Object Detection: A Benchmark (2015)

Sidenote: Understand Bias

Bias is a vast topic within itself. There are some critical matters that we need to keep in mind.

Just like any other source of information, Datasets carry within them an inherent level of bias.

This might not necessarily have negative implications, especially if you want your model to survive the test of relevance in an already biased world. However, it’s very important that we are aware of any bias and measure any implications.

2- When do we need a dataset?

The aim of this article is not to focus on specific computer vision techniques. However, I’ll quickly walk you through a selection of topics and highlight the need for a dataset.

Object Detection and Recognition

Object Detection deals with identifying and locating an object of certain classes in an image. Interpreting the object localisation can be done in various ways. A commonly used approach in dataset annotation includes the drawing of a bounding box or polygon around the object as discussed below. Such an annotation allows the dataset to be used for detection. The same dataset can then be used for recognition if every annotation is accompanied by a label. Once objects are selected, they can also be used to mark every pixel in the image which contains the object (segmentation).

Object Segmentation

Segmentation is the process of partitioning an image into multiple segments (sets of pixels) that correspond to a specific region or object. This can be applied to objects using thresholding techniques such as Otsu’s method.

Segmentation can also make use of features. Modern approaches make use of deep learning methods where models trained over datasets containing thousands of pixel-level annotated labels. These approaches include Semantic Segmentation (region selection accompanied by a label) and Instance Segmentation (semantic segmentation that identifies multiple separate objects per class).

Visual Saliency

Visual saliency is a less popular area in computer vision that answers the following question: Which part of the image attracts more attention? Saliency detection techniques receive a colour image as input and return an 8-bit saliency map where the brighter the pixel value (max 255) implies a very salient pixel. Visual Saliency is used in different applications ranging from data compression to product placement and image manipulation. Datasets such as the MSRA10K featured below provide a binary image as ground-truth that indicates which pixels are salient or not.

3- What do we measure?

The type and quality of annotations available in a dataset are crucial to its relevance. In this section, I’ll quickly walk you through the main types of annotations. Credit goes to @jiayin_Supahands for her neat outline of this aspect, and I encourage you to read her article. Here, I’m only giving an overview of the most commonly used annotations and their relation to the topic.

Bounding Boxes

The bounding box approach is the simplest type of annotations and naturally involves the drawing of a bounding box around an object of interest. It is generally defined by a pair of coordinates and corresponding width and height. The bounding box definition often needs to be accompanied by a label if used for classification or recognition. The main drawback of using a bounding box is that it labels any background pixels caught in the bounding box in the same way as target object pixels. From an error metric perspective, it can be helpful for tracking recall, but it is then weak for precision, hence generating the need for something which is more specific.

Polygons

The limitation of bounding boxes brings along the need for something more precise: polygon annotation. The idea of polygon annotation is similar to the bounding box but allows for better pixel precision in labelling by reducing the number of background pixels being miss-labelled. A tool such as LabelMe is required for such an annotation. Label me is an opensource online annotation tool to build image databases for computer vision research. It also offers its own datasets.

Line Annotations

As the name implies, this approach uses lines to annotate specific regions in an image. Lines can be useful in a situation where a bounding box would take a substantial area of pixels. Lane detection is an easily applied case for the use of such an annotation. This can be also used for monitoring of queues and quality control situations.

Point Annotations

These annotations are a specification of groups of keypoints on an image, often carrying a semantic connotation. This approach is very commonly used for pose estimation and facial recognition. The geometrical properties between different points are used as features, and machine learning algorithms are trained using these features. This approach was used in our recent work titled “ Detecting abnormal human behaviour through a video generated model ” published in 2019.

4- Which datasets are available?

Well, plenty :)

There are dozens of remarkable computer vision datasets that were crucial to the development of models that are changing the world. In this section, I am focusing on a selection of landmark datasets that every computer vision professional should know about.

Image-Net

Official Website : http://www.image-net.org/

Image-Net is the legendary computer vision dataset that contributed to the rise of deep learning. It is an image database organised according to the WordNet hierarchy where each meaningful concept in, possibly described by multiple words, is called a “synonym set” or “synset”. Image-net is generally used for object classification/recognition. This dataset contains a total of 14,197,122 with a total of 1,034,908 images with bounding box annotations.

This dataset gained its popularity for the Image-net competition through which Deep Learning gained its traction after AlexNet won this competition in 2012. It was founded by Fei-Fei Li , and she shared the remarkable journey behind this dataset in the Ted talk I’m featuring below:

No matter how experienced you think/feel you are in computer vision, I strongly advise you invest some time in listening to this inspirational talk. Even though techniques have advanced since its release in 2015, the mindset and humbleness presented in this video are still highly relevant.

MNIST

Original Numbers MNIST: http://yann.lecun.com/exdb/mnist/
Fashion MNIST: https://github.com/zalandoresearch/fashion-mnist

The original MNIST dataset, led by Yann Le Cun , consisted of a large volume of handwritten images. It served the vital role of providing a much needed easy-access benchmark for early convolutional neural networks. By 2017, CNNs achieved constant outstanding accuracy (over 99%) on MNIST, and the need for a more challenging benchmark dataset arose. This served as a motivation for the Fashion MNSIT dataset. The latter version includes a training set of 60,000 examples and a test set of 10,000 examples, where every example is a 28x28px of a fashion item from 10 different classes.

CIFAR-10

Official Website : https://www.cs.toronto.edu/~kriz/cifar.html

This dataset was released by the Canadian Institute For Advanced Research (CIFAR) and probably gained some of its popularity through the involvement of Geoffrey Hinton and his associates. The CIFAR-10 dataset contains 60,000 32x32px colour images in 10 different classes. It is used for train/testing of object recognition models.

COCO

Official Website : http://cocodataset.org/

The Common Objects in Context (COCO) dataset is an object detection, segmentation, and captioning dataset. The COCO 2017 has a training and validation collection of 123,287 images containing a total of 886,284 instances. These instances are spread over 80 object categories.

Face2Text

Official Website : https://rival.research.um.edu.mt/

There is a significant number of datasets covering different sorts of facial data. Here, I chose to feature a new and innovative dataset compiled by my colleagues at the Unversity of Malta. Unlike other facial detection or recognition datasets, this one is annotated using descriptive text. This allows machine learning models to be trained to return a textual description of a face given just an image. The full details of the publication introducing this dataset can be found here and the dataset itself may be acquired by filling in the contact form on the official website of this project.

MSRA10K

Official Website : https://mmcheng.net/msra10k/

This is a Salient Object Image Database. Every image in this dataset has a mask for the most salient region in the image. The MSRA10K dataset gained its relevance from the volume of images it contains. It consists of 10,000 colour images with a corresponding binary image mask for the salient object.

MSR 3D

Official Website : https://www.microsoft.com/en-us/download/details.aspx?id=52358

The Microsoft Research Dataset (MSR) includes a sequence of 100 images (colour and depth) captured from 8 cameras showing the breakdancing and ballet scenes. This dataset contains frames for each scene. Every frame has a colour image and high-quality grayscale depth image, captured by an infrared camera.

COTS

Official Website : www.cotsdataset.info

This is a dataset I carefully designed and built last year to evaluate image manipulation techniques. One of such applications is inpainting where an object is removed from an image. Inpainting techniques are usually evaluated using subjective or opinion-based approach because datasets would lack adequate ground truth. This served as a motivation behind this dataset that has a series of progressive scenes as demonstrated below. Further details about this dataset and the experience behind its construction will be shared in separate work.

5- Where do we find datasets?

In academia, you’ll typically come across datasets in peer-reviewed publications about your topic of interest. However, you sometimes just need to browse your options, and for that, you need a good platform. Here follow my 4 favourite sources:

Google Dataset Search

Dataset Search

Learn more about including your datasets in Dataset Search. ‫العربية‬ ‪Deutsch‬ ‪English‬ ‪Español (España)‬ ‪Español…

datasetsearch.research.google.com

Pros : Very extensive
Cons : Easy to get lost when comparing different datasets.

VisualData

www.visualdata.io

Pros : Focused on Computer Vision datasets, Excellent interface, easy to use and quick to get to direct repositories.
Cons : Still limited in terms of selection of available datasets.

Kaggle

www.kaggle.com

Pros : Variety of datasets across different domains, active community, competitions.
Cons : Can take longer to see what each dataset offers.

Tensorflow

TensorFlow Datasets

A collection of datasets ready to use with TensorFlow or other Python ML frameworks, such as Jax, enabling easy-to-use…

www.tensorflow.org

Pros : An extensive selection of straight to the point pages for every dataset. Every dataset is also accompanied by excellent usage resources.
Cons : For completeness, I need to squeeze out a drawback. In this case, the disadvantage is that (obviously) this website only provides Tensorflow resources.

以上就是本文的全部内容，希望本文的内容对大家的学习或者工作能带来一定的帮助，也希望大家多多支持码农网

查看所有标签

猜你喜欢:

Getting started with Computer Vision Datasets: a 5-step primer

本站部分资源来源于网络，本站转载出于传递更多信息之目的，版权归原作者或者来源机构所有，如转载稿涉及版权问题，请联系我们。

码农书籍

形式感+：网页视觉设计创意拓展与快速表现

晋小彦 / 清华大学出版社 / 2014-1-1 / 59.00元

网页设计师从早年的综合性工作中分化出来，形成了相对独立的专业岗位，网页设计也不再是单纯的软件应用，它衍生出了许多独立的研究方向，当网站策划、交互体验都逐渐独立之后，形式感的突破和表现成为网页视觉设计的一项重要工作。随着时代的发展，网页设计更接近于一门艺术。网络带宽和硬件的发展为网页提供了使用更大图片、动画甚至视频的权利，而这些也为视觉设计师提供了更多表现的空间。另外多终端用户屏幕（主要是各种移动设......一起来看看《形式感+：网页视觉设计创意拓展与快速表现》这本书的介绍吧!

码农工具