Getting started with Computer Vision Datasets: a 5-step primer

栏目: IT技术 · 发布时间: 1周前


内容简介:Just like we need material such as textbooks/blogs/videos to learn new skills and test our knowledge, machine learning algorithms need datasets to do the same thing.The choice of a dataset is crucial. It’s precisely what stands between an outstanding machi


The why/when/what/where/which of CV datasets in the age of AI

Just like we need material such as textbooks/blogs/videos to learn new skills and test our knowledge, machine learning algorithms need datasets to do the same thing.

Getting started with Computer Vision Datasets: a 5-step primer

Aren’t postcards our version of real-life datasets to learn how to recognise a place? (Image by Hector Rivas on Unsplash )

The choice of a dataset is crucial. It’s precisely what stands between an outstanding machine learning model or just another experiment.

There are plenty of excellent articles about text-based datasets. Over the past years lecturing topics in computer vision, I noticed students struggling to get their head around understanding the what/when/where/how of computer vision datasets.

So here’s the primer I usually give to those getting started:

  1. Why do we need a dataset?
  2. When do we need a dataset?
  3. What do we measure?
  4. Which datasets are available?
  5. Where do we find datasets?

Let’s start.

1- Why do we need a dataset?

By definition, a dataset is a collection of related examples that are used to train and test a model. This can be a selection of examples belonging to a particular topic or domain, and a dataset generally aims to cater to one or more application. A dataset may be labelled, and therefore, ideal for training and testing supervised models. However, there are also unlabeled datasets that are used to train unsupervised models.

Train and Test

From a machine learning perspective, we need Datasets to train models and subsequently test them. This process requires us to choose a part of the dataset (e.g. 70% of it) and ‘show’ it to the machine learning algorithm for learning purpose. We then select the remaining unseen examples in the dataset (e.g. the remaining 30%) and use them to test how well the model learnt. It is crucial that we don’t test with examples that were already used for training since the model will be predicting something it already knows, which is known as ‘overfitting’ a model. This is something that we wouldn’t want because it only guarantees the failure of the model once it is used on a different dataset. There are various methods for organising the train-test set, and you can take a look at these examples.


Datasets also serve as a measurement tool when it comes to the performance of machine learning techniques. A selection of models performing the same task needs to be compared fairly. This is carried out by running the different methods on a range of datasets. The performance measurement of each method would, therefore, be comparable and allows for the neat comparison of results.

Ali Borji carried out and published an outstanding set of benchmarking exercises on Saliency techniques. These are some of his papers that I recommend to my students:

  1. Salient object detection: A survey (2019)
  2. Revisiting Video Saliency: A Large-scale Benchmark and a New Model (2018)
  3. Salient Object Detection: A Benchmark (2015)

Sidenote: Understand Bias

Bias is a vast topic within itself. There are some critical matters that we need to keep in mind.

Just like any other source of information, Datasets carry within them an inherent level of bias.

This might not necessarily have negative implications, especially if you want your model to survive the test of relevance in an already biased world. However, it’s very important that we are aware of any bias and measure any implications.

2- When do we need a dataset?

The aim of this article is not to focus on specific computer vision techniques. However, I’ll quickly walk you through a selection of topics and highlight the need for a dataset.

Object Detection and Recognition

Object Detection deals with identifying and locating an object of certain classes in an image. Interpreting the object localisation can be done in various ways. A commonly used approach in dataset annotation includes the drawing of a bounding box or polygon around the object as discussed below. Such an annotation allows the dataset to be used for detection. The same dataset can then be used for recognition if every annotation is accompanied by a label. Once objects are selected, they can also be used to mark every pixel in the image which contains the object (segmentation).

Object Segmentation

Segmentation is the process of partitioning an image into multiple segments (sets of pixels) that correspond to a specific region or object. This can be applied to objects using thresholding techniques such as Otsu’s method.

Segmentation can also make use of features. Modern approaches make use of deep learning methods where models trained over datasets containing thousands of pixel-level annotated labels. These approaches include Semantic Segmentation (region selection accompanied by a label) and Instance Segmentation (semantic segmentation that identifies multiple separate objects per class).

Visual Saliency

Visual saliency is a less popular area in computer vision that answers the following question: Which part of the image attracts more attention? Saliency detection techniques receive a colour image as input and return an 8-bit saliency map where the brighter the pixel value (max 255) implies a very salient pixel. Visual Saliency is used in different applications ranging from data compression to product placement and image manipulation. Datasets such as the MSRA10K featured below provide a binary image as ground-truth that indicates which pixels are salient or not.

3- What do we measure?

The type and quality of annotations available in a dataset are crucial to its relevance. In this section, I’ll quickly walk you through the main types of annotations. Credit goes to @jiayin_Supahands for her neat outline of this aspect, and I encourage you to read her article. Here, I’m only giving an overview of the most commonly used annotations and their relation to the topic.

Bounding Boxes

The bounding box approach is the simplest type of annotations and naturally involves the drawing of a bounding box around an object of interest. It is generally defined by a pair of coordinates and corresponding width and height. The bounding box definition often needs to be accompanied by a label if used for classification or recognition. The main drawback of using a bounding box is that it labels any background pixels caught in the bounding box in the same way as target object pixels. From an error metric perspective, it can be helpful for tracking recall, but it is then weak for precision, hence generating the need for something which is more specific.

Getting started with Computer Vision Datasets: a 5-step primer

An easy example of using a bounding box annotation (Source: Jiayin )


The limitation of bounding boxes brings along the need for something more precise: polygon annotation. The idea of polygon annotation is similar to the bounding box but allows for better pixel precision in labelling by reducing the number of background pixels being miss-labelled. A tool such as LabelMe is required for such an annotation. Label me is an opensource online annotation tool to build image databases for computer vision research. It also offers its own datasets.

Getting started with Computer Vision Datasets: a 5-step primer

Examples of Polygon Annotation from the official LabelMe Website

Line Annotations

As the name implies, this approach uses lines to annotate specific regions in an image. Lines can be useful in a situation where a bounding box would take a substantial area of pixels. Lane detection is an easily applied case for the use of such an annotation. This can be also used for monitoring of queues and quality control situations.

Getting started with Computer Vision Datasets: a 5-step primer

Line annotation being used for lane detection (Source: Jiayin )

Point Annotations

These annotations are a specification of groups of keypoints on an image, often carrying a semantic connotation. This approach is very commonly used for pose estimation and facial recognition. The geometrical properties between different points are used as features, and machine learning algorithms are trained using these features. This approach was used in our recent work titled “ Detecting abnormal human behaviour through a video generated model ” published in 2019.

Getting started with Computer Vision Datasets: a 5-step primer

Do you feel like something practical? Check this excellent pose estimation TensorFlow tutorial, which is the source of this image .

4- Which datasets are available?

Well, plenty :)

There are dozens of remarkable computer vision datasets that were crucial to the development of models that are changing the world. In this section, I am focusing on a selection of landmark datasets that every computer vision professional should know about.


Official Website :

Image-Net is the legendary computer vision dataset that contributed to the rise of deep learning. It is an image database organised according to the WordNet hierarchy where each meaningful concept in, possibly described by multiple words, is called a “synonym set” or “synset”. Image-net is generally used for object classification/recognition. This dataset contains a total of 14,197,122 with a total of 1,034,908 images with bounding box annotations.

This dataset gained its popularity for the Image-net competition through which Deep Learning gained its traction after AlexNet won this competition in 2012. It was founded by Fei-Fei Li , and she shared the remarkable journey behind this dataset in the Ted talk I’m featuring below:

No matter how experienced you think/feel you are in computer vision, I strongly advise you invest some time in listening to this inspirational talk. Even though techniques have advanced since its release in 2015, the mindset and humbleness presented in this video are still highly relevant.


The original MNIST dataset, led by Yann Le Cun , consisted of a large volume of handwritten images. It served the vital role of providing a much needed easy-access benchmark for early convolutional neural networks. By 2017, CNNs achieved constant outstanding accuracy (over 99%) on MNIST, and the need for a more challenging benchmark dataset arose. This served as a motivation for the Fashion MNSIT dataset. The latter version includes a training set of 60,000 examples and a test set of 10,000 examples, where every example is a 28x28px of a fashion item from 10 different classes.

Getting started with Computer Vision Datasets: a 5-step primer

This is a cool visualisation of the Fashion MNIST database from the official GitHub repository .


Official Website :

This dataset was released by the Canadian Institute For Advanced Research (CIFAR) and probably gained some of its popularity through the involvement of Geoffrey Hinton and his associates. The CIFAR-10 dataset contains 60,000 32x32px colour images in 10 different classes. It is used for train/testing of object recognition models.

Getting started with Computer Vision Datasets: a 5-step primer

A selection of images from the 10 classes of the CIFAR-10 dataset (Source: CIFAR website )


Official Website :

The Common Objects in Context (COCO) dataset is an object detection, segmentation, and captioning dataset. The COCO 2017 has a training and validation collection of 123,287 images containing a total of 886,284 instances. These instances are spread over 80 object categories.

Getting started with Computer Vision Datasets: a 5-step primer

This is a screen capture of one of the images in the COCO dataset .


Official Website :

There is a significant number of datasets covering different sorts of facial data. Here, I chose to feature a new and innovative dataset compiled by my colleagues at the Unversity of Malta. Unlike other facial detection or recognition datasets, this one is annotated using descriptive text. This allows machine learning models to be trained to return a textual description of a face given just an image. The full details of the publication introducing this dataset can be found here and the dataset itself may be acquired by filling in the contact form on the official website of this project.

Getting started with Computer Vision Datasets: a 5-step primer

A sample of the dataset extracted from the official publication .


Official Website :

This is a Salient Object Image Database. Every image in this dataset has a mask for the most salient region in the image. The MSRA10K dataset gained its relevance from the volume of images it contains. It consists of 10,000 colour images with a corresponding binary image mask for the salient object.

Getting started with Computer Vision Datasets: a 5-step primer

What’s most salient in the image? Every coloured image in the MSRA10K dataset is accompanied by a binary image that serves as ground truth.


Official Website :

The Microsoft Research Dataset (MSR) includes a sequence of 100 images (colour and depth) captured from 8 cameras showing the breakdancing and ballet scenes. This dataset contains frames for each scene. Every frame has a colour image and high-quality grayscale depth image, captured by an infrared camera.

Getting started with Computer Vision Datasets: a 5-step primer

This is a sample frame from the “Breakdance” sequence in the MSR3D dataset, filmed through a linear setup of 8 cameras. There is also another sequence, the “Ballet” sequence that was filmed through a circular configuration of 8 cameras.


Official Website :

This is a dataset I carefully designed and built last year to evaluate image manipulation techniques. One of such applications is inpainting where an object is removed from an image. Inpainting techniques are usually evaluated using subjective or opinion-based approach because datasets would lack adequate ground truth. This served as a motivation behind this dataset that has a series of progressive scenes as demonstrated below. Further details about this dataset and the experience behind its construction will be shared in separate work.

Getting started with Computer Vision Datasets: a 5-step primer

A sample from the COTS dataset showing the progressive nature of scenes where a new object is introduced at every instance. This means that inpainting can be applied to the nth instance and having instance n-1 serve as ground truth.

5- Where do we find datasets?

In academia, you’ll typically come across datasets in peer-reviewed publications about your topic of interest. However, you sometimes just need to browse your options, and for that, you need a good platform. Here follow my 4 favourite sources:

Google Dataset Search

  • Pros : Very extensive
  • Cons : Easy to get lost when comparing different datasets.


  • Pros : Focused on Computer Vision datasets, Excellent interface, easy to use and quick to get to direct repositories.
  • Cons : Still limited in terms of selection of available datasets.


  • Pros : Variety of datasets across different domains, active community, competitions.
  • Cons : Can take longer to see what each dataset offers.


  • Pros : An extensive selection of straight to the point pages for every dataset. Every dataset is also accompanied by excellent usage resources.
  • Cons : For completeness, I need to squeeze out a drawback. In this case, the disadvantage is that (obviously) this website only provides Tensorflow resources.

以上就是本文的全部内容,希望本文的内容对大家的学习或者工作能带来一定的帮助,也希望大家多多支持 码农网






Getting Started with C++ Audio Programming for Game Development

Getting Started with C++ Audio Programming for Game Development

David Gouveia

Written specifically to help C++ developers add audio to their games from scratch, this book gives a clear introduction to the concepts and practical application of audio programming using the FMOD li......一起来看看 《Getting Started with C++ Audio Programming for Game Development》 这本书的介绍吧!

JSON 在线解析
JSON 在线解析

在线 JSON 格式化工具

UNIX 时间戳转换
UNIX 时间戳转换

UNIX 时间戳转换

HEX HSV 转换工具
HEX HSV 转换工具

HEX HSV 互换工具