CNN Object Detection Models


Brief Description:

illustrates a comparison between image classification, object detection, and instance segmentation.

Detailed Description:

Figure 1 illustrates a comparison between image classification, object detection, and instance segmentation. When a single object is in an image, the classification model 102 maybe utilized to identify what is in the image. For instance, the classification model 102 identifies that a cat is in the image. In addition to the classification model 102, a classification and localization model 104 may be utilized to classify and identify the location of the cat within the image with a bounding box 106. When multiple objects are present within an image, an object detection model 108 may be utilized. The object detection model 108 utilizes bounding boxes to classify and locate the position of the different objects within the image. An instance segmentation model 110 detects each object of an image, its localization and its precise segmentation by pixel with a segmentation region 112.

The Image classification models classify images into a single category, usually corresponding to the most salient object. Photos and videos are usually complex and contain multiple objects. This being said, assigning a label with image classification models may become tricky and uncertain. Object detection models are therefore more appropriate to identify multiple relevant objects in a single image. The second significant advantage of object detection models versus image classification ones is that localization of the objects may be provided.

Some of the model that may be utilized to perform image classification, object detection, and instance segmentation include but are not limited to, Region-based Convolutional Network (R-CNN), Fast Region-based Convolutional Network (Fast R-CNN), Faster Region-based Convolutional Network (Faster R-CNN), Region-based Fully Convolutional Network (R-FCN), You Only Look Once (YOLO), Single-Shot Detector (SSD), Neural Architecture Search Net (NASNet), and Mask Region-based Convolutional Network (Mask R-CNN).

These models may utilize a variety of training datasets that include but are not limtied to PASCAL Visual Object Classification (PASCAL VOC) and Common Objects in COntext (COCO) datasets.

The PASCAL Visual Object Classification (PASCAL VOC) dataset is a well-known dataset for object detection, classification, segmentation of objects and so on. There are around 10 000 images for training and validation containing bounding boxes with objects. Although, the PASCAL VOC dataset contains only 20 categories, it is still considered as a reference dataset in the object detection problem.

ImageNet has released an object detection dataset since 2013 with bounding boxes. The training dataset is composed of around 500 000 images only for training and 200 categories.

The Common Objects in COntext (COCO) datasets were developed by Microsoft. This dataset is used for caption generation, object detection, key point detection and object segmentation. The COCO object detection consists in localizing the objects in an image with bounding boxes and categorizing each one of them between 80 categories.

Brief Description:

illustrates a Region-based Convolution Network 200.

Detailed Description:

Figure 2 illustrates an example of a Region-based Convolution Network 200 (R-CNN). Each region proposal feeds a convolutional neural network (CNN) to extract a features vector, possible objects are detected using multiple SVM classifiers and a linear regressor modifies the coordinates of the bounding box. The regions of interest (ROI 202) of the input image 204. Each ROI 202 of  resized/warped creating the warped image region 206 which are forwarded to the convolutional neural network 208 where they are feed to the support vector machines 212 and bounding box linear regressors 210.

In R-CNN, the selective search method is an alternative to exhaustive search in an image to capture object location. It initializes small regions in an image and merges them with a hierarchical grouping. Thus the final group is a box containing the entire image. The detected regions are merged according to a variety of color spaces and similarity metrics. The output is a few number of region proposals which could contain an object by merging small regions.

The R-CNN model combines the selective search method to detect region proposals and deep learning to find out the object in these regions. Each region proposal is resized to match the input of a CNN from which the method extracts a 4096-dimension vector of features. The features vector is fed into multiple classifiers to produce probabilities to belong to each class. Each one of these classes has a support vector machines 212 (SVM) classifier trained to infer a probability to detect this object for a given vector of features. This vector also feeds a linear regressor to adapt the shapes of the bounding box for a region proposal and thus reduce localization errors.

The CNN model described is trained on the ImageNet dataset. It is fine-tuned using the region proposals corresponding to an IoU greater than 0.5 with the ground-truth boxes. Two versions are produced, one version is using the PASCAL VOC dataset and the other the ImageNet dataset with bounding boxes. The SVM classifiers are also trained for each class of each dataset.

Brief Description:

illustrates a Fast Region-based Convolutional Network 300.

Detailed Description:

Figure 3 illustrates an example of a Fast Region-based Convolutional Network 300 (Fast R-CNN). The entire image (input image 306) feeds a CNN model (convolutional neural network 302) to detect RoI (ROI 304) on the feature maps 310. Each region is separated using a RoI pooling layer (ROI pooling layer 308) and it feeds fully connected layers 312. This vector is used by a softmax classifier 314 to detect the object and by a bounding box linear regressors 316 to modify the coordinates of the bounding box. The purpose of the Fast R-CNN is to reduce the time consumption related to the high number of models necessary to analyse all region proposals.

A main CNN with multiple convolutional layers is taking the entire image as input instead of using a CNN for each region proposals (R-CNN). Region of Interests (RoIs) are detected with the selective search method applied on the produced feature maps. Formally, the feature maps size is reduced using a RoI pooling layer to get valid Region of Interests with fixed height and width as hyperparameters. Each RoI layer feeds fully-connected layers creating a features vector. The vector is used to predict the observed object with a softmax classifier and to adapt bounding box localizations with a linear regressor.

Brief Description:

illustrates a Faster Region-based Convolutional Network 400.

Detailed Description:

Figure 4 illustrates an example of a Faster Region-based Convolutional Network 400 (Faster R-CNN).

Region proposals detected with the selective search method were still necessary in the previous model, which is computationally expensive.  Region Proposal Network (RPN) was introduced to directly generate region proposals, predict bounding boxes and detect objects. The Faster R-CNN is a combination between the RPN and the Fast R-CNN model.

A CNN model takes as input the entire image and produces feature map 410. A window of size 3×3 (sliding window 402) slides all the feature maps and outputs a features vector (intermediate layer 404) linked to two fully-connected layers, one for box-regression and one for box-classification. Multiple region proposals are predicted by the fully-connected layers. A maximum of k regions is fixed thus the output of the box regression layer 408has a size of 4k (coordinates of the boxes, their height and width) and the output of the box classification layer 406 a size of 2k (“objectness” scores to detect an object or not in the box). The k region proposals detected by the sliding window are called anchors.

When the anchor boxes 412 are detected, they are selected by applying a threshold over the “objectness” score to keep only the relevant boxes. These anchor boxes and the feature maps computed by the initial CNN model feeds a Fast R-CNN model.

The entire image feeds a CNN model to produce anchor boxes as region proposals with a confidence to contain an object. A Fast R-CNN is used taking as inputs the feature maps and the region proposals. For each box, it produces probabilities to detect each object and correction over the location of the box.

Faster R-CNN uses RPN to avoid the selective search method, it accelerates the training and testing processes, and improve the performances. The RPN uses a pre-trained model over the ImageNet dataset for classification and it is fine-tuned on the PASCAL VOC dataset. Then the generated region proposals with anchor boxes are used to train the Fast R-CNN. This process is iterative.

Parts List


classification model


classification and localization model


bounding box


object detection model


instance segmentation model


segmentation region


Region-based Convolution Network




input image


warped image region


convolutional neural network


bounding box linear regressors


support vector machines


Fast Region-based Convolutional Network


convolutional neural network




input image


ROI pooling layer


feature maps


fully connected layers


softmax classifier


bounding box linear regressors


Faster Region-based Convolutional Network


sliding window


intermediate layer


box classification layer


box regression layer


feature map


anchor boxes