Intro to object detection
Human beings can go about detecting and identifying objects present in an image with ease. The human visual system is quick and precise and can execute complicated tasks like detecting several objects and identify obstacles with minimal conscious thought. With the availability of massive amounts of data, quicker GPUs, and improved algorithms, we can now go about training computers to identify and categorize several objects within an image with high precision with relative ease. In this blog post by AICoreSpot we will look into terminology like object detection, object localization, loss function for object detection and localization, and lastly look into an object detection algorithm referred to as “You only look once” (YOLO)
An image categorization or image recognition framework merely identifies the probability of an object within an image. In comparison to this, object localization is in reference to identification of the location of an object within an image. An object localization algorithm will put out the coordinates of the location of a specific object with regards to the image. Within computer vision, the most widespread way to go about localizing an object within imagery is to indicate its location with the assistance of bounding boxes. The image below shows an instance of a bounding box.
A bounding box can be setup leveraging the following parameters:
- bx, by: coordinate of the centre of the bounding box
- bw: width of the bounding box w.r.t the image width
- bh: height of the bounding box w.r.t the image height
Definition of the target variable
The target variable with regards to a multi-class image categorization problem is defined as:
ci = Probability of the ith class.
For example, if there are four classes, the target variable is defined as
We can extend this strategy to provide definition to the target variable for object localization. The definition of the target variable is:
pc = Probability/confidence of an object (i.e the four classes) being present in the bounding box.
bx,by,bh,bw = Bounding box coordinates.
ci = Probability of the ith class the object belongs to.
For instance, the four categories can be ‘truck’, ‘car’, ‘bike’, ‘pedestrian’ and their probabilities are indicated as c1,c2,c3,c4. So,
Let the values of the target variable y are indicated as y1, y2, …, y9.
The loss function for object localization will be defined as:
Practically, we can leverage a log function taking into account the softmax output in case of the predicted classes (c1,c2,c3,c4). While for bounding box coordinates, we can leverage something such as a squared error and for pc (confidence of object) we can leverage logistic regression loss.
As we have defined the target variable and the loss function, we can now leverage neural networks to both categorize and localize objects.
A strategy to building an object detection is to first develop a classifier than can classify cropped imagery of objects. Fig 2. Illustrates an instance of such a model, where a model receives training on a dataset of closely cropped imagery of a vehicle and the model forecasts the probability of an image being a car.
Presently, we can leverage this model to identify cars leveraging a sliding window mechanism. Within a sliding window mechanism, we leverage a sliding window – just like the one leveraged in convolutional networks, and crop a portion of the imagery in every slide. The size of the crop is identical to the size of the sliding window. Every cropped image is then run through a ConvNet model – like the one depicted in the below figure, which subsequently forecasts the probability of the cropped image being a car.
After executing the sliding window through the entire image, we resize the sliding window and execute it again over the image another time. This procedure is repeated several times. As we crop through several images and run it through the ConvNet, this strategy is both intensive from a computational standpoint, and time-intensive, making the entire procedure really slow. Convolutional implementing of the sliding window assists in finding a resolution to the issue.
Convolutional implementing of sliding windows
Prior to discussing the implementing of the sliding window leveraging convents, let’s undertake analysis of how we can translate the fully connected layers of the network into convolutional layers. The following figure demonstrates a simplistic convolutional network with two completely connected layers each of shape (400, ).
A completely connected layer can be translated to a convolutional layer with the assistance of a 1D convolutional layer. The width and height of this layer are identical to one and the numbers of filters are equivalent to the shape of the completely connected layer. An instance of this is demonstrated in the figure below:
We can undertake application of this theory of conversion of a completely connected layer into a convolutional layer to the model by substituting the completely connected layer leveraging a 1D convolutional layer. The number of the filters of 1D convolutional layer is identical to the shape of the completely connected layer. This representation is demonstrated in the figure below. Additionally, the output softmax layer is also a convolutional layer of shape (1, 1, 4), wherein number 4 is the number of categories to forecast.
Now, let’s take an extension of the above strategy in order to go about implementing a convolutional variant of sliding window. To start with, let’s consider the ConvNet that we have trained to be a part of the following representation – no completely connected layers.
Assuming the size of the input image to be 16 x 16 x 3, and if we are to leverage a sliding window strategy, then we would have put this image through the above ConvNet four times, where every time the sliding window goes about cropping a portion of the input image of size 14 x 14 x 3 and then put it through the ConvNet. But rather than do this, we input the complete image – with size 16 x 16 x 3 directly into the ConvNet which has received training.
This has the outcome of an output matrix of shape 2 x 2 x 4. Every cell in the output matrix is indicative of the outcome of a potential crop and the categorized value of the cropped image. For instance, the left cell of the output, the green one, in the following figure is indicative of the outcome of the first sliding window. The other cells indicate the outcomes of the remainder of the sliding window operations.
Observe that the stride of the sliding window is determined by the amount of filters leveraged in the Max Pool layer. In the instance above, the Max Pool layer possess dual filters, and as an outcome, the sliding window shifts with a stride of two resulting in four potential outputs. The primary benefit of leveraging this strategy is that the sliding window shifts with a stride of two having the outcome of four potential outputs. The primary benefit of leveraging this strategy is that the sliding window executes and goes about computing all values at the same time. As a result, this strategy is really quick. Although a drawback of this strategy is that the position of the bounding boxes is not very precise.
The YOLO (You Only Look Once) Algorithm
An improved algorithm that handles the problem of forecasting precise bounding boxes while leveraging the convolutional sliding window strategy is the YOLO algorithm. YOLO is abbreviated as you only look once and was produced in 2015, by Joseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. It’s widespread as it accomplishes increased precision while being executed in real time. This algorithm is referred to as such as it needs only one forward propagation pass via the network to make the forecasts.
The algorithm demarcates the image into grids and executes the image classification and localization algorithm (detailed under object localization) on every cell grid. For instance, we possess an input image of size 256 x 256. We put a 3 x 3 grid on the image.
Subsequently, we go about applying the image classification and localization algorithm on every grid cell. For every grid cell, the target variable is defined as:
Do it all one time with the convolution sliding window. As the shape of the target variable for every grid cell is 1 x 9 and there are 9 (3 x 3) grid cells, the resultant outcome of the model will be:
The benefits of the YOLO algorithm is that it is very quick and forecasts a lot more precise bounding boxes. Also, practically, to obtain more precise forecasts, we leverage a much finer grid, say 19 x 19, in which scenario, the target output is of the shape 19 x 19 x 9.
This brings us to the conclusion of our intro to object detection. We currently have an improved understanding of how we can go about localizing objects while categorizing them within an image. We also learned to bring together the concept of classification and localization with the convolutional implementing of the sliding window to develop an object detection framework.