Object detection for self-driving vehicles
In our prior blog, Introduction to Object Detection, we found out about the fundamentals of object detection. We also obtained a bird’s eye view perspective of the YOLO (You only look once) algorithm. In this blog post, we will expand our learning and will delve deeper into the YOLO algorithm. We will go about learning subjects like intersection over area metrics, non maximal suppression, multiple object detection, anchor boxes, etc. Lastly we will develop an object detection system for an autonomous vehicle leveraging the YOLO algorithm. We will be leveraging the Berkeley driving dataset to train the model.
Prior to getting into developing the several components of the object detection framework, we will execute a few preprocessing steps. The preprocessing steps consist of resizing the imagery (going by the input shape accepted by the model) and translating the box coordinates into the relevant form. Since we will be developing an object detection for an autonomous vehicle, we will be detecting and localizing eight differing categories. These classes are ‘car’, ‘bike’, ‘bus’, ‘person’, ‘motor’, ‘train’, ‘rider’, and ‘truck’. Hence, our target variable will have the following definition:
Pc: Probability/confidence of an object being present in the bounding box
bx, by: Coordinates of the centre of the bounding box.
Bw: Width of the bounding box w.r.t the image width
Bh: Height of the bounding box w.r.t the image height
Ci: Probability of the ith class.
However, as the box coordinates furnished in the dataset are in the following format: xmin, ymin, xmax, and ymax, we require to translate them according to the target variable defined above. This can have the following implementation:
W: Width of the original image
H: Height of the original image
def process_data(images, boxes=None):
Process the data
images = [PIL.Image.fromarray(i) for i in images]
orig_size = np.array([images.width, images.height])
orig_size = np.expand_dims(orig_size, axis=0)
processed_images = [i.resize((416, 416), PIL.Image.BICUBIC) for i in images]
processed_images = [np.array(image, dtype=np.float) for image in processed_images]
processed_images = [image/255. for image in processed_images]
if boxes is not None:
# Box preprocessing
# Original boxes stored as as 1D list of class, x_min, y_min, x_max, y_max
boxes = [box.reshape((-1, 5)) for box in boxes]
# Get extents as y_min, x_min, y_max, x_max, class for comparison with
# model output
box_extents = [box[:, [2,1,4,3,0]] for box in boxes]
# Get box parameters as x_center, y_center, box_width, box_height, class.
boxes_xy = [0.5* (box[:, 3:5] + box[:, 1:3]) for box in boxes]
boxes_wh = [box[:, 3:5] – box[:, 1:3] for box in boxes]
boxes_xy = [box_xy / orig_size for box_xy in boxes_xy]
boxes_wh = [box_wh / orig_size for box_wh in boxes_wh]
boxes = [np.concatenate((boxes_xy[i], boxes_wh[i], box[:, 0:1]), axis=-1) for i, box in enumerate(boxes)]
# find the max number of boxes
max_boxes = 0
for boxz in boxes:
if boxz.shape > max_boxes:
max_boxes = boxz.shape
# add zero pad for training
for i, boxz in enumerate(boxes):
if boxz.shape < max_boxes:
zero_padding = np.zeros((max_boxes – boxz.shape, 5), dtype=np.float32)
boxes[i] = np.vstack((boxz, zero_padding))
return np.array(processed_images), np.array(boxes)
Intersection over Union (IoU) is an assessment metric that is leveraged to measure the precision of an object detection algorithm. Typically, IoU is a measure of the overlap amongst two bounding boxes. In order to go about calculating this metric, we require:
- The ground truth bounding boxes (that is the manually labelled bounding boxes)
- The forecasted bounding boxes from the model
Intersection over Union is the ratio of the area regarding an intersection over the union area taken up by the ground truth bounding box and the predicted bounding box.
Now, that we possess an improved understanding of the metric, let’s code it:
def IoU(box1, box2):
Returns the Intersection over Union (IoU) between box1 and box2
box1: coordinates: (x1, y1, x2, y2)
box2: coordinates: (x1, y1, x2, y2)
# Calculate the intersection area of the two boxes
xi1 = max(box1, box2)
yi1 = max(box1, box2)
xi2 = min(box1, box2)
yi2 = min(box1, box2)
area_of_intersection = (xi2 – xi1) * (yi2 – yi1)
# Calculate the union area of the two boxes
# A U B = A + B – A ∩ B
A = (box1 – box1) * (box1 – box1)
B = (box2 – box2) * (box2 – box2)
union_area = A + B – area_of_intersection
intersection_over_union = area_of_intersection/ union_area
Defining the model
Over developing the model from the ground up, we will be leveraging a pre-trained network and leveraging transfer learning to develop our finalized model. You only look once (YOLO) is a cutting-edge, real-time object detection framework, which possesses a mAP on VOC 2007 of 78.6% and a mAP of 48.1% on the COCO test-dev. YOLO goes about applying a singular neural network to the complete image. This network demarcates the image into regions and forecasts the bounding boxes and probabilities for every region. These bounding boxes are weighted by
One of the benefits of YOLO is that it appears at the complete image during the evaluation time, so its forecasts are informed by global context in the imagery. Unlike R-CNN, which needs thousands of networks for a singular image, YOLO makes forecasts with a singular network. This renders the algorithm really quick, over 1000-fold quicker than R-CNN and 100-fold quicker than Fast R-CNN.
If the target variable y is defined as:
The loss function with regards to object localization is defined as
The loss function with regards to the YOLO algorithm is calculated leveraging the subsequent steps:
- Identify the bounding boxes with the highest IoU with the actual bounding boxes
- Calculate the confidence loss (the probability of object being present within the bounding box)
- Calculate the classification losses (The probability of class within the bounding box)
- Calculate the coordinate loss for the matching detected boxes.
- Cumulative loss is the total of the confidence loss, classification loss, and coordinate loss.
Leveraging the steps illustrated above, let’s calculate the loss function for the YOLO algorithm.
Generally, the target variable can have its definition as:
pi(c) : Probability/confidence of an object being present in the bounding box.
xi, yi : coordinates of the center of the bounding box.
wi : width of the bounding box w.r.t the image width.
hi : height of the bounding box w.r.t the image height.
Ci = Probability of the ith class.
Then the corresponding loss function is calculated as:
The above equation is representative of the YOLO loss function. The equation may appear intimidating to begin with, but on taking a zoomed in look we can observe it is the sum of the coordinate loss, the classification loss, and the confidence loss in that order. We leverage sum of squared errors as it is simple to optimize. But, it weights the the localization error equally with classification error which may not be for the best. To find a solution to this, we enhance the loss from bounding box coordinate forecasts and reduce the loss from confidence forecasts for boxes that don’t consist of objects. We leverage dual parameters λcoord and λnoobj to accomplish this.
Observe that the loss function only punishes classification error if an object is present in that grid cell. It also punishes the bounding box coordinate error if that predictor is accountable for the ground truth box – which has the biggest IOU of any predictor in that grid cell.