>Business >An intro to the promise of Deep Learning for Computer Vision

An intro to the promise of Deep Learning for Computer Vision

The promise of deep learning in the domain of computer vision is better performing models that might need additional data but reduced digital signal processing expertise to train and operate. 

There is a massive hype and major claims with regards to deep learning strategies, but beyond the hype, deep learning strategies are accomplishing state-of-the-art outcomes on difficult problems. Notably, on computer vision activities like image classification, object recognition, and face detection. 

In this blog article, you will find out the particular promises that deep learning strategies have for handling computer vision problems. 

After going through this post, you will be aware of: 

  • The promises of deep learning for computer vision. 
  • Instances of where deep learning has or is delivering on its promises. 
  • Critical deep learning strategies and applications for computer vision. 

Tutorial Summarization 

This tutorial is subdivided into three portions, which are: 

1] Promises of Deep Learning 

2] Variants of Deep Learning Networks Models 

3] Variants of Computer Vision Problems 

Promises of Deep Learning 

Deep learning strategies are widespread, mainly because they are following through on their promise. 

That is not to mention that there is no hype surrounding the technology, but that the hype is on the basis of very real outcomes that are being demonstrated across an array of very challenging artificial intelligence problems from computer vision and natural language processing. 

A few of the initial large demonstrations of the power of deep learning were in computer vision, particularly image recognition. More lately in object detection and face recognition. 

In this blog article, we will look at five particular promises of deep learning strategies in the domain of computer vision: 

In summarization, they are: 

  • The Promise of Automatic Feature Extraction: Features can be automatically learned and extracted from raw imagery data. 
  • The Promise of End-to-End models: Single end-to-end models can substitute pipelines of specialized models. 
  • The Promise of Model Reuse: Learned features and even complete models can be reused across activities. 
  • The Promise of Superior Performance: Strategies demonstrate improved skill on challenging activities. 
  • The Promise of General Method: A single general method can be leveraged on a range of related tasks. 

We will now take a closer look at each. 

There are other promises of deep learning for computer vision, these were just the five that we choose to highlight. 

Promise 1: Automatic Feature Extraction 

A primary focus of study in the domain of computer vision is on strategies to identify and extract features from digital images. 

Extracted features furnish the context for inference with regards to an image, and usually the richer the features, the improved the inference. 

Advanced hand-designed features like scale-invariant feature transform (SIFT), Gabor filters, and histogram of oriented gradients (HOG) have been the concentration of computer vision for feature extraction for some time, and have observed good success. 

The promise of deep learning is that complicated and useful features can be automatically learned directly from major image datasets. More particularly, that a deep hierarchy of rich features can be learned and automatically extracted from imagery, furnished by the several deep layers of neural network models. 

They possess deeper architectures with the capability to learn more complicated features in comparison to the shallow ones. Additionally, the expressivity and robust training algorithms facilitate to learn informative object representations without the requirement to develop features manually. 

Deep neural network models are following through on this promise, most noteworthily demonstrated by the transition away from advanced hand-crafted feature identification methods like SIFT toward deep convolutional neural networks on traditional computer vision benchmark datasets and competitions, like the ImageNet Large Scale Visual Recognition Competition (ILSVRC) 

ILSVRC over the previous half-a-decade has paved the way for various breakthroughs in computer vision. The domain of categorical object recognition has considerably evolved, beginning from coded SIFT features and evolving to large-scale convolutional neural networks dominating at all three activities of image classification, single-object localization, and object detection. 

Promise 2: End-to-end models 

Tackling computer vision activities conventionally involved leveraging a system of modular models. 

Every model was developed for a particular activity, such as feature extraction, image alignment, or classification. The models are leveraged in pipeline with a raw image at one end and an outcome, like a forecast, at the other end. 

This pipeline strategy can and is still leveraged with deep learning frameworks/models, where a feature detector model can be substituted with a deep neural network.  

Alternatively, deep neural networks facilitate a singular model to subsume dual or more conventional models, like feature extraction and classification, and there has been trends in the direction of substituting pipelines that leverage a deep neural network model where a singular model is trained end-to-end directly. 

With the availability of a ton of training data (combined with an effective algorithmic implementation and GPU computing resources) it became feasible to learn neural networks directly from the image data, without requiring to develop multi-stage hand-tuned pipelines of extracted features and discriminative classifiers. 

A good instance of this is in object detection and face recognition where initially improved performance was accomplished leveraging a deep convolutional neural network for feature extraction only, where more lately, end-to-end models are trained directly leveraging multiple-output models (for example, class and bounding boxes) and/or new loss functions (for example, contrastive or triplet loss functions). 

Promise 3: Model Reuse 

Conventionally, the feature detractors prepped for a dataset are very particular to that dataset. 

This makes sense, as the more domain data that you can leverage in the model, the better the model is probable to perform in the domain. 

Deep neural networks are usually trained on datasets that are much bigger than conventional datasets, for example, millions or billions of images. This enables the models to learn features and hierarchies of features that are general throughout photographs, which is in itself noteworthy. 

If this original dataset is massive enough and general enough, then the spatial hierarchy of features learned by the pretrained networks can basically function as a generic model of the visual world, and therefore its features can prove useful for several differing computer vision problems, even though these fresh problems may consist of completely differing classes that those of the original activity. 

For instance, it is typical to leverage deep models that have received training in the large ImageNet dataset, or a subset of this dataset, directly or as a beginning point on an array of computer vision activities.  

This is referred to as transfer learning, and the leveraging of pretrained models that can take days and at times multiple weeks to train has become standardized practice. 

The pretrained models can be leveraged to extract useful general features from digital imagery and can additionally be fine-tuned, customized to the particulars of the new task. This can save a ton of time and resources that have the outcome of very good outcomes almost immediately. 

A typical and real effective strategy to deep learning on small image datasets it to leverage a pretrained network. 

Promise 4: Improved Performance 

A critical promise of deep neural networks in computer vision is improved performance. 

It is the considerably improved performance with deep neural networks that has been a catalyst for the growth and interest in the domain of deep learning. Even though the strategies have been with us for decades, the spark was the standout performance by Alex Krizhevsky, et al. in 2012 for image classification. 

The present intensity of commercial interest in deep learning started when Krizhevsky et al. (2012) won the ImageNet object recognition challenge. 

Their deep convolutional neural network model, at the time referred to as SuperVision, and later referenced to as AlexNet, which had the outcome of a leap in classification precision. 

We additionally entered a variant of this model in the ILSVRC-2012 competition and accomplished a winning top-5 test error rate of 15.3% contrasted to 26.2% accomplished by the second best entry. 

This strategy was subsequently adopted for an array of really challenging computer vision activities, which includes object detection, while also witnessed a major leap in model performance over then state-of-the-art conventional strategies. 

The initial breakthrough in object detection was the RCNN which had the outcome of an enhancement of approximately 30% over the prior state-of-the-art. 

This trend of enhancement has continued year-over-year on an array of computer vision activities. 

Performance has been so dramatic that tasks prior thought not easily addressable by computers and leveraged as CAPTCHA to avoid scam (like forecasting if a photo is of a dog or cat) are basically “solved” and models on problems like face recognition accomplish better-than-human performance. 

We can look into considerable performance (mean average accuracy) enhancement since deep learning burst into the scene in 2012. The performance of the best detector has been gradually increasing by a considerable amount on an annual basis. 

Promise 5: General Method 

Probably the most critical promise of deep learning is that the highest performing models are all produced from the same fundamental components. 

The impressive outcomes have come from one variant of network, referred to as the convolutional network, consisted of convolutional and pooling layers. It was particularly developed for image data and can be trained on pixel data directly (with some minor scaling) 

Convolutional network furnish a way to specialize neural networks to operate with data that has a clear grid-structured topology and to scale such models to very large size.This strategy has been the most successful on a two-dimensional image topology. 

This is differing from the wider domain may have needed specialist feature detection strategies produced for handwriting recognition, character recognition, face recognition, object detection, and so on. Rather, a singular general class of model can be setup and leveraged across every computer vision activity directly. 

This is the promise of machine learning, generally speaking. It is impressive that such a versatile strategy has been identified and demonstrated for computer vision. 

Furthermore, the model is comparatively straightforward to comprehend and to train, even though may need modern GPU hardware to train effectively on a large dataset, and might need model hyperparameter tuning to accomplish bleeding-edge performance. 

Variants of Deep Learning Networks Models 

Deep learning is a major domain of study, and not all of it is connected to computer vision. 

It is easy to get bogged down in particular optimization strategies or extensions to model variants intended to lift performance. 

From a high-level, there is a single strategy from deep learning that deserves the most attention for application in computer vision. It is: 

  • Convolutional Neural Networks (CNNs) 

The reason that CNNs are the concentration of attention for deep learning frameworks/models is that they were particularly developed for image data. 

Also, both of the following network types might be useful for interpretation or development of inference models from the features learned and extracted by CNNs, which are: 

  • Multilayer Perceptron (MLP) 
  • Recurrent Neural Networks (RNNs) 

The MLP or fully-connected type neural network layer are good for development of models that make forecasts provided the learned features extracted by CNNs, RNNs, like LSTMs, might be beneficial when operating with sequences of imagery across time, like with video. 

Variants of Computer Vision Problems 

Deep learning will not solve computer vision or artificial intelligence. 

To this time, deep learning strategies have been assessed on a wider suite of problems from computer vision and accomplished success on a small set, where success indicates performance or capability at or above what was prior possible with other strategies. 

Critically, those regions where deep learning strategies are displaying the biggest success are some of the more end-user facing, challenging, and probably more interesting problems. 

Five instances consist of: 

  • Optical Character Recognition 
  • Image Classification 
  • Object Detection 
  • Face Detection 
  • Face Recognition 

All five tasks are connected under the umbrella of “object recognition” which references to tasks that consist of identifying, localizing, and/or extracting particular content from digital photographs. 

A majority of deep learning for computer vision is leveraged for object recognition or detection of some variant, whether this implies reporting which object is present in an image, annotation of an image with bounding boxes around every object, transcribing a sequence of symbols from an image, or labelling of every pixel in an image with the identity of the object it belongs to. 

Further Reading 

This section furnishes additional resources on the subject if you are seeking to delve deeper. 


Deep Learning, 2016 

Deep Learning with Python, 2017 


ImageNet Large Scale Visual Recognition Challenge, 2015. 

ImageNet Classification with Deep Convolutional Neural Networks, 2012 

Object Detection with Deep Learning: A Review, 2018. 

A Survey of Modern Object Detection Literature using Deep Learning, 2018. 

Deep Learning for Generic Object Detection: A Survey. 2018. 


In this blog article, you found out about the particular promises that deep learning strategies have for handling computer vision problems. 

Particularly, you learned: 

  • The promises of deep learning for computer vision. 
  • Instances of where deep learning has or is delivering on its promises. 
  • Critical deep learning strategies and applications for computer vision. 
Add Comment