“Feature Integration and Object Representations along the Dorsal Stream Visual Hierarchy”
August 2014 Frontiers in Computational Neuroscience 8:84, 22 October 2014
- Carolyn J Perry, Department of Biomedical and Molecular Science, Queen's University, Kingston, Ontario, Canada
- Mazyar Fallah, Centre for Vision Research, York University, Toronto, Ontario, Canada
“Bio-Inspired Computer Vision: Towards a Synergistic Approach of Artificial and Biological Vision”, April 2016
Computer Vision and Image Understanding 150, DOI: 10.1016/j.cviu.2016.04.009
- N V Kartheek Medathati, National Institute for Research in Computer Science and Control, Le Chesnay, France
- Heiko Neumann
- Guillaume S Masson, Aix-Marseille Université, Marseille, France
- Pierre Kornprobst, National Institute for Research in Computer Science and Control, Le Chesnay, France
Why vision is not both hierarchical and feedforward
Michael H. Herzog* and Aaron M. Clarke,
Laboratory of Psychophysics, Brain, Mind Institute, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland

Deep learning, hierarchical learning..

deep neural networks
deep belief networks
recurrent neural networks

Applications include:

computer vision, visual object recognition based on images CNN
speech recognition, natural language processing, machine translation, based on spectrograms RNN, long short-term memory (LSTM)
bioinformatics based on microarray data

CNN achieves translation, rotation and distortion invariance by

local receptitve field
shared weights (weight replication)
subsampling (pooling)

Different from conventional (shallow) neural networks which depend on a set of hand-selected features, the CNN relies directly on the raw data, such as images for visual recognition or spectrograms for sound recognition, from which features are automatically extracted by the network.

Application in speech recognition: spectrogram

Convolutional neural network (CNN, or ConvNet) is a class of multilayer, feed-forward artificial neural network algorithm that has successfully been applied to image analysis and computer vision, such as image object recognition specifically.

Convolutional networks were inspired by biological processes in the brain. The connectivity pattern between neurons resembles the organization of the visual cortex. Individual cortical neurons respond to stimuli only in a restricted region of the visual field known as the

CNNs use a variation of multilayer perceptrons designed to require minimal preprocessing. They are also known as shift invariant or space invariant artificial neural networks (SIANN), based on their shared-weights architecture and translation invariance characteristics.

CNNs use relatively little pre-processing compared to conventional image classification algorithms. The network learns the filters that in traditional algorithms were hand-engineered. This independence from prior knowledge and human effort in feature design is a major advantage.

3D volumes of neurons. The layers of a CNN have neurons arranged in 3 dimensions: width, height and depth. The neurons inside a layer are connected to only a small region of the layer before it, called a receptive field. Distinct types of layers, both locally and completely connected, are stacked to form a CNN architecture.

Local connectivity: following the concept of receptive fields, CNNs exploit spatial locality by enforcing a local connectivity pattern between neurons of adjacent layers. The architecture thus ensures that the learnt "filters" produce the strongest response to a spatially local input pattern. Stacking many such layers leads to non-linear filters that become increasingly global (i.e. responsive to a larger region of pixel space) so that the network first creates representations of small parts of the input, then from them assembles representations of larger areas.

Shared weights: In CNNs, each filter is replicated across the entire visual field. These replicated units share the same parameterization (weight vector and bias) and form a feature map. This means that all the neurons in a given convolutional layer respond to the same feature within their specific response field. Replicating units in this way allows for features to be detected regardless of their position in the visual field, thus constituting the property of translation invariance.

neurons with limited receptive field

Hierarchical structure of multiple layers
receptive field. The receptive fields of different neurons partially overlap such that they cover the entire visual field. In other words, a neuron is connected to a subset of the neurons in the previous layer inside its receptive field, instead of being fully connected to all neurons.

layers of different functions

Image input: $N\times N\times 3$ pixels in the image of three planes for red, green and blue (RGB).
Neurons in the convolution layers are locally connected to neurons inside its receptive field in the previous layer. In particular, each neuron in the first layer takes as input the pixel values inside its receptive field, a subregion in the image. The weights of each neuron form a kernel, and the activation of the neuron is the weighted sum of all pixel values inside the receptive field, called convolution. Each of such neurons functions as a filter. that extract one of a set of different features, such as the edges and line features, describng different aspects of the visual objects of interest. All neurons in a column along the depth dimension respond to the same spatial local region in the visual field.
sigmoid or ReLU function
A pooling layer performs down-sampling by taking either the maximum or the average of the output values from a local region of the previous layer. The distance between the receptive field centers of neighboring neurons is called stride. Down sampling serves two purposes: (a) local shift and rotational invariance and (b) computation reduction.
Dropout
Neurons in the fully connected (FC) layers are fully connected to all neurons in the previous layer. Neurons in this highest layer are responsible for the final recognition of various visual objects.
The convolution layers can be considered as feature extraction and the fully connected layers carry out the final recognition based on the features extracted by the convolution layers.
The weights of all neurons at all layers are iteratively updated based on backpropagation.

ImageNet

AlexNet

An example

A CNN course at Stanford

Convolutional Neural Networks (CNNs)