Convolutional Neural Networks (CNN)
Definition:
Convolutional Neural Networks (CNNs) are a class of deep neural networks specifically designed to process data with a grid-like structure, such as images. CNNs are highly effective in tasks like image classification, object detection, and recognition due to their ability to capture spatial hierarchies in data through convolutional layers.
Characteristics:
-
Convolutional Layers:
CNNs use convolutional layers to automatically detect spatial features (e.g., edges, textures) from input data, reducing the number of parameters and computational cost compared to fully connected layers. -
Local Connectivity:
Each neuron in a convolutional layer is connected only to a small, local region of the input data, capturing localized patterns that are useful for tasks like image and video analysis. -
Hierarchical Feature Learning:
CNNs learn to extract features in a hierarchical manner, starting with low-level features (edges, textures) and progressing to more complex patterns (shapes, objects).
Components of CNN:
-
Convolutional Layer:
This layer applies a set of filters (or kernels) to the input, producing feature maps. It captures local patterns in the data by sliding the filters over the input, detecting different features at each position. -
Activation Function (ReLU):
After convolution, a non-linear activation function, typically ReLU (Rectified Linear Unit), is applied to introduce non-linearity and help the network learn complex patterns. -
Pooling Layer (Subsampling):
The pooling layer reduces the spatial dimensions of the feature maps, summarizing the most important features and making the network more computationally efficient. Common pooling methods include Max Pooling and Average Pooling. -
Fully Connected Layer:
After the convolutional and pooling layers, the data is flattened and passed through fully connected layers. These layers combine the learned features to make predictions. -
Softmax Layer (for Classification):
In classification tasks, the output of the last fully connected layer is passed through a softmax function to produce probability distributions over classes.
CNN Architecture:
-
Input Layer:
The input can be an image (e.g., 32x32 pixels with 3 color channels: RGB). CNNs can also handle other grid-like data, such as time-series or audio spectrograms. -
Convolutional Layer(s):
Multiple convolutional layers are stacked, each extracting progressively more abstract features from the input data. -
Pooling Layer(s):
Pooling layers (e.g., Max Pooling) are interleaved between convolutional layers to reduce the dimensionality of feature maps while retaining important information. -
Fully Connected Layer(s):
After several convolutional and pooling layers, the feature maps are flattened and passed through fully connected layers, which serve as a classifier or regressor depending on the task. -
Output Layer:
The final fully connected layer outputs the class probabilities (for classification) or the prediction (for regression).
Types of Convolutions:
-
Standard Convolution:
A kernel is applied to the entire image, sliding over it to produce a feature map. -
Depthwise Convolution:
This type of convolution is applied separately to each input channel (e.g., RGB channels), reducing computational cost by keeping channels independent. -
Dilated Convolution:
The filter is applied with gaps between each element, allowing for a larger receptive field without increasing computational cost. -
Transposed Convolution:
Used in tasks like image generation, transposed convolutions perform the opposite operation of standard convolutions, increasing the spatial dimensions of the input.
Problem Statement:
Given an image dataset, the goal of a CNN is to classify the images into different categories (e.g., classifying digits in the MNIST dataset or identifying objects in CIFAR-10). CNNs are also used in segmentation, detection, and generation tasks in computer vision.
Key Concepts:
-
Filters (Kernels):
Filters are small matrices that slide over the input data to detect features like edges, corners, or textures. Multiple filters are used to detect various features. -
Stride:
The number of pixels by which the filter slides over the input data. A larger stride reduces the spatial dimensions of the feature map. -
Padding:
Adding pixels (typically zeros) around the input to maintain its spatial dimensions after convolution. Padding can be used to prevent shrinking of the feature maps. -
Receptive Field:
The region of the input image that influences a particular feature in the output. Deeper layers in a CNN have a larger receptive field and can detect more complex features. -
Pooling:
Pooling layers downsample the feature maps by summarizing regions of the data. Max pooling selects the maximum value, while average pooling computes the average.