
With the rise of deep learning, Convolutional Neural Networks (CNNs) have become a key technology, transforming image and video processing. These specialized neural networks are designed to replicate the functioning of the human visual cortex, allowing machines to analyze images by automatically extracting relevant visual features.
Since their introduction in 1989 for handwritten digit recognition, CNNs have evolved significantly, becoming essential in various fields that require image analysis and complex data processing. Their effectiveness comes from their ability to detect visual patterns with remarkable accuracy, often outperforming traditional computer vision techniques.
Today, CNNs play a vital role in multiple applications, including:
In this section, we will examine the structure of CNNs and their fundamental components.
A Convolutional Neural Network (CNN) is a specialized neural architecture designed to efficiently process visual data. Its operation relies on a series of layers that extract, transform, and interpret the features of an image to deduce a classification or prediction.
Thanks to this hierarchical architecture, a CNN can capture simple patterns (edges, textures) in the early layers and more complex structures (shapes, objects) in the deeper layers.
A CNN consists of several types of layers, each playing a specific role in image analysis:
An image can be represented as a matrix of pixels, where each pixel contains a light intensity (for a grayscale image) or several values (Red, Green, Blue — RGB) for a color image.
But how can a machine identify shapes, textures or objects from this raw data? This is where convolution comes in! It is a mathematical operation that allows to extract characteristic patterns from an image and reveal its essential structures. This operation apply filters to an image to detect specific patterns, such as:
Each filter acts like a lens, highlighting certain aspects of the image, making it easier for machines to automatically recognize visual elements.
🔍 A Simple Analogy: The Magnifying Glass
Imagine looking at an image through a magnifying glass. As you move it across different parts of the image, you can observe specific details more clearly, such as the edges of an object or a unique texture. Convolution does the same thing with a filter, but in a mathematical and systematic way.
Steps of convolution
1️⃣ Choosing the Filter (Convolution Kernel)
A filter is a small matrix of numbers (often 3×3 or 5×5) that interacts with an image to highlight specific features. Different types of filters serve distinct purposes:

Blurring filter (Gaussian Blur): Applies a weighted average to smooth an image and reduce noise. 
Sharpening filter: Enhances edges and improves image sharpness 
2️⃣ Applying the Filter to the Image (Convolution Operation) 
Each resulting value depends on the applied filter. An edge detection filter will highlight the edges, while a blurring filter will smooth out fine details. The image obtained after convolution is called the Feature Map. It highlights essential information while eliminating irrelevant details.
Mathematical formula for the convolution
The convolution on an image I with a filter K is expressed as:
After the convolution stage, where filters are applied to extract specific features from the image, the network generates a feature map. This map highlights relevant elements while eliminating unnecessary details. However, convolution alone has limitations. It can detect simple features, like edges, textures, or patterns, but it cannot model complex relationships or understand non-linearities in the data.
To address this limitation, an activation function is applied to each value in the feature map. This function introduces non-linearity into the model, which is essential for enabling the network to learn complex relationships and perform tasks like classification or object detection.
Convolution is a powerful operation for feature extraction, but it is inherently linear. A linear operation satisfies two properties:
While convolution can detect simple relationships, such as brightness differences between pixels, it cannot handle complex or non-linear relationships, which are necessary for tasks like recognizing objects in varying conditions.
Consider the example of recognizing a cat in an image. Convolution alone can detect basic features, such as edges (like the cat’s ears) or repetitive patterns (such as fur texture). However, these features are insufficient for recognizing a cat, especially in more challenging scenarios like:
In these cases, convolution struggles to combine simple features and understand complex relationships, such as “these contours form a cat’s face.” This is where activation functions come in.
Activation functions introduce non-linearity into the network, enabling it to model complex relationships. This is critical for tasks that involve more than simple linear transformations, such as recognizing objects that change in size, orientation, brightness, or position.
In a convolutional network, each layer progressively learns to detect more abstract features:
Without non-linearity, this hierarchy of abstraction would not be possible, as each layer would merely perform a linear combination of data from the previous layer.
Now, let’s talk about differents types of activation functions in part 2?👉