A convolutional neural network, or CNN, is a neural network that looks for small local patterns and then combines them into larger ones. In images, early layers often detect edges or corners, middle layers detect textures or parts, and deeper layers use those signals to support a final prediction.
The key idea is weight sharing. Instead of learning a separate weight for every pixel-location pair, a CNN reuses the same small filter across many positions. That makes it much cheaper than a dense layer on the raw image and helps it detect the same kind of pattern in more than one place.
What a convolutional neural network does
In a fully connected layer, each output can depend on every input value at once. A CNN is more structured. It uses small kernels, often called filters, that look at one local patch at a time.
For a single-channel input and a kernel , one output entry can be written as
This is the local weighted-sum idea behind a convolutional layer. In many machine learning libraries, the implemented operation is technically cross-correlation rather than a flipped mathematical convolution, but the practical intuition is the same: the kernel scans across the input and produces a feature map.
The feature map tells you where the learned pattern appears strongly.
Why shared filters help
If the same vertical edge appears near the top-left corner of an image or near the center, we usually want the model to notice it either way. A CNN supports that by reusing the same filter parameters across positions.
This has two practical effects:
- It reduces the number of learned parameters compared with a dense layer on the raw image.
- It encourages the network to detect recurring local patterns rather than memorize one fixed location.
That reuse is one reason CNNs became effective for image tasks.
What a basic CNN architecture looks like
A basic CNN often follows this pattern:
- convolution layer
- activation such as ReLU
- optional pooling or downsampling
- more convolution blocks
- final prediction layer
Early layers usually capture simple local structure. Deeper layers combine those responses into larger, more task-specific features.
Pooling is not mandatory, but when it is used, it shrinks the spatial dimensions so later layers can work with a more compact representation. A common example is max pooling, which keeps the largest value in each small region.
If stride is and padding is , then an input with a kernel produces an output. That size rule is useful when you check whether a worked example makes sense.
Worked example: how a CNN feature map is created
Take this input image:
Use this kernel:
Assume stride and no padding. Since the input is and the kernel is , the output must be . Each output entry is the sum of one patch because every kernel entry equals .
The top-left output value is
The patch one step to the right is
so
Working through all valid positions gives
This output is the feature map. Large values show where the kernel found a strong match. Here the filter responds most strongly where a full bright block appears.
If you now apply ReLU, nothing changes because all entries are already nonnegative. If you then use max pooling with stride , the pooled output becomes
That does not create new information. It keeps the strongest nearby responses and reduces the spatial grid.
This example is simple, but it shows the core mechanism clearly: a filter slides, computes local weighted sums, and creates a map of where a pattern appears.
What a CNN learns during training
The kernel above was chosen by hand, but in a real CNN the filter values are learned from data. Training adjusts those values so the resulting feature maps become useful for the task.
If the task is image classification, the network learns filters that help separate classes. If the task is segmentation or detection, the later layers are trained for those outputs instead. The basic mechanism is the same: forward pass, loss, backpropagation, parameter update.
Common mistakes when learning CNNs
Thinking a CNN just means "image classifier"
Images are the standard example, but CNNs are really about local structure and shared filters. If nearby values matter, the same idea can be useful beyond images.
Assuming pooling is always required
It is common, not universal. Some architectures reduce spatial size with strided convolutions instead, and some keep more spatial detail for longer.
Ignoring stride and padding
Feature-map size depends on these choices. If you change stride or padding, you change not just the shape of the output, but also which local neighborhoods each unit can see.
Treating the layer as only a formula
The convolution formula matters, but the architecture matters too. A CNN works because convolution, activation, stacking, and training all work together.
When convolutional neural networks are useful
CNNs are widely used in computer vision tasks such as image classification, object detection, and segmentation. They also appear in some signal-processing and sequence settings where local patterns are meaningful.
They are especially useful when the input has an obvious grid or ordered neighborhood structure. If that condition is weak, a different architecture may be a better fit.
A mental model that makes CNNs easier to understand
Think of a CNN as a pattern detector that starts small and grows more abstract with depth. One layer asks, "Does this small pattern appear here?" Later layers ask, "Do these simpler patterns combine into something more meaningful?"
That is why CNNs are easier to understand when you focus on feature maps, not just on the word "convolution."
Try your own version
Keep the same input, but change the kernel to
Recompute the feature map and see which regions now produce large positive or negative responses. That small change makes it much clearer how different filters detect different patterns.
Need help with a problem?
Upload your question and get a verified, step-by-step solution in seconds.
Open GPAI Solver →