A convolutional neural network, or CNN, is a neural network that looks for small local patterns and then combines them into larger ones. In images, early layers often detect edges or corners, middle layers detect textures or parts, and deeper layers use those signals to support a final prediction.

The key idea is weight sharing. Instead of learning a separate weight for every pixel-location pair, a CNN reuses the same small filter across many positions. That makes it much cheaper than a dense layer on the raw image and helps it detect the same kind of pattern in more than one place.

What a convolutional neural network does

In a fully connected layer, each output can depend on every input value at once. A CNN is more structured. It uses small kernels, often called filters, that look at one local patch at a time.

For a single-channel input xx and a k×kk \times k kernel KK, one output entry can be written as

yi,j=m=0k1n=0k1Km,nxi+m,j+n.y_{i,j} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} K_{m,n} x_{i+m,j+n}.

This is the local weighted-sum idea behind a convolutional layer. In many machine learning libraries, the implemented operation is technically cross-correlation rather than a flipped mathematical convolution, but the practical intuition is the same: the kernel scans across the input and produces a feature map.

The feature map tells you where the learned pattern appears strongly.

Why shared filters help

If the same vertical edge appears near the top-left corner of an image or near the center, we usually want the model to notice it either way. A CNN supports that by reusing the same filter parameters across positions.

This has two practical effects:

  • It reduces the number of learned parameters compared with a dense layer on the raw image.
  • It encourages the network to detect recurring local patterns rather than memorize one fixed location.

That reuse is one reason CNNs became effective for image tasks.

What a basic CNN architecture looks like

A basic CNN often follows this pattern:

  1. convolution layer
  2. activation such as ReLU
  3. optional pooling or downsampling
  4. more convolution blocks
  5. final prediction layer

Early layers usually capture simple local structure. Deeper layers combine those responses into larger, more task-specific features.

Pooling is not mandatory, but when it is used, it shrinks the spatial dimensions so later layers can work with a more compact representation. A common example is max pooling, which keeps the largest value in each small region.

If stride is 11 and padding is 00, then an n×nn \times n input with a k×kk \times k kernel produces an (nk+1)×(nk+1)(n-k+1) \times (n-k+1) output. That size rule is useful when you check whether a worked example makes sense.

Worked example: how a CNN feature map is created

Take this 4×44 \times 4 input image:

X=[3300330000330033]X = \begin{bmatrix} 3 & 3 & 0 & 0 \\ 3 & 3 & 0 & 0 \\ 0 & 0 & 3 & 3 \\ 0 & 0 & 3 & 3 \end{bmatrix}

Use this 2×22 \times 2 kernel:

K=[1111]K = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

Assume stride 11 and no padding. Since the input is 4×44 \times 4 and the kernel is 2×22 \times 2, the output must be 3×33 \times 3. Each output entry is the sum of one 2×22 \times 2 patch because every kernel entry equals 11.

The top-left output value is

y1,1=3(1)+3(1)+3(1)+3(1)=12.y_{1,1} = 3(1) + 3(1) + 3(1) + 3(1) = 12.

The patch one step to the right is

[3030],\begin{bmatrix} 3 & 0 \\ 3 & 0 \end{bmatrix},

so

y1,2=3+0+3+0=6.y_{1,2} = 3 + 0 + 3 + 0 = 6.

Working through all valid positions gives

Y=[12606660612].Y = \begin{bmatrix} 12 & 6 & 0 \\ 6 & 6 & 6 \\ 0 & 6 & 12 \end{bmatrix}.

This output is the feature map. Large values show where the kernel found a strong match. Here the filter responds most strongly where a full 2×22 \times 2 bright block appears.

If you now apply ReLU, nothing changes because all entries are already nonnegative. If you then use 2×22 \times 2 max pooling with stride 11, the pooled output becomes

[126612].\begin{bmatrix} 12 & 6 \\ 6 & 12 \end{bmatrix}.

That does not create new information. It keeps the strongest nearby responses and reduces the spatial grid.

This example is simple, but it shows the core mechanism clearly: a filter slides, computes local weighted sums, and creates a map of where a pattern appears.

What a CNN learns during training

The kernel above was chosen by hand, but in a real CNN the filter values are learned from data. Training adjusts those values so the resulting feature maps become useful for the task.

If the task is image classification, the network learns filters that help separate classes. If the task is segmentation or detection, the later layers are trained for those outputs instead. The basic mechanism is the same: forward pass, loss, backpropagation, parameter update.

Common mistakes when learning CNNs

Thinking a CNN just means "image classifier"

Images are the standard example, but CNNs are really about local structure and shared filters. If nearby values matter, the same idea can be useful beyond images.

Assuming pooling is always required

It is common, not universal. Some architectures reduce spatial size with strided convolutions instead, and some keep more spatial detail for longer.

Ignoring stride and padding

Feature-map size depends on these choices. If you change stride or padding, you change not just the shape of the output, but also which local neighborhoods each unit can see.

Treating the layer as only a formula

The convolution formula matters, but the architecture matters too. A CNN works because convolution, activation, stacking, and training all work together.

When convolutional neural networks are useful

CNNs are widely used in computer vision tasks such as image classification, object detection, and segmentation. They also appear in some signal-processing and sequence settings where local patterns are meaningful.

They are especially useful when the input has an obvious grid or ordered neighborhood structure. If that condition is weak, a different architecture may be a better fit.

A mental model that makes CNNs easier to understand

Think of a CNN as a pattern detector that starts small and grows more abstract with depth. One layer asks, "Does this small pattern appear here?" Later layers ask, "Do these simpler patterns combine into something more meaningful?"

That is why CNNs are easier to understand when you focus on feature maps, not just on the word "convolution."

Try your own version

Keep the same input, but change the kernel to

[1111].\begin{bmatrix} 1 & -1 \\ 1 & -1 \end{bmatrix}.

Recompute the feature map and see which regions now produce large positive or negative responses. That small change makes it much clearer how different filters detect different patterns.

Need help with a problem?

Upload your question and get a verified, step-by-step solution in seconds.

Open GPAI Solver →