A convolutional neural network, or CNN, is a neural network that looks for small local patterns and then combines them into larger ones. The right time to reach for one is when your input has an obvious grid or ordered-neighborhood structure: pixels in an image, where nearby values clearly belong together.

In images, early layers often detect edges or corners, middle layers detect textures or parts, and deeper layers use those signals to support a final prediction. The key idea is weight sharing. Instead of learning a separate weight for every pixel-location pair, a CNN reuses the same small filter across many positions.

The Procedure: From Local Patches To A Prediction

A CNN is best understood as a fixed sequence of moves applied to structured input.

  1. Start with local structure. Use an input where nearby values matter, such as pixels in an image.
  2. Slide learned filters. Apply small kernels across the input so each output value becomes a local weighted sum.
  3. Add nonlinearity. Pass the feature maps through an activation such as ReLU so stacked layers can learn richer patterns.
  4. Reduce spatial size when useful. Use pooling or stride if you want later layers to focus on larger patterns with fewer positions.
  5. Predict from learned features. Feed the deeper features into a final layer that outputs a class, score, or other prediction.

The sliding-filter step is the heart of it. For a single-channel input xx and a k×kk \times k kernel KK, one output entry can be written as

yi,j=m=0k1n=0k1Km,nxi+m,j+n.y_{i,j} = \sum_{m=0}^{k-1} \sum_{n=0}^{k-1} K_{m,n} x_{i+m,j+n}.

In many machine learning libraries, the implemented operation is technically cross-correlation rather than a flipped mathematical convolution, but the practical intuition is the same: the kernel scans across the input and produces a feature map that tells you where the learned pattern appears strongly.

If the same vertical edge appears near the top-left corner or near the center, we usually want the model to notice it either way. Reusing the same filter parameters across positions reduces the number of learned parameters compared with a dense layer, and encourages the network to detect recurring local patterns rather than memorize one fixed location.

A common stacking pattern is: convolution, ReLU, optional pooling, more convolution blocks, then a final prediction layer. Pooling is not mandatory, but when used it shrinks the spatial dimensions; a common example is max pooling, which keeps the largest value in each small region.

Running The Whole Procedure On One Example

Take this 4×44 \times 4 input image:

X=[3300330000330033]X = \begin{bmatrix} 3 & 3 & 0 & 0 \\ 3 & 3 & 0 & 0 \\ 0 & 0 & 3 & 3 \\ 0 & 0 & 3 & 3 \end{bmatrix}

Use this 2×22 \times 2 kernel:

K=[1111]K = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix}

Assume stride 11 and no padding. If stride is 11 and padding is 00, then an n×nn \times n input with a k×kk \times k kernel produces an (nk+1)×(nk+1)(n-k+1) \times (n-k+1) output. Since the input is 4×44 \times 4 and the kernel is 2×22 \times 2, the output must be 3×33 \times 3. Each output entry is the sum of one 2×22 \times 2 patch because every kernel entry equals 11.

The top-left output value is

y1,1=3(1)+3(1)+3(1)+3(1)=12.y_{1,1} = 3(1) + 3(1) + 3(1) + 3(1) = 12.

The patch one step to the right is

[3030],\begin{bmatrix} 3 & 0 \\ 3 & 0 \end{bmatrix},

so

y1,2=3+0+3+0=6.y_{1,2} = 3 + 0 + 3 + 0 = 6.

Working through all valid positions gives

Y=[12606660612].Y = \begin{bmatrix} 12 & 6 & 0 \\ 6 & 6 & 6 \\ 0 & 6 & 12 \end{bmatrix}.

This output is the feature map. Large values show where the kernel found a strong match, here a full 2×22 \times 2 bright block.

If you now apply ReLU, nothing changes because all entries are already nonnegative. If you then use 2×22 \times 2 max pooling with stride 11, the pooled output becomes

[126612].\begin{bmatrix} 12 & 6 \\ 6 & 12 \end{bmatrix}.

That does not create new information. It keeps the strongest nearby responses and reduces the spatial grid. The kernel above was chosen by hand, but in a real CNN the filter values are learned from data: training adjusts them so the feature maps become useful for the task through forward pass, loss, backpropagation, and parameter update.

Where Students Get Stuck, And How To Check Each Step

"A CNN just means image classifier"

Images are the standard example, but CNNs are really about local structure and shared filters. Self-check: if nearby values in your input carry meaning, the same idea can apply beyond images.

Assuming pooling is always required

It is common, not universal. Some architectures reduce spatial size with strided convolutions instead. Self-check: ask whether you actually need a smaller grid before adding pooling.

Ignoring stride and padding

Feature-map size depends on these choices. Self-check: before computing, predict the output shape with the (nk+1)(n-k+1) rule and confirm your worked numbers match it.

Treating the layer as only a formula

The convolution formula matters, but a CNN works because convolution, activation, stacking, and training all work together. Self-check: make sure you can name what each of the five procedure steps contributes.

When Convolutional Neural Networks Are Useful

CNNs are widely used in computer vision tasks such as image classification, object detection, and segmentation. They also appear in some signal-processing and sequence settings where local patterns are meaningful. They are especially useful when the input has an obvious grid or ordered neighborhood structure; if that condition is weak, a different architecture may be a better fit.

To feel the mechanism, keep the same 4×44 \times 4 input but switch the kernel to

[1111],\begin{bmatrix} 1 & -1 \\ 1 & -1 \end{bmatrix},

then recompute the feature map. Watching which regions now produce large positive or negative responses makes it concrete that different filters detect different patterns.

Frequently Asked Questions

Is a CNN only for images?
No. CNNs are common for images, but the same local-filter idea can also be used on signals, audio, time series, and some text settings when nearby patterns matter.
Does every CNN need pooling?
No. Many CNNs use pooling or strided convolutions to reduce spatial size, but pooling is not a requirement in every architecture.

Need help with a problem?

Upload your question and get a verified, step-by-step solution in seconds.

Open GPAI Solver →