In our quest to build an accurate image emotion detection system, we explored various Convolutional Neural Network (CNN) architectures. After careful consideration, we opted for ResNet18, a powerful yet efficient model that stood out for its ability to tackle the challenges of image-based emotion recognition. This article delves into the inner workings of ResNet18, explores the reasons behind our choice, and sheds light on its advantages over other architectures.
Understanding Convolutional Neural Networks (CNNs) for Image Emotion Detection
CNNs are a class of deep learning models specifically designed for image recognition tasks. They excel at extracting features from images, making them ideal for applications like emotion detection from facial expressions. However, training deep CNNs often encounters the vanishing gradient problem, where gradients used to update model weights become infinitesimally small as they backpropagate through the network, hindering effective learning.
Introducing ResNet18: Overcoming the Vanishing Gradient Problem
ResNet (Residual Network) architectures were introduced to address the vanishing gradient problem. ResNet18, a specific variant of ResNet, incorporates a clever concept called skip connections. These connections bypass a few layers in the network and add the input directly to the output of the bypassed layers. This creates a shortcut path for the gradient to flow, ensuring it retains sufficient magnitude for effective learning even in deeper networks.
ResNet18 Architecture Breakdown

The core building block of ResNet18 is the residual block. It consists of two or three convolutional layers followed by a batch normalization layer and a ReLU (Rectified Linear Unit) activation function. The input to the block is directly added to the output of the convolutional layers through a skip connection. This architecture allows the network to learn residual functions, effectively adding information to the original input rather than attempting to learn the entire function from scratch.
Why We Chose ResNet18 for Image Emotion Detection
Several factors influenced our decision to utilize ResNet18 for our image emotion detection project:
- Addresses Vanishing Gradient Problem: As discussed earlier, ResNet18’s skip connections effectively mitigate the vanishing gradient problem, enabling successful training of deeper networks. This is crucial for capturing the intricate details of facial expressions that convey emotions.
- Balance Between Accuracy and Efficiency: Compared to deeper ResNet variants like ResNet50 or ResNet101, ResNet18 offers a commendable balance between accuracy and computational efficiency. This is particularly advantageous for real-world deployments where resource constraints might exist.
- Transfer Learning Potential: Pre-trained ResNet18 models are readily available, allowing us to leverage their learned features for our emotion detection task. This approach significantly reduces training time and improves the model’s ability to generalize to unseen data.
ResNet18 vs. Other Architectures
While ResNet18 proved to be a compelling choice for our project, it’s essential to acknowledge other prominent CNN architectures:
- VGG Networks: VGG architectures, like VGG16, achieve high accuracy but often require more computational resources due to their deeper structures (You can find more details on VGG16 in this paper: Very Deep Convolutional Networks for Large-Scale Image Recognition by Karen Simonyan and Andrew Zisserman, ICLR 2015: https://ui.adsabs.harvard.edu/abs/2014arXiv1409.1556S/abstract)
- Inception Networks: Inception networks, such as InceptionV3, introduce efficient ways to handle filter sizes, but their architecture can be more complex to implement compared to ResNet (You can learn more about InceptionV3 in this paper: Rethinking the Inception Architecture for Compact and Efficient Deep Learning by Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Trevor Dean, CVPR 2016: https://ieeexplore.ieee.org/document/7780677)
Conclusion
ResNet18’s ability to overcome the vanishing gradient problem, coupled with its efficient architecture and transfer learning capabilities, made it the ideal choice for our image emotion detection project. By understanding its inner workings and the advantages it offers over other architectures, we were able to leverage its strengths to achieve promising results in recognizing emotions from images.





