Automatic Image Captioning Using Convolutional Neural Network and Long Short-Term Memory Techniques
Abstract
Image captioning is the task of automatically generating natural language descriptions based on image content, with applications in social media, e-commerce, and content creation. Classic approaches would involve reinforcement learning and multimodal transformers, but they demand large datasets and huge computational resources and are not suited to handle complex scenes. To satisfy these challenges, this work adopts a hybrid approach that employs the task of CNNs for visual feature extraction and uses LSTMs for sequential caption generation. More in detail, rich image features were extracted using a pre-trained Inception V3 model, while an LSTM was adopted for synthesizing image captions. Image caption generation was realized based on both greedy and beam search strategies. Finally, BLEU scores and visualizations demonstrate the successful blend of computer vision and NLP capabilities to ensure the accuracy and coherence of image captions.
Copyright (c) 2026 Dhamayandhi K, T Jayamalar, N. Krishnaveni, Lavanya C

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

