NEURAL COMPUTING AND APPLICATIONS, cilt.38, ss.1-22, 2026 (Scopus)
The primary aim is to develop a model that can achieve high accuracy in solving multi label classification problems by training, testing, and analyzing deep learning models that utilize both image and text data. In this paper, we propose a novel hybrid model that jointly considers textual and visual data for the task of movie genre prediction, which is a representative example of multi label classification problems. In the proposed model, textual features extracted from movie summaries are obtained using the DistilBERT model, while visual features derived from movie posters are extracted using the ConvNeXt deep learning model. These features from the two modalities are then combined using the XGBoost machine learning algorithm to perform genre prediction. This approach aims to achieve higher accuracy and better generalizability in movie genre classification by integrating information from different modalities through a late fusion method. The ConvNeXt architecture was adapted to the problem using transfer learning and fine-tuning techniques. To achieve the highest performance from the DistilBERT model, optimization was performed for the token length and threshold hyperparameters, and the model with a token length of 256 and a threshold of 0.5 was used in the hybrid model. Furthermore, to maximize the overall performance of the novel hybrid model, optimization was conducted using the Grid Search algorithm. All three proposed models were trained and tested on a dataset obtained from the IMDB website. The performances of the models were evaluated using hamming loss, precision and F1 score metrics. Experimental results revealed that, overall, the text-based model outperformed the image-based model, while the proposed hybrid model achieved higher performance than both individual models. It was demonstrated that textual and visual features complement each other and positively enhance the overall performance. This paper presents an original and effective study for multi label movie genre classification, combining the fields of computer vision, natural language processing, and machine learning.