Multimodal Image and Audio Recognition Model

Project Link: https://github.com/Jaron-U/Multimodal-Image-and-Audio-Recognition-Model

Multimodal Image and Audio Recognition Model

In this project, a multimodal digital recognition model was developed that combines a Convolutional Neural Network (CNN) and a Transformer-based decoder to effectively process digital audio and image data through a multi-head attention mechanism. Utilizing CNN to analyze image data of 28x28 pixels and Transformer to train audio vectors of length 507, this model demonstrates superior training speed and convergence compared to traditional CNN and RNN methods. The model achieved an accuracy of 99.5% in the Kaggle competition.

Result

I have experimented with various model architectures and hyperparameters to achieve the best performance. The final model achieved an accuracy of 99.5% in the Kaggle competition.