Multimodal Image and Audio Recognition Model
Project Link: https://github.com/Jaron-U/Multimodal-Image-and-Audio-Recognition-Model
Multimodal Image and Audio Recognition Model
In this project, a multimodal digital recognition model was developed that combines a Convolutional Neural Network (CNN) and a Transformer-based decoder to effectively process digital audio and image data through a multi-head attention mechanism. Utilizing CNN to analyze image data of 28x28 pixels and Transformer to train audio vectors of length 507, this model demonstrates superior training speed and convergence compared to traditional CNN and RNN methods. The model achieved an accuracy of 99.5% in the Kaggle competition.
Result
I have experimented with various model architectures and hyperparameters to achieve the best performance. The final model achieved an accuracy of 99.5% in the Kaggle competition.