Explainable Artificial Intelligence Model for Pneumonia Detection: A Hybrid CNN-ViT and Grad-CAM
Abstract
Pneumonia detection through medical imaging presents a significant challenge, particularly in regions with limited access to healthcare professionals. This study presents an explainable artificial intelligence (XAI) model that integrates convolutional neural network (CNN) and vision transformer (ViT) to enhance the accuracy of pneumonia diagnosis using chest X-ray images. The proposed research aims to enhance diagnostic accuracy by providing explanations through gradient-weighted class activation mapping (Grad-CAM) visualization. The methodology includes image preprocessing, local feature extraction via CNN, and global spatial relationship modelling using ViT. The model was trained on a preprocessed chest X-ray dataset and evaluated using standard performance metrics such as accuracy, precision, recall, and F1 score. The proposed CNN-ViT model was assessed using chest X-ray datasets for pneumonia detection. The experimental results demonstrated that the model achieved an accuracy of 96.5%, precision of 96%, recall of 96%, and F1 score of 94%, These results indicate that the integration of CNN and ViT effectively enhances classification performance and provides a reliable tool for medical image analysis. Furthermore, Grad-CAM visualizations highlight the critical regions in the images that influence the model’s predictions, thereby enhancing interpretability. Compared to conventional models, this approach offers improved transparency in AI-driven diagnostics. Consequently, the proposed model represents a promising and reliable diagnostic tool, particularly beneficial in underserved or remote areas with limited medical infrastructure. Additionally, this research opens opportunities for the development of transparent and XAI-based diagnostic systems.
References
S.M. Shafi and S.K. Chinnappan, “Hybrid transformer-CNN and LSTM model for lung disease segmentation and classification,” PeerJ. Comput. Sci., vol. 10, pp. 1–57, Dec. 2024, doi: 10.7717/peerj-cs.2444.
C.C. Ukwuoma et al., “A hybrid explainable ensemble transformer encoder for pneumonia identification from chest X-ray images,” J. Adv. Res., vol. 48, pp. 191–211, Jun. 2023, doi: 10.1016/j.jare.2022.08.021.
F.A. Mostafa, L.A. Elrefaei, M.M. Fouda, and A. Hossam, “A survey on AI techniques for thoracic diseases diagnosis using medical images,” Diagnostics, vol. 12, no. 12, pp. 1–60, Dec. 2022, doi: 10.3390/DIAGNOSTICS12123034.
D. Pantelaios, P.A. Theofilou, P. Tzouveli, and S. Kollias, “Hybrid CNN-ViT models for medical image classification,” in 2024 IEEE Int. Symp. Biomed. Imaging (ISBI), 2024, pp. 1–4, doi: 10.1109/ISBI56570.2024.10635205.
S.A. El-Ghany, M. Elmogy, M.A. Mahmood, and A.A. Abd El-Aziz, “A robust tuberculosis diagnosis using chest X-rays based on a hybrid vision transformer and principal component analysis,” Diagnostics, vol. 14, no. 23, pp. 1–36, Dec. 2024, doi: 10.3390/DIAGNOSTICS14232736.
A.I. Kaleel and S.B. Rajakumari, “Hybrid CNN-ViT model with Grad-CAM for ocular disease detection,” in 2025 Int. Conf. Front. Technol. Solut. (ICFTS), 2025, pp. 1–7, doi: 10.1109/icfts62006.2025.11031986.
Y. Hadhoud et al., “From binary to multi-class classification: A two-step hybrid CNN-ViT model for chest disease classification based on X-Ray images,” Diagnostics, vol. 14, no. 23, pp. 1–16, Dec. 2024, doi: 10.3390/DIAGNOSTICS14232754.
D. Bhati, F. Neha, and M. Amiruzzaman, “A survey on explainable artificial intelligence (XAI) techniques for visualizing deep learning models in medical imaging,” J. Imaging, vol. 10, no. 10, pp. 1–26, Oct. 2024, doi: 10.3390/JIMAGING10100239.
T. Kittani and A.M.A. Albrifkani, “Deep learning classification algorithms applications: A review,” Indones. J. Comput. Sci., vol. 13, no. 3, pp. 1–25, Jun. 2024, doi: 10.33022/ijcs.v13i3.4064.
R.J.M. Veiga and J.M.F. Rodrigues, “Fine-grained fish classification from small to large datasets with vision transformers,” IEEE Access, vol. 12, pp. 113642–113660, 2024, doi: 10.1109/ACCESS.2024.3443654.
G.M.S. Himel, and M.M. Islam, “Benchmark analysis of various pre-trained deep learning models on ASSIRA cats and dogs dataset,” J. Umm Al-Qura Univ. Eng. Archit., vol. 16, no. 1, pp. 134–149, Mar. 2025, doi: 10.1007/S43995-024-00094-W.
D. J. Klionsky et al., “Guidelines for the use and interpretation of assays for monitoring autophagy (3rd edition),” Autophagy, vol. 12, no. 1. pp. 1–222, 2016, doi: 10.1080/15548627.2015.1100356.
P. Chlap et al., “A review of medical image data augmentation techniques for deep learning applications,” J. Med. Imaging Radiat. Oncol., vol. 65, no. 5, pp. 545–563, Aug. 2021, doi: 10.1111/1754-9485.13261.
A. carré et al., “Standardization of brain MR images across machines and protocols: Bridging the gap for MRi-based radiomics,” Sci. Rep., vol. 10, pp. 1–15, Jul. 2020, doi: 10.1038/s41598-020-69298-z.
F. Pérez-García, R. Sparks, and S. Ourselin, “TorchIO: A Python library for efficient loading, preprocessing, augmentation and patch-based sampling of medical images in deep learning,” Comput. Methods Programs Biomed., vol. 208, pp. 1–12, Sep. 2021, doi: 10.1016/j.cmpb.2021.106236.
E.U. Henry, O. Emebob, and C.A. Omonhinmin, “Vision transformers in medical imaging: A review,” 2022, arXiv:2211.10043.
J.W. Kim, A.U. Khan, and I. Banerjee, “Systematic review of hybrid vision transformer architectures for radiological image analysis,” J. Imaging Inform. Med., pp. 1–15, Jan. 2025, doi: 10.1007/S10278-024-01322-4.
T. Wang et al., “PneuNet: Deep learning for COVID-19 pneumonia diagnosis on chest X-ray image analysis using vision transformer,” Med. Biol. Eng. Comput., vol. 61, no. 6, pp. 1395–1408, Jun. 2023, doi: 10.1007/S11517-022-02746-2/TABLES/10.
Y. Yang, G. Mei, and F. Piccialli, “A deep learning approach considering image background for pneumonia identification using explainable AI (XAI),” IEEE/ACM Trans. Comput. Biol. Bioinforma., vol. 21, no. 4, pp. 857–868, Jul./Aug. 2024, doi: 10.1109/TCBB.2022.3190265.
M. Umair et al., “Detection of COVID-19 using transfer learning and Grad-CAM visualization on indigenously collected X-ray dataset,” Sensors, vol. 21, no. 17, pp. 1–22, Sep. 2021, doi: 10.3390/S21175813.
T. Ozturk et al., “Automated detection of COVID-19 cases using deep neural networks with X-ray images,” Comput. Biol. Med., vol. 121, pp. 1–11, Jun. 2020, doi: 10.1016/J.COMPBIOMED.2020.103792.
Y. Zhang et al., “An interpretability optimization method for deep learning networks based on Grad-CAM,” IEEE Internet Things J., vol. 12, no. 4, pp. 3961–3970, Feb. 2025, doi: 10.1109/JIOT.2024.3485765.
R. Yulvina et al., “Hybrid vision transformer and convolutional neural network for multi-class and multi-label classification of tuberculosis anomalies on chest X-ray,” Computers, vol. 13, no. 12, pp. 1–29, Dec. 2024, doi: 10.3390/COMPUTERS13120343.
R.A. Zeineldin et al., “Explainable hybrid vision transformers and convolutional network for multimodal glioma segmentation in brain MRI,” Sci. Rep., vol. 14, pp. 1–14, Feb. 2024, doi: 10.1038/S41598-024-54186-7.
M.K. Islam et al., “Enhancing lung abnormalities diagnosis using hybrid DCNN-ViT-GRU model with explainable AI,” Image Vis. Comput., vol. 142, pp. 1–18, Feb. 2024, doi: 10.1016/J.IMAVIS.2024.104918.
© Jurnal Nasional Teknik Elektro dan Teknologi Informasi, under the terms of the Creative Commons Attribution-ShareAlike 4.0 International License.

1.png)

