CoAtNet for Chest X-Ray Report Generation with Bi-LSTM and Multi-Head Attention

Rafy Aulia Akbar; Ricky Eka Putra; Wiyli Yustanti

doi:10.35882/ijeeemi.v7i4.271

Authors

Rafy Aulia Akbar
24051905007@mhs.unesa.ac.id
Department of Informatics, Universitas Negeri Surabaya, Surabaya, East Java, Indonesia, Indonesia https://orcid.org/0009-0003-6991-0694
Ricky Eka Putra Department of Informatics, Universitas Negeri Surabaya, Surabaya, East Java, Indonesia, Indonesia https://orcid.org/0000-0002-5515-7967
Wiyli Yustanti Department of Informatics, Universitas Negeri Surabaya, Surabaya, East Java, Indonesia, Indonesia https://orcid.org/0000-0002-9574-7072

Vol. 7 No. 4 (2025): November

Medical Engineering

Submitted September 23, 2025

Accepted October 12, 2025

Published October 20, 2025

Downloads

pdf

Abstract
How to Cite
Metrics
References
License

In clinical environments, Chest X-Ray (CXR) represents the most prevalent diagnostic instrument, particularly facilitating diagnostic procedures through medical report. However, manual report preparation is time-consuming, highly dependent on the expertise of radiologists, and carries the risk of errors due to high workloads and limited expert staff. Therefore, an automated system based on artificial intelligence is needed to ease the workload of radiologists while increasing consistency. This study aims to develop an automated medical report generation system with balanced data distribution, reliable encoder, and bidirectional contextual understanding. The main contributions of this study include the implementation of an undersampling strategy based on majority captions followed by oversampling on minority labels while maintaining a proportion of labels with higher frequencies, the use of Bi-LSTM with Multi Head Attention (MHA) to strengthen text context understanding, and the use of CoAtNet as a visual encoder that combines the strengths of CNN and Transformer. The methodology incorporates image preprocessing via gamma correction for contrast improvement, data selection, balancing through combined undersampling and oversampling, and CoAtNet implementation as encoder paired with Bi-LSTM and MHA as decoder. Experimental execution employed the IU X-ray dataset, with assessment conducted using BLEU and ROUGE-L metrics. Outcomes revealed that the CoAtNet configuration with Bi-LSTM and MHA, coupled with the undersampling-oversampling strategy, delivered superior performance evidenced by a cumulative score of 1.642, with BLEU-1 to BLEU-4 and ROUGE-L achieving 0.480, 0.329, 0.245, 0.183, and 0.405, respectively. These findings prove that the combination of data balancing strategies with CoAtNet and Bi-LSTM is able to produce more accurate automated medical reports and reduce bias towards the majority label.

[1] G. D. Ancona et al., “Deep learning to predict long-term mortality from plain chest X-ray in patients referred for suspected coronary artery disease,” J Thorac Dis, vol. 16, no. 8, pp. 4914–4923, Aug. 2024, doi: 10.21037/JTD-24-322/PRF.

[2] J. Mahawar and A. Paul, “Generalizable diagnosis of chest radiographs through attention-guided decomposition of images utilizing self-consistency loss,” Comput Biol Med, vol. 180, p. 108922, Sep. 2024, doi: 10.1016/J.COMPBIOMED.2024.108922.

[3] A. R. Alruwaili et al., “A Critical Examination of Academic Hospital Practices—Paving the Way for Standardized Structured Reports in Neuroimaging,” Journal of Clinical Medicine 2024, Vol. 13, Page 4334, vol. 13, no. 15, p. 4334, Jul. 2024, doi: 10.3390/JCM13154334.

[4] S. Harsini, S. Tofighi, L. Eibschutz, B. Quinn, and A. Gholamrezanezhad, “An Evolution of Reporting: Identifying the Missing Link,” 2022. doi: 10.3390/diagnostics12071761.

[5] N. Kaur, A. Mittal, and G. Singh, “Methods for automatic generation of radiological reports of chest radiographs: a comprehensive survey,” Multimed Tools Appl, vol. 81, no. 10, 2022, doi: 10.1007/s11042-021-11272-6.

[6] R. Riechie, V. Jessica, M. Kurniawan, and F. V. P. Samosir, “Convolutional Kolmogorov-Arnold Network for Pneumonia Detection in Medical Image Analysis,” Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 7, no. 3, pp. 475–487, Aug. 2025, doi: 10.35882/IJEEEMI.V7I3.106.

[7] C. L. Canon, J. F. B. Chick, I. DeQuesada, R. B. Gunderman, N. Hoven, and A. E. Prosper, “Physician Burnout in Radiology: Perspectives From the Field,” American Journal of Roentgenology, vol. 218, no. 2, 2022, doi: 10.2214/AJR.21.26756.

[8] C. H. Ko, L. N. Chien, Y. T. Chiu, H. H. Hsu, H. F. Wong, and W. P. Chan, “Demands for medical imaging and workforce Size: A nationwide population-based Study, 2000–2020,” Eur J Radiol, vol. 172, 2024, doi: 10.1016/j.ejrad.2024.111330.

[9] Y. C. Peng, W. J. Lee, Y. C. Chang, W. P. Chan, and S. J. Chen, “Radiologist burnout: Trends in medical imaging utilization under the national health insurance system with the universal code bundling strategy in an academic tertiary medical centre,” Eur J Radiol, vol. 157, 2022, doi: 10.1016/j.ejrad.2022.110596.

[10] E. J. Topol, “Toward the eradication of medical diagnostic errors,” 2024. doi: 10.1126/science.adn9602.

[11] Z. Ye, R. Khan, N. Naqvi, and M. S. Islam, “A novel automatic image caption generation using bidirectional long-short term memory framework,” Multimed Tools Appl, vol. 80, no. 17, 2021, doi: 10.1007/s11042-021-10632-6.

[12] Y. Ming, N. Hu, C. Fan, F. Feng, J. Zhou, and H. Yu, “Visuals to Text: A Comprehensive Review on Automatic Image Captioning,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 8, 2022, doi: 10.1109/JAS.2022.105734.

[13] S. Elbedwehy, T. Medhat, T. Hamza, and M. F. Alrahmawy, “Enhanced descriptive captioning model for histopathological patches,” Multimed Tools Appl, vol. 83, no. 12, 2024, doi: 10.1007/s11042-023-15884-y.

[14] A. Ueda, W. Yang, and K. Sugiura, “Switching Text-Based Image Encoders for Captioning Images With Text,” IEEE Access, vol. 11, 2023, doi: 10.1109/ACCESS.2023.3282444.

[15] H. Chen, G. Ding, Z. Lin, Y. Guo, C. Shan, and J. Han, “Image Captioning with Memorized Knowledge,” Cognit Comput, vol. 13, no. 4, 2021, doi: 10.1007/s12559-019-09656-w.

[16] R. Sasibhooshan, S. Kumaraswamy, and S. Sasidharan, “Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction,” J Big Data, vol. 10, no. 1, 2023, doi: 10.1186/s40537-023-00693-9.

[17] H. Park, K. Kim, S. Park, and J. Choi, “Medical Image Captioning Model to Convey More Details: Methodological Comparison of Feature Difference Generation,” IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2021.3124564.

[18] Z. Babar, T. van Laarhoven, and E. Marchiori, “Encoder-decoder models for chest X-ray report generation perform no better than unconditioned baselines,” PLoS One, vol. 16, no. 11 November, 2021, doi: 10.1371/journal.pone.0259639.

[19] D. Hou, Z. Zhao, Y. Liu, F. Chang, and S. Hu, “Automatic Report Generation for Chest X-Ray Images via Adversarial Reinforcement Learning,” IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2021.3056175.

[20] Y. Gu, R. Li, X. Wang, and Z. Zhou, “Automatic Medical Report Generation Based on Cross-View Attention and Visual-Semantic Long Short Term Memorys,” Bioengineering, vol. 10, no. 8, 2023, doi: 10.3390/bioengineering10080966.

[21] S. Yan, W. K. Cheung, K. Chiu, T. M. Tong, K. C. Cheung, and S. See, “Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation,” IEEE Trans Med Imaging, vol. 42, no. 8, 2023, doi: 10.1109/TMI.2023.3245608.

[22] H. Tsaniya, C. Fatichah, and N. Suciati, “Automatic Radiology Report Generator Using Transformer With Contrast-Based Image Enhancement,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3364373.

[23] D. Naik and C. D. Jaidhar, “A novel Multi-Layer Attention Framework for visual description prediction using bidirectional LSTM,” J Big Data, vol. 9, no. 1, 2022, doi: 10.1186/s40537-022-00664-6.

[24] H. Zhang, C. Ma, Z. Jiang, and J. Lian, “Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s,” IEEE Access, vol. 11, 2023, doi: 10.1109/ACCESS.2022.3232508.

[25] D. Parres, A. Albiol, and R. Paredes, “Improving Radiology Report Generation Quality and Diversity through Reinforcement Learning and Text Augmentation,” Bioengineering 2024, Vol. 11, Page 351, vol. 11, no. 4, p. 351, Apr. 2024, doi: 10.3390/BIOENGINEERING11040351.

[26] F. F. Alqahtani, M. M. Mohsan, K. Alshamrani, J. Zeb, S. Alhamami, and D. Alqarni, “CNX-B2: A Novel CNN-Transformer Approach For Chest X-Ray Medical Report Generation,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3367360.

[27] G. Veras Magalhães, R. L. de S. Santos, L. H. S. Vogado, A. Cardoso de Paiva, and P. de Alcântara dos Santos Neto, “XRaySwinGen: Automatic medical reporting for X-ray exams with multimodal model,” Heliyon, vol. 10, no. 7, 2024, doi: 10.1016/j.heliyon.2024.e27516.

[28] J. Zhao et al., “Automated Chest X-Ray Diagnosis Report Generation with Cross-Attention Mechanism,” Applied Sciences (Switzerland), vol. 15, no. 1, Jan. 2025, doi: 10.3390/app15010343.

[29] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “CoAtNet: Marrying Convolution and Attention for All Data Sizes,” in Advances in Neural Information Processing Systems, 2021.

[30] S. M. N. Ashraf, M. A. Mamun, H. M. Abdullah, and M. G. R. Alam, “SynthEnsemble: A Fusion of CNN, Vision Transformer, and Hybrid Models for Multi-Label Chest X-Ray Classification,” in 2023 26th International Conference on Computer and Information Technology, ICCIT 2023, 2023. doi: 10.1109/ICCIT60459.2023.10441433.

[31] T. Xie, W. Ding, J. Zhang, X. Wan, and J. Wang, “Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning,” Applied Sciences (Switzerland), vol. 13, no. 13, Jul. 2023, doi: 10.3390/app13137916.

[32] D. Demner-Fushman et al., “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, 2016, doi: 10.1093/jamia/ocv080.

[33] P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev Biomed Eng, vol. 18, pp. 368–387, 2025, doi: 10.1109/RBME.2024.3408456.

[34] G. Siracusano, A. La Corte, A. G. Nucera, M. Gaeta, M. Chiappini, and G. Finocchio, “Effective processing pipeline PACE 2.0 for enhancing chest x-ray contrast and diagnostic interpretability,” Sci Rep, vol. 13, no. 1, 2023, doi: 10.1038/s41598-023-49534-y.

[35] H. Deng, H. Zhao, H. Zhang, and G. Liu, “γ Radiation Image Enhancement Method Based on Non-Linear Mapping,” IEEE Access, vol. 10, 2022, doi: 10.1109/ACCESS.2022.3209807.

[36] R. Hafizah, T. H. Saragih, M. Muliadi, F. Indriani, and M. I. Mazdadi, “Machine Learning Implementation for Sentiment Analysis on X/Twitter: Case Study of Class Of Champions Event in Indonesia,” Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 7, no. 2, pp. 370–386, May 2025, doi: 10.35882/IJEEEMI.V7I2.81.

[37] M. S. Alam et al., “Attention-based multi-residual network for lung segmentation in diseased lungs with custom data augmentation,” Sci Rep, vol. 14, no. 1, pp. 1–11, Dec. 2024, doi: 10.1038/S41598-024-79494-W;SUBJMETA=114,631,692,698;KWRD=ANATOMY,COMPUTATIONAL+BIOLOGY+AND+BIOINFORMATICS.

[38] Z. Fei, M. Fan, L. Zhu, J. Huang, X. Wei, and X. Wei, “Uncertainty-Aware Image Captioning,” in Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023, 2023. doi: 10.1609/aaai.v37i1.25137.

[39] H. Huan, J. Yan, Y. Xie, Y. Chen, P. Li, and R. Zhu, “Feature-enhanced nonequilibrium bidirectional long short-term memory model for Chinese text classification,” IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3035669.

[40] M. Sirshar, M. F. K. Paracha, M. U. Akram, N. S. Alghamdi, S. Z. Y. Zaidi, and T. Fatima, “Attention based automated radiology report generation using CNN and LSTM,” PLoS One, vol. 17, no. 1 January, 2022, doi: 10.1371/journal.pone.0262209.

[41] W. Liu, J. Luo, Y. Yang, W. Wang, J. Deng, and L. Yu, “Automatic lung segmentation in chest X-ray images using improved U-Net,” Sci Rep, vol. 12, no. 1, 2022, doi: 10.1038/s41598-022-12743-y.

[42] J. Wang, W. Xu, Q. Wang, and A. B. Chan, “On Distinctive Image Captioning via Comparing and Reweighting,” IEEE Trans Pattern Anal Mach Intell, vol. 45, no. 2, 2023, doi: 10.1109/TPAMI.2022.3159811.

[43] J. Chen, “Transform, contrast and tell: Coherent entity-aware multi-image captioning,” Computer Vision and Image Understanding, vol. 238, p. 103878, Jan. 2024, doi: 10.1016/J.CVIU.2023.103878.

[44] Z. Liu et al., “Swin Transformer V2: Scaling Up Capacity and Resolution,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022. doi: 10.1109/CVPR52688.2022.01170.

[45] S. Woo et al., “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023. doi: 10.1109/CVPR52729.2023.01548.

[46] A. Dosovitskiy et al., “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE,” in ICLR 2021 - 9th International Conference on Learning Representations, 2021.

How to Cite

Akbar, R. A., Putra, R. E., & Yustanti, W. (2025). CoAtNet for Chest X-Ray Report Generation with Bi-LSTM and Multi-Head Attention. Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics, 7(4), 654-672. https://doi.org/10.35882/ijeeemi.v7i4.271

Download Citation

CoAtNet for Chest X-Ray Report Generation with Bi-LSTM and Multi-Head Attention

Authors

Downloads

How to Cite

Similar Articles

Login

Journal Metrics

About IJEEEMI

Article Template

Citedness & Repository

Statistics

Information

Editorial Pick

The Role of U-Net Segmentation for Enhancing Deep Learning-based Dental Caries Classification

Acute effects of methadone on neural oscillations: an EEG study of theta, alpha, beta power, and frontal alpha asymmetry in opioid rehabilitation patients

Hybrid Feature Selection and Balancing Data Approach for Improved Software Defect Prediction

Address

Contact Info: