CoAtNet for Chest X-Ray Report Generation with Bi-LSTM and Multi-Head Attention
Downloads
In clinical environments, Chest X-Ray (CXR) represents the most prevalent diagnostic instrument, particularly facilitating diagnostic procedures through medical report. However, manual report preparation is time-consuming, highly dependent on the expertise of radiologists, and carries the risk of errors due to high workloads and limited expert staff. Therefore, an automated system based on artificial intelligence is needed to ease the workload of radiologists while increasing consistency. This study aims to develop an automated medical report generation system with balanced data distribution, reliable encoder, and bidirectional contextual understanding. The main contributions of this study include the implementation of an undersampling strategy based on majority captions followed by oversampling on minority labels while maintaining a proportion of labels with higher frequencies, the use of Bi-LSTM with Multi Head Attention (MHA) to strengthen text context understanding, and the use of CoAtNet as a visual encoder that combines the strengths of CNN and Transformer. The methodology incorporates image preprocessing via gamma correction for contrast improvement, data selection, balancing through combined undersampling and oversampling, and CoAtNet implementation as encoder paired with Bi-LSTM and MHA as decoder. Experimental execution employed the IU X-ray dataset, with assessment conducted using BLEU and ROUGE-L metrics. Outcomes revealed that the CoAtNet configuration with Bi-LSTM and MHA, coupled with the undersampling-oversampling strategy, delivered superior performance evidenced by a cumulative score of 1.642, with BLEU-1 to BLEU-4 and ROUGE-L achieving 0.480, 0.329, 0.245, 0.183, and 0.405, respectively. These findings prove that the combination of data balancing strategies with CoAtNet and Bi-LSTM is able to produce more accurate automated medical reports and reduce bias towards the majority label.
[1] G. D. Ancona et al., “Deep learning to predict long-term mortality from plain chest X-ray in patients referred for suspected coronary artery disease,” J Thorac Dis, vol. 16, no. 8, pp. 4914–4923, Aug. 2024, doi: 10.21037/JTD-24-322/PRF.
[2] J. Mahawar and A. Paul, “Generalizable diagnosis of chest radiographs through attention-guided decomposition of images utilizing self-consistency loss,” Comput Biol Med, vol. 180, p. 108922, Sep. 2024, doi: 10.1016/J.COMPBIOMED.2024.108922.
[3] A. R. Alruwaili et al., “A Critical Examination of Academic Hospital Practices—Paving the Way for Standardized Structured Reports in Neuroimaging,” Journal of Clinical Medicine 2024, Vol. 13, Page 4334, vol. 13, no. 15, p. 4334, Jul. 2024, doi: 10.3390/JCM13154334.
[4] S. Harsini, S. Tofighi, L. Eibschutz, B. Quinn, and A. Gholamrezanezhad, “An Evolution of Reporting: Identifying the Missing Link,” 2022. doi: 10.3390/diagnostics12071761.
[5] N. Kaur, A. Mittal, and G. Singh, “Methods for automatic generation of radiological reports of chest radiographs: a comprehensive survey,” Multimed Tools Appl, vol. 81, no. 10, 2022, doi: 10.1007/s11042-021-11272-6.
[6] R. Riechie, V. Jessica, M. Kurniawan, and F. V. P. Samosir, “Convolutional Kolmogorov-Arnold Network for Pneumonia Detection in Medical Image Analysis,” Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 7, no. 3, pp. 475–487, Aug. 2025, doi: 10.35882/IJEEEMI.V7I3.106.
[7] C. L. Canon, J. F. B. Chick, I. DeQuesada, R. B. Gunderman, N. Hoven, and A. E. Prosper, “Physician Burnout in Radiology: Perspectives From the Field,” American Journal of Roentgenology, vol. 218, no. 2, 2022, doi: 10.2214/AJR.21.26756.
[8] C. H. Ko, L. N. Chien, Y. T. Chiu, H. H. Hsu, H. F. Wong, and W. P. Chan, “Demands for medical imaging and workforce Size: A nationwide population-based Study, 2000–2020,” Eur J Radiol, vol. 172, 2024, doi: 10.1016/j.ejrad.2024.111330.
[9] Y. C. Peng, W. J. Lee, Y. C. Chang, W. P. Chan, and S. J. Chen, “Radiologist burnout: Trends in medical imaging utilization under the national health insurance system with the universal code bundling strategy in an academic tertiary medical centre,” Eur J Radiol, vol. 157, 2022, doi: 10.1016/j.ejrad.2022.110596.
[10] E. J. Topol, “Toward the eradication of medical diagnostic errors,” 2024. doi: 10.1126/science.adn9602.
[11] Z. Ye, R. Khan, N. Naqvi, and M. S. Islam, “A novel automatic image caption generation using bidirectional long-short term memory framework,” Multimed Tools Appl, vol. 80, no. 17, 2021, doi: 10.1007/s11042-021-10632-6.
[12] Y. Ming, N. Hu, C. Fan, F. Feng, J. Zhou, and H. Yu, “Visuals to Text: A Comprehensive Review on Automatic Image Captioning,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 8, 2022, doi: 10.1109/JAS.2022.105734.
[13] S. Elbedwehy, T. Medhat, T. Hamza, and M. F. Alrahmawy, “Enhanced descriptive captioning model for histopathological patches,” Multimed Tools Appl, vol. 83, no. 12, 2024, doi: 10.1007/s11042-023-15884-y.
[14] A. Ueda, W. Yang, and K. Sugiura, “Switching Text-Based Image Encoders for Captioning Images With Text,” IEEE Access, vol. 11, 2023, doi: 10.1109/ACCESS.2023.3282444.
[15] H. Chen, G. Ding, Z. Lin, Y. Guo, C. Shan, and J. Han, “Image Captioning with Memorized Knowledge,” Cognit Comput, vol. 13, no. 4, 2021, doi: 10.1007/s12559-019-09656-w.
[16] R. Sasibhooshan, S. Kumaraswamy, and S. Sasidharan, “Image caption generation using Visual Attention Prediction and Contextual Spatial Relation Extraction,” J Big Data, vol. 10, no. 1, 2023, doi: 10.1186/s40537-023-00693-9.
[17] H. Park, K. Kim, S. Park, and J. Choi, “Medical Image Captioning Model to Convey More Details: Methodological Comparison of Feature Difference Generation,” IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2021.3124564.
[18] Z. Babar, T. van Laarhoven, and E. Marchiori, “Encoder-decoder models for chest X-ray report generation perform no better than unconditioned baselines,” PLoS One, vol. 16, no. 11 November, 2021, doi: 10.1371/journal.pone.0259639.
[19] D. Hou, Z. Zhao, Y. Liu, F. Chang, and S. Hu, “Automatic Report Generation for Chest X-Ray Images via Adversarial Reinforcement Learning,” IEEE Access, vol. 9, 2021, doi: 10.1109/ACCESS.2021.3056175.
[20] Y. Gu, R. Li, X. Wang, and Z. Zhou, “Automatic Medical Report Generation Based on Cross-View Attention and Visual-Semantic Long Short Term Memorys,” Bioengineering, vol. 10, no. 8, 2023, doi: 10.3390/bioengineering10080966.
[21] S. Yan, W. K. Cheung, K. Chiu, T. M. Tong, K. C. Cheung, and S. See, “Attributed Abnormality Graph Embedding for Clinically Accurate X-Ray Report Generation,” IEEE Trans Med Imaging, vol. 42, no. 8, 2023, doi: 10.1109/TMI.2023.3245608.
[22] H. Tsaniya, C. Fatichah, and N. Suciati, “Automatic Radiology Report Generator Using Transformer With Contrast-Based Image Enhancement,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3364373.
[23] D. Naik and C. D. Jaidhar, “A novel Multi-Layer Attention Framework for visual description prediction using bidirectional LSTM,” J Big Data, vol. 9, no. 1, 2022, doi: 10.1186/s40537-022-00664-6.
[24] H. Zhang, C. Ma, Z. Jiang, and J. Lian, “Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s,” IEEE Access, vol. 11, 2023, doi: 10.1109/ACCESS.2022.3232508.
[25] D. Parres, A. Albiol, and R. Paredes, “Improving Radiology Report Generation Quality and Diversity through Reinforcement Learning and Text Augmentation,” Bioengineering 2024, Vol. 11, Page 351, vol. 11, no. 4, p. 351, Apr. 2024, doi: 10.3390/BIOENGINEERING11040351.
[26] F. F. Alqahtani, M. M. Mohsan, K. Alshamrani, J. Zeb, S. Alhamami, and D. Alqarni, “CNX-B2: A Novel CNN-Transformer Approach For Chest X-Ray Medical Report Generation,” IEEE Access, vol. 12, 2024, doi: 10.1109/ACCESS.2024.3367360.
[27] G. Veras Magalhães, R. L. de S. Santos, L. H. S. Vogado, A. Cardoso de Paiva, and P. de Alcântara dos Santos Neto, “XRaySwinGen: Automatic medical reporting for X-ray exams with multimodal model,” Heliyon, vol. 10, no. 7, 2024, doi: 10.1016/j.heliyon.2024.e27516.
[28] J. Zhao et al., “Automated Chest X-Ray Diagnosis Report Generation with Cross-Attention Mechanism,” Applied Sciences (Switzerland), vol. 15, no. 1, Jan. 2025, doi: 10.3390/app15010343.
[29] Z. Dai, H. Liu, Q. V. Le, and M. Tan, “CoAtNet: Marrying Convolution and Attention for All Data Sizes,” in Advances in Neural Information Processing Systems, 2021.
[30] S. M. N. Ashraf, M. A. Mamun, H. M. Abdullah, and M. G. R. Alam, “SynthEnsemble: A Fusion of CNN, Vision Transformer, and Hybrid Models for Multi-Label Chest X-Ray Classification,” in 2023 26th International Conference on Computer and Information Technology, ICCIT 2023, 2023. doi: 10.1109/ICCIT60459.2023.10441433.
[31] T. Xie, W. Ding, J. Zhang, X. Wan, and J. Wang, “Bi-LS-AttM: A Bidirectional LSTM and Attention Mechanism Model for Improving Image Captioning,” Applied Sciences (Switzerland), vol. 13, no. 13, Jul. 2023, doi: 10.3390/app13137916.
[32] D. Demner-Fushman et al., “Preparing a collection of radiology examinations for distribution and retrieval,” Journal of the American Medical Informatics Association, vol. 23, no. 2, 2016, doi: 10.1093/jamia/ocv080.
[33] P. Sloan, P. Clatworthy, E. Simpson, and M. Mirmehdi, “Automated Radiology Report Generation: A Review of Recent Advances,” IEEE Rev Biomed Eng, vol. 18, pp. 368–387, 2025, doi: 10.1109/RBME.2024.3408456.
[34] G. Siracusano, A. La Corte, A. G. Nucera, M. Gaeta, M. Chiappini, and G. Finocchio, “Effective processing pipeline PACE 2.0 for enhancing chest x-ray contrast and diagnostic interpretability,” Sci Rep, vol. 13, no. 1, 2023, doi: 10.1038/s41598-023-49534-y.
[35] H. Deng, H. Zhao, H. Zhang, and G. Liu, “γ Radiation Image Enhancement Method Based on Non-Linear Mapping,” IEEE Access, vol. 10, 2022, doi: 10.1109/ACCESS.2022.3209807.
[36] R. Hafizah, T. H. Saragih, M. Muliadi, F. Indriani, and M. I. Mazdadi, “Machine Learning Implementation for Sentiment Analysis on X/Twitter: Case Study of Class Of Champions Event in Indonesia,” Indonesian Journal of Electronics, Electromedical Engineering, and Medical Informatics, vol. 7, no. 2, pp. 370–386, May 2025, doi: 10.35882/IJEEEMI.V7I2.81.
[37] M. S. Alam et al., “Attention-based multi-residual network for lung segmentation in diseased lungs with custom data augmentation,” Sci Rep, vol. 14, no. 1, pp. 1–11, Dec. 2024, doi: 10.1038/S41598-024-79494-W;SUBJMETA=114,631,692,698;KWRD=ANATOMY,COMPUTATIONAL+BIOLOGY+AND+BIOINFORMATICS.
[38] Z. Fei, M. Fan, L. Zhu, J. Huang, X. Wei, and X. Wei, “Uncertainty-Aware Image Captioning,” in Proceedings of the 37th AAAI Conference on Artificial Intelligence, AAAI 2023, 2023. doi: 10.1609/aaai.v37i1.25137.
[39] H. Huan, J. Yan, Y. Xie, Y. Chen, P. Li, and R. Zhu, “Feature-enhanced nonequilibrium bidirectional long short-term memory model for Chinese text classification,” IEEE Access, vol. 8, 2020, doi: 10.1109/ACCESS.2020.3035669.
[40] M. Sirshar, M. F. K. Paracha, M. U. Akram, N. S. Alghamdi, S. Z. Y. Zaidi, and T. Fatima, “Attention based automated radiology report generation using CNN and LSTM,” PLoS One, vol. 17, no. 1 January, 2022, doi: 10.1371/journal.pone.0262209.
[41] W. Liu, J. Luo, Y. Yang, W. Wang, J. Deng, and L. Yu, “Automatic lung segmentation in chest X-ray images using improved U-Net,” Sci Rep, vol. 12, no. 1, 2022, doi: 10.1038/s41598-022-12743-y.
[42] J. Wang, W. Xu, Q. Wang, and A. B. Chan, “On Distinctive Image Captioning via Comparing and Reweighting,” IEEE Trans Pattern Anal Mach Intell, vol. 45, no. 2, 2023, doi: 10.1109/TPAMI.2022.3159811.
[43] J. Chen, “Transform, contrast and tell: Coherent entity-aware multi-image captioning,” Computer Vision and Image Understanding, vol. 238, p. 103878, Jan. 2024, doi: 10.1016/J.CVIU.2023.103878.
[44] Z. Liu et al., “Swin Transformer V2: Scaling Up Capacity and Resolution,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2022. doi: 10.1109/CVPR52688.2022.01170.
[45] S. Woo et al., “ConvNeXt V2: Co-designing and Scaling ConvNets with Masked Autoencoders,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2023. doi: 10.1109/CVPR52729.2023.01548.
[46] A. Dosovitskiy et al., “AN IMAGE IS WORTH 16X16 WORDS: TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE,” in ICLR 2021 - 9th International Conference on Learning Representations, 2021.
Copyright (c) 2025 Rafy Aulia Akbar, Ricky Eka Putra, Wiyli Yustanti (Author)

This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Authors who publish with this journal agree to the following terms:
- Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution-ShareAlikel 4.0 International (CC BY-SA 4.0) that allows others to share the work with an acknowledgement of the work's authorship and initial publication in this journal.
- Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgement of its initial publication in this journal.
- Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work (See The Effect of Open Access).





