A multimodal framework for explainable chest X-ray report generation
Indonesian Journal of Electrical Engineering and Computer Science
Abstract
Chest X-ray (CXR) interpretation remains a challenging task due to overlapping anatomical structures, variability in disease presentation, and increasing clinical workload. Existing automated report-generation models provide promising results but often lack explicit interpretability, limited clinical alignment, and insufficient comparative evaluation with established baselines. This study proposes an explainable multimodal framework that combines a dual CNN encoder (ResNet-50 and EfficientNet-B0) with the Gemma-3 1B language model fine-tuned using low-rank adaptation (LoRA). Visual explanations are produced through Gradient-weighted Class Activation Mapping (Grad-CAM) to enhance transparency in the decision process. Unlike prior image-to-text pipelines, our approach follows a findings-guided paradigm and integrates both visual and textual cues during generation. Experiments conducted on public datasets demonstrate consistent improvements over representative vision-language baselines reported in recent literature, with notable gains in BLEU, ROUGE, METEOR, and BERTScore. Generated reports show improved factual completeness and clinically relevant region-level attention. Limitations include the absence of evaluation against emerging foundation models and the need for anatomical- level explainability metrics. Future work will extend benchmarking to models such as M2-Transformer, MedCLIP-GPT, and R2Gen, and will explore clinical validation in real-world workflows.
Discover Our Library
Embark on a journey through our expansive collection of articles and let curiosity lead your path to innovation.





