1. INTRODUCTION
In recent years the rapid advancement of artificial intelligence (AI) technology has sparked a growing interest in the integration of large scale models within the field of medical imaging. Large models, often denoting neural network models with a profusion of parameters, intricate architecture, and abundant neurons, have gained prominence in the realm of deep learning. Large models consistently exhibit exceptional performance and robust generalization capabilities, rendering them versatile tools in various domains, including medical image analysis and natural language processing [NLP] [1]. The roots of large models can be traced back to the foundational concept of neuron modeling that was initially proposed by Warren McCulloch and Walter Pitts in 1943. However, the field of deep learning, where these expansive models reside, grappled with technical constraints for an extended period [2]. It was not until 2012, when Alex Krizhevsky and colleagues introduced the AlexNet model, that a crucial turning point was reached. Their victory in the ImageNet image classification competition underscored the considerable advantages of large deep models in computer vision tasks, marking the dawn of an era where substantial models flourished [3]. Subsequent developments introduced models, such as VGG, GoogLeNet, and the Residual Neural Network (ResNet), all contributed significantly to enhanced model performance [4–6]. The subsequent advent of the Transformer model pioneered the concept of self-attention, providing the foundation for large-scale language modeling [7]. Building upon this foundation, researchers unveiled the Bidirectional Encoder Representations from Transformers (BERT) model, bringing about substantial improvements in the performance of large models in NLP tasks [8]. Today, the OpenAI’s Generative Pre-trained Transformer (GPT) series of models, boasting billions of parameters, exemplify remarkable capabilities [9]. The evolution of large models, as witnessed today, owes much to the contributions of computational resources. In this context, computational resources encompass the hardware devices employed for training deep learning models, the computational time required for this training, and the energy consumption necessary to sustain these devices. Since the release of the AlexNet model in 2012, there has been an exponential surge in the utilization of computational resources by researchers for model training. The deployment of large-scale computational resources has significantly enhanced model performance [10]. However, the extensive use of large-scale computational resources has also given rise to a set of challenges, including substantial economic investments, heightened energy demands, increased carbon emissions, and concerns related to research inequality [11].
Despite the persistent nature of these challenges, the pivotal role played by large models in the realm of medical image analysis remains undisputed. The hierarchical architecture inherent in deep neural networks within large models facilitates a systematic process for the identification and accentuation of crucial features within input images, while concurrently eliminating superfluous elements. This process reveals the intrinsic characteristics latent within the original images [12]. This remarkable capability empowers large models to conduct medical image analysis with enhanced efficiency and precision. Moreover, the integration of large models in the field of medical imaging has catalyzed innovative research directions in medical image analysis, encompassing automated image segmentation and the automated generation of comprehensive medical image analysis reports [13].
The applications of large models in the field of medical imaging encompass several distinct areas. The first of these is image classification and segmentation, a critical task in medical image analysis that finds wide utility in assisted diagnosis and lesion localization. Large models can autonomously discern salient features within original medical images, offering precise classification outcomes and the ability to delineate and segment various tissues and organs within the image with exceptional accuracy [14, 15]. The second area focuses on the detection and prediction of anomalies within medical images. Large models exhibit the capability to identify diverse pathologic anomalies, including viral infections, exemplified by the capacity to detect early-stage COVID-19 infections through medical images [16]. Furthermore, large models can effectively forecast disease onset or future progression, as evidenced by the accurate predictions in the context of glaucoma onset and progression [17]. The third domain pertains to multimodal medical image analysis, which addresses the multifaceted nature of contemporary medical image data. Large models adeptly combine multiple types of medical images, extracting common features across all modalities to effectively analyze target images [18]. Lastly, large models play an invaluable complementary role for radiologists. Large model applications, whether focused on image segmentation or anomaly detection, significantly streamline the work of radiologists, enhancing the accuracy and efficiency of their tasks within the realm of medical image analysis.
2. METHODS
In this section the fundamental architecture of large models, along with the training strategies and optimization techniques, will be introduced. The goal is to foster a deeper comprehension of large models.
2.1 Basic architecture
2.1.1 Categories of models
The landscape of large models has evolved significantly over time. Currently, these models can be broadly classified into the following three groups based on their foundational architectural structures:
Convolutional Neural Networks (CNNs)
CNNs typically comprise three distinct types of layers (convolutional, pooling, and fully connected). Within the convolutional layer, a pivotal element emerges—the convolutional kernel, which is often referred to as the filter. The presence of these filters empowers CNNs to adeptly discern salient features within input images, enhancing the efficiency of image processing. However, traditional CNNs grapple with certain limitations, including the challenge of gradient vanishing, which hinders the capacity of the model to grasp intricate features [19]. In response, researchers have devised innovative models building upon the foundation of CNN architecture to surmount these challenges. Noteworthy exemplars encompass AlexNet, VGGNet, and ResNet models. AlexNet leverages multiple convolutional layers, ReLU activation functions, maximum pooling, and normalization to optimize model accuracy. VGGNet enhances AlexNet by introducing a sequence of convolutional layers characterized by smaller convolutional kernels, thereby enhancing feature recognition. The ResNet model ( Figure 1 ) introduces the concept of residual learning, effectively addressing the gradient vanishing problem and enabling the model to grasp more intricate features [20]. In the realm of medical image analysis, these models predominantly find application in image classification and segmentation. A multitude of studies have unequivocally demonstrated the adeptness of this class of models in accurately classifying and segmenting medical images [21–23].
Recurrent Neural Network (RNNs)
RNNs ( Figure 2 ) typically adopt a tree-like or directed acyclic graph structure. This architectural design facilitates recursive propagation and sharing of information across various network structures. This enables RNNs to demonstrate high efficiency when dealing with tree-structured and hierarchical data, particularly in natural language processing (NLP) or sequential data analysis [24]. By leveraging the functional characteristics of RNNs, the application in medical imaging extends to automating the generation of medical image reports and processing medical sequence data, such as time series images or video data [25, 26].
Transformer Model
The Transformer model ( Figure 3 ) introduces a self-attention mechanism, dynamically adjusting the focus of the model on different segments of the input based on task-specific features and input data characteristics, thereby enhancing model performance and resilience. This self-attention mechanism, when combined with feed-forward neural networks, enables global context modeling and maintains parallel processing capabilities, especially for extended sequences [7]. Building upon the foundations of the Transformer model, BERT inherits and expands its attributes. BERT introduces bidirectional training, addressing the unidirectional processing constraint of Transformers, while also featuring pre-training and fine-tuning capabilities [8]. The emergence of the GPT series models has elevated the application of neural language models based on the Transformer architecture. GPT-3, equipped with 175 billion parameters, boasts an innovative few-shot learning feature, allowing GPT-3 to proficiently handle various NLP tasks with minimal examples or task descriptions [9].

ResNet Architecture – Demonstrates how shortcut connections enable residual learning to address the vanishing gradient problem in deep neural networks.

Recurrent Neural Network – Illustrates how RNNs use internal looping structures to handle sequential information, suitable for tasks like language processing and time-series analysis in medical imaging.

Transformer Model – Showcases the self-attention mechanism, which dynamically tunes focus on input segments, thus enhancing performance and adaptability in processing sequential data.
Hence, the significance of applying Transformer, BERT, and GPT-3 in the field of medical image analysis cannot be overstated. Numerous scholars have authored reviews highlighting the potential of Transformer models for medical image segmentation. BERT models have exhibited exceptional performance in automatic medical image report generation. The exploration of GPT series models for aiding clinical decision-making in the realm of radiology further underlines their promising utility [27–29].
2.1.2 Model size and complexity
Until now, all three categories of large models exhibit a considerable scale. Kaplan et al. conducted a comprehensive study to investigate the intricate relationship between model performance and model scale. Kaplan et al. reported that model performance is significantly contingent on the scale of the model in terms of parameters, the extent of the dataset, and the computational resources employed for training. It was evident that the judicious expansion of model size markedly enhances performance. Consequently, large models have demonstrated remarkable efficiency in the analysis of medical images, achieving commendable accuracy in tasks, such as tumor detection, image segmentation, and disease discrimination [30]. However, several challenges persist in the application of large models in the field of medical image analysis. A notable hurdle is the predominantly limited size of medical image datasets, which often fails to meet the demands of training large scale models [31].
2.1.3 Pre-training and transfer learning
The advent of pre-training and transfer learning methodologies has effectively addressed the quandary of limited data sizes. Pre-training involves the preliminary training of large models on extensive datasets, enabling the large models to glean generic features and structural knowledge from the data. Subsequently, transfer learning allows these large models to apply the generalized features acquired during pre-training to specific datasets under scrutiny [32]. Notably, Hopson et al. [33] delved into the utilization of pre-trained CNN models for assessment of the quality of clinical PET images using transfer learning techniques. Hopson et al. [33] demonstrated that pre-training significantly enhances the performance of CNN models in the task of assessing the quality of clinical images, particularly in automating the prediction of PET images.
2.2 Training strategies
2.2.1 Data preparation
The quality of data significantly influences the performance of large models. It is imperative to undertake meticulous steps in preparing medical image data for model training. Commencing with a comprehensive summary of relevant medical image data for the study is crucial, followed by a rigorous assessment of data reliability. Subsequent steps involve data cleansing to eliminate non-compliant entries, standardization to ensure image consistency, and annotation tailored to the study requirements for effective model learning and comprehension [34].
2.2.2 Data augmentation
Beyond the above steps, the inclusion of data augmentation is imperative. Data augmentation involves the generation of new data from existing sources, incorporating techniques, such as rotation, translation, flipping, and cropping. This augmentation of the original dataset is vital to expanding its size and enhancing the generalization capabilities of the model. Given that collected data may fall short in meeting the demands of training a large model in practical scenarios, data augmentation becomes an essential facet of the preparation process [35].
2.2.3 Loss functions and optimization objectives
In the training regimen of large models, the meticulous selection of appropriate loss functions emerges as a pivotal factor in augmenting model performance. Loss functions serve as metrics to gauge the deviation between the model predictions and actual values, elucidating how closely the model aligns with ground truth. The optimization objective is to minimize this deviation, with a smaller value of the loss function signifying superior model performance [36]. Commonly utilized loss functions in medical image segmentation tasks include Cross-Entropy Loss and Dice Loss. The judicious choice of a loss function in practical research hinges upon the specific data characteristics, research objectives, and the intended applications of large models [37].
2.3 Optimization techniques
2.3.1 Training optimization algorithms
Following data preprocessing, augmentation, and the selection of appropriate loss functions, a critical facet of training large models revolves around optimization algorithms. These algorithms aim to identify model parameters that minimize the loss function, thereby enabling optimal model performance. Stochastic Gradient Descent (SGD) and Adaptive Moment Estimation (Adam) stand out as commonly used optimization algorithms [38]. SGD serves as a foundational optimization algorithm that computes gradients for each training sample and updates model parameters accordingly. By selecting only one training sample at a time, this approach introduces randomness, aiding in escaping local optima and exploring the parameter space more comprehensively. Adam integrates SGD with a momentum-adaptive learning rate optimization algorithm, providing greater stability compared to SGD and facilitating faster convergence towards local optimal solutions [39].
2.3.2 Regularization and overfitting control
As model training progresses, the challenge of overfitting becomes increasingly pronounced. Overfitting manifests when the model excels on training data but performs poorly on unfamiliar data. To mitigate this issue, constraints or penalty terms can be incorporated into the loss function to reduce model complexity, an approach known as regularization. Typically, dropout and L2 Regularization serve as effective means for model regularization. Dropout entails the random deactivation of certain neurons during each training iteration, which prevents the model from relying too heavily on specific neurons and thereby enhancing generalization capabilities. In contrast, L2 Regularization introduces penalty terms to the loss function, encouraging optimization algorithms to favor smaller weight values during parameter selection and consequently diminishing the risk of overfitting [40].
2.3.3 Model compression and acceleration
Efficiently reducing the time and cost associated with training and inference for large models stands as a crucial facet in the training continuum. Presently, model compression and acceleration primarily rely on methods, such as model pruning, quantization, and knowledge distillation. Model pruning involves judiciously trimming redundant weights, neurons, filters, and layers within large models based on CNN architecture. This process mitigates model storage requirements and expedites the inference phase. Quantization entails converting floating-point representations of model parameters and intermediate activation values into lower-precision integers or fixed-point numbers. Quantization not only reduces model size but also enhances inference efficiency. Knowledge distillation adopts the following two-step approach: initially training a large model, known as the teacher model; and subsequently constructing a smaller model, referred to as the student model, for the same task. Transferring knowledge from the teacher model to the student model results in a more streamlined architecture that demands fewer computational resources, thereby improving overall inference efficiency [41].
2.3.4 Distributed and parallel training
Additionally, distributed and parallel training assumes a pivotal role in expediting the training of large models and processing extensive datasets in medical imaging. Distributed training involves partitioning the parameters and training data of large models into multiple segments, each assigned to multiple computers or compute nodes. Independently computing updates to the model parameters on each node and sharing these updates facilitate simultaneous model training across multiple nodes, resulting in expedited training speeds [42]. In contrast, parallel training, distinct from distributed training ( Table 1 ), necessitates a single computer or computer node. This method leverages multiple processing units within the computer to concurrently process various segments of the training task, thereby augmenting training speed. When applied to medical imaging data, the utilization of distributed and parallel training can notably accelerate the delivery of patient health information to healthcare professionals [43].
Comparison of distributed and parallel training.
Aspect | Distributed training | Parallel training |
---|---|---|
Concept | Utilizes a network of interconnected computers for distributed tasks. | Employs multiple processors within a single computer for concurrent tasks. |
Primary Goal | To manage and expedite training with large datasets across several machines. | To optimize and expedite training within the constraints of a single machine. |
Resource Requirements | Multiple interconnected computers or nodes; network bandwidth and latency are critical. | A computer with multi-core processors; dependent on the quality and number of cores. |
Data Handling | Implements data or model parallelism across nodes, splitting tasks among multiple machines. | Executes simultaneous training on different parts or subsets of data or model within the same machine. |
Communication Overhead | Higher due to the need for node synchronization and data exchange across the network. | Lower, as all processes occur within the same physical system, minimizing data exchange time. |
Scalability Potential | Highly scalable with the ability to add more nodes; influenced by network architecture and data strategies. | Limited to the physical and technical specifications of the single computer; can be extended by upgrading hardware. |
Operational Complexity | More complex due to coordination, network configuration, and data distribution across multiple machines. | Relatively simpler in setup but may require sophisticated parallel algorithms to fully utilize all cores efficiently. |
The foregoing information provides a foundational understanding of large model architecture, training methodologies, and optimization techniques. Subsequent sections will delve into the research advances and practical applications of large models within the domain of medical imaging.
3. EXPLORATION OF LARGE MODELS IN MEDICAL IMAGE ANALYSIS
3.1 Application examples
3.1.1 Precision in image classification and segmentation
The diligent efforts of researchers have yielded significant strides in the analysis of medical images through the integration of large models. Notably, Jin et al. introduced the RA-UNet model, a sophisticated architecture amalgamating CNNs, residual learning, and attention mechanisms. This model adeptly achieves precise segmentation of the liver and tumors within three-dimensional computed tomography (CT) images. Leveraging datasets, such as Liver Tumor Segmentation Challenge (LiTS) and 3DIRCADb for model training and evaluation, the study used metrics, including the Dice coefficient and Jaccard index, to gauge segmentation quality. In liver segmentation, RA-UNet attained Dice coefficients of 0.961 and 0.977, along with Jaccard indices of 0.926 and 0.977 on the two datasets. Furthermore, RA-UNet demonstrated robust performance in tumor segmentation across both datasets. A noteworthy innovation in this study was the pioneering use of an attention-residual mechanism for tumor segmentation in three-dimensional medical images. The integration of residual modules within the model enables adaptive adjustments in attention-aware features, thereby amplifying overall model performance [44].
3.1.2 Advances in anomaly detection and prediction
Large models have showcased remarkable progress in the domains of medical image anomaly detection and disease prediction. A notable example is the work of Brown et al., who harnessed deep CNNs for the automated diagnosis of “plus lesions” within retinal images of premature infants, a distinctive characteristic of retinopathy of prematurity (ROP). Given the critical importance of early plus lesion detection for effective ROP management, and considering the inherent low accuracy in clinical diagnosis, this research achieved remarkable precision and reproducibility in plus lesion diagnosis [45].
Furthermore, Jiang et al. introduced the “S-net,” a tailored deep neural network model designed for extracting image features from preoperative CT scans of gastric cancer patients to construct predictive models. These models not only forecast disease-free survival and overall survival in gastric cancer patients but also identify individuals likely to benefit from postoperative adjuvant therapy. The study unveiled a unique image feature termed “DeLIS,” which enables accurate prognostication of patient outcomes when integrated with clinical factors [46].
3.1.3 Computer-aided diagnosis systems and automated report generation
Moreover, the integration of large models has propelled the radiology field forward by assisting radiologists in disease diagnosis and automating the generation of medical imaging reports. Jiang et al. utilized a transformer-based image classification model employing optical coherence tomography (OCT) images to discern between age-related macular degeneration (AMD) and diabetic macular edema (DME), contributing significantly to the diagnosis of retinal diseases. The trained Transformer model achieved an impressive recognition accuracy of 90.9% when classifying normal, AMD, and DME OCT images, underscoring the potential of Transformer models in computer-aided diagnosis [47].
Furthermore, Yang et al. introduced an Adaptive Multimodal Attention network (AMAnet) designed for generating high-quality medical imaging reports, as evidenced by experiments conducted on a dataset of breast ultrasound images. The outcomes revealed that the AMAnet model autonomously produces semantically coherent and high-quality medical image reports, accurately portraying essential local features [48].
3.2 Technical challenges and solutions
3.2.1 Data scarcity and data bias
While large models have demonstrated substantial advantages in medical image analysis, several challenges persist. Primarily, concerns arise regarding the availability and quality of datasets, specifically related to issues of data scarcity and data bias. Large models demand considerable volumes of data for effective training, yet numerous research studies currently rely on medical imaging datasets that are relatively small in scale, falling short of the requirements for large model training. Additionally, some diseases exhibit an imbalanced data distribution, potentially leading to biased model outcomes. Furthermore, medical imaging data stems from diverse sources, posing challenges in ensuring data consistency. However, techniques, such as data augmentation, are presently used to alleviate these challenges, at least in part [49].
3.2.2 Model interpretability
Another crucial consideration is the interpretability of the model. The primary objective of using large models in medical image analysis is to support clinical decision-making, necessitating a transparent rationale behind every clinical decision. However, elucidating the decision-making process in large models is often challenging, potentially resulting in an inability to rectify errors, posing challenges for healthcare professionals and patients.
To address this concern, various techniques exist for model interpretation, such as Local Interpretable Model-Agnostic Explanations (LIME) and Shapley Additive Explanations SHAP). LIME operates by using a set of perturbed data around the predictions of the original model, subsequently training an interpretable model to elucidate the decision-making process of the original model. SHAP utilizes game theory concepts to consider diverse combinations of features, calculating the contribution of each feature to the final prediction. This approach aids in comprehending how the model arrives at decisions. While both methods offer a degree of interpretability, rigorous research is imperative to ensure the accuracy and reliability of the results [50].
3.2.3 Computational resources and efficiency
Mitigating the demand for computational resources and minimizing energy consumption in large models constitutes a significant technical challenge. The training of large models necessitates substantial computational resources, and the escalating demand for computational resources concurrently amplifies the energy consumption associated with training large models.
Current techniques for model compression and acceleration, such as model pruning and quantization, can partially alleviate the strain on computational resources and energy consumption in large models. However, addressing this challenge comprehensively requires sustained research efforts [10].
4. FUTURE DIRECTIONS
Given the substantial potential of large models in the field of medical image analysis, ongoing research on the application in this domain is continually advancing. In this section, I will delineate the future directions of large models in medical image analysis, encompassing, but not limited to, the following aspects:
4.1 Model performance optimization
As the trend toward increasing model scale persists, the complexity of large models rises, necessitating greater computational resources and energy. The training and deployment of large models encounter challenges related to inadequate computational resources and heightened energy consumption. Identifying model optimization and acceleration techniques that diminish computational resource requirements and energy consumption is imperative to propel the development and application of large models.
4.2 Enhancing model interpretability
While the application of large models in medical image analysis brings convenience to physicians and patients, the rigorous and specific nature of medical treatment mandates a clear rationale for treatment decisions. Enhancing the interpretability of large models is essential, enabling physicians and clinicians to comprehend the decision-making process of the models. This improvement provides a reliable foundation for large model-assisted clinical decision-making. These future directions underscore the importance of addressing challenges related to model scale, resource utilization, and interpretability to unlock the full potential of large models in advancing medical image analysis.
4.3 Multimodal medical image analysis
Medical imaging data is diverse, presenting in various formats, and large models exhibit the capability to seamlessly integrate information from multiple types of medical imaging data. This integration fosters information fusion and complementarity between distinct imaging modalities, ultimately enhancing diagnostic accuracy.
4.4 Self-supervised and few-shot learning
Advancing the application of self-supervised and few-shot learning in medical image analysis is crucial for mitigating the challenges posed by limited annotated data.
4.5 Automated medical report generation
Automated medical report generation remains a paramount focus. The ongoing evolution of large models will continue to propel the automation of medical image report generation, thereby alleviating the workload of radiologists.
4.6 Real-time medical image analysis
Real-time analysis and monitoring of medical images constitute a pivotal frontier for the future development of large models. Exploring the application of large models in real-time medical image analysis and monitoring is anticipated to provide substantial support to healthcare professionals and patients alike.
4.7 Privacy protection
The widespread integration of large models in medical image analysis brings forth ethical and regulatory considerations that demand attention. The future trajectory of macro-modeling necessitates stringent privacy protection measures in accordance with ethical guidelines and regulatory requirements.
5. CONCLUSION
In conclusion, large models have demonstrated substantial advantages in the analysis of medical images, offering the potential to enhance the precision of disease diagnosis and introduce innovative possibilities to the field of medical image analysis. However, the application of these sophisticated models encounters several challenges, including insufficient data, interpretability of models, and computational resource demands. Researchers have proposed addressing these challenges through techniques, such as LIME explanatory modeling and model compression and acceleration, which mitigate these issues, at least in part.
The evolving landscape of large models continues to witness advances, with ongoing efforts focused on optimization, acceleration, and the augmentation of interpretability. Additionally, addressing challenges related to the analysis of multimodal medical image data, refining diagnostic accuracy, automating the generation of medical reports, and other dimensions signify the principal developmental trajectories for large models in the foreseeable future.
In summary, large models possess the robust capacity to conduct accurate and in-depth analysis of medical images, introducing unprecedented possibilities for their application. The existing challenges encountered by large models are serving as catalysts for their further refinement. Looking ahead, these models are poised to exhibit heightened performance in the realm of medical image analysis, experiencing deeper integration and continually charting new developmental pathways for the field.