Limitations of Medical Machine Translation: Systematic Review with SAIMSARA.



saimsara.com

Review Stats
- Generated: 2025-09-22 00:36:22 CEST
- Plan: Premium (Europe PMC enabled; PubMed optional)
- Source: Europe PMC
- Keyword gate: Fuzzy (≥60% of required terms, minimum 2 terms matched in title/abstract)
- Retrieved: 36301
- Abstracts analyzed: 36306
- Included originals: 257

1. Introduction
The integration of artificial intelligence (AI), particularly machine translation (MT) and large language models (LLMs), into healthcare is accelerating, promising significant efficiency gains in clinical workflows and research [10, 20]. These technologies offer the potential to bridge language divides in multilingual settings, automate the translation of healthcare data, and support tasks ranging from generating discharge summaries to analyzing medical literature [35, 60, 81]. However, their deployment in the high-stakes medical domain is fraught with challenges. Concerns persist regarding the accuracy, reliability, and ethical implications of using automated systems for tasks that directly impact patient care and safety [4, 60, 65].

Early research has highlighted that while models like ChatGPT and DeepL show potential, they also demonstrate significant limitations, including incomplete coverage of complex medical terminology, the generation of clinically significant errors, and the perpetuation of systemic biases [1, 3, 11]. The performance of MT systems is often not uniform across languages, with notable shortcomings for less-resourced languages, which can exacerbate existing health disparities [11, 135]. Furthermore, the nuanced nature of clinical communication, which involves specific idioms, complex syntax, and emotional context, presents a formidable challenge for current AI [2, 10]. These limitations underscore the critical need for human supervision and a comprehensive understanding of the risks before MT can be safely integrated into routine clinical practice [4, 43, 68]. This systematic review synthesizes evidence from recent studies to provide a structured overview of the documented limitations of medical machine translation.

2. Aim
The aim of this systematic review is to identify, synthesize, and categorize the principal limitations of machine translation technologies when applied within the medical domain, based on an analysis of original research studies.

3. Methods

3.1 Eligibility criteria
This review included original studies of any design that evaluated or discussed the limitations of machine translation or related AI/ML applications in a medical or biomedical context. Reviews, case reports, and articles not presenting original empirical data were excluded from the primary synthesis.

3.2 Study selection
The included articles were identified from a larger pool using a predefined keyword gate applied during an upstream screening process. This review is based exclusively on the structured data extracted from that selection.

3.3 Data items
Data were extracted from each included study on the following items: study design, directionality (e.g., prospective, retrospective), population or setting, sample size (N), follow-up duration, main results, and key statistics relevant to model performance and limitations.

3.4 Risk of bias
No formal risk of bias tool was applied. Instead, potential sources of bias were inferred qualitatively from the available data fields. Common concerns identified across the studies include the use of small or non-diverse datasets, a lack of external validation, reliance on single imaging or data modalities, and inconsistencies in reporting and evaluation methodologies, all of which may limit the generalizability and certainty of the findings [8, 67, 70, 115].

3.5 Synthesis methods
A narrative synthesis was conducted to collate and summarize the findings. Study results are cited using their recorded article numbers. Simple descriptive summaries of quantitative data, such as the median and range, were computed only when performance metrics shared the exact same definition, unit, and timepoint. Due to significant heterogeneity in endpoints and metrics, no formal meta-analysis or calculation of new confidence intervals was performed.

4. Results

4.1 Study characteristics
The synthesized studies primarily consisted of mixed-methods designs, retrospective cohorts, and cross-sectional analyses [1, 2, 6, 11]. Populations were diverse, ranging from specific patient cohorts, such as those with diabetes or cancer, to broader settings like medical imaging departments, multilingual clinical encounters, and analyses of large electronic health record databases [4, 43, 79, 109]. Follow-up was generally not applicable or not reported in most studies evaluating translation performance.

4.2 Main numerical result aligned to the query
No single, comparable numeric outcome for translation error could be synthesized across the studies due to significant heterogeneity in evaluation metrics, languages, and contexts. However, multiple studies reported quantitative evidence of limitations. For instance, in translations for Haitian Creole, ChatGPT and Google Translate produced a higher proportion of potentially clinically significant errors (33.3% and 23.3%, respectively) compared to professional translations (8.3%) [11]. In a study of two-way clinical communication, the proportion of MT-interpreted phrases scored as acceptable was poor, ranging from 0.36 to 0.84 depending on the language pair, and failed to meet non-inferiority thresholds compared to professional interpreters [79]. Physician validation of a medical text generation toolkit yielded lower scores for accuracy (3.90/5) and completeness (3.31/5), further highlighting performance gaps [13].

4.3 Topic synthesis
The analysis of the included studies revealed several cross-cutting themes regarding the limitations of medical machine translation.

* Accuracy and Clinical Significance of Errors: A primary limitation is the generation of factual inaccuracies, with some errors being potentially clinically significant [4, 11]. Studies report that MT can produce outputs that could lead to delayed patient care or jeopardize diagnostic decisions [127, 215]. Physician evaluations confirm these concerns, with generated texts scoring lower on accuracy (3.90/5) and completeness (3.31/5) [13].
* Handling of Nuance and Complexity: MT models consistently struggle with linguistic complexity. This includes challenges with grammatical and syntactic nuances, domain-specific idioms, and complex or specialized medical terminology [2, 29, 50]. The quality of translation often degrades as the conceptual difficulty of the source text increases [215].
* Data Dependencies and Poor Generalizability: The performance of MT is highly dependent on the data used for training. Key limitations include small dataset sizes, a scarcity of multilingual medical corpora, and a lack of external validation across different clinical settings or populations [8, 53, 168, 243]. This often leads to poor model generalizability, with performance decreasing when models are applied to new geographic locations or patient demographics [203, 28].
* Performance Disparities Across Languages: Translation quality is not uniform across all languages. Models that perform comparably to professional services for high-resource languages like Spanish and Portuguese show significant shortcomings for less-resourced languages such as Haitian Creole [11]. This asymmetry between English and non-English models can perpetuate health disparities [135].
* Systemic Biases and Ethical Concerns: MT systems can reflect and amplify biases present in their training data. Gender bias has been identified as a persistent challenge with no simple technical solution [3]. Broader concerns include data privacy, patient confidentiality, and the risk of generating hallucinatory or misleading information, all of which require robust ethical and legal frameworks [4, 43, 60].
* Necessity of Human Oversight: Given the prevalence of errors and the lack of clinical context, a consistent finding is the indispensable need for human supervision [4, 10, 65]. Studies show that human post-editors outperform automated systems on most quality metrics, and physician oversight is deemed essential for safe implementation [2, 46].
* Technical and Functional Deficiencies: Beyond linguistic accuracy, current models have functional limitations. These include an inability to process visual or multimodal data (e.g., diagrams in patient instructions), a lack of emotional intelligence crucial for empathetic communication, and difficulty with highly specialized or calculation-based questions [10, 103].

5. Discussion

5.1 Principal finding
The principal finding of this review is that while medical machine translation shows promise for improving efficiency, it is constrained by significant limitations in accuracy, nuance, and generalizability. The evidence consistently highlights that MT models produce errors, a notable portion of which can be clinically significant, particularly when applied to less-resourced languages or complex medical content [11, 79, 215].

5.2 Clinical implications
* Risk to Patient Safety: The use of unverified MT in clinical communication, especially for languages with poorer model performance, poses a direct risk to patient safety through misunderstanding of symptoms, diagnoses, or treatment instructions [11, 127, 199].
* Mandatory Human Verification: Clinical workflows incorporating MT must include a mandatory human verification step by a qualified professional before information is used for decision-making. Relying solely on MT output is not supported by current evidence [4, 65, 68].
* Exacerbation of Health Disparities: The performance gap between high-resource and low-resource languages means that deploying current MT systems could worsen health inequities for patients with limited English proficiency from certain linguistic backgrounds [11, 135].
* Limited Utility in Complex Scenarios: MT tools may be unsuitable for nuanced or emotionally charged conversations, such as end-of-life discussions or complex consent procedures, where misinterpretation of tone or specific terminology could have severe consequences [2, 10, 167].

5.3 Research implications / key gaps
* Standardized Evaluation Metrics: There is a critical need to develop and validate standardized metrics for evaluating medical MT that go beyond technical scores (e.g., BLEU) to incorporate clinical relevance and patient safety impact [13, 25, 119].
* Performance in Low-Resource Languages: Further research is required to investigate and improve MT performance for a wider range of low-resource languages and dialects commonly encountered in diverse healthcare settings [11, 168].
* Impact on Clinical Outcomes: Prospective studies are needed to assess the real-world impact of MT-assisted communication on clinical outcomes, patient satisfaction, and health disparities compared to professional human interpreters [79, 211].
* Human-in-the-Loop Workflows: Research should focus on optimizing human-in-the-loop models, comparing the effectiveness and efficiency of different strategies (e.g., pre-editing source text vs. post-editing translated text) in mitigating MT errors [2, 222].
* Multimodal Translation: A key gap exists in the ability of MT to handle multimodal medical information, such as translating text embedded within diagrams, charts, or patient-facing instructional videos [10, 103].

5.4 Limitations
* Heterogeneity of Metrics — The studies included in this review used a wide array of different metrics (e.g., BLEU scores, ROUGE scores, human-rated accuracy, error categorization) to evaluate translation quality, which prevented a quantitative meta-analysis and required a narrative synthesis of the findings.
* Lack of Clinical Outcome Data — The majority of studies focused on technical performance or user perceptions of MT systems. There is a significant lack of research directly measuring the impact of MT errors on patient health outcomes, diagnostic accuracy, or treatment adherence.
* Rapid Technological Evolution — The findings are based on specific versions of AI models (e.g., GPT-3.5, GPT-4o) that are rapidly evolving. Consequently, specific performance benchmarks may become outdated, though the fundamental types of limitations discussed are likely to persist.
* Focus on Textual Data — This review is primarily based on studies of text-to-text translation. Limitations related to the translation of multimodal content, which includes images and diagrams, were identified as a gap but are less thoroughly documented in the synthesized literature.
* Potential Publication Bias — The synthesis relies on published literature, which may be biased toward studies demonstrating the potential of MT rather than those documenting outright failures. This could lead to an underestimation of the full spectrum and severity of MT limitations.

5.5 Future directions
* Standardized Benchmarking Corpora — A critical next step is the development of large, publicly available, and multilingual parallel corpora specifically for the medical domain to enable standardized and reproducible benchmarking of MT models, with a focus on including low-resource languages [168, 243].
* Prospective Clinical Trials — Future research should include prospective, randomized controlled trials that compare clinical workflows integrated with MT against those using professional human interpreters, using patient safety incidents and communication efficacy as primary endpoints [79, 211].
* Multimodal Translation Models — Efforts should be directed toward developing and validating MT systems capable of interpreting and accurately translating mixed-modality medical documents, such as patient education pamphlets that combine text with instructional diagrams [10, 103].
* Bias Auditing Frameworks — Researchers should create and implement systematic frameworks for auditing medical MT models for social biases (e.g., gender, race, language) before they are considered for deployment in clinical settings to prevent the amplification of health disparities [3, 38].
* Explainable AI for MT — The integration of eXplainable AI (XAI) methods into MT systems could help clinicians understand why a particular translation was generated, allowing them to better identify outputs that are at high risk for error and require more thorough verification [100, 173].

6. Conclusion
This systematic review found that while medical machine translation offers potential benefits, its application is constrained by significant and varied limitations. No single comparable numeric outcome for translation error could be synthesized, but studies consistently reported issues with accuracy, the handling of linguistic nuance, and generalizability, with some errors being clinically significant [11, 79]. These limitations are particularly pronounced for less-resourced languages and complex medical content, posing risks to patient safety and potentially widening health disparities. The certainty of these findings is most affected by the heterogeneity of evaluation metrics across studies. A crucial next step is to conduct prospective clinical trials that directly compare MT-integrated workflows with professional human interpretation, focusing on patient safety outcomes and communication effectiveness.


References
SAIMSARA Session Index — Saimsara session index


Figure 1. Publication-year distribution of included originals
Figure 1. Publication-year distribution of included originals


Figure 2. Study-design distribution of included originals
Figure 2. Study-design distribution


Figure 3. Study-type (directionality) distribution of included originals
Figure 3. Directionality distribution


Figure 4. Main extracted research topics
Figure 4. Main extracted research topics (Results)


Figure 5. Limitations of current studies (topics)
Figure 5. Limitations of current studies (topics)


Figure 6. Future research directions (topics)
Figure 6. Future research directions (topics)