SAIMSARA Journal

Machine Generated Science • ISSN 3054-3991

AI-Generated Voice, Synthetic Speech, and Voice Cloning: Scoping Review with ☸️SAIMSARA.

Digital Health icon

Digital Health

Issue 3, Volume 1, 2026

DOI: 10.62487/saimsara3635922a

Editorial note
• Last update: 2026-05-09 08:35:32
What is this paper about
AI-generated voice is now useful enough for education, healthcare, accessibility, media, and commerce — but realistic enough to expose a dangerous gap between human perception and synthetic-voice deception. This review compresses 226 original studies into a structured human- and machine-readable evidence map, showing where voice cloning, synthetic speech, detection, authentication, and provenance are already working — and where they remain unsafe, fragile, or poorly validated.
Human-verified editorial review Verified by World ID proof-of-human. This editorial layer was submitted from a SAIMSARA account verified as a unique human.

Evidence preview
Realistic scene of a patient speaking with a medical voice robot.

Clinical / practical impact

Useful voice interfaces

AI-generated voice is already being tested in education, healthcare, accessibility, media, and commercial interaction.

Healthcare workflow signal

Voice-enabled AI can support patient education, clinical documentation, virtual patients, and medical simulations, but needs oversight.

Accessibility and self-voice

Personalized voices may help users with visual, physical, hearing, or speech impairments preserve identity and communicate more naturally.

Realistic scene of an AI engineer testing voice patterns on a large monitor.

Evidence / detection frontier

Humans are unreliable detectors

Listener studies showed weak or inconsistent detection, including very low accuracy in vishing-style synthetic voice clips.

Automated detectors can excel

Dataset-specific systems reported very high accuracy using spectrograms, acoustic features, CNNs, transformers, and ensemble models.

Voice realism has acoustic fingerprints

Prosody, pitch, timbre, spectral artifacts, vowel-level cues, and time-frequency anomalies remain important signals for detection.

Realistic scene of a mobile phone warning that an incoming call may use AI-generated voice.

Translation gaps / governance

Consent and identity risk

Voice cloning raises practical questions about ownership, impersonation, misinformation, child-facing use, and posthumous or clinical identity replication.

Layered safeguards needed

Safe deployment depends on provenance, watermarking, authentication, explainable detection, and human review rather than one metric alone.

Benchmarks remain fragile

Generalizability is limited by small human studies, heterogeneous datasets, multilingual gaps, adversarial attacks, and real-time deployment constraints.

Swipe sideways on mobile · full evidence map opens after unlock

Abstract: To map the original research literature on AI-generated voice, identify the most query-relevant recurring finding, and synthesize major research topics, practical implications, limitations, and future directions across technical, human-centered, clinical, educational, security, and societal domains. The review utilises 226 original studies with 3297311 total participants (topic deduplicated ΣN). This scoping review suggests that AI-generated voice has reached a level of realism and social utility sufficient to support meaningful applications across education, healthcare, and accessibility, while simultaneously outpacing unaided human ability to distinguish synthetic from authentic speech, with listener accuracy reported as low as 37.5% in vishing-style clips. The dominant signal is a widening gap between human perceptual limits and the demonstrated, though dataset-specific, capability of automated detectors reaching above 99% accuracy in constrained settings. This convergence highlights that safe deployment depends less on any single performance metric than on layered safeguards combining provenance, explainable detection, and authentication. Generalizability remains constrained by heterogeneous benchmarks and small human studies. Future research should prioritize standardized multilingual, adversarial, real-time evaluation alongside enforceable consent and provenance frameworks for voice cloning.

Keywords: AI-generated voice; Synthetic speech; Voice cloning; Deepfake detection; Text-to-speech; Voice conversion; Speaker verification; Acoustic features; Mel spectrograms; Human perception

Review Stats

Get access to the full paper

Unlock the full evidence map

The full evidence review, including the Introduction, Methods, Results, Discussion, Conclusion, figures, and complete reference index, opens after purchase or sign-in. The Evidence Object JSON is a separate machine-readable evidence product: a concentrated synthesis of results, topic-level evidence, and discussion across original and non-original studies. It can be directly input into your LLM, agent, or RAG workflow.

Reference Index (170)