SAIMSARA Journal

Machine-Readable Science • ISSN 3054-3991

AI-Generated Voice, Synthetic Speech, and Voice Cloning: Scoping Review with ☸️SAIMSARA.

Digital Health & Biotech icon

Digital Health & Biotech

Issue 3, Volume 1, 2026

DOI: 10.62487/saimsara3635922a

Editorial note
• Last update: 2026-05-09 08:35:32
What is this paper about
AI-generated voice is now useful enough for education, healthcare, accessibility, media, and commerce — but realistic enough to expose a dangerous gap between human perception and synthetic-voice deception. This review compresses 226 original studies into a structured human- and machine-readable evidence map, showing where voice cloning, synthetic speech, detection, authentication, and provenance are already working — and where they remain unsafe, fragile, or poorly validated.
Human-verified editorial review Verified by World ID proof-of-human. This editorial layer was submitted from a SAIMSARA account verified as a unique human.

Evidence preview
Realistic scene of a patient speaking with a medical voice robot.

Clinical / practical impact

Useful voice interfaces

AI-generated voice is already being tested in education, healthcare, accessibility, media, and commercial interaction.

Healthcare workflow signal

Voice-enabled AI can support patient education, clinical documentation, virtual patients, and medical simulations, but needs oversight.

Accessibility and self-voice

Personalized voices may help users with visual, physical, hearing, or speech impairments preserve identity and communicate more naturally.

Realistic scene of an AI engineer testing voice patterns on a large monitor.

Evidence / detection frontier

Humans are unreliable detectors

Listener studies showed weak or inconsistent detection, including very low accuracy in vishing-style synthetic voice clips.

Automated detectors can excel

Dataset-specific systems reported very high accuracy using spectrograms, acoustic features, CNNs, transformers, and ensemble models.

Voice realism has acoustic fingerprints

Prosody, pitch, timbre, spectral artifacts, vowel-level cues, and time-frequency anomalies remain important signals for detection.

Realistic scene of a mobile phone warning that an incoming call may use AI-generated voice.

Translation gaps / governance

Consent and identity risk

Voice cloning raises practical questions about ownership, impersonation, misinformation, child-facing use, and posthumous or clinical identity replication.

Layered safeguards needed

Safe deployment depends on provenance, watermarking, authentication, explainable detection, and human review rather than one metric alone.

Benchmarks remain fragile

Generalizability is limited by small human studies, heterogeneous datasets, multilingual gaps, adversarial attacks, and real-time deployment constraints.

Swipe sideways on mobile · full evidence map opens after unlock

Abstract: To map the original research literature on AI-generated voice, identify the most query-relevant recurring finding, and synthesize major research topics, practical implications, limitations, and future directions across technical, human-centered, clinical, educational, security, and societal domains. The review utilises 226 original studies with 3297311 total participants (topic deduplicated ΣN). This scoping review suggests that AI-generated voice has reached a level of realism and social utility sufficient to support meaningful applications across education, healthcare, and accessibility, while simultaneously outpacing unaided human ability to distinguish synthetic from authentic speech, with listener accuracy reported as low as 37.5% in vishing-style clips. The dominant signal is a widening gap between human perceptual limits and the demonstrated, though dataset-specific, capability of automated detectors reaching above 99% accuracy in constrained settings. This convergence highlights that safe deployment depends less on any single performance metric than on layered safeguards combining provenance, explainable detection, and authentication. Generalizability remains constrained by heterogeneous benchmarks and small human studies. Future research should prioritize standardized multilingual, adversarial, real-time evaluation alongside enforceable consent and provenance frameworks for voice cloning.

Keywords: AI-generated voice; Synthetic speech; Voice cloning; Deepfake detection; Text-to-speech; Voice conversion; Speaker verification; Acoustic features; Mel spectrograms; Human perception

Review Stats

Get access to the full paper

Unlock the full evidence map

Full paper access includes the complete human-readable review, figures, reference index, PDF export, and machine-readable Evidence JSON download.
Evidence JSON can also be purchased separately if you only need the LLM-ready object for agent, AI, or RAG workflows.
Institutional or library access? Sign in with your institution email to open all available SAIMSARA papers under your institution access arrangement.
Need a SAIMSARA review on your own topic? ☸️Request.

Reference Index (170)

Unlock the full paper to view the complete Reference Index.