Rapid Evolution of AI Voice Cloning in Cybersecurity

TL;DR

AI voice cloning technology has rapidly advanced, allowing realistic voice replicas to be created with minimal audio input. While initially used for benign purposes like content creation, the technology poses significant cybersecurity risks. Malicious actors can exploit voice cloning for real-time impersonation, potentially leading to security breaches by mimicking trusted voices.

To better understand these new threats, NetSPI developed a voice cloner tool that allows us to demonstrate this risk to our customers by using cloned voices in social engineering tests. To meet the challenges voice cloners pose, organizations must implement robust verification processes and conduct employee training to better position their organization against modern threats.

Introduction

The pace of AI development is accelerating – often moving faster than the implementation of necessary safety measures and regulations meant to harness it. This gap can create opportunities for malicious actors to exploit emerging technologies.

A striking example is the progress in voice cloning. Not long ago, cloning a user’s voice required hours of pristine audio, and the results were often imperfect. Today, companies like ElevenLabs have transformed the landscape by enabling the cloning of any voice using just one or two minutes of audio. These advanced models introduce natural cadence to produce highly realistic voice replicas.

Initially, the primary applications of this technology were benign. They mostly aimed at enhancing content creation and reducing the cost and time associated with voiceover production. However, the potential for malicious use is significant and growing.

Threat Landscape of Real-Time Voice Cloning

Consider the implications of an attacker using this sophisticated voice cloning technology to create a tool for real-time audio impersonation. This goes beyond traditional text-to-speech (TTS) applications, enabling attackers to mimic any voice live during a conversation.

Imagine the potential consequences:

An attacker could impersonate a manager and instruct an employee to perform actions that compromise security.
More alarming is the potential to clone a CEO’s voice from public interviews to announce false information, such as company layoffs, causing internal chaos.
On a broader scale, the ability to impersonate high-profile figures like military officials or political leaders could lead to mass panic or unauthorized access to sensitive information, posing severe national security threats.

Introducing NetSPI’s Advanced Voice Cloning Tool

To better learn what we’re up against, NetSPI developed a cutting-edge tool capable of generating cloned voices in real time. The tool uses short audio samples, which can be sourced publicly or through brief social engineering interactions, to accurately replicate any voice. The cloned voice can be deployed for TTS or real-time voice impersonation, equipping threat actors for a sophisticated vishing attack.

This development builds on NetSPI’s previous work in adversary simulation and deepfake technology, specifically research on using deepfakes to bypass voice biometrics. This earlier work demonstrated how deepfake technology could be leveraged to overcome voice authentication systems, highlighting the potential security risks posed by AI-driven voice manipulation.

CTA-using-deep-fakes-to-bypass-voice-biometrics

In a recent engagement, The NetSPI Agents proved the power and potential risks of voice cloning technology. The objective was to demonstrate how easily cybercriminals could exploit this technology to deceive employees and gain unauthorized access to a system.

Setup

For the purpose of the demonstration, a manager from the client’s help desk consented to participate. NetSPI recorded a short conversation with this manager to create a clone of his voice using advanced voice cloning software. The goal was to leverage the familiarity and trust associated with the manager’s voice to deceive other employees into divulging their login credentials.

Execution

Using the cloned voice, NetSPI crafted a voicemail message that sounded alarmingly authentic. The message went something like this:

” This is [impersonated employee] with the help desk. Our security team received an alert for your workstation this morning. When you get a chance, please review the ticket by signing in at [example.com]. Thank you.”

This message was then sent to several employees, who were likely to recognize the manager’s voice and trust the request.

Outcome

The operation was a success. One of the employees listened to the voicemail and followed the instructions, visiting the provided link and logging in with their credentials. Unbeknownst to them, the link directed them to a phishing site, allowing us to capture their login information.

Key Learnings

This test demonstrated the alarming ease with which cybercriminals can exploit voice cloning technology for malicious purposes. Here are two implications of AI voice cloning for cybersecurity:

Trust Exploitation: Employees are more likely to follow instructions from voices they recognize and trust, making voice cloning a potent tool for vishing attacks.
Awareness and Training: Organizations must invest in regular cybersecurity training to educate employees about the risks of advanced social engineering attacks, including voice cloning.

NetSPI’s use of voice cloning in this engagement underscores the evolving nature of cyber threats. By understanding these risks and taking proactive measures, organizations can better protect themselves from sophisticated attacks.

Foundational Guidelines for Safely Using AI Voice Cloners

In exploring the capabilities of voice cloning technology, it’s essential to address the ethical implications and practices that guide its use. At NetSPI, we uphold a strict policy regarding cloned content, ensuring that we only proceed with the explicit consent of the individual whose voice is being utilized. This commitment to ethical standards extends to our pretexts, which are crafted to avoid controversial or harmful scenarios.

For instance, we steer clear of high-stress narratives such as threats of job loss or emergencies involving family members. Instead, our intent is to create interactions that are as unmemorable as possible — blending into the background rather than drawing attention.

This approach reduces the potential for unnecessary stress among employees, aligning our practices with both ethical communication and responsible technology usage.

The technology behind voice cloning is still emerging, but its impact on cybersecurity is significant. As major players like OpenAI continue to advance voice generation models, the accessibility and speed of these technologies will increase. Future models may only require seconds of audio to create accurate voice clones.

To mitigate these risks, organizations must implement rigorous verification processes for voice-based interactions. Below are a few effective strategies to enhance security against voice cloners:

Issue a one-time passcode to the caller’s device.
Rotate codes or phrases available on a shared intranet site. The caller provides the secret word when the call starts. The secret word refreshes every two to three minutes to avoid re-use.
Perform secondary verification through internal chat. The caller confirms a Microsoft Teams message with the verification code/phrase to make sure it is legitimate
Conduct a live video call where the caller can present proof of identification, such as a government-issued ID.

Without stringent safeguards, the misuse of voice cloning technology is poised to become a prevalent method for social engineering attacks. Such attacks are among the most straightforward and effective tactics for compromising systems, as demonstrated by recent high-profile breaches like the one MGM experienced in 2023, resulting in an estimated loss of up $100 million.

See NetSPI’s Voice Cloner in Action

While the advancements in AI voice cloning offer transformative potential, they also demand a proactive approach to security. Ensuring collaboration between technologists, policymakers, and cybersecurity professionals is essential to mitigate the risks and harness the benefits of this powerful technology.

See how The NetSPI Agents use our voice cloner tool to enhance clients’ social engineering detection and prevention capabilities. Prepare for the future by equipping your team against hyper-targeted vishing campaigns. Request a demo today.

Chubb partners with NetSPI to bring attack surface management to its policyholders

Partner with NetSPI

The Rapid Evolution of AI Voice Cloning and its Implications for Cybersecurity

Rafael Seferyan

TL;DR

Introduction

Threat Landscape of Real-Time Voice Cloning

Introducing NetSPI’s Advanced Voice Cloning Tool

Setup

Execution

Outcome

Key Learnings

Foundational Guidelines for Safely Using AI Voice Cloners

See NetSPI’s Voice Cloner in Action

Authors:

Explore More Blog Posts

Confidence Over Noise: Introducing Continuous AI Findings Validation

Bypassing Microsoft Entra Conditional Access Policies via Nested App Authentication

I’m Just Asking Questions: Social Engineering as a Reporter

Chubb partners with NetSPI to bring attack surface management to its policyholders

Partner with NetSPI

The Rapid Evolution of AI Voice Cloning and its Implications for Cybersecurity

Rafael Seferyan

TL;DR

Introduction

Threat Landscape of Real-Time Voice Cloning

Introducing NetSPI’s Advanced Voice Cloning Tool

How NetSPI Used AI Voice Cloning for Real-World Social Engineering Testing

Setup

Execution

Outcome

Key Learnings

Foundational Guidelines for Safely Using AI Voice Cloners

Proactive Security Can Counter Social Engineering Risks

Strategies for Social Engineering Prevention Against AI Voice Cloning

See NetSPI’s Voice Cloner in Action

Authors:

Explore More Blog Posts

Confidence Over Noise: Introducing Continuous AI Findings Validation

Bypassing Microsoft Entra Conditional Access Policies via Nested App Authentication

I’m Just Asking Questions: Social Engineering as a Reporter