AI/ML Pentesting | December 16, 2024

Balancing Security and Usability of Large Language Models: An LLM Benchmarking Framework

By 2026, Gartner predicts that “80% of all enterprises will have used or deployed generative AI applications.” However, many of these organizations have yet to find a way to balance usability and security in their deployments. As a result, consumer-facing LLM capabilities introduce a new and less understood set of risks for organizations. The mission of this article, along with the first release of the NetSPI Open Large Language Model (LLM) Security Benchmark, is to clarify some of the ambiguity around LLM security and highlight the visible trade-offs between security and usability.

TLDR;

Large Language Models (LLMs) have become more integrated into critical systems, applications, and processes, posing a challenge for potential security risks.
Increasing security measures in LLMs can negatively affect usability, requiring the right balance. But these behaviors may be desired depending on the business use case.
Our LLM benchmarking framework shows how different LLMs handle adversarial conditions, testing their jailbreakability, while measuring any impact on usability.

Security Concerns in Large Language Models

As LLMs become integral to critical systems, the risk of vulnerabilities like model extraction, data leakage, membership inference, direct prompt injection, and jailbreakability increases. Jailbreaking refers to manipulating a model to bypass safety filters, potentially generating harmful content, exposing sensitive data, or performing unauthorized actions.

These vulnerabilities have significant implications. In business, a compromised LLM could leak proprietary information or become an attack vector. In public applications, there is a risk of harmful or biased content causing reputational damage and legal issues. Therefore, ensuring LLM security is crucial, highlighting the need for robust benchmarks to test their resilience against attacks, including jailbreakability.

Balancing Security and Usability

While enhancing security of an LLM is important, usability is equally important. The model should still perform its intended functions effectively. Oftentimes, security and usability is a balancing act. This challenge is well-documented in software and system design – overly strict filters may limit useful responses, while insufficient security poses risks.

LLM Benchmarking Framework

These challenges and concerns are not going away anytime soon. So, what can be done? We’ve created a benchmarking framework that evaluates both the security and usability of LLMs. Our systematic assessment shows how different LLMs handle adversarial conditions, testing their jailbreakability, while measuring any impact on usability. This dual evaluation helps balance security with functionality, crucial for AI applications in cybersecurity.

Our intent is that the benchmark can provide some level of transparency so that it can be used by organizations to make more informed choices that better align to their use cases and risk appetite.

Read the details surrounding our benchmark approach and key finding in the initial release

While the findings and benchmarks presented in this paper reflect our current understanding of LLM security and usability, it is important to note that this research is part of an evolving body of work. As advancements in model evaluation techniques and security practices emerge, we expect to refine and expand upon these benchmarks. We encourage feedback and constructive critique from readers, as it will help to further improve the robustness and comprehensiveness of our methodology. We remain committed to ensuring that these evaluations continue to meet the highest standards as the field develops.

We invite you to participate in this research and contribute your insights to the paper, helping shape the future of AI security.

Authors:

Kurtis Shelton

Principal AI Researcher

Tristan Blackburn

Data Scientist

Jake Karnes

Senior Technical Architect, Technical Enablement

Nicholas Stang

Head of Analytics and Data Science

Explore More Blog Posts

Cloud Pentesting

Extracting Sensitive Information from Azure Load Testing

July 1, 2025

Learn how Azure Load Testing's JMeter JMX and Locust support enables code execution, metadata queries, reverse shells, and Key Vault secret extraction vulnerabilities.

Learn More

Proactive Security

3 Key Takeaways from Continuous Threat Exposure Management (CTEM) For Dummies, NetSPI Special Edition

July 1, 2025

Discover continuous threat exposure management (CTEM) to learn how to bring a proactive approach to cybersecurity and prioritize the most important risks to your business.

Learn More

Penetration Testing as a Service (PTaaS)

How Often Should Organizations Conduct Penetration Tests?

June 30, 2025

Learn how often organizations should conduct penetration tests. Discover industry best practices, key factors influencing testing frequency, and why regular pentesting is essential for business security.

Learn More

NetSPI Labs

Gut Check: Are You Getting the Most Value out of Your Penetration Testing Report?

Chubb partners with NetSPI to bring attack surface management to its policyholders

Partner with NetSPI

Balancing Security and Usability of Large Language Models: An LLM Benchmarking Framework

Kurtis Shelton,

Tristan Blackburn,

Jake Karnes,

Nicholas Stang

TLDR;

Security Concerns in Large Language Models

Balancing Security and Usability

LLM Benchmarking Framework

Read the details surrounding our benchmark approach and key finding in the initial release

Authors:

Explore More Blog Posts

Extracting Sensitive Information from Azure Load Testing

3 Key Takeaways from Continuous Threat Exposure Management (CTEM) For Dummies, NetSPI Special Edition

How Often Should Organizations Conduct Penetration Tests?