Beyond the Hype: What Regulated Industries Need to Know Before Trusting AI Security Tooling

Anthropic recently borrowed a page from OpenAI’s playbook by announcing that their new model, Mythos, was “too dangerous” for general release. OpenAI ran that same play with GPT-2 way back in 2019, and it generated headlines, hand-wringing, and a bump in valuation, so the result this time was just as predictable. C-suites and boards turned to their Application and Network Security teams and demanded that they get “Mythos-ready,” a phrase that anyone with a software engineering background would instinctively follow with “I don’t think that word means what you think it means.” Those teams were then tasked with evaluating whether AI-centric, autonomous tools could identify meaningful vulnerabilities, map attack paths across their environments, and (possibly) replace the security tools that they already have in place. In nearly every case, their answer was the same: trust, context, and cost make this a conditional ‘yes’ at best, but not a broadly usable capability today. Mythos and other frontier models make attacks that were previously performed by advanced adversaries available to less advanced attackers. True, new zero-day vulnerabilities need to be addressed, but they’re not the most common way that attackers get into an environment or capture sensitive data.

What Enterprise Security Teams Actually Need

At NetSPI, we have the privilege of working with clients that are leaders in their sectors ( finance, telecommunications, manufacturing, healthcare, big tech ) and what we hear consistently from their security teams is that AI’s capability to build an attack is not the problem. Instead, what they need is consistency, reliability, and auditability in identifying and addressing risks. What their CISOs tell us that they need also is a predictable return on investment and the confidence that such AI-enabled tooling will not become a liability that will erode customer trust. That combination (accuracy, reliability, and ROI) is not yet common in the AI security market.

The Infrastructure Gap No One Is Talking About

Despite advances in reasoning and the large context windows available today, most LLM-enabled systems lack the infrastructure and controls to provide the level of determinism, auditability, error handling, and well-defined trust boundaries that enterprise clients require and that software engineering has treated as standard capabilities for over sixty years. Instead, when results from these tools vary across identical inputs, when false positives crowd out actionable findings, and when their accuracy rarely exceeds 80%, the performance gap is one that enterprise security teams cannot operationally absorb and responsible C-suite executives hesitate to bet on.

Also, consider that from a governance perspective, the selection of an AI model is important, but it represents roughly 10% of the engineering focus. The remaining 90% belongs to the systems surrounding it: the orchestration, memory controls, contextual constraints, and tooling that constitute the AI’s ‘harness’. That harness is what manages the broad, dynamic, and non-deterministic behavior of LLMs and agents in production environments. Frontier model providers and some start-ups are building these harnesses because they recognize that this is where the levers to adjust risk, cost, and reliability live, and it is precisely where solutions built on AI tooling alone fall short.

What our own use of AI automation demonstrates is that AI security tools, when integrated with clear operational guardrails and a realistic understanding of their limitations, can function well as productivity enhancers for skilled practitioners, but not as replacements for them. Our philosophy is to augment our subject matter experts with AI but not substitute it for their judgment or creativity. We find that for our clients in regulated industries, the exposure created by offloading that expertise to a fully automated AI tool is not a theoretical risk but a real threat to the client’s business with significant consequences.

The Cost and Trust Problem

Our clients tell us that costs are where boardroom patience with AI security tooling runs out the fastest. When Mythos was released to selected organizations, it included $100 million in free tokens, a figure that highlights both the commercial stakes involved and the opacity of budgeting involved with deploying AI models. As AI-enabled security tools move from the prototype phase to operational use, token budgetary forecasts fail, funding allocations shift to cover the gap, and pricing structures change yet again (often on the vendor’s timelines). The result is what finance teams are starting to call a “pull-the-rug” effect where AI budgets that were approved at one cost projection are suddenly exposed to new ones and the delta between them is rarely small.

Trusting the output of AI tools is another barrier, and in many ways the hardest to solve. NetSPI’s offensive security practitioners have been testing machine learning systems, LLMs, and agentic platforms for years, and what they observe is that without precise instructions, guardrails, orchestration harnesses, auditing layers, and accuracy filters, AI-generated findings in enterprise deployments degrade in reliability over time. This is not rare; it exists as a structural characteristic of how these systems act at scale. Instead, when we integrate AI into our testing methodologies, we do so alongside practitioners with years of experience identifying novel vulnerabilities and developing proof-of-concept exploits that can be used reliably in the real-world. For us, this isn’t a marketing position but an operational requirement.

Finally, regulators and auditors are still finding their footing on how to evaluate AI development processes, security controls, and outputs from LLM-enabled and agentic applications because of the non-deterministic nature of these tools. When you can ask a question five times and get different answers in two of those, the evidence becomes questionable. This is why AI-only security testing often doesn’t meet the bar, and auditors, regulators, and cybersecurity vendors are raising those concerns now. Simply put, to most companies enterprise security is too sensitive to be left exclusively in the hands of automation without human oversight, validation, or added intuition.

Why Human Expertise Still Matters

Enterprise applications and environments are highly complex, and there are often implications for securing of a system or the assigning severity of a vulnerability that exist outside of code, threat models, or architecture diagrams.These information sources, which useful as a baseline, lack the deep, systemic knowledge that lives within your engineering, security, and risk teams. The scale and scope of AI deployments demand that teams address risk holistically, across organizational boundaries that have traditionally operated in isolation. Understanding “the whole animal” requires a systems-thinking perspective that autonomous AI tools have not yet developed but one that human expertise proves invaluable.

This kind of systems-level thinking works best when human expertise is grounded in structured methods like failure mode analysis, dependency mapping, and risk quantification all applied together rather than in isolation. Our practitioners see the whole animal, not just the part that is in front of them, and our research indicates that this is the “secret sauce”’ that makes AI-enabled security testing viable.

Three Questions to Cut Through the Hype

If your organization is evaluating AI security tools, use these few questions to cut through the capability claims:

Do we understand how much this will cost over time?
Is pricing token-based and could it become much more expensive as compute costs rise?
What regulatory requirements does our organization have, and will auditors accept AI-only security assessments of our environment?
How confident are we in the capabilities and consistency of a given tool?
How prepared are we to triage output that may not be calibrated for our organization specifically?

Chubb partners with NetSPI to bring attack surface management to its policyholders

Partner with NetSPI

Beyond the Hype: What Regulated Industries Need to Know Before Trusting AI Security Tooling

Phil Morris

What Enterprise Security Teams Actually Need

The Infrastructure Gap No One Is Talking About

The Cost and Trust Problem

Why Human Expertise Still Matters

Three Questions to Cut Through the Hype

Authors:

Explore More Blog Posts

Azure VM Command Execution using Third-Party Extensions – Salt Minion

Azure VM Command Execution using Third-Party Extensions – Chef

CVE-2026-63030 & CVE-2026-60137: WordPress Core Pre-Authentication RCE Overview & Takeaways