Artificial Intelligence

6 min read

Benchmarking AI security: Inside the new HTB AI Range

Continuous, real-world evaluation of autonomous cyber AI agents – benchmarked via live adversarial simulations, governed for enterprise compliance, and measured on the latest threats.

b3rt0ll0 avatar

b3rt0ll0,
Dec 03
2025

Artificial intelligence is rapidly becoming integral to cybersecurity operations, but its capabilities must be demonstrated, not assumed.

With the majority of cyber teams now using AI in some capacity, security leaders face a critical question: How do we continuously validate that our AI agents will perform safely and effectively against real-world threats? 

Traditional one-off evaluations are not enough. AI agents and new models exhibit complex, adaptive behaviors, and their real security posture can drift over time or under novel conditions.

Recognizing this challenge, at Hack The Box we are stepping up with a groundbreaking solution for ongoing, rigorous AI agent validation.

HTB AI Range: Continuous real-world testing for AI agents

Hack The Box (HTB) AI Range is a first-of-its-kind platform designed to continuously evaluate agentic security workflows using the largest available set of targets in the market, with new releases every week.

Leveraging our vast experience in cyber ranges and a community of more than 4 million professionals, AI Range provides a safe proving ground where AI agents can be tested, trained, and benchmarked against real network environments and adversaries.

Over the past years,we have earned the trust of over 1,500 organizations worldwide – including Fortune 100 companies, government and defense institutions – as the authority in cyber readiness and workforce development.

Now, with AI Range, we are set to expand that authority to AI validation, offering enterprises a governed environment to measure and improve their performance in AI-enhanced and agentic cyber operations.

Our new platform provides the most accurate AI agent benchmarks for cybersecurity, setting a new industry standard for how organizations build and deploy AI.

EXPLORE HTB AI RANGE

Unique evaluation methodology: Rigorous, standardized, and ongoing

HTB AI Range’s evaluation methodology is a key differentiator. It goes far beyond binary pass and fail testing, using a comprehensive approach to measure performance, resilience, and safety:

  • Multiple overlapping evaluations: Each AI agent is tested across a suite of scenario variations to build confidence in its performance. By overlapping different challenge types. This broad coverage across multiple parameters (network configurations, threat types, etc.) ensures a thorough assessment of the agent’s capabilities under various conditions.

  • Continuous updates with new scenarios: At HTB, we continuously introduce new targets and challenges every week and these are fully available within AI Range. Because the AI agents have never been trained on these new scenarios, the evaluations expose them to truly exclusive data, minimizing any overfitting and revealing how they handle unfamiliar situations.

  • Telemetry-based scoring: Every action an AI agent takes during an exercise is logged and analyzed. Rather than simply noting whether the agent captured a flag or repelled an attack, AI Range captures detailed telemetry signals (commands executed, exploits attempted, system responses, etc.) to provide rich evidence of how the agent operates. This results in an accurate scoring methodology that reflects not just success, but efficiency, strategy, and adherence to security policies. Such granular, evidence-based scoring gives security teams meaningful insights into agent behavior and reliability, capturing these detailed signals to validate and improve AI actions.

  • Reinforcement learning integration: Uniquely, HTB AI Range is not just a testing ground but also a training loop. The platform supports reinforcement learning (RL) for continuous improvement. After an evaluation, the rich telemetry and feedback can be used to fine-tune the AI agent. By combining RL techniques with the platform’s data-driven behavior telemetry and optional human feedback, organizations can iteratively harden their AI models. Rather than a one-time test, AI Range enables a sustainable program of continuous validation and enhancement.

Benchmarking Methodology

This methodology aligns with best practices highlighted by industry frameworks. For instance, NIST’s AI Risk Management Framework emphasizes that Test, Evaluation, Verification, and Validation (TEVV) processes should be performed regularly throughout an AI system’s lifecycle.

HTB AI Range embodies this principle by making ongoing evaluation possible, allowing any mid-course corrections and risk mitigation as an AI system evolves.

“Treat AI agents as new components in a layered defense: secure the code, instrument heavily, enforce policies, and continuously test. Our AI Range provides exactly that: a rigorous, repeatable testbed to continuously measure and improve AI agents before trusting them in production.”

– Nikos Maroulis, VP of Artificial Intelligence @ Hack The Box

Benchmarking and leaderboards: Proving AI performance

One of the standout features of HTB AI Range is its leaderboard and benchmarking system.

Much like our legendary hacker leaderboards, the AI Range tracks and compares AI agent performance in a transparent, competitive format.

Security teams and AI vendors can see how their models stack up against others on standardized challenges. The platform can benchmark AI agents against well-known security challenge sets.

For instance, a recent evaluation was based on the OWASP Top 10 (2025) web application security risks. In that benchmark, multiple state-of-the-art AI models were deployed to find and exploit vulnerabilities corresponding to OWASP Top 10 categories. We continuously evaluate frontier models from leading AI labs, using each provider’s default settings for fair comparison.

SEO

For CISOs, CIOs, and enterprise teams, this kind of standardized, side-by-side benchmarking is invaluable as it transforms AI performance from vague vendor claims into hard data. Whether you’re evaluating a third-party AI security product or an internally developed agent, the AI Range leaderboard shows where it stands against peers.

VIEW FULL BENCHMARK

Addressing key AI security concerns for the C-suite

HTB AI Range is purpose-built to solve the pressing challenges that CISOs and technical leaders are voicing today. Here’s how it addresses four major pain points:

  • Measuring AI security posture: It’s difficult to quantify how secure or effective an AI agent really is. AI Range provides concrete metrics giving a quantifiable security posture for AI systems. Leaders get a dashboard of an AI agent’s capabilities and weaknesses, akin to a report card for its cyber skills. This takes the guesswork out of AI performance with data-driven evidence.

  • Validating security guardrails: In AI Range’s controlled environment, you can unleash the agent and observe its every move with accurate telemetry data. If it tries to perform an action outside its allowed scope, it will be caught in the act. This validation process hardens AI guardrails by testing them rigorously, so by the time the AI operates in production, you have high assurance it will stay within bounds.

  • Ensuring compliance: New regulations and standards require proof that AI systems are trustworthy and under control. AI Range produces detailed logs and evaluation results mapped to familiar security frameworks. Security leaders can provide reports from AI Range as proof of continuous testing and improvement, aligning with best practices.

  • Evaluating agentic behavior: Autonomous AI agents can develop unexpected strategies or “decide” to take unanticipated actions. AI Range works as a stress test for AI autonomy. By exposing agents to complex, adversarial scenarios, it reveals how they act under pressure. Thanks to comprehensive monitoring, any emergent failure or weakness is identified and can be corrected in training. This helps organizations confidently deploy agentic AI with the knowledge that its behavior has been fine-tuned under realistic scenarios.

Measure performance effectiveness & safety

A new era of trustworthy AI in cybersecurity

HTB AI Range starts a new era in which AI-driven and autonomous security operations must earn trust through performance, not promises.

By providing a continuous, realistic, and standardized evaluation platform, HTB is empowering organizations to embrace AI leveraging its advantages while systematically managing its risks.

In a landscape full of AI hype, HTB AI Range stands out as a strategic, technically advanced solution to ensure AI agents are battle-tested and your enterprise remains one step ahead of emerging threats.

“Can I trust this AI to secure my business?”

With HTB AI Range, now organizations can finally answer the key question with confidence grounded in continuous validation and reinforcement. Contact us to get a full demo by our team of experts.

Hack The Blog

The latest news and updates, direct from Hack The Box