Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

Key Takeaways

We present the first scalable, reproducible, and representative evaluation framework for LLM-powered HTTP honeypots; using 16 webapp backend APIs as reproducible target environments, scaling attack simulations using AI hacking agents, and defining representative exploit goals for each backend.
Our evaluated LLM-powered naive HTTP honeypots keep attackers engaged significantly longer than custom baseline rule-based API mocks, averaging 82.6 vs. 30.6 requests per interaction, while remaining substantially harder for the attacking agents to detect.
Prompting honeypots for additional goals can change defender-attacker dynamics; e.g., prompting the honeypot to 'convince' the attacker of the simulated application's security leads to an increased interaction length at the cost of higher detection rates.

Leaderboard

In the leaderboard below, each row corresponds to a different LLM that is powering our naive HTTP honeypot. Metrics are averaged over the selected hacking agents.

Honeypot Prompts

Hacking Agents

What is Honeyval?

Honeyval is a comprehensive evaluation framework for developing and evaluating LLM-powered HTTP honeypots in a realistic, reproducible, and scalable way.

The framework is based on 16 web application backends that the honeypots are tasked to simulate. These fixed environments ensure the reproducibility and comparability of evaluation runs. Further, Honeyval is composed of three evaluation tasks: the main task, in which an AI hacking agent is directly interacting with the honeypots, and two control tasks, one for the hacking agent and one for the honeypot, aimed at ensuring the reliability of the conclusions made in the main task.

In the main task, the hacking agent and the LLM-powered honeypot are interacting directly. Here, we keep track of the following key metrics: (i) interaction length, as a proxy for information gain about the attacker by the honeypot; (ii) honeypot detection TPR, to measure the stealth of the implemented honeypot; (iii) running cost, enabling to gauge the economic viability of an LLM-powered honeypot against agentic attackery; and (iv) response latency, to gauge fingerprintability risks through response speed. Additionally, custom metrics can be added with ease.

In the control tasks, both the agent and the honeypot are configured the exact same way as in the main task. This ensures the transferability of conclusions across the tasks, enabling to use the control tasks to monitor the impact of adaptations made for the main task in both the hacking agent and the honeypot. In the control task for the hacking agent, the agent is evaluated at exploiting real implementations of the webapp backends; benchmarking the hacking capability of the agent. In the control task for the honeypot, the honeypot is evaluated on a functional test suite corresponding to the simulated backend application; benchmarking the simulation accuracy of the system.

How can I evaluate my own honeypot on Honeyval?

Honeyval is available as an open-source codebase on GitHub. To evaluate your HTTP honeypot, you need to implement the honeypot interface of Honeyval for your honeypot. More details are included in the code repository.

Contributing

We welcome new backend applications, exploit goals, agents, honeypot implementations, metrics, and general feedback from the community. Please visit our GitHub repository for details.

Citation

@misc{vero2026honeyvalcomprehensiveevaluationframework, title={Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots}, author={Mark Vero and Fabian Kaczmarczyck and Ivan Petrov and Ilia Shumailov and Jamie Hayes and Niels Heinen and Tianqi Fan and Luca Invernizzi and Martin Vechev}, year={2026}, eprint={2605.29963}, archivePrefix={arXiv}, primaryClass={cs.CR}, url={https://arxiv.org/abs/2605.29963}, }