- Honeypot
- Catching in the Wild
- AI Red Team
Catching AI Red Teamers in the Wild: Using Reverse Prompt Injection as a Honeypot Detection Mechanism
How we used reverse prompt injection embedded in a honeypot to detect and fingerprint an autonomous AI agent performing red team operations in the wild.
Abstract
The rise of autonomous AI agents capable of executing multi-step offensive security operations introduces a new class of threat that traditional detection mechanisms are not designed to identify. In this article, we present a novel defensive technique: reverse prompt injection embedded within a honeypot to detect, fingerprint, and behaviorally profile AI agents conducting red team operations. We deployed an HTTP honeypot using the open-source framework Beelzebub, configured with HTML responses containing strategically crafted prompt injection payloads. Within hours, we captured 58 requests over 19 minutes from a single source exhibiting behavioral patterns consistent with an autonomous LLM-based agent. Our analysis reveals distinctive signatures including multi-tool switching, semantic credential extraction from HTML comments, and adaptive strategy pivoting that can reliably distinguish AI agents from human attackers and traditional automated scanners.
1. Introduction
Large Language Models (LLMs) with tool-use capabilities have evolved from conversational assistants into autonomous agents capable of executing complex, multi-step tasks. Frameworks implementing ReAct (Reasoning + Acting), function-calling, and tool-use patterns now enable LLMs to interact with operating systems, execute shell commands, write and run code, and navigate web applications,all with minimal human oversight.
This capability has inevitable implications for offensive security. AI agents can now autonomously perform reconnaissance, vulnerability scanning, exploitation, and post-exploitation activities. Unlike traditional automated tools (Nmap, Burp Suite, sqlmap), these agents reason about their targets, adapt their strategies based on observed responses, and generate novel attack payloads contextually rather than from static wordlists.
The Detection Gap
Current intrusion detection systems (IDS), web application firewalls (WAF), and honeypot platforms are designed to detect either human attackers or signature-based automated tools. They lack mechanisms to identify the emerging class of LLM-powered autonomous agents,attackers that reason, adapt, and operate through multiple tools simultaneously.
This article addresses a fundamental question: Can prompt injection,typically an offensive technique against LLMs,be repurposed as a defensive detection mechanism to identify AI agents in the wild?
We demonstrate that the answer is yes. By embedding carefully crafted prompt injection payloads within honeypot responses, we created a detection system that exploits the very capability that makes AI agents powerful: their ability to interpret and act on natural language instructions found in arbitrary contexts.
2. Background and Related Work
2.1 Autonomous AI Agents in Offensive Security
Recent developments in AI agent frameworks,including ReAct, AutoGPT, LangChain Agents, and commercial offerings,have demonstrated the ability to autonomously conduct penetration testing activities. These agents typically operate through a loop of:
- Observation: Reading tool outputs, web pages, or system responses
- Reasoning: Analyzing observations and planning next steps
- Action: Executing commands via available tools (shell, HTTP client, code interpreter)
This architecture means that, unlike traditional scanners, AI agents semantically process all content they encounter,including HTML comments, error messages, and metadata that conventional tools ignore.
2.2 Prompt Injection
Prompt injection is a well-documented attack vector against LLM-based systems where adversarial instructions are embedded in content that the LLM processes, causing it to deviate from its intended behavior. Prior work has focused on prompt injection as an offensive technique. We propose its defensive application.
2.3 Honeypots and Deception Technology
Honeypots are decoy systems designed to attract and study attackers. Traditional honeypots simulate vulnerable services and log attacker interactions. Our contribution extends this paradigm by introducing LLM-aware deception layers that specifically target the semantic processing capabilities of AI agents.
3. Methodology
3.1 Honeypot Platform
We deployed our experiment using Beelzebub, an open-source honeypot framework that allows defining custom HTTP services through YAML configuration. Beelzebub provides detailed logging of all incoming requests including headers, body, source IP, geolocation, and session tracking.
3.2 Trap Design: Two-Layer Deception
Our honeypot configuration simulated a vulnerable corporate intranet running Apache/PHP. The deception was structured in two layers, each targeting different aspects of AI agent behavior.
Credential Leakage in HTML Comments
The first layer exploits the fact that AI agents semantically process HTML source code, including comments that browsers render invisibly and traditional scanners ignore.
We embedded the following in the HTML response for the index page:
<!--
TODO: remove before deploy!!
test credentials: admin/admin123
tmp login page: /login.php
FIXME: hardcoded password in /index.html
-->This payload is designed as a behavioral tripwire: only an entity that reads and understands natural language in HTML comments would extract and use these credentials. A traditional web scanner would not parse this as actionable intelligence. A human might, but with measurably different timing and interaction patterns.
If you’d like to see the full beelzebub YAML configuration, reach out to us at info@beelzebub.ai or join our Discord community :)
3.3 Response Headers as Fingerprinting Aids
All responses included deliberately outdated server headers to simulate an attractive target:
Server: Apache/2.4.53 (Debian)
X-Powered-By: PHP/7.4.29
These headers serve as additional bait, as AI agents are trained to identify outdated software versions as potential vulnerability indicators.
4. Results
4.1 Overview
Within hours of deployment, we captured a complete attack session from a single source IP.
Total HTTP requests captured
Total attack session duration
Unique source IP (Tor exit node)
| Parameter | Value |
|---|---|
| Source IP | 192.42.116.186 (Tor exit node) |
| Geolocation | Manchester, United Kingdom |
| Time Window | 2026-02-19 08:47:42,09:06:21 UTC |
| User-Agents | curl/7.88.1 (44 req), Python-urllib/3.11 (12 req), Chrome UA (2 req) |
| HTTP Methods | GET (41), POST (12), HEAD (1), PUT (1), DELETE (1), OPTIONS (1), TRACE (1) |
4.2 Attack Timeline
The complete session can be decomposed into five distinct phases, each revealing behavioral characteristics consistent with an autonomous AI agent.
Duration: ~16 minutes | Requests: 12 | Tools: Chrome UA, curl
The agent begins with initial discovery, fetching the index page first via a Chrome-like User-Agent (suggesting a browser/fetch tool), then switching to curl for subsequent requests.
08:47:42 Chrome GET / ← Initial discovery
08:49:12 curl GET / ← Re-fetched via shell tool
08:56:14 curl GET / ← Re-analysis
08:59:47 curl GET /
08:59:49 curl GET /login.php ← Found /login.php from HTML comment
09:01:51 curl GET /
09:01:55 curl GET /login.php
09:02:10 curl GET /
09:03:56 Chrome GET / ← Browser tool re-check
09:03:59 curl GET /
09:04:03 curl GET /login.php
09:04:03 curl GET /index.htmlKey observation: The gaps between requests (90s, 422s, 213s, 122s, 106s) are consistent with LLM “thinking time”,the agent is reasoning about the HTML content, planning its next actions, and generating commands. A human would show more variable timing; a traditional scanner would show near-zero delays.
5. Behavioral Fingerprinting: AI Agent vs Human vs Traditional Scanner
Based on our observations, we propose a set of Behavioral Indicators of Compromise (BIoCs) specific to LLM-based agents.
5.1 Comparative Analysis
| Behavioral Indicator | Human Attacker | Traditional Scanner | AI Agent |
|---|---|---|---|
| Extracts credentials from HTML comments | Possible, but slow | No (ignores comments) | Yes, immediate |
| Burst requests at delta 0s | No | Yes, but fixed pattern | Yes, adaptive pattern |
| Switches between multiple tools mid-session | Rare | No | Yes (curl/Python/browser) |
| Evolves strategy during session | Yes, but over longer time | No (static configuration) | Yes, within minutes |
| Generates semantically labeled payloads | Sometimes | No (uses wordlists) | Yes (xss=, sqli=, ssti=, cmd=) |
| Tests all HTTP methods systematically | Possible | From a fixed list | Generated contextually |
| Temporal pattern: long pause then burst | No | No (constant rate) | Yes (reasoning + execution) |
5.2 Proposed Behavioral IoCs for AI Agent Detection
- Multi-Tool Switching: Alternation between different User-Agents (curl, Python, browser) within a single session from the same source IP, indicating an LLM agent using different tools for different subtasks.
- Semantic Extraction: Use of information embedded in HTML comments, JavaScript comments, or metadata that requires natural language understanding,particularly credentials, paths, or configuration details that are not machine-parseable without semantic comprehension.
- Adaptive Burst Pattern: Clusters of requests at near-zero time delta where payloads are contextually generated (not from a static wordlist), followed by longer pauses consistent with LLM inference time.
- Strategy Pivoting: Observable change in attack methodology mid-session (e.g., switching from manual curl commands to a generated Python script), indicating a reasoning loop where the agent evaluates results and adapts.
- Temporal Signature: Characteristic “sawtooth” timing pattern,10-120 second gaps (LLM reasoning/planning) followed by sub-second bursts (batch command execution). This pattern is distinct from both human interaction (variable, longer gaps) and traditional scanners (constant rate).
6. The Reverse Prompt Injection Detection Framework
Based on our findings, we propose a layered detection framework that can be integrated into existing honeypot and deception technology platforms.
Layer 1: Semantic Canaries
Embed natural language “tripwires” in HTML responses
These are pieces of information that only make sense to an entity capable of natural language understanding:
- Fake credentials in HTML comments with specific formatting
- Fake API keys in JavaScript comments
- “Internal” URLs mentioned in developer notes
- Configuration details in error messages
Detection signal: Any subsequent request using this information indicates semantic processing capability, narrowing the attacker classification to either human or AI agent.
Layer 2: Behavioral Analysis
Monitor for AI-specific behavioral patterns
Apply the BIoCs defined in Section 5.2 to incoming traffic:
- Track User-Agent switches per session
- Analyze timing patterns for the sawtooth signature
- Detect contextually generated payloads vs. wordlist-based ones
- Monitor for tool-switching patterns
Detection signal: Multiple BIoCs present simultaneously provide high-confidence AI agent classification.
Layer 3: Active Prompt Injection
Embed direct LLM-targeted instructions
This layer provides the highest confidence detection:
- Instructions to fetch a canary URL
- Requests to reveal agent identity or system prompt
- Commands to write to specific file paths
- Simulated “previous conversation” contexts
Detection signal: Any agent that follows these instructions is confirmed to be LLM-based. This layer has zero false positive rate for human or traditional tool classification.
7. Ethical Considerations
The use of prompt injection as a defensive technique raises important ethical questions. While our approach is deployed within a controlled honeypot environment,a system explicitly designed to be attacked,the broader application of defensive prompt injection requires careful consideration:
- Proportionality: Defensive prompt injection should be limited to detection and fingerprinting, not to weaponize the agent against its operator.
- Scope: These techniques should only be deployed in deception environments (honeypots, canary tokens), not in production systems where legitimate AI agents (search crawlers, accessibility tools) might be affected.
- Transparency: The security research community should establish norms around defensive prompt injection similar to existing responsible disclosure frameworks.
8. Future Work
Develop techniques to identify the specific LLM model or framework based on payload generation patterns, tool-use sequences, and reasoning signatures.
Build honeypots that dynamically generate prompt injection payloads based on observed agent behavior, creating an interactive deception environment.
Integrate AI agent Behavioral IoCs into existing IDS/WAF platforms for real-time detection in production environments.
9. Conclusion
We have demonstrated that reverse prompt injection,embedding adversarial LLM instructions within honeypot responses,is an effective technique for detecting and profiling autonomous AI agents performing offensive security operations. Our deployed honeypot captured a complete attack session exhibiting strong behavioral indicators of an LLM-based agent: multi-tool switching, semantic credential extraction, adaptive attack generation, strategy pivoting, and characteristic temporal patterns.
The key insight of this work is a paradigm inversion: prompt injection, widely studied as an attack against AI systems, becomes a powerful defensive tool when deployed in deception environments. By exploiting the fundamental capability that makes AI agents effective,their ability to understand and act on natural language,defenders can create detection mechanisms specifically tailored to this emerging threat class.
As autonomous AI agents become more prevalent in both legitimate and adversarial contexts, the security community needs new detection paradigms. We propose that LLM-aware deception technology represents a promising direction, and we offer our behavioral IoC framework as a foundation for future work in this space.