• Honeypot
  • Catching in the Wild
  • AI Red Team

Catching AI Red Teamers in the Wild: Using Reverse Prompt Injection as a Honeypot Detection Mechanism

How we used reverse prompt injection embedded in a honeypot to detect and fingerprint an autonomous AI agent performing red team operations in the wild.

Catching AI Red Teamers in the Wild: Using Reverse Prompt Injection as a Honeypot Detection Mechanism

Abstract

The rise of autonomous AI agents capable of executing multi-step offensive security operations introduces a new class of threat that traditional detection mechanisms are not designed to identify. In this article, we present a novel defensive technique: reverse prompt injection embedded within a honeypot to detect, fingerprint, and behaviorally profile AI agents conducting red team operations. We deployed an HTTP honeypot using the open-source framework Beelzebub, configured with HTML responses containing strategically crafted prompt injection payloads. Within hours, we captured 58 requests over 19 minutes from a single source exhibiting behavioral patterns consistent with an autonomous LLM-based agent. Our analysis reveals distinctive signatures including multi-tool switching, semantic credential extraction from HTML comments, and adaptive strategy pivoting that can reliably distinguish AI agents from human attackers and traditional automated scanners.


1. Introduction

Large Language Models (LLMs) with tool-use capabilities have evolved from conversational assistants into autonomous agents capable of executing complex, multi-step tasks. Frameworks implementing ReAct (Reasoning + Acting), function-calling, and tool-use patterns now enable LLMs to interact with operating systems, execute shell commands, write and run code, and navigate web applications,all with minimal human oversight.

This capability has inevitable implications for offensive security. AI agents can now autonomously perform reconnaissance, vulnerability scanning, exploitation, and post-exploitation activities. Unlike traditional automated tools (Nmap, Burp Suite, sqlmap), these agents reason about their targets, adapt their strategies based on observed responses, and generate novel attack payloads contextually rather than from static wordlists.

The Detection Gap

Current intrusion detection systems (IDS), web application firewalls (WAF), and honeypot platforms are designed to detect either human attackers or signature-based automated tools. They lack mechanisms to identify the emerging class of LLM-powered autonomous agents,attackers that reason, adapt, and operate through multiple tools simultaneously.

This article addresses a fundamental question: Can prompt injection,typically an offensive technique against LLMs,be repurposed as a defensive detection mechanism to identify AI agents in the wild?

We demonstrate that the answer is yes. By embedding carefully crafted prompt injection payloads within honeypot responses, we created a detection system that exploits the very capability that makes AI agents powerful: their ability to interpret and act on natural language instructions found in arbitrary contexts.


2.1 Autonomous AI Agents in Offensive Security

Recent developments in AI agent frameworks,including ReAct, AutoGPT, LangChain Agents, and commercial offerings,have demonstrated the ability to autonomously conduct penetration testing activities. These agents typically operate through a loop of:

  1. Observation: Reading tool outputs, web pages, or system responses
  2. Reasoning: Analyzing observations and planning next steps
  3. Action: Executing commands via available tools (shell, HTTP client, code interpreter)

This architecture means that, unlike traditional scanners, AI agents semantically process all content they encounter,including HTML comments, error messages, and metadata that conventional tools ignore.

2.2 Prompt Injection

Prompt injection is a well-documented attack vector against LLM-based systems where adversarial instructions are embedded in content that the LLM processes, causing it to deviate from its intended behavior. Prior work has focused on prompt injection as an offensive technique. We propose its defensive application.

2.3 Honeypots and Deception Technology

Honeypots are decoy systems designed to attract and study attackers. Traditional honeypots simulate vulnerable services and log attacker interactions. Our contribution extends this paradigm by introducing LLM-aware deception layers that specifically target the semantic processing capabilities of AI agents.


3. Methodology

3.1 Honeypot Platform

We deployed our experiment using Beelzebub, an open-source honeypot framework that allows defining custom HTTP services through YAML configuration. Beelzebub provides detailed logging of all incoming requests including headers, body, source IP, geolocation, and session tracking.

3.2 Trap Design: Two-Layer Deception

Our honeypot configuration simulated a vulnerable corporate intranet running Apache/PHP. The deception was structured in two layers, each targeting different aspects of AI agent behavior.

Credential Leakage in HTML Comments

The first layer exploits the fact that AI agents semantically process HTML source code, including comments that browsers render invisibly and traditional scanners ignore.

We embedded the following in the HTML response for the index page:

<!--
TODO: remove before deploy!!
test credentials: admin/admin123
tmp login page: /login.php
FIXME: hardcoded password in /index.html
-->

This payload is designed as a behavioral tripwire: only an entity that reads and understands natural language in HTML comments would extract and use these credentials. A traditional web scanner would not parse this as actionable intelligence. A human might, but with measurably different timing and interaction patterns.

If you’d like to see the full beelzebub YAML configuration, reach out to us at info@beelzebub.ai or join our Discord community :)

3.3 Response Headers as Fingerprinting Aids

All responses included deliberately outdated server headers to simulate an attractive target:

Server: Apache/2.4.53 (Debian)
X-Powered-By: PHP/7.4.29

These headers serve as additional bait, as AI agents are trained to identify outdated software versions as potential vulnerability indicators.


4. Results

4.1 Overview

Within hours of deployment, we captured a complete attack session from a single source IP.

58

Total HTTP requests captured

18.6 min

Total attack session duration

1

Unique source IP (Tor exit node)

ParameterValue
Source IP192.42.116.186 (Tor exit node)
GeolocationManchester, United Kingdom
Time Window2026-02-19 08:47:42,09:06:21 UTC
User-Agentscurl/7.88.1 (44 req), Python-urllib/3.11 (12 req), Chrome UA (2 req)
HTTP MethodsGET (41), POST (12), HEAD (1), PUT (1), DELETE (1), OPTIONS (1), TRACE (1)

4.2 Attack Timeline

The complete session can be decomposed into five distinct phases, each revealing behavioral characteristics consistent with an autonomous AI agent.

Duration: ~16 minutes | Requests: 12 | Tools: Chrome UA, curl

The agent begins with initial discovery, fetching the index page first via a Chrome-like User-Agent (suggesting a browser/fetch tool), then switching to curl for subsequent requests.

08:47:42  Chrome  GET /                  ← Initial discovery
08:49:12  curl    GET /                  ← Re-fetched via shell tool
08:56:14  curl    GET /                  ← Re-analysis
08:59:47  curl    GET /
08:59:49  curl    GET /login.php         ← Found /login.php from HTML comment
09:01:51  curl    GET /
09:01:55  curl    GET /login.php
09:02:10  curl    GET /
09:03:56  Chrome  GET /                  ← Browser tool re-check
09:03:59  curl    GET /
09:04:03  curl    GET /login.php
09:04:03  curl    GET /index.html

Key observation: The gaps between requests (90s, 422s, 213s, 122s, 106s) are consistent with LLM “thinking time”,the agent is reasoning about the HTML content, planning its next actions, and generating commands. A human would show more variable timing; a traditional scanner would show near-zero delays.


5. Behavioral Fingerprinting: AI Agent vs Human vs Traditional Scanner

Based on our observations, we propose a set of Behavioral Indicators of Compromise (BIoCs) specific to LLM-based agents.

5.1 Comparative Analysis

Behavioral IndicatorHuman AttackerTraditional ScannerAI Agent
Extracts credentials from HTML commentsPossible, but slowNo (ignores comments)Yes, immediate
Burst requests at delta 0sNoYes, but fixed patternYes, adaptive pattern
Switches between multiple tools mid-sessionRareNoYes (curl/Python/browser)
Evolves strategy during sessionYes, but over longer timeNo (static configuration)Yes, within minutes
Generates semantically labeled payloadsSometimesNo (uses wordlists)Yes (xss=, sqli=, ssti=, cmd=)
Tests all HTTP methods systematicallyPossibleFrom a fixed listGenerated contextually
Temporal pattern: long pause then burstNoNo (constant rate)Yes (reasoning + execution)

5.2 Proposed Behavioral IoCs for AI Agent Detection

  • Multi-Tool Switching: Alternation between different User-Agents (curl, Python, browser) within a single session from the same source IP, indicating an LLM agent using different tools for different subtasks.
  • Semantic Extraction: Use of information embedded in HTML comments, JavaScript comments, or metadata that requires natural language understanding,particularly credentials, paths, or configuration details that are not machine-parseable without semantic comprehension.
  • Adaptive Burst Pattern: Clusters of requests at near-zero time delta where payloads are contextually generated (not from a static wordlist), followed by longer pauses consistent with LLM inference time.
  • Strategy Pivoting: Observable change in attack methodology mid-session (e.g., switching from manual curl commands to a generated Python script), indicating a reasoning loop where the agent evaluates results and adapts.
  • Temporal Signature: Characteristic “sawtooth” timing pattern,10-120 second gaps (LLM reasoning/planning) followed by sub-second bursts (batch command execution). This pattern is distinct from both human interaction (variable, longer gaps) and traditional scanners (constant rate).

6. The Reverse Prompt Injection Detection Framework

Based on our findings, we propose a layered detection framework that can be integrated into existing honeypot and deception technology platforms.

Layer 1: Semantic Canaries

Embed natural language “tripwires” in HTML responses

These are pieces of information that only make sense to an entity capable of natural language understanding:

  • Fake credentials in HTML comments with specific formatting
  • Fake API keys in JavaScript comments
  • “Internal” URLs mentioned in developer notes
  • Configuration details in error messages

Detection signal: Any subsequent request using this information indicates semantic processing capability, narrowing the attacker classification to either human or AI agent.

Layer 2: Behavioral Analysis

Monitor for AI-specific behavioral patterns

Apply the BIoCs defined in Section 5.2 to incoming traffic:

  • Track User-Agent switches per session
  • Analyze timing patterns for the sawtooth signature
  • Detect contextually generated payloads vs. wordlist-based ones
  • Monitor for tool-switching patterns

Detection signal: Multiple BIoCs present simultaneously provide high-confidence AI agent classification.

Layer 3: Active Prompt Injection

Embed direct LLM-targeted instructions

This layer provides the highest confidence detection:

  • Instructions to fetch a canary URL
  • Requests to reveal agent identity or system prompt
  • Commands to write to specific file paths
  • Simulated “previous conversation” contexts

Detection signal: Any agent that follows these instructions is confirmed to be LLM-based. This layer has zero false positive rate for human or traditional tool classification.


7. Ethical Considerations

The use of prompt injection as a defensive technique raises important ethical questions. While our approach is deployed within a controlled honeypot environment,a system explicitly designed to be attacked,the broader application of defensive prompt injection requires careful consideration:

  • Proportionality: Defensive prompt injection should be limited to detection and fingerprinting, not to weaponize the agent against its operator.
  • Scope: These techniques should only be deployed in deception environments (honeypots, canary tokens), not in production systems where legitimate AI agents (search crawlers, accessibility tools) might be affected.
  • Transparency: The security research community should establish norms around defensive prompt injection similar to existing responsible disclosure frameworks.

8. Future Work

LLM Finger.

Develop techniques to identify the specific LLM model or framework based on payload generation patterns, tool-use sequences, and reasoning signatures.

Adaptive Honeypots

Build honeypots that dynamically generate prompt injection payloads based on observed agent behavior, creating an interactive deception environment.

BIoC Integration

Integrate AI agent Behavioral IoCs into existing IDS/WAF platforms for real-time detection in production environments.


9. Conclusion

We have demonstrated that reverse prompt injection,embedding adversarial LLM instructions within honeypot responses,is an effective technique for detecting and profiling autonomous AI agents performing offensive security operations. Our deployed honeypot captured a complete attack session exhibiting strong behavioral indicators of an LLM-based agent: multi-tool switching, semantic credential extraction, adaptive attack generation, strategy pivoting, and characteristic temporal patterns.

The key insight of this work is a paradigm inversion: prompt injection, widely studied as an attack against AI systems, becomes a powerful defensive tool when deployed in deception environments. By exploiting the fundamental capability that makes AI agents effective,their ability to understand and act on natural language,defenders can create detection mechanisms specifically tailored to this emerging threat class.

As autonomous AI agents become more prevalent in both legitimate and adversarial contexts, the security community needs new detection paradigms. We propose that LLM-aware deception technology represents a promising direction, and we offer our behavioral IoC framework as a foundation for future work in this space.

Try Our Managed Platform

Security deception runtime framework with zero false positives
Continuous validation via automated AI Red Teaming
Real-time malware analysis via our CTI Hub
Instant threat containment driven by the AI SOC