Telhawk Research Brief

The State of AI Security

Why AI-generated findings are creating a new remediation and validation burden, and how structured evidence can improve LLM-based security review.

AI security analysis · LLM review quality · Software remediation · 2026
Abstract

Executive summary

LLM-based code and security review is changing the economics of software security. AI systems can now surface findings at a scale that traditional application security teams were not designed to absorb. The constraint is shifting from finding discovery to finding verification, remediation, validation, and reporting.

The quality of AI security review also depends on the quality of the evidence given to the model. LLMs can miss real issues when they lack security context, and they can create noise when findings are not tied to provable code paths, data flows, permission boundaries, or agent behavior. Telhawk's Galen© engine helps improve AI-assisted review by structuring the evidence before findings are evaluated and preserving proof after findings are produced.

For software teams, the result is a new operational problem: AI can create more security work than existing teams can process. Each finding still has to be reviewed, proven, prioritized, assigned, fixed, retested, validated, and documented. In larger organizations, this can translate into months of developer and security-team time and, in some cases, millions of dollars in follow-up remediation effort.

Telhawk is designed for this post-AI-audit reality. Galen© helps convert raw AI findings and security inputs into proof-backed, prioritized, remediation-ready outcomes that can be validated and documented.

Section 01

AI has changed finding volume

Traditional security review was constrained by human time. Teams could only inspect so much code, test so many paths, and review so many possible failure conditions. LLM-based systems change that constraint. They can generate findings across large codebases, APIs, agent workflows, access paths, and AI-generated software at a speed and scale that was not previously practical.

That increase in finding volume is valuable, but it also creates a downstream burden. A finding is not the same as a confirmed security risk. AI output still has to be evaluated against real code paths, data flows, authorization boundaries, business logic, deployment context, and remediation status. Without that evidence, organizations may be left with a large inventory of possible issues but no clear path to validated security improvement.

Section 02

LLM findings improve when the model has evidence

LLMs can identify security patterns across code, APIs, agents, and AI-generated software, but their output depends heavily on the quality of the context they receive. When an LLM is reviewing disconnected files, incomplete snippets, raw scanner output, or vague prompts, it may miss real issues or generate findings that are speculative, duplicated, overstated, or unsupported by the actual system behavior.

This is one of the central problems in AI security review: the same model can produce very different results depending on whether it is reasoning from raw text or from structured security evidence. Missing context can reduce real finding discovery. Weak context can increase noise and hallucinated findings.

Telhawk's Galen© engine is designed to improve that evidence layer. Galen© helps organize security-relevant context such as affected code paths, data flows, authorization boundaries, API routes, missing guards, permission models, agent tool access, remediation context, and validation status. That structured evidence helps AI-assisted review focus on the issues that are more likely to be real, relevant, and actionable.

Better context can improve both discovery and precision

For AI security review, quality depends on two outcomes: finding more real risks and reducing unsupported noise. A raw LLM review may miss issues that require multi-file reasoning, authorization context, data-flow understanding, tenant-boundary awareness, or agent-tool permission analysis. It may also produce findings that sound plausible but are not supported by the actual code path or deployment context. Galen© helps address both sides of the problem by structuring the evidence before the review and preserving proof after the finding.

Raw LLM review vs. Galen©-supported LLM review
Raw LLM review
  • May review incomplete or disconnected context
  • May miss risks that require path, permission, or data-flow evidence
  • May produce speculative or duplicated findings
  • May require heavy manual validation
  • Findings may be difficult to prioritize
Galen©-supported LLM review
  • Uses structured security evidence
  • Helps expose real risks across code paths, APIs, agents, and permissions
  • Helps reduce unsupported or hallucinated findings
  • Produces clearer proof and remediation context
  • Supports validation and report-ready documentation
Section 03

The real cost is after the audit

The largest cost of AI security review may not be the audit itself. The larger cost is often the work that follows. When an AI audit produces hundreds or thousands of findings, companies still need people to determine which issues are real, which are duplicates, which are overstated, which are exploitable, and which require engineering action.

That work usually falls on developers, AppSec teams, security engineers, and project managers. Findings have to be triaged, assigned, investigated, fixed, retested, tracked, and documented. For larger organizations, this can create a substantial internal workload. The business may need to allocate existing engineering capacity or hire additional developers and security personnel just to process the AI-generated backlog.

This is where AI security changes from a discovery problem into an execution problem. The organization may have more visibility, but it also has more remediation work to manage.

Section 04

Validation becomes the bottleneck

A suggested fix is not the same as a validated fix. Even when AI systems identify a possible issue and propose a correction, the organization still needs evidence that the underlying exposure was actually removed. That means confirming whether the vulnerable path, missing guard, unsafe data flow, permission issue, insecure API route, or risky agent behavior was corrected.

Without validation, teams may close tickets based on intent rather than proof. This creates residual risk. A ticket may show as completed, while the original exposure remains reachable through another path, role, tenant boundary, tool call, or workflow.

Validation is therefore one of the most important parts of the AI security workflow. It is also one of the most time-consuming.

Section 05

Why companies may need more developers

AI-generated findings create a staffing issue because most remediation work still requires engineering judgment. Developers must understand the affected code, determine the safest correction, avoid breaking production behavior, update tests, and support retesting. Security teams must confirm the risk and validate the correction. Project managers may be needed to track ownership, status, deadlines, and reporting.

As AI audits scale, the number of findings can grow faster than the organization's ability to process them. This can force companies to choose between delaying remediation, redirecting developers away from product work, or adding more engineering and security capacity. For large enterprises, the cost of proving, fixing, validating, and documenting AI findings can become material.

Section 06

Where Telhawk fits

Telhawk is built for the post-finding workflow, but its value begins earlier than remediation. Galen© helps structure the evidence that AI-assisted security review needs in order to produce better findings. By organizing affected code paths, data flows, authorization boundaries, API routes, missing guards, permission models, agent tool access, remediation context, and validation status, Galen© helps LLMs reason from security evidence rather than disconnected raw input.

This can improve both sides of AI security review. It can help increase the number of real, security-relevant findings that may otherwise be missed, while also helping reduce speculative findings, duplicates, unsupported conclusions, and hallucinated issues. The result is not simply more findings. The result is a clearer path from AI-assisted analysis to proof-backed security work.

Telhawk helps move security teams from raw AI findings to proof-backed outcomes: evidence, priority, remediation direction, validation status, and report-ready documentation. In this model, LLMs become more useful because they are paired with structured proof.

Figure

Figure 1 — Where AI audit cost accumulates

Raw AI findings
18
Human review
72
Remediation
88
Retesting
46
Validation
78
Documentation
64

Illustrative workload intensity. Bars are directional, not measured. Cost and team time tend to concentrate in human review, remediation, validation, and documentation rather than in finding generation itself.

Figure

Figure 2 — How structured evidence improves LLM security review

  1. Step 1
    Raw inputs
  2. Step 2
    Galen© evidence layer
  3. Step 3
    LLM security reasoning
  4. Step 4
    Proof-backed findings
  5. Step 5
    Remediation + validation

Structured evidence helps LLMs reason from security context rather than disconnected raw input.

Closing

Conclusion

AI security will increase the number of findings available to software teams. But finding volume alone does not reduce risk. Risk is reduced when findings are real, proven, prioritized, remediated, validated, and documented.

LLMs will play a major role in software security, but their results depend on context. When AI review is supported by structured evidence, it can help surface more real risk while reducing unsupported noise. That is the role Telhawk is designed to serve.

The future of AI security is not more alerts. The future is better findings, stronger proof, remediation, validation, and trust.