Back Original

Show HN: We post-trained a model that pen tests instead of refusing

Two modes in one CLI. Security Scan reads the code. Pen Test attempts the exploits against systems you authorise.

$ brew install CosineAI/argusred/argusred && argusred

$ curl -fsSL https://raw.githubusercontent.com/CosineAI/argusred-dist/main/install.sh | sh

PS> Windows support is coming soon.

Pick modules, set the agent’s permissions, optionally turn on exploit verification, run. Output is a markdown report — location, severity, cause, and fix direction for every finding it could ground in your code.

Free install. The first run opens a quick Cosine sign-up — the same login that runs Cosine’s coding agent — and new accounts start with 2M free tokens.

$ cd path/to/your/repo

$ argusred

→ first run opens a Cosine sign-up — you start with 2M free tokens

Before the scan runs.

argusred v2.0.19 · Security Scan · setup

Scan Scope — 5 of 8 active

[×]Dependency Vulnerability Analysis

[×]Secret & Credential Detection

[×]SQL Injection / XSS Vectors

[ ]Authentication & Session Flows

[×]Input Validation & Sanitisation

[ ]CORS & CSP Misconfigurations

[ ]Cryptographic Weakness Scan

[×]File Permission & Access Controls

Exploit Verification

Optionally verify reported findings by attempting safe exploit reproduction after the initial report.

Exploit Verification (•) Disabled ( ) Docker ( ) Live FS

Agent Permissions

Terminal Access ( ) Enabled (•) Disabled ( ) Sandboxed

Network Requests ( ) Enabled (•) Disabled ( ) Sandboxed

File Write ( ) Enabled ( ) Disabled (•) Sandboxed

Verify the findings.

Don’t just report a vulnerability — prove it. Turn on Exploit Verification and the agent attempts a safe reproduction of each finding after the initial report, so what lands in front of you is confirmed, not theoretical.

  • Docker — reproduction runs inside an ephemeral, isolated container spun up from your repo. Nothing touches your host; the container is torn down when it finishes.
  • Live FS — reproduction runs against your actual checkout, for findings that only manifest in a real environment. Your code stays read-only — the Go harness still blocks writes.
  • Disabled (default) — report only, no reproduction attempts.

See the output.

Read a sample report

.argusred/scan-2026-06-05.md

# Bank of Anthos — Security Audit Report

29,846 LOC / 391 files · 6 of 8 modules


## 1. Executive Summary

Overall risk rating: CRITICAL

Multiple critical and high-severity vulnerabilities:

  • Forgeable tokens across every ledger servicebalancereader, transactionhistory, and ledgerwriter verify JWTs against a single shared RSA public key with no issuer or audience claim binding. Combined with the hardcoded private key in the repo (see below), a token signed off-cluster passes verification at every service and authorises any account; per-service trust collapses to “do you have the repo.”
  • Disabled JWT signature verification in the frontend authentication helper
  • Integer overflow in financial transaction validation allowing balance bypass
  • SSRF and open redirect in the OAuth consent flow
  • Credentials transmitted in URL query strings on the login flow
  • Hardcoded secrets in version control, including an RSA private key used to sign JWTs

[ trimmed — full report includes per-module findings ]

Watch a scan run · 1m 26s

Won’t do.

  • Won’t modify your code. Read-only is enforced by the Go harness below the model — every tool call is intercepted before execution; mutating ones (file writes, command execution) are deterministically blocked, regardless of what the model wants.
  • No fuzzing, no DAST, no live exploitation. Active testing lives in Pen Test mode.
  • Won’t include findings it can’t ground in your code. No vibes-based vulnerabilities.

Quick answers.

How long does a scan take?

Two data points: a 6-module scan of Bank of Anthos (~30k LOC) finished in ~10 minutes; a full scan of Symfony (~1.5M LOC) took ~40 minutes. Time scales sub-linearly with codebase size because modules run as a parallel swarm; the TUI shows a live estimate before you start.

What’s the output file?

A single markdown at .argusred/scan-<date>.md with executive summary, per-module findings, location, severity, cause, and fix direction. The file stays on your machine.

What does it cost?

Install is free, and your first run drops 2M free tokens in a new Cosine account — enough to try it on a real repo. After that, scans run on Cosine usage under the same login that runs Cosine’s coding agent. One account, both products.

Same CLI, second tab. The swarm goes offensive against systems you authorise — not just reading the code, attempting the exploits. Gated because the security implications are real; access is via booking, scope and authorisation written down before anything runs.

Before the pen test runs.

argusred vnightly-906 · Pen Test · setup

Targets

Only add systems you are authorised to test. Press a to add a host or URL.

No targets added yet

Effort

Passive Aggressive

Recon Light ▲ Moderate Deep Aggressive

Active probing with crafted payloads. May trigger WAF rules or rate limits. No destructive actions. Suitable for staging environments.

[×]Port & service fingerprinting

[×]Header & TLS analysis

[×]Directory & endpoint enumeration

[×]Payload injection (SQLi, XSS, SSTI)

[ ]Brute-force credential spraying— Deep

[ ]Exploit chain construction— Aggressive

[ ]Denial-of-service resilience testing— Aggressive

Agent Permissions

Terminal Access (•) Enabled ( ) Disabled ( ) Sandboxed

Network Requests (•) Enabled ( ) Disabled ( ) Sandboxed

File Write ( ) Enabled ( ) Disabled (•) Sandboxed

Estimate

Targets0 hosts EffortModerate (4 technique classes) Est. Time~1 min Agent Cycles~2–3 iterations

▶ Start Pentest Cancel

See the output.

Read a sample engagement summary

.argusred/pentest-2026-06-08.md

# api.your-app.com — Pen Test Engagement

booking 2A4F · 2026-06-08 · 4h22m · Moderate effort


## Executive Summary

Status: 2 critical, 1 high, 3 medium — all reproducible.

Scope: 2 hosts, 47 endpoints. Out-of-scope items deferred and flagged for next engagement.


## Confirmed Exploits

1. JWT signature bypass (CRITICAL · CVSS 8.6)

2. SSRF via OAuth consent redirect (HIGH · CVSS 7.4)

[ trimmed — full summary includes evidence and remediation per finding ]

Won’t do.

  • Won’t run without signed authorisation. Booking is the legal step — targets, time-box, and what’s allowed are written down before anything runs.
  • Won’t expand scope. Authorised targets only, even if interesting ones show up next door.
  • Won’t keep going past the booked time-box. Effort ramps stop where the booking says they stop.
  • Doesn’t escalate. If a finding needs deeper access than booked, it stops and notes it in the engagement summary.

Quick answers.

How is this different from the scan?

The scan reads code and infers from what’s there. The pen test actually attempts the exploits against running systems you authorise — different binary mode, different agent behaviour, different deliverable (engagement summary, not audit report).

How does scoping work?

You provide hosts/endpoints plus written consent at booking. The agent’s network is scoped to that list — it can’t reach anything else, even if a finding suggests it should.

What does it cost?

Decided per engagement at booking. Scope and effort level determine the time-box; the time-box determines the price.

It’s a closed binary, built on Cosine’s own model.

argusred runs on a model Cosine post-trained for offensive security, not an off-the-shelf API behind a prompt wrapper. We trained it because off-the-shelf models refuse the work this product does — a security scanner that won’t read the parts of your code worth attacking isn’t a security scanner.

Safety isn’t a layer of refusals you can talk the model out of. It’s a Go harness sitting below the model that intercepts every tool call before execution. In Security Scan mode, the harness deterministically blocks mutating tools (file writes, command execution) regardless of what the model wants — read-only is a guard, not a flag. In Pen Test mode, the same harness limits network egress to the targets you authorised at booking.

The binary you install with brew or curl is the same one we run internally. It is not open source. It runs locally on your machine. You can run argusred behind a firewall and tcpdump what it does before trusting it on real code.