I Used AI to Triage 500 Alerts Last Week — Here's What Actually Happened
Last Monday, I made a decision that my SOC manager probably wouldn't have approved if I'd asked permission first. I decided to run every single alert that came through our SIEM — all 500+ of them over five business days — through an AI assistant before I touched them manually. Not as a replacement for my workflow. As a parallel track. I wanted to know: can AI actually help with alert triage, or is this all conference-talk vapor?
The short answer is yes, with a pile of caveats that no vendor demo will ever show you.
The Setup
I used two AI tools: Claude (via API) and ChatGPT-4 (via the interface, since our org doesn't have API access yet). I fed each alert's raw data — source IP, destination, event type, timestamp, log excerpt — into a structured prompt and asked for three things:
- Classification: true positive, false positive, or needs investigation
- Severity estimate: critical, high, medium, low, informational
- Recommended next step: one sentence
I kept a spreadsheet tracking the AI's call versus my eventual manual determination. Five days, 512 alerts total. Here's what happened.
What Worked: Pattern Matching on Known IOCs
The single best use case was matching against known-bad indicators. When an alert contained a hash, IP, or domain that mapped to a well-documented threat, AI nailed it. Out of 74 alerts involving known IOCs, Claude correctly classified 71 of them (96%). ChatGPT got 68 right (92%). That's genuinely useful — not because I couldn't have done it, but because AI did it in seconds instead of the 2-3 minutes per alert I'd normally spend cross-referencing threat intel feeds.
Over a week, that pattern-matching alone saved me roughly 3 hours. Not transformational, but real.
What Worked: Log Cluster Summarization
The second win was unexpected. I started pasting clusters of related alerts — say, 15 firewall denies from the same source over 20 minutes — and asking for a summary narrative. AI was excellent at this. It could synthesize a sequence of events into a readable paragraph faster than I could scan the raw logs. This didn't replace analysis, but it accelerated my ability to get context before diving in.
I'd estimate this saved another 2 hours over the week, mostly on the Monday and Thursday spikes when alert volume was highest.
What Didn't Work: False Positive Judgment
Here's where it gets uncomfortable for the AI hype crowd. Of 512 alerts, roughly 340 were ultimately false positives (welcome to SOC life). AI's accuracy on false positive identification was only about 62%. That means it flagged 129 alerts as needing investigation that turned out to be noise. In a real workflow where I trusted the AI's output, that's 129 wasted analyst-minutes — more time than it saved me on IOC matching.
The core problem: false positive determination requires context that isn't in the alert. You need to know that the finance team runs a batch job every Tuesday at 2 AM that looks like data exfiltration. You need to know that the new developer in building 3 is testing against a staging server that triggers geo-impossible-travel alerts. AI doesn't know your environment. It can't.
What Was Actively Dangerous: Hallucinated Severity
This is the part I want to be loud about. On 23 occasions, AI assigned a severity level that was completely wrong — and not in the "conservative, erring on the side of caution" direction. Eleven times, it downgraded what I'd classify as high-severity to medium or low. The reasoning looked plausible in the output. It cited factors that sounded legitimate. But it was wrong.
One specific example: a lateral movement alert from a service account got classified as "low — routine service account activity." The AI's reasoning referenced common patterns for service accounts. But this particular service account had been dormant for six months. That context matters. AI didn't have it and couldn't have known to ask.
If I had been a junior analyst trusting AI severity ratings to prioritize my queue, I would have missed a legitimate incident. That's not a theoretical risk. That's what happened in my test.
The Numbers, Summarized
- Total alerts processed: 512
- AI correct on true positive identification: ~78%
- AI correct on false positive identification: ~62%
- AI correct on severity rating: ~71%
- Time saved (estimated): 5 hours over 5 days
- Time wasted chasing AI-recommended investigations: ~2.5 hours
- Net time saved: ~2.5 hours
- Dangerous misclassifications: 11 (severity downgraded incorrectly)
My Takeaway
Net time saved over five days: about two and a half hours. Net dangerous misclassifications: eleven. That ratio tells you everything about where AI triage stands right now.
Use it for IOC matching and log summarization — it genuinely earns its keep there. But the moment you let an AI severity rating skip the human queue, you are going to miss an incident. I did, in a controlled test. In production, that miss has a cost.
When a vendor shows you their triage demo, ask one question: what's your false positive rate on alerts that aren't in your training data? Then watch their face.