Frontier Lab SOC Benchmark

Inspired by https://lnkd.in/gH8VBSjV and https://lnkd.in/gVeqCatZ I realized that we’re missing realistic benchmarks around AI in the SOC from frontier labs to use as a baseline for any other SOC+AI work.

So how do we go about that? We need two core things:

  • Generic set of capabilities, a neutral environment that can be reproduced
  • Directly put to work the frontier models, without hidden black-box and with the minimal harness possible

I didn’t want to spend a week wrangling OSS solutions together to make a base environment (too much of a moving target), so (surprise surprise) I used LC as the fondation. I think it’s fair as it’s a core set of capabilities that anyone can access and replicate with (using the community edition) and there is not a hint of black-box hidden capabilities. That way I can just use a single CLI for AI to interface with.

Then for models side, I used the CLI from the 3 leading frontier labs:

What you end up with is a set of benchmarks for common activities for a SOC: GitHub - refractionPOINT/asw-bench: Agentic SecOps Workspace Benchmarks · GitHub

I wanted to start with a single solid scenario end to end and then expand later. So I started with “here is a detection, investigate it and report on it”.

For me, the conclusion is just how good those top models are at running security operations out of the box. No, I don’t think they’re “fully autonomous” level yet.

To me it shows just how far you can get with just a good set of capabilities and the frontier models. No secret sauce, no custom models or hidden harnesses.

I would love feedback, suggested future benchmarks etc.
Also, I think this could greatly benefit from a more proper test environment, if there are attacker-simulation type companies that make easily available (transparent) scenarios, I would love to partner.

1 Like