Case Study

NodeBench improved real code and scanner precision in the same loop

By Michael K Onyekwere

This was not just a maintainer fix story. It was a correction-loop story. AgentScore flagged real shell-exec risk in nodebench-mcp. The maintainer then separated the real findings from the false positives, and the scanner gained new mitigators the same day.

Real shell-exec sites refactored
3
False-positive classes closed
2
Score change
55 → 85

Where it started

AgentScore first scanned nodebench-mcp@3.2.0 on April 18, 2026. The package scored 55 out of 100 with risk level ELEVATED. Two HIGH findings drove that score:

  • HIGH command_injection: shell execution with template literal input
  • HIGH unsafe_eval: regex hit on code that looked like dynamic evaluation
  • LOW no_provenance: package was not published with npm provenance

We opened HomenShum/nodebench-ai#8 on April 21 with the report and the source locations.

What was real

The maintainer review did not dismiss the scan. It split the findings into two categories. Three shell-exec sites were real enough to refactor. Those paths moved toward argv-based process execution rather than shell-interpreted strings, which is exactly the kind of improvement a static scan should provoke.

That matters strategically. The correction loop was not a defensive maintainer arguing with a score. It was a maintainer doing real hardening work while also helping the scanner get more precise.

What the scanner got wrong

Two false-positive classes emerged from the issue:

  • db.exec(`SQL`) in a database context matched the shell-exec detector because regex alone could not tell a database member expression from child_process.exec.
  • The literal word eval inside a recommendation-string path matched the unsafe_eval detector even though the string was explanatory text, not executable code.

Those are real precision gaps, not cosmetic complaints. They were the right input for the scanner to learn from.

What shipped the same day

On April 26, the public issue produced two durable scanner improvements that landed in the ruleset:

  • sanitizer mitigators for database-shaped .exec() usage and SQL-first strings
  • test-fixture mitigators for message-array pushes and eval-in-string contexts

The full lineage is public at /scanner/precision. The scanner did not silently get better. The cause, the fix, and the resulting ruleset are all inspectable.

What the score did next

By April 30, monitored rescans of nodebench-mcp@3.2.1 were returning 85 out of 100 with risk level LOW under ruleset 3185eb87b4ce.

  • command_injection: HIGH to LOW, downgraded by the new sanitizer mitigators
  • unsafe_eval: HIGH to LOW, downgraded by the new test-fixture mitigators
  • no_provenance: remained LOW

The result is a cleaner package and a better scanner. That is a stronger outcome than a one-sided victory story.

Why this is an Iroko artifact

Redis showed that a consumer will act on MCP dependency risk. Agions showed that a maintainer will ship fixes when the report is concrete. NodeBench shows the third leg: a maintainer can also improve the scanner itself. That is how a trust layer grows roots. Public critique turns into public hardening and then into a stronger record for the next person who arrives.

Follow the full correction loop

Read the live package report, inspect the precision lineage, or browse the rest of the case-study set.