Case Study
NodeBench improved real code and scanner precision in the same loop
By Michael K Onyekwere
This was not just a maintainer fix story. It was a correction-loop story. AgentScore flagged real shell-exec risk in nodebench-mcp. The maintainer then separated the real findings from the false positives, and the scanner gained new mitigators the same day.
- Real shell-exec sites refactored
- 3
- False-positive classes closed
- 2
- Score change
- 55 → 85
Where it started
AgentScore first scanned nodebench-mcp@3.2.0 on April 18, 2026. The package scored 55 out of 100 with risk level ELEVATED. Two HIGH findings drove that score:
- HIGH
command_injection: shell execution with template literal input - HIGH
unsafe_eval: regex hit on code that looked like dynamic evaluation - LOW
no_provenance: package was not published with npm provenance
We opened HomenShum/nodebench-ai#8 on April 21 with the report and the source locations.
What was real
The maintainer review did not dismiss the scan. It split the findings into two categories. Three shell-exec sites were real enough to refactor. Those paths moved toward argv-based process execution rather than shell-interpreted strings, which is exactly the kind of improvement a static scan should provoke.
That matters strategically. The correction loop was not a defensive maintainer arguing with a score. It was a maintainer doing real hardening work while also helping the scanner get more precise.
What the scanner got wrong
Two false-positive classes emerged from the issue:
db.exec(`SQL`)in a database context matched the shell-exec detector because regex alone could not tell a database member expression fromchild_process.exec.- The literal word
evalinside a recommendation-string path matched theunsafe_evaldetector even though the string was explanatory text, not executable code.
Those are real precision gaps, not cosmetic complaints. They were the right input for the scanner to learn from.
What shipped the same day
On April 26, the public issue produced two durable scanner improvements that landed in the ruleset:
- sanitizer mitigators for database-shaped
.exec()usage and SQL-first strings - test-fixture mitigators for message-array pushes and eval-in-string contexts
The full lineage is public at /scanner/precision. The scanner did not silently get better. The cause, the fix, and the resulting ruleset are all inspectable.
What the score did next
By April 30, monitored rescans of nodebench-mcp@3.2.1 were returning 85 out of 100 with risk level LOW under ruleset 3185eb87b4ce.
command_injection: HIGH to LOW, downgraded by the new sanitizer mitigatorsunsafe_eval: HIGH to LOW, downgraded by the new test-fixture mitigatorsno_provenance: remained LOW
The result is a cleaner package and a better scanner. That is a stronger outcome than a one-sided victory story.
Why this is an Iroko artifact
Redis showed that a consumer will act on MCP dependency risk. Agions showed that a maintainer will ship fixes when the report is concrete. NodeBench shows the third leg: a maintainer can also improve the scanner itself. That is how a trust layer grows roots. Public critique turns into public hardening and then into a stronger record for the next person who arrives.
Follow the full correction loop
Read the live package report, inspect the precision lineage, or browse the rest of the case-study set.