AI does bad code review

AI alone is not enough for reliable code review. Thoughts on how to combine deterministic checks with AI to build a consistent, scalable code quality pipeline.

Code review is boring, it takes time, and even worse — if it's not done properly you risk introducing bugs or anti-patterns that make the codebase a nightmare to maintain long-term.

Why going full AI is wrong

Most of the time AI can help. Like you can prompt it to review your branch changes for security concerns or bad practices. Honestly, this works for small or critical changes, but it doesn't fully solve the problem. Let me explain.

The context problem

Every new chat you open is like talking to a new folk. Imagine hiring a software engineer in your company, a very good one indeed, but who doesn't know how your team works — and worse, unable to read your whole codebase, because for a mature project that's simply not feasible token-wise.

The common solution is to document everything: architectural decisions, conventions, do's and don'ts, the reasoning behind how things are structured. Essentially, you're building a written mental model of your codebase so the AI has something to work from — and honestly, this is a good practice regardless of AI. It forces your team to be explicit about decisions that would otherwise live only in someone's head.

Now the key for AI is not to dump everything into a single context file. Your main AI context — whether that's a CLAUDE.md, AGENTS.md, or a base prompt — should stay minimal: a high-level index that tells the agent what documentation exists and where to find it. From there, the agent reads only what's relevant to the task at hand. Touching the API layer? Read api-integration.md. Working on a new UI component? Read design-system.md.

docs/
	  api-integration.md
    architecture.md
    design-system.md
    i18n.md
	  contributions.md
	  (...)
/src
    /features
    (...)

The structure above is just an example — it will look different from company to company.

So let's say you've done that. You have a well-documented codebase. But documentation alone doesn't dictate how code review should be conducted or what bad practices to catch. I don't want any in TypeScript. I don't want useEffect unless it's truly necessary (see You Might Not Need an Effect). And I don't want unnecessary comments like:

// Check if user is validated
const isValidated = user.status === UserStatus.Validated

I bring this up because as code review requirements grow, you end up with a long list of conditions — a lot of things AI needs to keep in mind, consuming tokens fast. And sometimes, if the model isn't strong enough, it quietly overlooks some of them.

Here's what a real custom code review skill ends up looking like after a few months of iteration:

## Code Review

Review the staged changes and check for the following:

- Do not allow `any` in TypeScript unless explicitly justified with a comment
- Avoid `useEffect` unless there is no other option; prefer derived state, event handlers, or data fetching patterns
- Do not write comments that restate what the code does; only comment on *why*
- Service layer functions must not contain business logic; delegate to domain layer
- Domain layer must not import from application or infrastructure layer
- Feature modules must not import from each other; use shared/ for common logic
- API calls must go through the service layer, never called directly from components
- Avoid default exports; use named exports
- Do not use `console.log` in production code
- All user-facing strings must use the i18n helper, never hardcoded
- Do not use `setTimeout` or `setInterval` without cleanup
- React components must not exceed 200 lines; suggest splitting if they do
- Avoid prop drilling beyond 2 levels; suggest context or state management
- All async functions must handle errors explicitly
- Do not introduce new dependencies without a comment justifying the addition

And this list is still short compared to what a mature codebase accumulates. Every rule is valid — but feeding all of it into an AI on every review is expensive, inconsistent, and most of these are better enforced by a linter that never forgets.

Also, AI output is not deterministic. You can't reliably test whether your code review guidelines will actually be followed. The model might hit contradictory requirements and produce inconsistent results, or silently skip part of your instructions.

We shouldn't be offloading everything to AI. There are many steps in code review that can be automated deterministically before any AI pass:

Formatting: Probably the first thing that comes to mind, and the simplest win. Tools like Prettier, Biome, or oxlint in the JS ecosystem — and equivalent tools in other languages — quickly enforce code conventions: indentation, quote style, trailing commas, line length, import ordering, bracket spacing, and more.

Linting: Similar to formatting, but worth separating out because this is where more complex, opinionated checks live — the kind AI shouldn't need to bother with. For example, enforcing hexagonal architecture by preventing imports from the application layer into the domain layer. Biome, oxlint, or ESLint with custom rules can cover this.

Architectural tests: Arguably better than custom lint rules, these act like regular tests but validate codebase structure rather than behavior:

  • Frontend cannot import backend internals
  • Domain layer cannot depend on infrastructure
  • Feature modules cannot cross-import
  • Shared UI components cannot import business logic

In JS, ts-arch is a solid option, but the tool doesn't matter as long as it solves your problem.

Dead code detection: Unused exports, dead feature flags, unused dependencies, unused i18n keys — I've seen all of these pile up in codebases that had no guardrails. knip in JS is a great option to keep your codebase free of dead code.


Now think about what it would take to encode all of this into a single AI prompt. That's a lot of work, a lot of context, and a lot of tokens — for checks that can be done deterministically, for free. Once these are in place, AI can use their output as part of its feedback loop and focus on what's actually hard to automate: logic errors, design tradeoffs, subtle security issues.

You can extend the pipeline with security checks, performance tests, or bundle size analysis. The idea is to reduce the AI's surface area so it acts as an orchestrator and reasons only about the hard stuff. And the good part: you can even use AI to help set up these steps in the first place.