How Do You Evaluate Engineers When Everyone Uses AI?

By Stealthy Team | Sun May 31 2026 07:47:00 GMT+0000 (Coordinated Universal Time)

The old signal is breaking

Evaluating software engineers was never simple.

A resume only tells you where someone worked. A take-home project rewards people with free time. A whiteboard interview often tests nerves more than engineering judgment. A LeetCode round can prove someone practiced LeetCode.

None of this was perfect.

But now AI has made it even harder.

A candidate can use AI to polish their resume. A company can use AI to screen it. A candidate can use AI to prepare answers. An interviewer can use AI to generate questions. A coding assistant can write a decent first implementation.

Then the hiring manager asks the only question that matters:

Did we actually get a good signal?

That question is becoming harder to answer.

AI did not make engineering judgment less important

AI coding tools changed the mechanics of software development. They made it easier to produce code, explain errors, generate tests, and move faster through implementation work.

But producing code is not the same as being a strong engineer.

A strong engineer can enter an unfamiliar system, understand what is happening, reason through tradeoffs, identify the root cause of a failure, and make a fix that holds under real conditions.

That matters even more now.

The DORA 2025 State of AI-assisted Software Development report frames AI as an amplifier. It can magnify existing strengths, but it can also magnify existing weaknesses. The Stack Overflow 2025 Developer Survey shows the tension clearly: AI tool adoption is widespread, but trust in AI output remains mixed.

That is the point.

AI can help engineers move faster. It can also help weak engineers look stronger than they are.

So the evaluation method has to change.

The wrong response is banning AI

Some hiring processes try to solve the problem by banning AI.

That may feel clean, but it is not realistic.

In real engineering work, developers use tools. They search docs. They read past incidents. They ask teammates. They use AI. They inspect logs. They copy patterns from existing code. They run experiments.

The job is not performed in a vacuum.

So an interview that says “pretend AI does not exist” creates an artificial environment. It may test memory, speed, or compliance, but it does not necessarily test modern engineering ability.

A better question is:

Can this person use AI and still think clearly?

That means evaluating the engineer’s judgment, not their ability to avoid tools.

What AI makes easy to fake

AI makes some parts of the hiring process easier to inflate.

1. Resume polish

A weak resume can sound sharper. A generic project can sound strategic. A small contribution can be framed as system ownership.

That does not mean candidates are lying. It means the language layer is now cheaper.

2. Prepared answers

Behavioral questions are easier to rehearse. System design answers are easier to outline. Common interview problems are easier to pattern-match.

A candidate can sound fluent without having deep experience.

3. First-pass implementation

AI is good at producing plausible code.

Not always correct code. Not always production-safe code. Not always code that matches your system.

But plausible enough to pass many shallow screens.

That is the danger.

If your evaluation only checks whether the candidate can produce code, you may be testing the assistant more than the engineer.

What AI still makes hard to fake

The good news: some skills are still hard to fake.

1. Debugging unfamiliar systems

A realistic production bug forces candidates to read, investigate, and reason.

They have to understand code they did not write. They have to interpret logs. They have to notice contradictions. They have to decide what evidence matters. They have to separate symptoms from root cause.

AI can help, but it cannot replace judgment.

2. Knowing what to trust

AI assistants are useful, but they are not always right.

A strong engineer can use AI as a second brain without outsourcing the final decision. They can ask better questions, challenge the answer, and verify claims against the system.

That is a modern engineering skill.

3. Handling messy context

Real software does not live in one neat file.

It lives across services, queues, permissions, feature flags, deploy environments, data models, runtime behavior, stale docs, and historical decisions.

Candidates who can navigate messy context are valuable.

Candidates who can only implement isolated tasks are easier to replace with tools.

4. Fixing the actual root cause

The best engineers do not just make the error disappear.

They ask why it happened. They check whether the fix preserves other behavior. They test the failure condition. They avoid turning the client into the source of truth. They do not patch around the symptom and call it done.

That is the signal hiring teams should look for.

The interview should look more like the job

Most software engineering jobs are not eight hours of solving isolated algorithm puzzles.

They are closer to this:

Something broke. The system is unfamiliar. The logs are noisy. The docs are incomplete. The first theory is wrong. The fix has to be safe. The clock is running.

That is the environment where strong engineers stand out.

So the assessment should move closer to real engineering work.

Not by making interviews longer. Not by adding more steps. Not by creating unpaid take-home labor.

But by testing the right thing.

Give candidates a realistic broken system. Let them use AI. Give them logs, code, docs, architecture, and runtime clues. Ask them to find the root cause and ship the fix.

That tells you much more than asking them to reverse a linked list from memory.

What a good AI-era engineering assessment should measure

A good assessment should not only ask “did the code work?”

It should measure the path.

Signal 1: Orientation

How quickly can the candidate understand an unfamiliar system?

Do they find the entry point? Do they map the relevant services? Do they understand the request path? Do they know where state is stored?

Signal 2: Hypothesis quality

Do they form testable theories, or do they randomly change files?

A good candidate says:

“I think the verifier is reading stale state before the final damage is committed. I’m going to prove that by checking the order of these two events.”

A weaker candidate says:

“Maybe the leaderboard is broken.”

The difference matters.

Signal 3: Evidence handling

Can they use logs, runtime data, tests, and code together?

Strong debugging is not vibes. It is evidence.

Signal 4: AI judgment

Do they blindly accept AI suggestions, or do they use AI carefully?

A good candidate may use AI constantly, but they still own the reasoning.

Signal 5: Fix quality

Does the fix solve the production failure without breaking normal behavior?

This is where many shallow assessments fail. A candidate can make the visible test pass while creating a worse system.

A good assessment should check the real production condition.

Signal 6: Communication

Can the candidate explain what broke, why it broke, and why the fix is safe?

Engineering is not only code. It is shared understanding.

This is especially useful for non-technical recruiters

Technical hiring is hard when you are not technical.

A recruiter may know whether a candidate communicates well, seems motivated, or has the right background. But it is much harder to know whether they can actually reason through a system.

AI makes that gap worse.

If everyone sounds polished and everyone can produce decent-looking code, recruiters need a clearer signal before bringing engineers into the loop.

That signal should not be “the candidate claims they know distributed systems.”

It should be:

Here is how they handled a realistic incident.

Did they finish? How long did it take? What did they inspect? What did they change? Did the fix survive production replay? Did they understand the root cause?

That gives recruiters and hiring managers a shared artifact.

Not just an opinion. Not just a score. A work sample.

What not to measure

AI-era hiring should stop overvaluing signals that are easy to game or weakly connected to the job.

Do not overvalue syntax memory

Most engineers do not need to memorize syntax under pressure. They need to solve problems correctly.

Do not overvalue speed alone

Speed matters, but only after correctness.

A fast wrong fix is an incident multiplier.

Do not overvalue confident explanations

AI has made confidence cheap.

Look for evidence, not fluency.

Do not overvalue perfect isolation

Real engineers work with tools.

An assessment that forbids every external aid may test discipline, but not necessarily job performance.

The future of technical hiring is closer to incident response

The best engineering assessments will look less like school exams and more like realistic work.

Here is a broken system. Here is the evidence. Here are the constraints. You can use your tools. Find the root cause. Fix it safely. Explain your reasoning.

That is a much better test for the AI era.

Because when AI can generate code, the scarce skill is not typing.

The scarce skill is judgment.

The Incident Challenge is built around that idea. Engineers are dropped into realistic production-style failures, with code, logs, docs, runtime evidence, and architecture clues. AI agents are allowed. The goal is not to ban tools. The goal is to see whether the engineer can use them and still find the real root cause.

You can try a challenge here: The Incident Challenge

For more context, read:

Final thought

The old hiring question was:

Can this person write code?

The new hiring question is:

Can this person understand a system well enough to fix it when it breaks?

That is the signal worth testing.