Seeker Research

Original analysis based on aggregate career intelligence data collected through Seeker.

Inside Seeker|June 28, 2026

How We Built a Résumé Parser That Explains Its Own Mistakes

By Seeker Research

A résumé parser has one job that sounds simple and isn't: turn a document a human wrote for other humans into structured data — roles, companies, dates, skills. The trouble is that résumés don't follow rules. Dates are missing, sections have creative names, two jobs overlap, a degree program looks exactly like a job. Any single extraction strategy is wrong some of the time.

So, like most teams, we ended up with two.

Two parsers, one résumé, two answers

One parser is section-anchored: it finds the "Experience" heading and reads roles out of the blocks beneath it. It's precise when a résumé is conventional. The other is date-anchored: it scans for date ranges and works outward to find the role attached to each one. It's the safety net for the messy résumés — the ones with no clear sections, or roles written as single inline lines.

Most of the time they agree. The interesting question is what happens when they don't, and our production code had an answer: it merged them. If the section parser found nothing but the date parser recovered three dated roles, we'd backfill those roles so the résumé didn't show up as "0 jobs." Reasonable.

The problem wasn't the merge. The problem was that the merge logic lived in one giant function, and everything else that needed to answer "what's in this résumé" — our diagnostics, our quality monitors, our debugging tools — re-ran only the first parser. They each re-derived the answer slightly differently. So the system quietly held two truths at once: what we showed the user, and what our own tooling believed we showed the user. They drifted apart. A résumé could look healthy on one dashboard and broken on another, and both were "right" according to the code that produced them.

This is a specific, recurring bug class: the same fact computed in two places, which then drift. It is worth naming because the fix is always the same shape.

The fix is honesty, not intelligence

The tempting fix is to make the parser smarter — more rules, more heuristics, a model. That usually adds confident new ways to be wrong.

The actual fix is to stop throwing information away. Our parser already knew when it was uncertain — a role with no extractable title, a confidence score that got floored when no sections were found. It just collapsed all of that into a single confident-looking number on the way out. So the work wasn't "compute new things." It was: derive the uncertainty once, in one place, and let everyone read the same answer.

Concretely, that meant one canonical entry point — one function that runs both parsers, merges them with one shared implementation, and returns the result plus the reasons: which reconciliation steps fired, where the two parsers disagreed, what got recovered and how confident we are after recovery. Production, diagnostics, replay tools, monitors — all of them call that one function now. There is exactly one definition of "what's in this résumé."

The hard part: proving the new code matched the old

Here's the part that's actually interesting, and the part teams usually skip.

When you extract a 200-line tangle into a clean shared function, how do you know you didn't change its behavior? "The tests pass" is not enough — the old code had no tests, which is part of why it was a tangle.

So before deleting a single line of the old path, we built a shadow comparison. The plan: run the old code and the new code side by side on real traffic, compare their outputs résumé-by-résumé, and only delete the old path once they were provably identical (or different in ways we fully understood).

That comparison needs one thing to be trustworthy: a single, dumb definition of "are these two results the same?" Dumb is the point. It explains how two results differ — this company field changed, that title changed, the role count dropped — and nothing more. It has no opinion about which answer is better, whether the parser was right, or what should have happened. The moment a comparator starts having opinions, you've reinvented the exact drift you're trying to kill.

Getting "the same" right is fiddlier than it sounds. Two results can be genuinely identical but look different because a list of skills came out in a different order, or a number carries floating-point noise, or one was generated a second later and has a different timestamp. So the comparator normalizes first: it sorts the things that are sets (skills are unordered), but preserves the order of things that aren't — your job history is a chronology, and a comparator that "helpfully" sorts it would hide a real bug. It rounds away float noise. It strips timestamps. Then it hashes what's left. Same hash, done. Different hash, it tells you exactly which fields moved.

We fuzz-tested that normalizer with thousands of random structures, because it's now infrastructure, and infrastructure earns adversarial testing. We versioned it, because the rules will change and old comparisons shouldn't be silently mixed with new ones.

What the machine is for

The payoff isn't just a clean migration. Once you can mechanically compare two parses, you can run the new parser against a corpus, cluster every difference by how it differs, and get a report that groups them — "all of these résumés changed the same way, and here's a representative example" — instead of a pile of raw diffs.

That's where we think AI actually belongs in this system — not writing the parser, but reading those clusters and explaining them. "All of these are really the same bug" is a sentence a human is bad at producing from thousands of raw diffs, and a model is surprisingly good at. The loop is: collect evidence, normalize it, cluster it, let the machine explain it, and let a human decide what to fix.

Notice what's missing from that loop. The machine doesn't change production. It explains production. The seductive version — point a model at the system and let it write the fix — feels faster and has a habit of generating fascinating new bugs while confidently explaining why they're impossible.

A parser that produces an answer is table stakes. A parser that can show you, on demand, exactly where today's answer differs from yesterday's — and why — is the thing that stops the same class of bug from coming back wearing a different disguise. That's the system we're building toward: not one that's never wrong, but one that can always tell you how it's wrong.

See where your experience fits

Upload your resume to see how these findings apply to your background. Free analysis in about 60 seconds.

Analyze my resume