We reviewed feedback from 312 Sea Wolf attempts. Here are the 5 patterns that separated passers from rejects.
I'm Tom. Over the past 14 months I've personally read prep notes, score reports, and post-test debriefs from more than 1,200 candidates preparing for McKinsey Solve. The Sea Wolf game generates more support emails than any other part of the assessment — by a long margin.
What follows is the honest synthesis of 312 of those Sea Wolf attempts where the candidate later shared an outcome. It's not a controlled study. But the patterns are consistent enough that I think they're worth publishing in full.
TL;DR — the 5 patterns
- ≈68% of failing reports describe blowing time on sites 1–2.
- ≈54% included a microbe with the undesirable trait, invalidating the site.
- ≈47% re-solved each site from scratch instead of templating.
- ≈31% optimised cleaning score instead of just hitting the target.
- ≈29% panic-clicked in the final 5 minutes and broke a valid site.
How often each pattern showed up
% of failing self-reports (n=312) mentioning each pattern. Patterns are not mutually exclusive.
Methodology & limitations
This is qualitative synthesis, not a controlled study. The 312 attempts were drawn from three sources between January 2025 and April 2026: our support inbox (n≈190), a post-purchase survey sent 7 days after delivery (n≈85), and refund-request interviews (n≈37).
Percentages reflect how often each pattern appeared in self-reported debriefs from candidates who described their result as a failure or below their target band. Outcomes are self-reported and unverified. All quotes have been paraphrased and stripped of identifying detail for privacy.
Treat the numbers as directional, not statistical. The goal of this study is to surface recurring failure modes, not to prove population-level rates.
Which patterns to fix first
Frequency (how often) × estimated impact (band-points lost when present). Top-right = highest priority. Impact is a qualitative 1–5 rating I assigned by reviewing the debriefs — directional only.
Where the minutes actually go
Self-reported median minutes per site, rounded. Failing reports (n=312) vs top scorers (n≈48). The front-loading pattern is what P1 is really about.
- Failing reports
- Top scorers
The 5 patterns, in order of frequency
Time blown on sites 1 and 2
The single most common failure mode. Candidates spend 12–15 minutes perfecting the first two sites — usually re-checking the toxicity sum — then have ~7 minutes left for the remaining three. Sites 4 and 5 get clicked through almost randomly.
“I felt great after site 2. Then I looked at the clock and realised I had 8 minutes for three sites. I basically guessed the last one.”Representative candidate comment, paraphrased for privacy.
Set a hard 6-minute cap per site for the first three sites. If you're not done, lock in the best legal combo you have and move on. You lose more points by leaving a site unanswered than by submitting a suboptimal one.
Treating the 'avoid undesirable trait' rule as soft
Most candidates treat the undesirable-trait constraint as a tiebreaker rather than a hard filter. It isn't. A single microbe carrying the undesirable trait invalidates the entire site — even if every other number is perfect.
“My toxicity and cost were both inside the range. I was confident. The site still scored zero — I'd included a microbe that had the flagged trait.”Representative candidate comment, paraphrased for privacy.
Before evaluating numbers at all, drop every microbe in the pool that carries the undesirable trait. Solve from the filtered pool only. This single change alone moves most candidates up a band.
Re-solving from scratch on every site
Each site reuses the same underlying logic: filter by required/undesirable traits, then find a 3-microbe combination inside the numeric ranges that hits the cleaning target. Candidates who treat each site as a new puzzle burn 2–3 minutes per site just re-orienting.
“By site 4 my brain was mush. I kept re-reading the rules even though they hadn't changed.”Representative candidate comment, paraphrased for privacy.
Build a 4-step template before the test: (1) filter undesirable, (2) check required trait, (3) sort by cost, (4) test the cheapest combo that hits target. Run the same sequence on every site. The mechanics don't change — only the numbers.
Misreading the cleaning target as 'maximise'
The cleaning target is a threshold to hit, not a number to maximise. Overshooting it wastes cost budget that could be spent on staying inside other ranges. Candidates who optimise for the highest possible cleaning score consistently break the cost constraint on later sites.
“I thought more cleaning = more points. Turns out I was just spending budget I needed.”Representative candidate comment, paraphrased for privacy.
Hit the target by the smallest margin you can. Treat any cleaning above target as wasted spend.
Panic-clicking in the final 5 minutes
Once the timer crosses 5 minutes remaining, decision quality collapses. Candidates start swapping microbes without re-checking constraints, often turning a valid site into an invalid one in the last 30 seconds.
“I changed my mind on site 5 with a minute left and ended up submitting something I knew was wrong.”Representative candidate comment, paraphrased for privacy.
Treat 'submitted and legal' as a higher-value state than 'unsubmitted and optimal'. Once you have a legal combination, do not touch it unless you can prove the swap is strictly better against every constraint.
All five patterns at a glance
| # | Pattern | Frequency | Impact (1–5) | One-line fix |
|---|---|---|---|---|
| P1 | Time blown on sites 1–2 | ≈68% | 5/ 5 | Hard 6-min cap per site. |
| P2 | Undesirable trait included | ≈54% | 5/ 5 | Filter undesirable before solving. |
| P3 | Re-solving from scratch | ≈47% | 3/ 5 | Run the same 4-step template per site. |
| P4 | Optimising cleaning above target | ≈31% | 3/ 5 | Hit target by smallest margin. |
| P5 | Panic-clicking in final 5 min | ≈29% | 4/ 5 | Lock legal answers. Don't touch. |
n=312 self-reported attempts · Jan 2025 – Apr 2026 · Impact ratings are qualitative.
What the top scorers did differently
A smaller subset (n≈48) reported scoring at the top of their band or receiving an interview invite. Three habits showed up in almost every one of those debriefs:
- 01Spent the first 60 seconds reading constraints, not microbes. Filtering before solving was the single most repeated habit.
- 02Used a fixed timer per site (≈6 minutes) and stopped iterating the moment they had a legal answer, even when they suspected something better existed.
- 03Treated cost and toxicity as binary: 'inside the range' or 'invalid'. No partial credit reasoning, no 'close enough'.
I started SolvePrep after watching strong candidates lose offers to a 35-minute game with no real prep market. Since 2024 we've delivered prep tools to candidates across 40+ countries. I read the support inbox personally — these studies are what surfaces from doing that.
Want the full Sea Wolf mechanics breakdown?
The patterns above are about behaviour. If you also want the exact game mechanics and how each constraint is scored, our Sea Wolf guide goes deeper — and the free trial lets you feel the timer for yourself.