Methodology & Data Sources

Every output on this site is grounded in publicly available NRMP data. This page documents what data the model uses, how it's calibrated, and how its outputs are validated against held-out statistics that the calibration code never sees.

Disclaimer

MatchStudy is an experimental research tool. It is not medical, career, or professional advice. No representations or warranties are made about the accuracy, completeness, or fitness for any particular purpose of any output produced by this tool.

Outputs are model predictions from a Gale-Shapley simulation over a synthetic cohort whose feature distributions are calibrated to real NRMP statistics. They reflect what the model says about a hypothetical applicant population — not deterministic guidance about any individual applicant's real prospects.

Use at your own discretion. No clinical claims are made.

Data sources

NRMP Charting Outcomes in the Match (2024 + 2022) — per-specialty feature distributions for matched and unmatched applicants (Step 1, Step 2 CK, AOA %, publications, research, work, volunteer experiences). Separate reports cover MD seniors, DO seniors, and IMGs.
NRMP Results & Data 2024 — total positions, applicants by type, per-specialty match rates.
NRMP Main Match Program Results 2022–2026 — every individual residency program with its real 2024 quota and historical fill rates (5,000+ canonical-22 programs).
NRMP Program Director Survey 2020 — per-specialty factor importance ratings (38 factors × 22 specialties × interview/ranking stages), source for the calibrated Wp matrix.

Calibration / holdout split

The NRMP statistics are partitioned into two disjoint sets:

Calibration set — per-specialty marginals on matched applicants (means of Step, AOA %, publications, etc.). Used to fit generator parameters.
Holdout set — per-specialty match rates and per-specialty unmatched- applicant means. Used only to validate emergent simulation outcomes after the model is fit.

matchstudy/calibration.py may not import or read holdout.json. This is enforced at runtime via a path guard that raises PermissionError, and statically by tests/test_calibration_isolation.py (5 isolation tests).

Without this discipline, the model would be fit and graded on the same numbers — “circular validation” — and any pass rate would be meaningless.

Validation scorecard

Looking for charts and per-pool calibration? See the full validation page →

Loading scorecard...

On clustering & circularity

Specialty clustering (real finding): We cluster the 22 NRMP specialties by their published applicant feature averages (Step, AOA %, publications, research, etc. — directly from real NRMP statistics). K-means with k=2 partitions them with 85% bootstrap stability (Adjusted Rand Index = 0.85). The split: 5 highly competitive specialties (Dermatology, Neurological Surgery, Orthopaedic Surgery, Otolaryngology, Plastic Surgery) vs. the other 17. That this discovered split exactly matches what medical students independently call “the competitive ones” is the kind of external validation a non-circular discovery should have. Surfaced on /applicant as “Specialty Cluster (real NRMP data)”.

An earlier release also ran K-means on synthetic student feature vectors as a generator-fidelity sanity check. That analysis was removed in the Phase 7.7 cleanup along with the underlying synthetic-archetype data files; the result was a tautology by construction (K-means recovers the hand-coded archetypes used to generate the data) and ought never to have been surfaced alongside the real-data cluster.

What the model captures

Full-cohort Gale-Shapley at build time. A 44,838-applicant synthetic cohort — matching real 2024 NRMP pool counts (MD seniors 19,755; MD grads 1,662; DO seniors 8,033; DO grads 616; US IMGs 4,751; Non-US IMGs 10,021) — is drawn from per-pool×per-specialty NRMP feature distributions, then run through the actual deferred-acceptance algorithm against ~5,978 PGY-1 programs with their real capacities. Each program ends up with an equilibrium acceptance threshold — the lowest-scoring applicant it accepted. Per-program thresholds + score distributions are baked into a static JSON shipped to your browser. At runtime, your profile is scored against those thresholds; no per-bootstrap GS runs in the browser.
Pool membership lives as feature columns, not multipliers. The applicant feature matrix S is 12-d: 7 stat features (Step, research, grades, AOA, Gold Humanism, LoR, X-Factor) + 5 pool indicators (is_DO, is_DO_grad, is_MD_grad, is_US_IMG, is_NonUS_IMG; MD seniors are reference). Programs assign weights to those indicator columns in W_p— the same way they weight any other feature. Pool weights are calibrated once at build time so simulated per- pool match rates land within ~5 pp of real NRMP rates. No POOL_PRIORITY scalar applied to anyone's score.
Per-specialty calibrated marginalsfor Step 1, Step 2 CK, AOA %, research, publications — sourced directly from NRMP Charting Outcomes 2024 across 4 applicant pools (MD seniors, DO seniors, US IMGs, Non-US IMGs) plus synthesized previous-grad pools (drawn from the senior- unmatched distribution, since NRMP doesn't publish per- feature data on re-applicants). Both matched and unmatched profiles are used for cohort sampling.
Per-specialty applicant-pool mixture from NRMP Tables 12A/12B for all 22 specialties.
Hard Step cutoffs at top-selectivity programs. Programs with 5-year fill rate ≥ 99% implicitly reject applicants below the specialty-25th-percentile Step (≈ matched-cohort mean − 1 SD).
Audition rotation boost. Per-program toggle: when set, the program-side score for that user multiplies by 1.5× (literature mid-point for surgical specialties).
Per-program reputation via Bayesian Beta-Binomial selectivity posterior on 5-year fill data (specialty-level prior, up to 5 years of evidence). Top-tier programs are more selective in the GS competition.
Signal liftcalibrated against published AAMC/Academic Radiology data: 3.5× gold, 2.0× silver. ERAS allotments verified against AAMC's official 2025 list.
Couples structural penalty:7% reduction on “both match” outcomes after Roth-Peranson runs, reflecting documented logistical/strategic factors.
Structural unmatch floor + ROL length effect: Even strong applicants have a baseline ~1.5% unmatch risk; short rank lists carry up to 30% additional risk via 0.015 + 0.45 / (1 + 0.5 × k) where k = number of programs ranked.
Imputed Letters / Clinical Grades / Humanism from correlated observed features. NRMP doesn't publish these per applicant; we use AOA → Clinical Grades and overall-strength → Letters of Rec correlations.

Known limitations

Single-pool simulation in the validation runs. Match rates above 95% in the most competitive specialties (Plastic Surgery, Dermatology, NeuroSurgery) reflect MD-only competition. Real NRMP rates of ~70-80% include IMG and DO competition. The multi-pool architecture is in place but not wired into the default simulation yet.
Step 1 reporting is sparse. Only ~13% of MD seniors self-reported numeric Step 1 scores (post pass/fail transition). Per-specialty Step 1 distribution statistics use only the reported subset.
PD Survey weights are 2020. The 2022 and 2024 PD Surveys deliberately omitted the factor-importance items. Calibrated Wp uses the most recent available data (2020) which may not reflect current practice shifts.
Program features are uncalibrated. Per-program Prestige, Research Funding, Lifestyle, etc. are not directly published by NRMP at scale. Phase 10 will swap in residency-db's FRIEDA + CMS data once the scrape completes (program salaries, work hours, demographics, hospital ratings).
SOAP modeling is approximate. Real SOAP runs as four discrete offer rounds; the model approximates this as a single Gale-Shapley round on (unmatched applicants, unfilled programs).