Research / May 2026 · Signal test

Cluster buying: tested, reduced, and replaced with something better

Published 2026-05-08 · Test window 2025-04-01 → 2026-04-01 · N = 3,537 filings · ← All research

TL;DR

We tested whether multiple insiders buying the same stock in the same week — historically called a “cluster buy” and weighted heavily in our scoring model — actually predicts forward returns. The answer surprised us. Cluster buys (3–4 insiders) didn’t outperform solo buys at any window or any horizon. And clusters of 5+ insiders significantly underperformed — by nearly 7 percentage points at 90 days, with p < 0.0001, robust across every cut of the data. Production scoring has been updated.

This is the second research piece we’ve published. The first (new-exec signal) found nothing. This one found something — just not what we expected.

Why we ran this test

Our production scoring model gave +25 points to every insider in a 3+ insider cluster, and +10 each to paired buys (2 insiders). That weighting came from conventional wisdom: multiple insiders agreeing is a stronger signal than a lone buyer. It’s the kind of thing every insider-trading writeup repeats.

We’d already reduced cluster weights once before — from +40 down to +25 — after a backfill analysis suggested the original weight was too aggressive. But “too aggressive” isn’t the same as “wrong direction.” This test was supposed to validate that the current weights were calibrated correctly. We pre-registered the methodology and decision rules before pulling any data.

The decision rule was binary: if bucketed cluster vs. solo spread at 90 days was ≥ +3pp at p < 0.05, keep the weights. If it came in below that, reduce or kill.

It came in at −3.72pp at p = 0.034. That’s the wrong sign. The signal isn’t underpowered — it’s pointing the other way.

Methodology

Sample: Every Form 4 open-market purchase (transaction code “P”), NYSE/Nasdaq listed, from 2025-04-01 to 2026-04-01.
Filters: Individual insiders only (institutional entities excluded via regex). Buy value ≥ $10,000. Stock price ≥ $1 at filing.
Cohort assignment: Each filing assigned to one of four cohorts based on distinct insider count at the same ticker within an N-day window:
- Solo — no other Form 4 buys at same ticker within ±N days
- Paired — exactly 2 distinct CIKs within window
- Cluster — 3–4 distinct CIKs
- Mega-cluster — 5+ distinct CIKs
Window sweep: N tested at 3, 7, 14, and 30 days. The production setting (intraday batch) maps closest to N=3.
Outcome: Excess return vs. SPY at 30, 60, and 90 trading days, using historical daily closes from Massive (formerly Polygon).
Controls: Bucketed comparison within (market cap × buy value) cells, plus a multivariate regression controlling for log market cap, log buy value, and a micro-cap dummy.
Clean-room: No production scoring code was reused. The institutional-buyer regex was duplicated, not imported.

The pre-registration document, decision rules, and full results are committed to the repo and timestamped before the analysis ran.

Cohort sizes

Window	Solo	Paired	Cluster	Mega-cluster
3 days	2,194	610	389	344
7 days	1,981	649	507	400
14 days	1,824	692	587	434
30 days	1,701	687	678	471

Total filings analyzed: 3,537. Cluster cohort cleared the pre-registered N=50 sanity floor at every window.

The headline result

At the production-relevant window (N=7 days), 90-day excess return vs. SPY:

Cohort	n	Mean excess	Median excess	Win rate	p vs. solo
Solo	1,355	+1.27pp	−4.77pp	39.9%	—
Paired	460	+4.72pp	−7.32pp	35.0%	0.21
Cluster	341	−2.29pp	−3.79pp	46.6%	0.11
Mega-cluster	311	−5.67pp	−6.64pp	38.9%	<0.0001

Three things to notice.

First, paired buys (exactly 2 insiders) are the best cohort by mean excess return. They’re not statistically significant at the 5% level, but the direction is consistent across every window tested (+2.87 to +4.27pp at d90 depending on N). That matches the intuition that “two people independently agreeing” is a real signal.

Second, the relationship between cluster size and forward returns reverses past 2 insiders. It’s not noisy — it’s monotonic. Adding more insiders past two makes returns worse, not better:

paired       +3.46pp
solo          0.00pp  (baseline)
cluster      -3.55pp
mega_cluster -6.94pp

Third, mega-cluster is robustly negative. Every window, every horizon, every test we ran agrees. At d90: −6.54pp at N=3 (p<0.001), −6.94pp at N=7 (p<0.0001), −5.81pp at N=14 (p=0.001), −6.38pp at N=30 (p<0.001). It’s not regime-specific to one window definition.

The bucketed comparison

The pre-registered primary test was bucketed cluster vs. solo at d90, controlling for market cap and buy value. Sixteen cells, requiring ≥ 10 observations per cohort to be usable.

The two cells doing most of the work:

Market cap	Buy value	n_solo	n_cluster	mean_solo	mean_cluster	spread	p
Micro	Small ($10K–$100K)	392	115	+5.03pp	−3.92pp	−8.95pp	0.012
Small	Medium ($100K–$500K)	110	58	+0.78pp	−10.13pp	−10.91pp	0.009

The micro-cap small-buy cell is significant on its own. So is the small-cap medium-buy cell. Both push in the same direction.

There’s one cell pushing the other way (large × small, +19.70pp, p=0.023, n=120) — but it’s a large-cap cell where the solo arm itself returned −21.52pp, dragged down by the broader large-cap underperformance our scorecard already documents. Hard to read that as cluster signal.

Pooled across all usable cells, weighted by sample size: −3.72pp at p=0.034.

The regression

To address the worry that bucketing is too coarse and might be missing within-cell variation, we ran a regression on the same data:

excess_return_d90 = β₀ + β₁·paired + β₂·cluster + β₃·mega_cluster
                    + β₄·log(market_cap) + β₅·log(buy_value)
                    + β₆·micro_cap_dummy + ε

Term	Coef	t-stat	p-value
(Intercept)	+0.397	2.53	0.011
paired	+0.028	0.98	0.32
cluster	−0.032	−1.41	0.16
mega_cluster	−0.095	−5.34	<0.0001
log(market_cap)	−0.027	−4.45	<0.0001
log(buy_value)	+0.018	1.83	0.067
micro_cap_dummy	−0.058	−2.02	0.043

n = 2,425. Adjusted R² is small (1.8%) — we don’t expect a regression of insider-trade-level returns to explain a lot of variance, and that’s fine. What matters here is the sign and significance of the cohort coefficients.

The mega-cluster coefficient survives controls: an ~9.5pp underperformance after controlling for size, market cap, and the small-cap effect. The cluster coefficient is negative but not significant on its own. The paired coefficient is positive but not significant.

Composition: who’s in the cluster matters

For non-solo cohorts, we tagged each cluster’s role composition. Three categories:

Officer-only: all C-suite, VP-tier, or other officers
Director-only: all board members
Mixed: any combination

At N=7 d90:

Composition	n	Mean excess	Median	Win rate
Officer-only	104	+1.22pp	−7.42pp	47.1%
Mixed	419	−2.49pp	−4.24pp	43.7%
Director-only	129	−12.62pp	−7.86pp	37.2%

Director-only clusters are the worst-performing sub-cohort in the dataset. Officer-only clusters are roughly flat. Mixed clusters fall in between.

We’re not yet ready to claim “director-only clusters are a contrary indicator” as a standalone production rule — the sample is smaller and we haven’t isolated it from sector or other confounds. But it’s the next thing to look at.

Adding more insiders past two makes returns worse, not better.

What we changed in production

Effective with the next pipeline run after this writeup is published:

Cluster bonus (3–4 insiders) — removed. No score change. The signal is still surfaced in the editorial layer (“multiple insiders bought”) but doesn’t move the score.
Mega-cluster (5+ insiders) — added as a penalty. −20 per filing in the cluster. This is the first time we’ve added a negative conviction signal of this kind. It deserves a contrary-indicator label in the editorial copy.
Paired (2 insiders) — held at +10. Direction-positive but not significant. We’re holding the weight pending an independent retest on a second year of data.

Picks scored before the deploy date used prior weights. Picks scored after use updated weights. The scorecard will show both regimes.

What this doesn’t tell us

A few honest limitations.

One year, one regime. 2025-04 to 2026-04 covers a single market environment. The mega-cluster contrarian effect could be regime-specific. We’re queueing an independent retest on a second year of data when it’s available.

Cluster definition is mechanical. Our cohorts count distinct CIKs, nothing else. We don’t weight by seniority, dollar amount, or whether the insiders all bought on the same day vs. spread across the window. A more nuanced definition might produce different results.

Mega-cluster mechanism is unclear. We don’t have a clean explanation for why 5+ insiders piling in within a week predicts underperformance. Plausible stories: capitulation buying near the bottom of a longer decline, signaling activity around a corporate event the market then re-prices, or simple “everyone read the same memo” herding. We’re treating the result as descriptive, not yet causal.

The director-only finding needs its own test. The −12.62pp result is striking but uncontrolled. Could be sector-driven, could be specific to a few large clusters that dominate the cohort, could be real. Separate research.

What we’d do differently

The pre-registered decision rule worked exactly as it was supposed to: it forced a binary call before we saw the data, and it produced a finding we wouldn’t have arrived at by squinting at the numbers. The thing we didn’t anticipate was that the most interesting result wouldn’t be in the primary test (cluster vs. solo) but in the cohort we treated as exploratory (mega-cluster). Worth remembering for the next test.

Files

Pre-registration document (committed before the data pull)
Full results JSON
Cohort assignments CSV
Methodology scripts