I Tried to Beat Prediction Markets

It was a Tuesday afternoon. The MLB regular season had been running long enough that my data pipeline was full. I’d been backtesting on Kalshi’s binary contracts, the ones where you bet YES or NO against teams like the Yankees beating the Royals tonight. Then it hit me. Slight favorites, teams that the Kalshi market priced between 55 and 60 percent likely to win, were overperforming the market’s implied probability by 17 percentage points. The line said they should win 55.3% of the time. They were actually winning 71.5% of the time. That’s an enormous gap. T-statistic of 5.4. p-value of 1.4 × 10⁻⁷. The kind of number you triple-check because it can’t possibly be right.

It was right, in theory.

I sat there for a moment and let myself feel the obvious feeling: holy shit, I found it. This is what every quant fantasizes about, a robust, well-defined inefficiency in a real, liquid market, sitting in plain sight, exploitable with simple rules. Buy YES on every Kalshi MLB market that opens between 0.50 and 0.60. Hold to settlement. Collect the 17 percentage points minus fees.

The math: at Kalshi’s roughly 4 cent fee per round-trip plus a 1-2 cent bid-ask spread, that’s about 6 cents of friction. 17 percentage points of edge translates to roughly 17 cents per contract in expected value. Even after friction, you’d net about 11 cents per contract. On Kalshi’s MLB book, daily volume could support maybe $1,000 of expression at this size. 11 cents on, say, 50 contracts a day is $5.50 a day, every day, on a venue where the strategy didn’t even require predicting anything I didn’t already get from public information.

I started drafting the deploy plan, position sizing, walk-forward validation, and most importantly; the kill switch.

And then I made the one decision that mattered: before I funded the live account, I ran the candidate through the validation harness I’d been building. The harness has 7 gates. It killed the strategy in about 8 minutes.

The 17 percentage point edge wasn’t real. It wasn’t even close to real. Once I controlled for candle-coverage survivorship bias, a technical artifact in how Kalshi’s API serves historical price data, the gap dwindled to -2.6 percentage points, statistically indistinguishable from zero, on a much larger sample. The signal I had been certain about was a measurement artifact in my pipeline. A measurement artifact that, if I had skipped validation and deployed real money, would have cost me roughly the entire account in 12 weeks.

I want to tell you exactly what that artifact was, exactly how the harness caught it, exactly what other candidates died the same way, and exactly what is left of the prediction-market trading dream after 7 such kills in a row.

The answer, at the end, is that I tried for 6 weeks to beat Kalshi and lost $1.57. The $1.57 is the worst part, but I’d have preferred to make $1,000 and have a story. Instead, I got a precise understanding of why retail prediction-market trading is structurally impossible at this scale and on this venue.

This is the story.

Why I started

Prediction markets like Kalshi and Polymarket are the financial product that ought to work. They have a clear structure, binary outcomes, and well defined settlement. They have thick academic literature Wolfers & Zitzewitz, 2004; [Manski, 2006] showing their aggregate prices are well-calibrated forecasters of real-world events. They have low-tens-of-dollars minimum bet sizes, retail-accessible APIs, regulated US venues. By 2026, they had attracted a non-trivial body of gray-literature claims that retail traders were making consistent money on them.

I’m the sort of person who reads finance Twitter and notices when a particular kind of edge claim keeps appearing in my feed. By spring 2026, “I’ve been making $200/day market-making on Kalshi” had been appearing in my feed about once a week for 6 months. The claim was specific enough to pique my interest. The venue was small enough that institutional money (hopefully) hadn’t optimized it away, along with the math; buy at the bid, sell at the ask, collect a few cents of spread on every round trip, was simple enough that I could just try it.

I’m also the sort of person who, when I decide to try something, builds infrastructure first. The first 3 weeks of this project were spent building:

A point-in-time data pipeline that pulled every settled Kalshi market into a local Postgres database, with both legs of every binary contract (the YES side and the NO side), every minute-level candle of price activity, every fee structure update, and every order-book snapshot I could capture. Final dataset: 71,316 settled markets, 790,317 price candles, spanning August 2021 to June 2026.
A backtest harness that scored a candidate strategy at Kalshi’s exact fee schedule, ⌈0.07 × p × (1-p) × N⌉ cents per order, where p is the YES price between 0 and 1 and N is the contract count. Maker fee is 25% of the taker fee. No assumptions about flat rates.
A validation framework. I knew, from reading the asset-pricing literature, that the universe of plausible retail edges I’d be testing was big, probably 20 or 30 pre-registered hypotheses by the end. Bailey and López de Prado’s 2014 paper on the Deflated Sharpe Ratio, and Harvey, Liu, and Zhu’s 2016 paper on the cross-section of expected returns, both make the same point: the conventional t > 2.0 threshold for “significant” is catastrophically inadequate when you’re testing many candidates. The proper threshold under reasonable multiple-testing assumptions is t > 3.0, and at 20 pre-registered hypotheses with Bonferroni correction, the threshold rises to roughly t > 2.99.

That last paragraph is the foundation everything else in this story rests on. It’s also the reason my MLB favorite-fade signal didn’t survive: a t-statistic of 5.4 sounds impressive, but I had 19 other hypotheses queued up at the time, and Bonferroni-adjusted significance plus a few selection-bias controls destroyed the whole thing.

So that was the setup: the data, the costs modeled correctly, and the harness. Then I started testing edges.

Predictive Models

The most obvious thing to try, and the thing every retail quant tries first, is build a better model than the market. If Kalshi’s NBA contracts price the Knicks at 58% to beat the Cavaliers tonight, and your hand-rolled model says 64%, you buy YES, hold to settlement, and collect 6 cents of edge per contract.

I built 6 predictive models over the first 2 weeks:

Weather model. Kalshi runs contracts on tomorrow’s high temperature in various US cities. I built a fairly serious one, pulled NWS hourly forecasts, ran a deterministic numerical-weather-prediction stack, scored its calibration against settled outcomes. The oracle backtest (perfect forward weather knowledge, pretending I had a crystal ball about tomorrow’s temperature) won 73% of trades. With the real production NWS forecast, it lost 0.07 cents per contract net of fees.

Why? Because the Kalshi market is already pricing an ensemble of weather forecasts, ECMWF, GFS, NWS, that dominates any single source. My single-source model was strictly worse than the consensus the market was already integrating. Even if my model had been right, the market was a tighter combination of better information.

A second insight: most weather contracts on Kalshi have almost zero trading volume. There was no venue at which to express the trade at any meaningful size. So even if I had found a real edge, there was no money on the other side to take it from me.

NBA team-strength model. I built an Elo-rating model on NBA team performance, calibrated it on 3 years of regular-season data, scored it against Kalshi’s pre-game win-probability contracts. The model produced a Brier score of 0.218 against the market’s 0.224. Roughly a 3 percent calibration improvement. Sounds promising.

When I dug into the disagreements, the cases where my Elo model and Kalshi materially differed, the model was correct 35% of the time. That is worse than chance. Kalshi was tracking sharp sportsbook lines closely enough that my Elo model was effectively just noise relative to the true probability. The disagreement segment was a pure measurement artifact of my model being underspecified.

I tried a multi-horizon test. At 6 hours before tipoff, my Elo model had a Brier score of 0.235; Kalshi’s was 0.224. At 1 hour before tip, 0.222 to 0.220, Kalshi pulled even. At 15 minutes before tip, 0.218 to 0.232, Kalshi beat me. The market was continuously absorbing sharp information up until tip-off, and my model wasn’t. The “edge” I thought I’d found was the early-game phase where my static model happened to be roughly equally bad as the market; the moment the market started actually working, my edge inverted.

MLB pitcher model. This was the most serious modeling effort. I built starting-pitcher-aware models with same-day lineup data, sportsbook line ingestion, the whole nine yards. Brier score on a powered sample of 1,287 games: 0.250 model versus 0.225 market. My model was worse than the market. Even on the narrow subset of extreme favorites (p > 0.85) where the model and market disagreed, the model lost 4 cents per contract on 39 trades. Sub-threshold, sub-edge, no signal worth investigating.

ATP and WTA tennis models. 2 tries here. The first found a “7 percent edge” buying favorites in low-liquidity ATP tournaments. The second found a similar gap in WTA underdogs. Both, when validated against the full population with proper survivorship-bias controls, evaporated. ATP favorites inverted to a 24.3 percentage-point loss against the both-legs-covered subset. WTA was a different artifact, volume cherry-picking plus closing-line drift, but same outcome: 0 capacity, 0 edge.

Crypto tail model. Kalshi runs daily and monthly above/below contracts on Bitcoin, Ethereum, Solana, etc. I built a Black-Scholes-style tail-probability model using stale spot prices and implied volatility surfaces from Deribit. The model produced reasonable tail probabilities. The market priced them at a 16-percentage-point premium to the model on average, mostly in the deep out-of-the-money strikes. That premium is real, it’s the volatility risk premium documented since Bates 2003, but Kalshi’s bid-ask spread on the affected strikes was wider than the premium, so you couldn’t execute against it. Spreads ate the edge for breakfast.

Gas-price model. This was the most painful one to kill, because the model was good. R² of 0.54 on daily Brent crude moves, against publicly available AAA daily-average data. The catch: the market knows about AAA’s daily-average data and prices it within minutes of release. Every threshold I tested netted negative 7 cents per contract after fees. 22 effective independent events in the dataset. No power, no edge, dead.

6 models. 0 survived the harness. I had spent 2 weeks of evenings on this. The honest takeaway from the first wave was: Kalshi’s prices are sharp. The venue isn’t priced by amateurs anymore (if it ever was), it’s priced by professional arbitrageurs running against the sharpest available sportsbook and information feeds. A retail account with a hand-rolled model is, on average, worse than the market.

That was a hard insight to internalize. I had genuinely believed, going in, that “build a better model” was the canonical retail-trading approach. It was, instead, the easiest avenue to disprove.

Structural Arbitrage

If you can’t out-predict the market, maybe you can find logical inconsistencies in its prices. This is the second-most-obvious retail-trading approach: scan the prediction-market venue for situations where the prices don’t add up correctly, and execute risk-free.

I tested 4 structural angles.

Within-event arbitrage. Kalshi often runs multiple binary contracts on the same underlying event, a 0-3, 3-5, 5-10, 10+ inches of snow market, where the 4 buckets must sum to 100% probability. If they don’t sum correctly, you can buy the under-priced subset and sell the over-priced subset and lock in risk-free profit.

I scanned approximately 45,000 settled markets for logical violations. The largest violation I found netted negative 4 cents per contract after fees and spread. The violations exist, but they live entirely inside the bid-ask spread. To capture them, you have to cross both legs simultaneously, paying the spread on each, and the spread eats the gap. 0 executable locks across a 4-year dataset.

Cross-market monotonicity. Same idea, slightly more sophisticated. If Kalshi prices a “S&P 500 closes above $5,200 today” market at 60% and a “S&P 500 closes above $5,180 today” market at 55%, that’s a logical violation, the former implies the latter, so the second probability cannot be lower than the first. I built a scanner. Found violations all the time. None were executable. Same reason as above: the bid-ask spread always exceeded the implied violation.

Cross-venue arbitrage with Polymarket. This was the most exciting one. Kalshi is the US-regulated venue; Polymarket is the global crypto-collateralized one. The 2 often run almost-identical contracts, same FOMC decision, same election, same Trump statement. If their prices diverge by more than transaction costs, you can buy on the cheap venue and sell on the expensive one.

I built a cross-venue diff scanner. It found exactly 1 cleanly matched pair over 6 weeks of scanning: the FOMC rate decisions. The gap on those: 2.5 percentage points mean absolute, sitting inside the combined transaction-cost floor of both venues. The other ostensible overlaps, the most tempting being Bitcoin price contracts, turned out to be settlement-rule mirages. Kalshi settles the Bitcoin contracts against CF Benchmarks BRTI at 5pm Eastern. Polymarket settles against Binance volume-weighted at noon UTC. Those are different bets. The arbitrage doesn’t exist; the venues are complementary, not substitutable.

Parlay correlation arbitrage. If Kalshi runs both single-game markets and parlay (multi-game) markets, and the parlay is priced higher than the product of the individual probabilities, you can sell the parlay and buy the individual components. I found 59,675 plausible combinations. 0 had executable liquidity on the combo side. The combo books are buy-only, priced above the independence product because of vig, and have no contra-side sellers. The math says you should sell the over-priced combo, but you cannot transact.

4 structural angles. 0 exploitable. The pattern across all 4 was the same: the inefficiencies exist, in the sense that the prices don’t always perfectly cohere, but the bid-ask spread is calibrated tightly enough to the size of the typical inefficiency that you cannot trade through it. Kalshi’s market makers, professionals running tight, low-latency books, have priced the spread to exactly absorb the structural noise.

This was the second wave, and the second kill.

The Favorite-Longshot Bias

There is one persistent empirical regularity in prediction-market pricing that has survived 3 decades of academic scrutiny: the favorite-longshot bias. Bets on outcomes with very low implied probabilities, “longshots,” typically below 15%, lose money on average against their realized win rates. Bets on heavy favorites, typically above 85%, make money on average. The bias was first documented by Griffith (1949) on horse racing, replicated by Snowberg and Wolfers (2010) on Tradesports, and shows up in nearly every prediction-market dataset that has ever been analyzed.

So: fade longshots, buy heavy favorites, collect the bias.

I tested this in 3 flavors.

Weather longshot fade. Buy NO on Kalshi weather markets where YES is priced at 15% or lower. Sample: 252 trades from 671 settled weather markets, VWAP-priced. Reported result: +4.4 cents per contract, t-statistic of 9, almost 100% win rate.

I almost deployed this one. The numbers were good. The win-rate-vs-implied-probability calibration looked dead on. The only thing that gave me pause was that another agent in my project had run a similar test on the same data and got a much more modest +0.6 cents per contract with a t-statistic of 0.44, no edge. 2 adjacent backtests, same hypothesis, same data, wildly different results.

I dug in. The +4.4 cent result was pseudo-replicated, it was treating each weather contract as an independent trade when in reality dozens of contracts on the same city-day are essentially the same bet. After event-clustering, effective N dropped from 252 to about 28. The +0.6 result was VWAP-versus-real-ask substitution, VWAP includes mid prices that you cannot actually execute against. At the real ask, the bias was +0.24 cents with t-statistic 0.27. After Bonferroni, indistinguishable from zero. And 87% of the trades had occurred in a single warm-anomaly regime, January through March 2026, when the model had a structural advantage that would not generalize.

The weather longshot fade was a textbook artifact: pseudo-replication + non-executable price substitution + single-regime selection + settlement-leakage. Each of those alone would have hidden a real edge if it existed; combined, they manufactured an edge that didn’t.

MLB longshot fade. I ran the same play on MLB markets where I had better volume. Sample: 513 trades, both legs covered, event-clustered, Bonferroni-corrected. Fade result: +2.3 cents per contract, t = 0.37, 95% CI bottom = −10 cents per contract.

The bias is real here, NO contracts at YES ≤ 25% won at about 91% against an implied 80%, which is a 16-percentage-point bias, qualitatively the largest in the literature. But the spread on those markets is exactly calibrated to the bias. The market makers know about the favorite-longshot bias too. They post bid-ask spreads that exactly size to consume the predictable edge. Net EV: zero. Capacity: roughly $1,500 of daily volume, which would be fine if the EV weren’t zero.

This is the same finding the academic literature [Snowberg & Wolfers, 2010] reports in horse racing and sports betting: the bias is empirically real, but its execution cost is also empirically calibrated to it. You can document the bias. You cannot trade it.

Fade-slight-favorite. This is the candidate I led with: the Tuesday-afternoon +17 percentage point MLB-favorite signal that I almost deployed. It died as I described in the cold open: candle-coverage survivorship plus pre-resolution leakage plus single-regime stacked to a fake +17 that collapsed to −2.6 on both-legs-covered data. The harness caught it before it ended up doing damage.

3 flavors, all dead. The favorite-longshot bias passes academic scrutiny because the right comparison is “implied versus realized probability,” not “after-cost trading return.” When you change the comparison to the one that matters operationally, do I make money trading this?, the answer is no.

Candle-Coverage Survivorship Bias

At this point I want to slow down and explain the single most important thing I learned from this project.

Kalshi’s API has a candlestick endpoint that returns 1-minute and 1-hour price candles for each market. The candle is created when at least 1 trade occurs in the time window. If no trades happen, if the order book is quiet, no candle is written.

Now consider what this means for a binary contract that resolves to YES.

In the hours and minutes before resolution, traders continue to quote and trade the winning leg. Why? Because the winning leg’s value is converging to $1. There is liquidity, there is interest, there are people buying and selling on small fluctuations near the settlement value. Candles get written.

But the losing leg? Once the outcome becomes statistically obvious, no one trades the losing leg anymore. Its order book thins, then evacuates, then is empty for the final minutes before resolution. No trades. No candles.

The result, in raw data form: the winning leg of every settled binary contract is more likely to have candle coverage in the final hour before resolution than the losing leg. In my powered dataset across sports, tennis, and macro categories, the gap was 7 to 8 percentage points. Winning legs were 7-8pp more likely to have at least 1 settled candle in the final pre-resolution hour.

Now consider what this does to a backtest.

If you filter your trades on “markets with at least 1 settled candle in the last hour before close”, which is a completely reasonable data-quality filter, the sort of thing every reasonable backtest does, you have inadvertently filtered on the outcome. The trades you analyze are 7-8pp more likely to be on the winning side.

For a strategy like buy slight favorites on MLB, this looks like the favorites winning at 71.5% when they should win at 55.3%. That’s the 16-percentage-point illusion I saw on my Tuesday afternoon. The filter, “show me markets with active price history”, is acting as a survivorship filter, selecting the winning leg into my sample.

The control is simple to state, but hard to implement on a typical retail data pipeline. It require both legs of every binary contract to satisfy your candle-coverage criterion. If the YES leg has a candle but the NO leg doesn’t, the event is excluded. This is the both-legs-covered control, and it is the single most important methodological correction in my entire project.

Once I implemented both-legs-covered control:

The MLB favorite signal collapsed from +17pp to −2.6pp.
The ATP favorite signal inverted from +7pp to −24.3pp.
The weather longshot fade collapsed from +4.4 cents to +0.24 cents.
6 other candidates I was actively investigating evaporated similarly.

This bias accounts for, by my estimate, somewhere between 60% and 100% of the apparent edge in 7 of the 9 candidates I evaluated. It is the dominant artifact in retail prediction-market backtests.

To my knowledge, and I have looked, this specific mechanism is not documented in the academic prediction-markets literature. Brown, Goetzmann, Ibbotson, and Ross documented mutual-fund survivorship in 1992. Elton, Gruber, and Blake replicated it for fund performance in 1996. But the venue-specific mechanism, winning-leg trade activity continues past resolution while losing-leg activity ceases, asymmetrically populating the candle endpoint, is something I had to discover the way I often learn; the hard way. I documented it formally in a separate paper (titled “An Adversarial Validation Harness for Retail Trading-Edge Claims in Binary Prediction Markets”) which I’ll happily send you if you email me at austin[at]lutztalk[dot]com.

The honest implication: a non-trivial fraction of published retail-edge claims on prediction-market venues, Kalshi and Polymarket are very probably manifestations of the same artifact. If you have ever read a Twitter thread that says “I backtested this strategy on Kalshi and it makes 10% per month,” and the author did not explicitly describe both-legs-covered control, the strategy is most likely candle-coverage survivorship.

This is the part of the project that I consider the actual contribution. The losses are not exciting, how I got here is.

The Pivot to Market Making

By week 3 I had accepted that predictive alpha was dead. 6 models, 0 edges. Whatever advantage existed in the venue was not accessible to me as a model-builder.

But there was a second category of retail-trading claims I had not addressed: be the market maker, not the model-builder. Don’t try to predict where the price is going. Post liquidity at the bid and ask, collect the spread on every round-trip, manage adverse selection.

Kalshi’s fee structure makes this superficially attractive. The maker fee is 25% of the taker fee, a 75% discount. The venue runs a Volume Incentive program (cashback) and a Liquidity Incentive Program (LIP) that pays daily reward pools to accounts that rest size on the book. The numbers I’d seen on Twitter, $200/day, $500/day market-making, generally referenced these incentive programs.

I built a full market-making harness. The strategy was:

Subscribe to the live WebSocket order book.
Identify “viable” markets, narrow shortlist of deep, low-adverse-selection books (crypto strikes far from the money, deep sports championship books).
Post symmetric quotes at the touch on both YES buy and YES sell sides.
Re-quote when the touch moves.
Manage inventory with a flatten rule (close any unwanted position as the spread comes to you).
Pull/widen on adverse signals: large recent moves (momentum ≥ 2 cents), wide spreads (≥ 3 cents), proximity to event resolution.
Apply a regime filter: only quote in crypto markets when mid is between 8% and 92% YES and more than 30 minutes from expiry.

The thesis was that on a narrow viable-shortlist, symmetric quoting net-positive cleared breakeven after measured adverse selection. The honest economics agents reported:

Net edge: −0.1 to +1.85 cents per round-trip, depending on product.
Realistic total: ~$100/day, saturating at ~$10k of capital. The binding constraint was not capital but the scarce supply of low-adverse-selection books.
LIP rebate is a shared pool, so a small account gets a sliver, counted as a garnish, not the meal.

This was the most promising thing I’d found. I should pause and emphasize: the team of agents working on the project at this point was not just me typing into a backtest framework. I had quants, a Bayesian statistician, a hedge-fund-style risk manager, and microstructure specialists working on different angles. The “$100/day on $10k” number was the consensus output of 3 independent specialist agents working from different angles. It survived the early-stage validation gates.

So I started building the live market-maker. Real WebSocket. Real Kalshi authentication. Real order placement code path, though crucially, behind the PaperOnlyGuard and DisabledRestOrderPlacer safety chains that made it impossible to actually transmit orders without a series of annoying confirmation windows. The bot would simulate; it would not execute.

The Market-Making Reality Check

Once I had the paper-only market-maker running, I ran a historical backtest of the actual strategy code over real Kalshi data from the past 18 months. Not a model of the strategy. The literal code that was about to be deployed, replayed against tick data.

The numbers came back. Across 1,555 market-days, 861 markets, and 6,983 simulated fills:

Spread capture, net of the real maker fee: +0.20 cents per round-trip. Per-segment numbers matched the agents’ earlier estimates. Crypto-intraday at +1.85 cents, deep sports at +0.21 cents, the rest break-even or negative.
Inventory carried to settlement: −$1,321 of P&L drag. Every time a quote got filled and the resulting position was held to settlement against the wrong outcome, the loss exceeded the spread captured on the way in.
Full net: −$1,017 without VIP rebate. +$537 with the assumed full VIP rebate. Sports specifically: −$2.9 per day without VIP, +$3.4 per day with VIP. Inside the noise floor of $34 daily standard deviation. 48% win rate.

So the headline thesis, “$100/day at $10k capital”, was off by an order of magnitude to 2 orders. The most charitable interpretation, with full VIP rebate which I would not in practice receive at retail volume, was about $3.4 per day on the sports book. Without VIP it was negative.

What had the original agents missed? 2 things:

They had focused on spread capture (the +0.2 cents) and not on inventory carry to settlement (the −$1,321 drag). The spread you collect on a fill is only the gross. If you cannot exit the unwanted inventory before resolution, the position is held to outcome, and outcomes are adversely correlated with the prices at which your quote got hit. The math is exactly what every market-maker has known forever: post a YES sell at 52 cents on a market that resolves YES, and you’ve paid the venue 48 cents per contract to do it. The +0.2 cent spread is round-off error against the 48 cent settlement loss.
They had assumed retail-tier access to the full VIP rebate. Kalshi’s VIP program is a shared pool, your share of the daily payout is proportional to your share of the daily venue volume. At retail volume (tens of contracts per day), the share is a fraction of a percent. The assumed rebate of $0.005 per contract is the cap, not the realized payout for a small account. Realized rebate for me was somewhere closer to $0.0001 per contract, 3 orders of magnitude below the cap.

I tried every lever I could think of to rescue the strategy.

Cloud co-location for latency. Maybe I was getting picked off by faster participants. I ran the sweep: latency 4ms versus 25ms versus 80ms versus 173ms. Net P&L was flat to the cent across all 4. Going from 173ms to 4ms recovered $0. The decomposition was clear: of the −$1,321 inventory drag, +$411 was latency-sensitive (good fills lost to slow reactions), but the remaining −$1,732 was structural carry, slow one-sided accumulation in trending markets, immune to any speed-up I could implement. Even at perfect zero-millisecond latency, the structural carry remained, and the +$411 of “good” latency-sensitive fills disappeared too, because they only fired in scenarios where the slow reaction was the cause of the otherwise-good fill.

Inventory management, hard caps, cross-to-flatten, regime avoidance. I rebuilt the strategy with aggressive inventory controls. Net result: −$78 worse. The honest cross-to-flatten cost roughly equaled the inventory carry it saved. The hard caps were redundant with the existing regime filter. The tempting ”+$1,704” I got at one point was yet another** artifact: when you replay candle data, intra-candle paths are interpolated, and the interpolation lets your flatten quotes “fill” at prices the real trending market would never have offered you.

Adverse-selection control. This was the make-or-break. The hypothesis was that variance, not direction, was predictable, and a symmetric pull rule could cut the adverse cost. The thesis tested positive on deep sports: a |momentum| ≥ 2c or spread ≥ 3c or near-event filter cut adverse cost by 87% in deep sports books, flipping −0.31 cents per decision to +0.13 cents per decision. So at the segment level, deep sports only, with adverse-selection skill, the strategy could clear break-even. But near-expiry/hot crypto stayed negative; the regime filter just excluded it.

The honest aggregate: with all levers pulled, the strategy was empirically break-even. Not profitable. Not losing money, but not profitable.

Market-making on Kalshi at retail scale, as a strategy, was dead. The +$100/day model was empirically wrong by an order of magnitude. The closest you could honestly call it was that the strategy might clear break-even on the deep sports books if you were extremely disciplined about regime filtering and adverse-selection control.

This was the lowest point of the project. I had spent 3 weeks building infrastructure for a strategy that, after honest measurement, didn’t work. The team converged on the verdict in a single afternoon: no durable retail trading edge exists on Kalshi, predictive, structural, or market-making, and the only reliable positive is the 3.25-3.75% APY that Kalshi pays on deposited capital. Most HYSAs offer that…with no risk.

I sat with that for a day. Then I started looking at the APY.

Structural mechanisms

If active trading is dead, what about passive structural mechanisms? Kalshi has 2 of them:

Deposit APY. Kalshi pays approximately 3.25-3.75% annualized on your portfolio net value, idle cash and the collateral backing open positions, both. For US members with a $250 minimum, this is essentially risk-free. On a $50,000 deposit, that’s $1,750/year, uncorrelated with trading activity.

Liquidity Incentive Program (LIP). A daily reward pool that Kalshi pays to accounts that rest size on designated incentivized markets. The mechanism is structural, Kalshi pays for liquidity provision by program design, not predictive. The trader doesn’t need to be right about anything; they need to be present, with size, at the touch.

I did not yet know, but the LIP rebate at retail capital is small enough that it does not exceed adverse-selection costs on the same books. But the concept was attractive: a structural source of yield that didn’t depend on outsmarting the market.

This was the project’s turn from can we find a predictive edge? to can we mechanically extract a rebate?

The next 2 weeks were building the LIP harvester.

The LIP harvester

I built the LIP harvester in a progression I came to call Paths A through E. Each path solved a real problem and surfaced 2 more. The progression is, in retrospect, the most back-assward part of the story.

Path A, the one-shot CLI. First version was a simple lip-post command that you ran by hand to post a target quote pair on one of 2 whitelisted markets. Every safety guard was in place: hard-coded ticker whitelist (only 2 markets, both political contracts with known LIP activity), maximum contract count of 100, maximum exposure of $500, dual-arming (you need both the --live CLI flag and the KALSHI_LIVE_ORDERS_ARMED=true environment variable), kill switch file, full audit log of every order attempt. The first real order I ever placed via this code was 100 contracts of KXTRUMPPHOTO-26JUN07 YES at 7 cents, $7 of working capital, and the next attempt, for the YES sell side at 8 cents.

My first real money on the venue. $6 of inventory exposure, all directional, in a market I had no intention of being directional in. Lesson learned: a one-shot CLI is not a market-making strategy. It’s a way to accumulate adverse positions.

Path B, the continuous-mode harvester. The next iteration was a daemon that ran continuously, subscribed to the WebSocket book, posted at-touch quotes, cancelled and re-quoted when the touch moved, and flattened inventory on fill. Architecture, yay. Then I tried to deploy it as a 24/7 operation. Here’s what went wrong:

The auto-sizer divided cash by per-pair cost across the whole whitelist, including markets whose order books were empty. With 1 ticker quiet, the safe-size computation returned None, and the daemon halted itself. Fix: handle partial-touch sizing, size off the markets we have data for, skip the others.
The placer’s MAX_ORDERS_PER_INVOCATION = 4 was scoped per daemon lifetime. After the initial 4-order quote cycle, every subsequent re-quote was refused. The daemon ran for 30 minutes generating 2,638 PLACE_REFUSED audit rows before I caught it. Fix: instantiate a fresh placer per call, so the per-invocation cap protects each quote cycle, not the daemon lifetime.
On startup the daemon posted at --quote-size 100 before the auto-sizer had a chance to scale down to my actual cash. Kalshi rejected with insufficient_balance and the daemon hard-halted. Across about 30 minutes I watched it cycle through 3 launches and 3 crashes before I figured out what was going on. Fix: wait for the first auto-size computation to succeed before posting anything.
The whitelist was static; markets that dropped below the LIP eligibility threshold remained whitelisted. Fix: continuous per-market eligibility re-check every 5 minutes; pause quoting on markets that fall below their target size, resume when they come back.
The eligibility re-check tried to filter Kalshi’s /incentive_programs endpoint by ticker, but the endpoint doesn’t support that filter, and Kalshi has 1,690 active programs spread across 9 paginated pages. The daemon was only ever seeing page 1. Every whitelisted ticker whose program lived past page 1 returned no_active_program and got paused. Fix: paginate the full list, filter locally.
The Kalshi WebSocket feed dropped fill notifications. The daemon’s fill detection was based on WebSocket events, which means the daemon was blind to fills. Real fills happened, Kalshi’s /portfolio/fills REST endpoint showed 7 of them overnight, but the daemon’s audit log showed zero FILL events. Without fill detection, the daemon never flattened the resulting inventory, and adverse-selection losses accumulated silently. Fix: poll /portfolio/fills via REST every 30 seconds as a defensive fallback, regardless of what the WebSocket says.

Each of these was an honest bug. Each surfaced only in production with real money on the line. The unit tests covered the bug-free path; the bugs lived in the integration with Kalshi’s live API, which was the part the unit tests couldn’t reach. This is one of those tough lessons: systems that look correct in development, can and will fail in production. The cost of these particular failures was: about $0.50 of realized losses across the night they happened, plus the embarrassment of waking up to find positions I have never heard of.

Path C, strict eligibility curation. The original 2-ticker whitelist closed in early June. I built a scan script that paginated Kalshi’s full LIP program list, applied a quality filter (spread ≤ 10c, daily pool ≥ $1, depth ≥ 50 contracts at the touch, ≥72h to close, two-sided book, active program, not paid out), and selected the top 15 by risk-adjusted expected income. I expanded the structural ALLOWED_TICKERS to those 15 markets. The model predicted ~$14/day at $5k of working capital across this whitelist.

Path D, stricter eligibility. I then audited which of those 15 markets were currently paying LIP, meaning the market’s total resting volume at the touch was actually meeting the program’s target_size minimum. The honest answer was: only 3 of 15 markets were currently above their target. The remaining 12 were below, meaning the LIP pool wasn’t paying out anything for snapshots taken on them. I tightened the filter to thinner_side_depth >= target_size and re-curated. The honest top-N from the stricter filter was 4 markets, not 15, with a cumulative risk-adjusted expected income of $2.83/day at full $1,500 of deployed capital.

The model had quickly fallen by a factor of 5 between Path C and Path D, and the reason was that Path C had not been honest about which markets were actively paying. Stricter discipline → smaller honest expectation.

Path E, runtime continuous eligibility. Markets drop below target during the day, then come back, then drop again. The whitelist is static between re-curations, so a market that fails the gate at noon would still receive quotes until the weekly re-curation. The fix was per-market eligibility re-checking every 5 minutes, with automatic pause-on-fail and resume-on-recover. The whitelist remained the structural safety net; eligibility was the runtime decision maker.

By the end of Path E, the architecture was complete. The bot ran 24/7, posted only on currently-eligible markets, auto-sized to cash, auto-flattened on fill, halted gracefully on insufficient balance, and respected the kill switch composability. And it didn’t make money.

The Truth

One Thursday morning I woke up to find Kalshi’s UI said I was down overnight. My own portfolio-value calculation, from the API, said I was up $9.45. I was not.

The portfolio value as Kalshi’s API reports it includes locked collateral at face value, treating a YES sell I had posted at 14 cents as if it were guaranteed to settle at 14 cents. Kalshi’s daily P&L was the realized plus mark-to-market change, accounting for the fact that my YES sells had been adversely filled overnight and were now underwater relative to where the touch had moved.

I dug in.

The audit log showed 0 FILL events from the night. The daemon had not detected any fills. But Kalshi’s /portfolio/fills endpoint showed 7 fills overnight, including 3 on KXTRUMPPHOTO-26JUN07 (1 YES buy, 1 YES sell, 1 YES buy), 1 on KXEOWEEK, 1 on KXTRUMPENDORSEMENTS, and 2 on KXBTCMAXMON, the last of which weren’t even on the current whitelist. They were orphan fills on markets I had dropped from the whitelist in the Path D re-curation; the resting orders had been left in place, and overnight, somebody else’s order had hit them.

The realized P&L line on the affected events: −$0.49 on KXTRUMPPHOTO, −$0.07 on KXVOTEHUBTRUMPUPDOWN. Net realized: −$0.56 overnight. Open positions were additionally underwater on mark-to-market.

I had built the bot to flatten inventory on fill. The bot had not detected the fills. Therefore the bot had not flattened. Therefore I was directionally short on Trump-related contracts when the market moved against me overnight.

I patched the fill detection that evening, but the lesson here is this: at $40 of working capital, an adverse-selection event costing $0.56 is roughly 3 days of expected LIP rebate incinerated in a single night. The strategy’s economics are too thin to absorb even small operational errors.

I added up the all-time numbers from my Predicted Markets experiment:

All-time realized P&L: −$1.57.
All-time fees paid: $5.93.
Net work performed against the venue: −$7.50.

The LIP rebate that was supposed to offset all of this was dead. 0 verified LIP credits in my cash balance after weeks of operation. It’s possible the rebate had landed and Kalshi’s API just didn’t expose it on the endpoint I could query, but it was clear: the rebate was not landing at a rate that exceeded the adverse-selection cost of the harvest operation.

What I have ACTUALLY learned

These are the facts.

1. Kalshi is efficiently priced for retail. Across 9 plausible candidate edges, predictive, structural, behavioral, market-making, latency, none survived validation. The combined effort was approximately 30 specialist agents working full-time, on a powered dataset of 71,316 settled markets, with proper statistical controls. 0 of 9 cleared the gate. The venue is, by my measurement, an efficient market for a retail participant. This does not preclude inefficiencies available to institutional participants with different cost structures, low-latency access, or non-public information. But for a retail account: the venue is efficient.

2. Candle-coverage survivorship bias is the dominant retail-backtest artifact. I have argued, and the academic paper I separately drafted argues in more detail, that asymmetric coverage of winning versus losing legs of a binary contract in trade-driven candle endpoints accounts for the majority of the artifact magnitude in retail prediction-market backtests. Anyone publishing a strategy on Kalshi or Polymarket who has not applied the both-legs-covered control is likely looking at this artifact rather than at a real edge. Please prove me wrong.

3. The fee wall is structural and binding. Kalshi’s fee schedule is ⌈0.07 × p × (1-p) × N⌉ cents per order. At p = 0.5, this is 1.75 cents per contract per leg, and 3.5 cents per round-trip. Combined with the 1-3 cent typical spread, the round-trip friction on any active enter-and-exit strategy is 5-8 cents per contract. No retail-detectable predictive edge of mine has been larger than 5 cents per contract. The fee wall is not, on its own, an argument against retail trading. But combined with the venue’s already-sharp pricing, it is the second necessary kill.

4. Market-making is empirically break-even, not modestly profitable. The Twitter claims of “$100-$500/day market-making Kalshi” are, on my measurement, off by an order of magnitude in the modest case and qualitatively wrong in the aggressive case. With every lever pulled, adverse-selection control, regime filtering, inventory management, simulated zero-latency, the honest result was inside the mess of daily P&L. Specific exception: deep sports championship books with disciplined regime filtering may clear a few dollars per day at low-tens-of-thousands of capital, but the EV-to-variance ratio is poor. No a good return on tying up that much capital.

5. LIP rebate is real but capacity-capped to a small absolute number. Kalshi’s Liquidity Incentive Program does pay, mechanically, to accounts that rest size on designated markets. The per-market pools are typically $0.50-$21 per day. After Path D’s strict eligibility curation, my model said ~$2.83/day at the structural ceiling of $1,500 of deployable capital. The realization at $40 of working capital was strictly negative once adverse-selection costs were accounted for. The mechanics are real, but the number on a retail account is too small to be operationally interesting.

6. The 3.25-3.75% APY is the only reliable positive. On a $50,000 deposit, this is approximately $1,750/year, fully passive, uncorrelated with trading activity.

7. Discretionary edges may exist; systematic edges do not. The harness I built can only catch systematic strategies, rules-based approaches that reduce to a backtest. It cannot evaluate “you, the human, having a faster or sharper interpretation of breaking news than the market in the 30 seconds before automated quoters react.” That class of edge, discretionary human edge, was not the subject of this project, and I make no claim about it one way or the other.

Wrap-Up

I went into this project hoping to find a retail trading edge on prediction markets. I instead built a system that disproves it. The verdict? the systematic-edge search is empirically complete and negative.

I’d hoped to end this blog post with “and here’s how I make $300/day on Kalshi.” Far cooler than “$1.57 of realized losses, $5.93 of fees paid, and 0 verified LIP credits.”

If you are a retail trader reading this and considering Kalshi: you are very probably going to lose money trying to trade actively. The venue’s pricing is sharper than a hand-rolled model can beat. The fee schedule is rigid enough to eat predictable edges, and the most-discussed retail strategies do not survive validation. The structural alternative, fund the account, earn the APY, is real, modest, and uncorrelated with skill.

If you are a quantitative researcher reading this: the candle-coverage survivorship-bias mechanism is novel and worth your attention. Any published retail-edge claim on a trade-driven candle endpoint deserves a re-examination with the both-legs-covered control applied. I would not be surprised if a non-trivial fraction of the existing gray literature on these venues is the same artifact.

If you are someone who has been telling yourself “I just haven’t found the strategy yet”: I held that belief for the entire 6 weeks of this project. Every dead candidate, before it was killed, felt like maybe this is the one. The MLB favorite signal at t=5.4. The fade-slight-favorite +8.7 cents. The weather longshot +4.4 cents per contract with t=9. Each looked, at the moment I saw it, like the moment when this project becomes a money machine. Each was a measurement artifact. The discipline that ended the project was not “stop looking”; it was take seriously the harness’s verdicts, even when they were verdicts I wanted not to be true.

The harness was the most valuable thing I built, the bot the most expensive, and the verdict the most useful.

I did not beat prediction markets. I think I learned why nobody at my scale does. Knowledge requires costs to acquire. The market is more efficient than you think.