The Harness Is Where the Value Lives
"Harness engineering" is being talked about as a new discipline. The term is new. The problem isn't.
In early 2026, an OpenAI publication on internal agent infrastructure popularized a simple formula: Agent = Model + Harness. Mitchell Hashimoto, Martin Fowler, and others gave the concept shape. Harness is a better word than "glue code." But it carries an implication worth examining: that this is something the agent era created. It isn't. ML engineers have been building harnesses for a decade. We called it application code, heuristics, serving logic, business rules, the post-processing step that keeps the model from returning something the product can't use. In 2015, Google's "Hidden Technical Debt in Machine Learning Systems" made the same point with a diagram: a small black box for the model, surrounded by a vast infrastructure of data collection, feature engineering, serving, monitoring, and configuration. Real-world ML systems are mostly not ML code. We named the debt. We didn't name the responsibility.
How harness ownership drifted
The division of labor made sense at the time. ML teams focused on model quality ? data, features, training, evaluation. Application engineers owned the product and wired model outputs into the UX. The boundary was clean, and it worked when models were advisory: a recommendation score, a ranking signal, a classification a human might override. There was a second reason. Early model quality was the bottleneck ? if the model was mediocre, the integration didn't matter much. Improving the model produced the most visible improvements in product behavior, so ML teams optimized for what moved the metric. The harness was handled by whoever was available. Then models got better. The fallback logic that compensated for earlier weaknesses got quietly removed. Confidence thresholds were tuned upward. Rules that said "if confidence is below X, show the heuristic" were refactored out because the model was now accurate enough. Each step was reasonable.
The cumulative effect was that the harness drifted further from ML ownership and became less visible at the same time. By the time agent-era systems started failing in visible, expensive ways, most teams had no clear owner for the integration layer.
What the harness actually is
Not serving infrastructure. Not the deployment pipeline. Not monitoring dashboards. The harness is the product integration layer: what triggers the model, what gets done with its output before a user or system sees it, how edge cases and low-confidence predictions are handled, what happens when the model is unavailable. It's the logic between the model's prediction and the behavior you actually want. A model that outputs a score doesn't produce a feature. The feature is what happens when that score is interpreted, thresholded, combined with other signals, and surfaced in a way that changes user behavior. The harness is the gap between the score and the behavior change.
A useful analogy: in pharma, an active ingredient isn't a drug. The drug is the active ingredient plus
a formulation that gets it to the right tissue, at the right concentration, at the right time,
without interacting badly with everything else in the body. Shipping a model without owning its
harness is shipping the active ingredient and hoping someone else figures out the formulation.
Why owning the harness matters
There are five specific ways ceding the harness costs you.
The feedback loop. Training data is shaped by harness decisions. If the application engineer thresholds aggressively ? filtering out low-confidence outputs because they don't trust them ? the hard cases stop generating user feedback. The next model trains on a distribution the harness already filtered. It evaluates well. It gets worse in production on exactly the cases it needed to improve. The degradation looks like a model problem. It's a harness problem from eighteen months ago.
Interpretive context. To define the harness correctly, you have to understand how the application generates the data the model consumes. A timestamp in an event log means something specific about user behavior ? whether it's a session-start event or an idle-timeout event changes the interpretation entirely. Teams separated from application engineers build harnesses from data dictionaries. Data dictionaries don't capture that context. You end up thresholding on signals that mean something different than you thought they did.
The fallback gap. When ML teams don't own fallback logic, fallbacks get removed without a conscious decision, or they never get designed in the first place. When the model was 88% accurate and the fallback handled the rest, the product worked. When someone removes the fallback because the model is now 92%, the remaining 8% fails in production instead of being caught. That's not a model failure. It's a harness decision nobody made explicitly.
Drift diagnosis. Features degrade in ways that can look like a model problem or an integration problem. If you don't own the harness, you can't tell the difference. You investigate the model, find nothing, and the degradation persists. The people who understand the model don't own the surface where its behavior becomes observable.
Value realization. The model's potential value is not the product's realized value. The gap between them is the harness. If the harness is owned by someone who doesn't know the model, that gap gets filled by intuition. The feature ships. It works, mostly. The value the model was capable of delivering never fully materializes.
How to start owning the harness
This doesn't require a reorganization. It requires a scope expansion. Write a harness spec before the model spec. For any new feature, define the integration layer before model work begins: what triggers the model, how its output gets transformed, what happens on low confidence, what the fallback is, how edge cases are handled. The spec belongs to the ML team. Writing it first forces the right conversations with application engineers before the harness gets built without you.
Put ML engineers in application code reviews. Any code that transforms, thresholds, or routes model outputs should have an ML engineer on the PR. Not to own the whole service ? to own the integration decisions. It's the lowest-friction way to maintain interpretive context without a reorganization.
Instrument the harness separately from the model. Log what the harness did, not just what the model predicted. Which inputs triggered the fallback? Which outputs were filtered? Which thresholds were hit? Without this, you cannot trace production behavior back to its cause.
Version the harness alongside the model. When you retrain, ask what the harness did to training data during the last cycle. Aggressive filters? Fallbacks that replaced model outputs with heuristics? The harness shapes the signal the model learns from. Treating them as independent artifacts misses the dependency. Document threshold and fallback decisions, not just values. "Threshold is 0.7 because we saw a 12% false positive rate above that on the eval set" is a decision. "Threshold is 0.7" is a number someone put in a config file. When the distribution shifts, the first is revisable. The second is a number nobody wants to touch.
The "harness engineering" discourse is a useful prompt. But if it's new vocabulary for an old responsibility, the question isn't whether to build harnesses? ML teams always did. The question is whether the ML team is in the room when those decisions get made, or whether they're happening without them and the consequences are being mistaken for model problems.
The harness was always there. The question is who owns it.