Sentiment is not one signal
What we found when we asked Claude to decompose financial news into three information types and tested it across nine sectors
Tommi Johnsen and Sveta Shasharina
Most sentiment research treats the question as one thing. Take a corpus of news articles. Classify each as positive, negative, or neutral. Aggregate to the daily level. Correlate with next-day returns. Report a coefficient. Argue about which classifier is best.
This is tidy. It is also wrong, in a specific way that took us a while to see clearly.
The articles being classified are not the same kind of thing. An earnings beat, a $5 billion acquisition, an analyst upgrade, and a Seeking Alpha opinion piece all enter the same regression and contribute equally to the same daily ratio. The classifier averages over all of them. The literature averages over them. We averaged over them too, for a long time.
However, they are not equivalent. Each represents a different mechanism by which information gets into prices, and the academic finance literature has been studying these mechanisms separately for forty years. There is a literature on earnings response coefficients. There is a literature on event studies. There is a literature on analyst forecasts and recommendations. They have different theoretical models, different empirical methods, different findings about magnitudes and timing. When we collapse them into one sentiment label, we are throwing away most of what makes the news informative.
This post is about what happens when you stop throwing it away.
The classification prompt
We asked Claude to do something more structured than label sentiment. We asked it to first identify what kind of information the headline contains, before labeling direction. The prompt returns four fields:
{ “criterion”: “quantitative” | “event” | “analyst” | “none”,
“material”: true | false,
“label”: “positive” | “negative” | “neutral”,
“reasoning”: “one sentence” }
Quantitative means a specific reported result, like earnings, guidance, revenue, prescriptions filled, vehicles delivered. The number is the news. Event means a binary corporate action such as M&A, FDA decision, contract award, lawsuit, executive change. Something discrete has happened. Analyst means a price target or rating change with stated direction. None means everything else: opinion pieces, sector roundups, generic market commentary, multi-company features.
These three categories aren’t arbitrary. They are the three dominant mechanisms in the academic literature for how company-specific news affects prices. Together they account for the substantial majority of the company-specific news in our corpus. “None” is the catch-all, important because most financial news flow really is sector commentary and opinion, not catalysts.
The prompt enforces that the four outputs are consistent: if criterion is none, the material must be false; if the material is false, the label must be neutral. Direction can only attach to a criterion-matched, material headline. The reasoning field is for auditing. We don’t aggregate it, but we read it when something looks strange.
Across 22,261 articles in nine sectors over sixteen days in April 2026, we got 71% “none,” 14% event, 10% quantitative, 5% analyst. The constraints held in 22,260 of 22,261 cases. So the prompt is doing what we asked it to do, reliably.
Here are four worked examples, one per criterion:
“Eli Lilly acquires Kelonia Therapeutics for $1.3B” → event, positive. M&A with a stated dollar amount.
“Tesla reports Q1 vehicle deliveries of 336,681, missing 390,000 estimate” → quantitative, negative. Reported number versus consensus.
“BofA raises NVDA price target to $300, reiterates Buy” → analyst, positive. Sell-side action with explicit direction.
“5 AI Stocks Worth Watching This Quarter” → none, neutral. Multi-company commentary; not about any individual ticker.
Where the criteria disagree
The most interesting articles in our corpus are the ones where Claude flags multiple criterion-matched articles for the same ticker on the same day, pointing in opposite directions. There are 222 such cases across the nine sectors. They are the clearest evidence that financial news isn’t one signal but are different channels that say different things about the same company at the same time.
Three examples:
Amazon, April 27, 2026
event + “Amazon Backs X-Energy IPO To Support AI Power Needs And Valuation”
quant − “Amazon Q1: $200B In FY26 CapEx For A $15B Run-Rate Story”
Both headlines describe Amazon’s AI investment strategy. The event channel reads it as a positive corporate move; the quantitative channel reads it as poor capital efficiency. A single sentiment label would average these to neutral. Both are correct readings of the underlying news.
Booking Holdings, April 22, 2026
quant + “Booking Holdings Up 5.4% After Blockbuster Stock Split, Beat-and-Raise Quarter And Bigger Buybacks”
analyst − “Benchmark cuts Booking Holdings stock price target on travel trends”
A reported beat against an analyst downgrade on the same day. Quantitative says the company outperformed; analyst says forward-looking concerns warrant a lower target. Two channels saying different things about the same company at the same moment.
Carvana, April 21, 2026
analyst + “BofA raises target for eBay, Carvana as e-commerce rebounds despite macro concerns”
quant − “Cost Issues Hurt Carvana (CVNA) in Q1”
Sell-side optimism on the macro setup against reported cost pressure in the quarter. The analyst is bullish on what comes next; the quantitative report is bearish on what just happened.
In each case the two signals are not in error. They are answering different questions. The literature, when it averages headlines into one sentiment ratio per day, treats them as if they were answering the same question.
They are not.
Cross-sector results
If the criteria mean different things: if a quantitative headline and an event headline are doing different work, then we should expect their predictive power to vary by sector. Different industries have different mixes of news flow, different relationships between news and price, different speeds of repricing.
We computed, for each sector and each criterion, the spread between average next-day returns on positive-labeled days and negative-labeled days. Adjusted for sectoral regimes (each daily return is reduced by the mean return across the sector that day), so a sector that happened to be selling off doesn’t look spuriously informative.
Excess-return spread by sector and criterion, in percentage points. Green: positive day returns more than negative day returns by 0.5pp+. Red: the reverse. Sample sizes vary; the most extreme cells in Materials and Industrials sit on smaller negative samples and should be read as suggestive rather than precise.
There is no universal criterion. Different sectors transmit signal through different channels. Quantitative is the strongest channel in Information Technology and Materials, where reported results carry information the market hasn’t already priced.
It actively inverts in Consumer Staples, where positive earnings days precede negative excess returns by 2.4 percentage points, which means by the time the headline runs, the news is already fully reflected in price and any further enthusiasm is a contrarian signal.
Event spread is weaker than we expected going in. The hypothesis was that events would be the cleanest channel as discrete corporate actions are exactly what event-study research has built itself on for decades. But the event spread is positive in only four of nine sectors, and strongly negative in two. The most plausible reason: events vary too much in size for a single label to capture them. A $50M contract and a $5B acquisition both get labeled “event-positive,” and the market response differs by orders of magnitude. The label captures direction; magnitude is lost.
The biggest surprise was the analyst column. We had been operating on a prior, based on earlier work in this series, on a single sector: that analyst commentary was the weakest channel everywhere. Sell-side calls move to clients before they reach Google News. Aggregator coverage is low-information by construction.
We were confident in this.
Cross-sector data does not support that confidence. Analyst spread is positive in six of nine sectors. The strongest analyst signals come from Industrials (+2.4pp), Energy (+1.3pp), and Materials (+0.9pp): sectors where the underlying business is cyclical and rate-sensitive; and where sell-side commentary tends to incorporate macro context the headline-level news flow doesn’t carry. An analyst raising a target on a steel producer is communicating something dense about steel demand, capacity, and pricing power. The price-target headline summarizes that view in a way the rest of the news flow doesn’t replicate. In sectors where analyst coverage is more crowded and more retail-attention-driven as in Health Care and Information Technology, the same kind of headline is one of many and tends to be pre-priced. So the channel works where the analysts themselves are doing useful synthesis, and not where they’re merely chasing momentum. That is a more interesting story than “analyst is noise.”
Consumer Staples is the only sector where every channel is weak or inverted. The rest of the time, every sector has at least one channel doing real work, but they don’t agree on which one.
What signal looks like, up close
Before anyone gets excited about specific spread magnitudes, it’s worth showing what the underlying distributions actually look like. The cross-sector matrix above shows means. The means are real, but they live inside very wide distributions.
Box plots of next-day excess returns by criterion-label combination, pooled across all nine sectors. Boxes show interquartile range; whiskers extend to the 5th and 95th percentiles; black diamonds mark means. Medians sit on or near zero in every panel.
Two things are visible here that the matrix hides.
First, the medians are at zero everywhere. The spreads we report are driven by means, which are pulled by tail observations, a handful of days when the news really moved the stock and the label was right. Most days, even on positive-labeled days, the next-day return is essentially zero. The classifier is not mostly right; it is occasionally very right, and the tail observations carry the result.
Second, the distributions for positive and negative labels overlap heavily. The boxes are nearly the same shape. Within a single article, knowing the label barely tells you anything about the next day. The signal is not a per-article forecast. It is a slight shift in the mean of a distribution that mostly looks like noise.
This matters for how the result should be used. A trading strategy that bets on individual headline signals is fighting variance that swamps any plausible expected return. A pricing or risk-management strategy that uses sector-aggregated criterion-specific sentiment as one signal among several is closer to the right shape.
What this means for the literature
If financial sentiment decomposes the way the matrix above suggests, several common claims need adjustment.
Comparisons of classifier accuracy on benchmarks like Financial PhraseBank are measuring agreement on a particular implicit weighting across information types. A classifier that excels at quantitative-heavy text will report high accuracy but will not necessarily generalize to event-heavy or analyst-heavy corpora. The benchmark numbers in the literature are not lies, but they are not what readers think they are either.
Reporting a single sentiment-return correlation across a heterogeneous corpus is averaging across cells whose magnitudes range from −2.4pp to +4.4pp. The aggregate number is real, but it is hiding the structure that actually does the work.
Sentiment-based strategies that operate at the universe level are competing with strategies that select on the cells where signal concentrates. The latter has substantially more statistical power, even before asking whether anything is exploitable after costs.
None of this means classifier work is wasted, or that sentiment doesn’t matter. It means that sentiment is not one thing. It is a family of related signals, and progress requires acknowledging the family structure rather than averaging over it.
Limitations
Sixteen days is short.
The cross-sector pattern is suggestive of structural sectoral differences but does not yet support strong claims about magnitudes. We are running the same classification on quarterly data and will report whether the matrix holds.
Several criterion-by-sector cells have small negative-observation samples.
Materials event with nine negative observations, Energy analyst with nine, Industrials event somewhat thin. We have flagged these in the matrix as suggestive.
The classification is itself a feature of our pipeline.
Articles “Claude classified as quantitative” are not necessarily “objectively quantitative articles.” A different prompt would partition the corpus differently. We have not exhaustively explored that space.
The corpus is Google News RSS, which over-represents aggregator coverage and under-represents primary sources. In sectors where this matters most like Health Care clinical-trial coverage, or Energy commodity reporting, the criterion distribution may differ from what a primary-source-weighted corpus would show.
What will we do next?
We will be testing whether the cross-sector pattern replicates on a 90-day window.
If the matrix on quarterly data resembles the matrix here, the structural interpretation strengthens. If it differs substantially, the patterns above are window-specific and the right question becomes which features of a window drive criterion-level performance.
Within a criterion, what features predict spread strength?
A $50M contract and a $5B acquisition both label as event-positive. Magnitude is the obvious next axis to extract. There is more to do.
We will also test whether sector-aware prompting helps. The current classifier doesn’t know which sector a ticker belongs to. Given how much sector context determines which channel transmits signal, telling the classifier the sector and adjusting the materiality threshold accordingly, may produce cleaner per-sector signals than treating all sectors with a uniform prompt.
A note about the data underlying this post: it lives in a single window in April 2026.
Markets are not static, and any specific spread number could move materially in a different window. What we are most confident in is the heterogeneity itself. The facts are that no single criterion works everywhere, and that the cells that are strong tend to differ across sectors in ways that map onto real differences in those sectors’ news flow. The specific numbers are starting points.
The pattern is what to take away.





