Target-aware Financial Sentiment: Why Structure Beats Confidence with LLMs

Jan 29, 2026

A close-up of a brain

AI-generated content may be incorrect.

By Sveta Shasharina and Tommi Johnsen

Abstract

Sentiment analysis in finance typically treats sentiment as a property of text as a whole, but what matters to investors is sentiment about specific entities. This paper investigates target-level sentiment attribution in financial news headlines, demonstrating that widely used sentiment models, including domain-specific financial models, systematically fail to attribute sentiment correctly when multiple signals coexist. Critically, classifier confidence cannot detect these failures; the most severe attribution errors occur when confidence is highest. We propose a hybrid architecture that (1) automatically accepts sentiment only in unambiguous cases, (2) explicitly detects structural ambiguity, and (3) escalates ambiguous cases to a reasoning-based language model instructed to judge sentiment about a specific target entity only. The system mirrors how human investors read financial news: fast heuristics for simple cases, and deliberate reasoning when ambiguity demands it.

1. Introduction

Sentiment analysis is widely used in finance to summarize news, guide trading strategies, and study market behavior. Most sentiment systems, however, implicitly treat sentiment as a property of a text as a whole. In financial contexts, this assumption is frequently invalid. What matters to investors is not whether a piece of text is positive or negative in general, but what entity the sentiment is about.

This distinction becomes critical in multi-entity or multi-clause financial news, where sentiment about markets, peers, macroeconomic conditions, or investor mood often coexists with sentiment about a specific company. In such cases, generic sentiment models frequently produce confident but misleading outputs when applied naïvely.

This paper investigates target-level sentiment attribution in financial news, with a particular focus on headlines. We show that widely used sentiment models, including domain-specific financial models, systematically fail to attribute sentiment correctly when multiple signals are present. These failures are not random. They arise from structural properties of language, such as contrastive framing, causal clauses, and macro-contextual references.

Crucially, we show that classifier confidence cannot be relied upon to detect these failures. In fact, the most severe attribution errors occur precisely when confidence and margin are high.

Based on these findings, we propose a hybrid architecture that:

Automatically accepts sentiment judgments only in narrowly defined, unambiguous cases.
Detects structural ambiguity explicitly.
Escalates ambiguous cases to a reasoning-based language model that is instructed to judge sentiment about a specific target entity only.

The resulting system mirrors how human investors read financial news: fast heuristics for simple cases, deliberate reasoning when ambiguity demands it.

2. Related Work

2.1 Sentiment Analysis in Finance

A substantial literature applies sentiment analysis to financial text, including earnings reports, news articles, and social media. Domain-specific models such as FinBERT, alongside general purpose models such as RoBERTa and SieBERT, are widely used and often report strong aggregate performance.

However, these models are trained to estimate overall emotional tone, not sentiment attributed to a specific entity. Their outputs answer the question: “How does this text feel?” rather than “What does this text say about a particular company?” See our previous studies at https://tommijohnsen.substack.com/p/what-investors-should-know-about.

2.2 Headline-Article Agreement

Prior work (Bhagwat, Vineet and Cookson, J. Anthony and Dim, Chukwuma and Niessner, Marina, The Market’s Mirror: Revealing Investor Disagreement with LLMs (January 25, 2026). FEB-RN Research Paper No. 107/2025, Available at SSRN: https://ssrn.com/abstract=5375473 or http://dx.doi.org/10.2139/ssrn.5375473) has shown that headline sentiment and article sentiment disagree in a small fraction of cases (e.g., fewer than 4%). This result is sometimes interpreted as evidence that headlines are reliable summaries of articles.

However, such agreement measures capture tone consistency, not attribution correctness. Headline and article sentiment can agree while both are incorrect with respect to a specific target company, particularly when macroeconomic or sector-level sentiment dominates.

Our work complements this literature by focusing explicitly on target attribution, rather than aggregate sentiment agreement.

2.3 LLMs and Financial Text Interpretation

Recent work has explored large language models (LLMs) for interpreting financial text, measuring disagreement, or simulating investor reasoning. Rather than treating LLMs as replacements for sentiment classifiers, we position them as conditional reasoning agents, invoked only when structural cues indicate that attribution cannot be resolved by simple classification.

3. Empirical Characterization of Sentiment Leakage

To understand how sentiment models behave under attribution ambiguity, we conducted controlled experiments on synthetic and real financial texts.

3.1 Multi-Sentiment Coexistence

Across multiple widely used models including FinBERT, Twitter-RoBERTa, SieBERT, and NLPTown, we observe consistent behavior:

All models estimate overall emotional tone, not sentiment attributed to a specific entity.
When multiple sentiments coexist, sentiment leakage is systematic, not random.
Strong emotional language in irrelevant segments frequently dominates the final output.
Neutral target sentiment is especially fragile. Unrelated sentiment often converts it into a confident positive or negative classification.

Model-specific differences largely reflect how aggressively emotional cues are amplified:

SieBERT strongly latches onto emotionally loaded words anywhere in the text.
Twitter-RoBERTa is more conservative but still fails under attribution ambiguity.
NLPTown tends to average sentiment, reducing extremes but diluting real signals.
FinBERT often defaults to negative polarity on short or mixed inputs.

These behaviors are internally consistent with model objectives. The models are not “wrong”; they are answering a different question than the one required for target-level financial inference.

This observation motivated an attempt to enforce attribution explicitly.

4. Why Sentence Filtering by Target Name Fails

A natural first attempt at enforcing attribution is to remove sentences that do not explicitly mention the target company or ticker, and then re-run sentiment analysis on the remaining text. In practice, this approach fails in predictable and fundamental ways.

4.1 Coreference Breaks Attribution

Financial writing frequently relies on coreference rather than repetition.

Example: “NVIDIA reported quarterly results after the close. The guidance was weaker than expected.”

Filtering by explicit mentions removes the evaluative sentence, producing a neutral fragment despite clear negative sentiment about the company.

4.2 Sentiment Is Distributed Across Sentences

Sentiment often emerges compositionally.

Example: “NVIDIA unveiled its next-generation AI chips on Tuesday. Analysts said demand remains strong, but margins could come under pressure.”

Filtering either sentence destroys the relationship between context and evaluation.

4.3 Multi-Entity Articles Become Brittle

Comparative and contrastive sentiment is common.

Example: “AMD warned that data-center demand is slowing. NVIDIA, by contrast, said orders remain robust.”

Keyword filtering disrupts relational meaning, leading to misattribution or loss of sentiment.

4.4 Pronouns and Generic References Evade Filtering

Generic descriptors replace repeated company names.

Example: “NVIDIA posted record revenue last quarter. The chipmaker said supply constraints are easing.”

Filtering removes relevant evaluative content tied to the target.

4.5 Short or Neutral Fragments Become Unstable

Filtering often produces degenerate inputs.

Example: “NVIDIA held its annual developer conference this week.”

Such fragments are poorly matched to model training regimes and yield unstable or biased outputs.

4.6 Synthesis

Sentence-level filtering fails not because sentiment models are weak, but because attribution in financial text is rarely local. Sentiment is distributed across clauses, relies on reference resolution, and is often expressed through contrast or implication. Removing sentences that do not explicitly name the target frequently removes the very information required for correct attribution.

This failure motivates a reframing. Instead of enforcing attribution inside long articles, we turn to a regime where attribution is already performed by the author, financial headlines.

5. Headlines as a Controlled Attribution Regime

The failure of sentence-level attribution suggests that enforcing sentiment attribution inside full articles is fundamentally brittle. However, financial news provides a natural alternative: headlines.

Financial headlines are not merely summaries. They are written to perform interpretive compression for human readers. Their implicit purpose is to answer a specific question:

“What should the reader think about this company right now?”

This makes headlines a special regime in which attribution is often already resolved by the author.

5.1 Why Headlines Are Different

Several properties distinguish headlines from longer financial text:

Attribution is explicit by design: Headlines typically name the relevant company directly, especially when sentiment is intended to be company-specific.
Interpretive selectivity: Headline writers choose which aspect of a story to emphasize. This choice implicitly resolves many ambiguities present in the full article.
Investor-oriented framing: Financial headlines are written for readers who need to make decisions about specific securities.
Structural simplicity: Headlines have constrained syntax and limited clause nesting, making structural analysis more reliable.

These properties make headlines an ideal testing ground for target-aware sentiment analysis.

5.2 Evaluation Methodology

Our evaluation uses synthetic headlines designed to systematically test the failure modes identified through initial work with FinBERT. These synthetic headlines were constructed to cover specific categories of attribution ambiguity:

Single-entity headlines with unambiguous sentiment (baseline cases)
Multi-entity headlines with contrastive framing
Headlines with macro-economic context alongside company mentions
Headlines with sector-level sentiment combined with company-specific information
Administrative or event-based headlines with neutral or non-financial content
Headlines with investor mood terms that could be misattributed to the target company

This controlled approach allows us to isolate specific structural patterns that cause sentiment leakage and test whether our hybrid architecture correctly handles each case.

6. The Confidence Trap: When High Confidence Masks Attribution Failure

A critical finding emerged when we examined the relationship between model confidence and attribution correctness: high confidence did not indicate attribution accuracy. In fact, some of the most severe attribution failures occur precisely when the model is most confident.

6.1 The Naive Gating Strategy

A common approach to sentiment analysis is to trust model outputs when confidence exceeds some threshold (e.g., 0.8 or 0.9). The intuition is appealing. If the model is highly confident, the judgment is likely correct.

This intuition fails catastrophically for target-level sentiment attribution.

6.2 Illustrative Example

Consider the headline: “Tech stocks flat, investors cautious, Nvidia shares rally”

A typical FinBERT classification produces:

Sentiment: Negative
High confidence
Large margin between top class and alternatives

The model is highly confident. By conventional gating logic, this judgment would be automatically accepted. Yet the attribution is completely wrong—the headline is positive about Nvidia specifically.

This pattern repeats systematically. The model’s certainty reflects its ability to detect emotional tone, not its ability to attribute that tone correctly.

6.3 Why Confidence Misleads

Sentiment models are trained to predict emotional polarity. When strong emotional signals are present, even if irrelevant to the target, the model detects them clearly and responds with high confidence. The confidence reflects signal clarity, not attribution correctness.

This creates a silent failure mode: confident misattribution passes through automated systems undetected, corrupting downstream applications such as portfolio signals or market analysis.

6.4 Implications

Classifier confidence cannot serve as a safety mechanism for attribution. Any system that relies on confidence-based gating will systematically accept incorrect attributions in multi-signal contexts.

This finding motivates a fundamental shift: instead of trusting confidence, we must examine structure.

7. Structural Gating Logic: Detecting Ambiguity Explicitly

Having established that confidence-based gating fails, we turn to an alternative: detect structural properties of headlines that indicate attribution ambiguity, and route these cases for deeper reasoning.

7.1 Ambiguity Indicators

In Table 1, we identify several linguistic structures that reliably signal attribution ambiguity in financial headlines:

7.2 Gating Logic Implementation

The structural gating system applies a two-stage filter:

Stage 1 - Auto-accept: Headlines with a single sentiment signal, no contrastive markers, and high classifier confidence are accepted without escalation.
Stage 2 - Escalate: Headlines containing any ambiguity indicator are routed to the LLM reasoning module, regardless of classifier confidence.

Critically, Stage 2 escalation is triggered by structure, not uncertainty. Even when the classifier is highly confident, structural markers indicate that attribution may be incorrect.

7.3 Auto-Accept Criteria

A headline is auto-accepted only when:

The target company is explicitly named
No ambiguity indicators are present
Classifier confidence exceeds a defined threshold and no structural ambiguity indicators are present
Only one clear sentiment signal exists

This conservative threshold ensures that only genuinely unambiguous cases bypass deeper reasoning.

7.4 Role of the Classifier After Reframing

Under this architecture, the sentiment classifier is no longer treated as an oracle. Instead, it serves as:

A fast path for unambiguous cases
A diagnostic signal for polarity and margin
A feature extractor whose outputs are trusted only within a restricted structural regime

This reframing eliminates the silent failure mode identified in Section 6.

8. LLM Adjudication and Investor-Oriented Interpretation

Structural gating routes ambiguous headlines to a large language model (LLM). The LLM is not used as a generic sentiment classifier. Instead, it functions as a target-aware reasoning agent, invoked only when attribution cannot be resolved through simple classification.

8.1 Role of the LLM

The LLM is tasked with answering a narrowly scoped question:

What is the financial sentiment about the target company, and only the target company?

This separation of roles is deliberate. Probabilistic classifiers handle straightforward cases efficiently, while the LLM is reserved for cases requiring reasoning about:

Attribution
Dominance between clauses
Investor-relevant implications

For all experiments, we use LLaMA 3.1 (8B) via the Ollama runtime. This model offers a practical balance between reasoning capability, latency, and deployability. At 8B parameters, it is lightweight enough for local inference while still capable of structured financial reasoning under explicit constraints.

The system is implemented entirely in Python and is modular. Both the classifier and LLM components can be replaced without altering the structural logic.

8.2 Prompt Design as Explicit Attribution Policy

A central advantage of this architecture is that attribution rules are made explicit, rather than embedded implicitly in model weights.

The LLM is prompted to act as a strict financial text judge under the following constraints:

Judge only the sentiment about the target company
If sentiment concerns the macroeconomy, sector, or another entity, the target sentiment should be neutral
If the headline concerns atmosphere, attendance, weather, or non-economic context, then is_financial = false and target_sentiment = neutral
If the headline is mixed or contrastive across clauses, default to neutral unless the target evaluation is explicit
If a concern is clearly macroeconomic and not attributed to the target, return neutral

These rules reflect policy decisions rather than empirical claims. Different users may reasonably prefer more conservative or more risk-weighted interpretations. The key point is that such choices are transparent and adjustable.

The LLM does not recover a ground-truth sentiment. It produces a policy-consistent interpretation under explicitly defined attribution rules.

9. Evaluation and Illustrative Cases

Evaluation focuses on routing correctness and attribution fidelity rather than aggregate accuracy, reflecting the goal of eliminating silent failures rather than optimizing a single scalar metric.

9.1 Routing Behavior

Across a curated test set of headline categories, the system exhibits the following behavior:

Simple, single-signal headlines are auto-accepted
Structurally ambiguous headlines are consistently escalated
No mixed-signal headline is silently accepted

This directly addresses the failure mode identified in Section 6.

9.2 Illustrative “Brag Cases”

The following examples highlight cases where FinBERT fails confidently, and the proposed architecture succeeds.

Case 1: Macro Sentiment Dominance

Headline: “Tech stocks flat, investors cautious, Nvidia shares rally”

FinBERT: Confident negative
Correct target sentiment: Positive
Our approach: Structural detection → LLM escalation → Positive

Case 2: Contrastive Framing

Headline: “Nvidia shares rise, but investors worry about valuations”

FinBERT: Negative
Correct target sentiment: Negative (risk-weighted investor interpretation)
Our approach: Contrast detected → Escalation → Negative (as per instructions to LLM)

Case 3: Sector Leakage

Headline: “Semiconductor stocks slide as yields rise, Nvidia announces major partnership”

FinBERT: Negative
Correct target sentiment: Positive
Our approach: Structural ambiguity → Escalation → Positive

Case 4: Administrative Event

Headline: “Nvidia hosts developer conference in sunny San Jose”

FinBERT: Neutral
Correct target sentiment: Neutral, non-financial
Our approach: Administrative framing → Escalation → Neutral, non-financial

Case 5: Misattributed Concern

Headline: “At Nvidia conference, investors worry about inflation”

FinBERT: Strong negative
Correct target sentiment: Neutral (concern not about Nvidia)
Our approach: Escalation → Neutral

These examples illustrate not just correctness, but control: attribution behavior is governed by explicit structure and policy, not implicit bias.

10. Computational Considerations

A natural question is why not route all headlines directly to an LLM.

The answer is both practical and epistemic:

Most financial headlines are unambiguous
Generic classifiers are fast, cheap, and stable in those cases
LLM inference is substantially more expensive

Structural gating increases runtime relative to a classifier-only baseline, while remaining far more efficient than an LLM-only approach.

The resulting hybrid system achieves a favorable tradeoff:

High precision where ambiguity exists
High efficiency where it does not

11. Discussion

This work highlights a broader lesson for applied NLP in finance: many failures attributed to model weakness are in fact question mismatches.

Sentiment models are not broken. They are answering the question they were trained to answer. Problems arise when emotional tone is treated as a proxy for target-level inference.

Our results suggest:

Sentiment is not a scalar property of text
Attribution is a structural and relational problem
Confidence is not safety when assumptions are violated

A notable advantage of this architecture is that attribution policy is not frozen into model weights. Interpretive choices can be revised without retraining, allowing systems to adapt as definitions of relevance evolve.

12. Conclusion

Target-level financial sentiment is fundamentally a structural attribution problem, not a probabilistic calibration problem.

Generic sentiment models perform well when a headline expresses a single, clearly attributed sentiment. When multiple sentiments coexist, these models consistently default to the most negative signal with high confidence, leading to silent and systematic failures under naïve gating.

The original approach failed because it treated confidence as safety.

The corrected approach treats structure as authority.

This insight motivates a hybrid pipeline in which:

Auto-acceptance is narrow and conservative
Structural ambiguity mandates escalation
LLMs are used selectively for reasoning, not brute-force classification

The resulting system mirrors how investors read financial news: trusting simple headlines when they are simple, and engaging deeper reasoning only when ambiguity demands it.

Tommi Johnsen, PhD

Discussion about this post

Ready for more?