Whitepaperv1.0Soma

The Human Attention Model.

Soma — Simulated Observation of Media Attention — is a neural model of where, when, and why human visual attention rewards content. This paper outlines the architecture, the cognitive foundation, and how Soma predicts retention before publish.

Whitepaper · v1.0 Noreason Research 10 min read

Abstract

For a decade, the creator economy has optimised content for algorithms, not for the brain that watches it. The result: trillions of frames published blind, billions in wasted media spend, and a feedback loop that resolves only after the algorithm has already decided. This paper introduces Soma, a multimodal neural model that predicts viewer attention frame-by-frame — collapsing the post-hoc analytics loop into a pre-publish forecast. Soma surfaces cognitive load, pattern-interrupt density, emotional arousal, and archetype match across the full timeline of a video, returning a retention curve and editable cut points before the file leaves the editor.

01 — Problem

The retroactive trap.

Every video published today is judged twice: once by a recommendation algorithm and once by a human nervous system. Both judgements happen after publish — long after the only window in which the creator could have changed anything.

The current state of practice is post-hoc analytics: ship the video, watch the retention curve, then guess at what to do next time. By the time a creator has enough data to learn the lesson, the algorithm has already deprioritised the post and the production cycle has moved on. The result is structural: an industry that optimises for the algorithm because the algorithm is the only thing it can measure in time.

02 — Foundation

Three things the brain reliably does.

Soma rests on three findings from cognitive neuroscience that hold across language, genre, and culture.

  1. Bounded attention windows. Working memory tolerates ~3–7 seconds of unstructured stimulus before disengagement. Most "first-second" hooks miss because they violate this window, not because they're boring.
  2. Pattern interrupts compound. Disengagement is not gradual — it's punctuated. Each interrupt either resets the window or drops the viewer through it. Density and timing matter more than novelty.
  3. Cognitive load is two-tailed. Both overwhelm (too dense) and boredom (too sparse) trigger drop-off. There is a measurable "optimal load" zone, narrow but stable across viewers.
03 — Architecture

Soma in three stages.

A video enters Soma as three parallel modalities — pixels, audio, transcript — and exits as five attention signals plus a recombined retention curve.

Architecture diagram Video frames RGB · 30fps Audio spectral · prosody Transcript multilingual SOMA CORE multimodal encoder · attention head Cognitive load per-frame · 0–100 Emotional arousal sympathetic peaks Hook strength first-window forecast Drop-off timestamps ms-precise Archetype match channel-relative
Figure 1.Three modalities enter the Soma core; five signals exit, each scored at frame resolution.

Each modality is encoded into a shared attention space before late-fusion. Frame-level outputs are recombined into a single retention curve via a learned aggregator, calibrated against held-out viewer panels.

04 — Inference

The retention curve, returned millisecond-resolved.

Soma's headline output is a per-frame retention forecast. The shape of the curve — not just the area under it — is the signal that informs every downstream decision.

Sample retention curve 100% 50% 0% 0s 5s 22s 45s 60s Pattern interrupt Pattern interrupt 5-second cliff Mid-video slump predicted retention video time
Figure 2.A typical Soma retention forecast — the cliff, the slump, and the recovery moments are localised before publish.

The curve is segmented into bands: the 5-second cliff (where ~40% of all uploads lose >50% of viewers), the mid-video slump (the 18–28s zone where pacing fatigue compounds), and the recovery shoulders where pattern interrupts can rescue an otherwise descending trajectory. Editors can request the model to suggest cuts that flatten any of these bands without retraining.

05 — Signals

Eight signal categories.

Each Soma inference produces eight per-frame signal channels. They are independent — no single channel collapses into another — and they are interpretable.

Pattern interrupt

Density and timing of cuts, jumps, contradictions.

Cognitive load

Per-frame information density vs working memory.

Emotional arousal

Sympathetic-system spikes — surprise, awe, fear.

Hook strength

First-window forecast — predicted 5-second survival.

Pacing rhythm

Cut cadence vs the channel's archetype baseline.

Payoff timing

Distance from claim to resolution; promise debt.

Audio resonance

Prosody, music drop, dialogue rhythm — as separate channels.

Archetype match

Similarity to the channel's historic top quartile.

06 — Cognitive load

The two-tailed problem.

Cognitive load is the most misunderstood signal in video. Editors instinctively maximise stimulus density; viewers leave when stimulus exceeds working-memory capacity. Soma scores this on a normalised 0–100 scale per frame, with five interpretable bands.

Cognitive load spectrum 62 target Boredom 0–14 Drift 15–37 Optimal 38–67 High load 68–84 Overwhelm 85–100
Figure 3.The cognitive-load spectrum. Disengagement spikes at both ends; the optimal band is narrow and stable.

Editors' single most common pattern of failure is parking entire videos in the high-load band. Soma surfaces these stretches as red bars on the timeline, with cut suggestions to redistribute load — usually by extending pauses, not by removing content.

07 — Tiers

Soma Lite vs Soma Pro.

Two model tiers ship today: a fast, single-channel cognitive scorer (Lite) and a full multimodal architecture (Pro).

CapabilitySoma LiteSoma Pro
Cognitive load curve
Hook strength score
Emotional arousal trace
Pattern-interrupt density
Archetype extraction
ms-precise cut suggestions
EDL / XML export
Multilingual transcript fusion
08 — Validation

What we measured.

Soma is benchmarked against held-out human viewer panels on a 12,000-video corpus across six verticals, plus continuous shadow-evaluation on creator opt-in production data.

0.87

Curve correlation

Pearson r between Soma forecast and observed retention on the held-out set.

±2.4%

5-sec survival error

Mean absolute error on the first-window forecast across genres.

12k

Validation videos

Retained-set, six verticals, multilingual, length 15s–45m.

09 — Limitations

What Soma does not do.

Soma forecasts attention; it does not forecast intent. A high-retention video may still under-convert, and a polished Soma score does not guarantee distribution under any specific platform's recommendation policy. Soma does not replace editorial judgement on taste, brand fit, or legal review. It also does not predict virality in the strict sense — virality is a network-propagation phenomenon that depends on factors outside the video itself (timing, social graph, distribution).

Soma's signal degrades on extreme out-of-distribution content (non-narrative interfaces, generative-only abstract video, screen recordings without spoken language). For these surfaces, Pro returns confidence-weighted scores instead of point predictions.

References

Foundational sources cited in this paper.

  1. Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological Review, 63(2), 81–97. — bounded working memory; foundational for the §2 attention-window claim.
  2. Sweller, J. (1988). Cognitive load during problem solving: effects on learning. Cognitive Science, 12(2), 257–285. — cognitive load theory; backs the two-tailed model in §6.
  3. Itti, L., & Koch, C. (2001). Computational modelling of visual attention. Nature Reviews Neuroscience, 2(3), 194–203. — bottom-up saliency models; informs the multimodal encoder in §3.
  4. Lavie, N. (2005). Distracted and confused?: Selective attention under load. Trends in Cognitive Sciences, 9(2), 75–82. — perceptual load theory; supports the cognitive-load spectrum in §6.
  5. Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. — System 1 / System 2 framing; underlies the pattern-interrupt mechanics in §2.
  6. Borji, A., & Itti, L. (2013). State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 185–207. — survey of attention modeling architectures predating Soma.

Citations are foundational works in cognitive science and computational attention modeling. Soma is an applied production system; the underlying theory traces to the literature above.

Whitepaper

Run Soma on a real video.

Drop in any cut. We'll return the retention curve, the cognitive load spectrum, and ms-precise cut suggestions.

Try Soma free Or talk to research
noreason