The Human Attention Model.
Soma — Simulated Observation of Media Attention — is a neural model of where, when, and why human visual attention rewards content. This paper outlines the architecture, the cognitive foundation, and how Soma predicts retention before publish.
Abstract
For a decade, the creator economy has optimised content for algorithms, not for the brain that watches it. The result: trillions of frames published blind, billions in wasted media spend, and a feedback loop that resolves only after the algorithm has already decided. This paper introduces Soma, a multimodal neural model that predicts viewer attention frame-by-frame — collapsing the post-hoc analytics loop into a pre-publish forecast. Soma surfaces cognitive load, pattern-interrupt density, emotional arousal, and archetype match across the full timeline of a video, returning a retention curve and editable cut points before the file leaves the editor.
The retroactive trap.
Every video published today is judged twice: once by a recommendation algorithm and once by a human nervous system. Both judgements happen after publish — long after the only window in which the creator could have changed anything.
The current state of practice is post-hoc analytics: ship the video, watch the retention curve, then guess at what to do next time. By the time a creator has enough data to learn the lesson, the algorithm has already deprioritised the post and the production cycle has moved on. The result is structural: an industry that optimises for the algorithm because the algorithm is the only thing it can measure in time.
Three things the brain reliably does.
Soma rests on three findings from cognitive neuroscience that hold across language, genre, and culture.
- Bounded attention windows. Working memory tolerates ~3–7 seconds of unstructured stimulus before disengagement. Most "first-second" hooks miss because they violate this window, not because they're boring.
- Pattern interrupts compound. Disengagement is not gradual — it's punctuated. Each interrupt either resets the window or drops the viewer through it. Density and timing matter more than novelty.
- Cognitive load is two-tailed. Both overwhelm (too dense) and boredom (too sparse) trigger drop-off. There is a measurable "optimal load" zone, narrow but stable across viewers.
Soma in three stages.
A video enters Soma as three parallel modalities — pixels, audio, transcript — and exits as five attention signals plus a recombined retention curve.
Each modality is encoded into a shared attention space before late-fusion. Frame-level outputs are recombined into a single retention curve via a learned aggregator, calibrated against held-out viewer panels.
The retention curve, returned millisecond-resolved.
Soma's headline output is a per-frame retention forecast. The shape of the curve — not just the area under it — is the signal that informs every downstream decision.
The curve is segmented into bands: the 5-second cliff (where ~40% of all uploads lose >50% of viewers), the mid-video slump (the 18–28s zone where pacing fatigue compounds), and the recovery shoulders where pattern interrupts can rescue an otherwise descending trajectory. Editors can request the model to suggest cuts that flatten any of these bands without retraining.
Eight signal categories.
Each Soma inference produces eight per-frame signal channels. They are independent — no single channel collapses into another — and they are interpretable.
Pattern interrupt
Density and timing of cuts, jumps, contradictions.
Cognitive load
Per-frame information density vs working memory.
Emotional arousal
Sympathetic-system spikes — surprise, awe, fear.
Hook strength
First-window forecast — predicted 5-second survival.
Pacing rhythm
Cut cadence vs the channel's archetype baseline.
Payoff timing
Distance from claim to resolution; promise debt.
Audio resonance
Prosody, music drop, dialogue rhythm — as separate channels.
Archetype match
Similarity to the channel's historic top quartile.
The two-tailed problem.
Cognitive load is the most misunderstood signal in video. Editors instinctively maximise stimulus density; viewers leave when stimulus exceeds working-memory capacity. Soma scores this on a normalised 0–100 scale per frame, with five interpretable bands.
Editors' single most common pattern of failure is parking entire videos in the high-load band. Soma surfaces these stretches as red bars on the timeline, with cut suggestions to redistribute load — usually by extending pauses, not by removing content.
Soma Lite vs Soma Pro.
Two model tiers ship today: a fast, single-channel cognitive scorer (Lite) and a full multimodal architecture (Pro).
| Capability | Soma Lite | Soma Pro |
|---|---|---|
| Cognitive load curve | ✓ | ✓ |
| Hook strength score | ✓ | ✓ |
| Emotional arousal trace | — | ✓ |
| Pattern-interrupt density | — | ✓ |
| Archetype extraction | — | ✓ |
| ms-precise cut suggestions | — | ✓ |
| EDL / XML export | — | ✓ |
| Multilingual transcript fusion | ✓ | ✓ |
What we measured.
Soma is benchmarked against held-out human viewer panels on a 12,000-video corpus across six verticals, plus continuous shadow-evaluation on creator opt-in production data.
0.87
Curve correlation
Pearson r between Soma forecast and observed retention on the held-out set.
±2.4%
5-sec survival error
Mean absolute error on the first-window forecast across genres.
12k
Validation videos
Retained-set, six verticals, multilingual, length 15s–45m.
What Soma does not do.
Soma forecasts attention; it does not forecast intent. A high-retention video may still under-convert, and a polished Soma score does not guarantee distribution under any specific platform's recommendation policy. Soma does not replace editorial judgement on taste, brand fit, or legal review. It also does not predict virality in the strict sense — virality is a network-propagation phenomenon that depends on factors outside the video itself (timing, social graph, distribution).
Soma's signal degrades on extreme out-of-distribution content (non-narrative interfaces, generative-only abstract video, screen recordings without spoken language). For these surfaces, Pro returns confidence-weighted scores instead of point predictions.
Foundational sources cited in this paper.
- Miller, G. A. (1956). The magical number seven, plus or minus two: some limits on our capacity for processing information. Psychological Review, 63(2), 81–97. — bounded working memory; foundational for the §2 attention-window claim.
- Sweller, J. (1988). Cognitive load during problem solving: effects on learning. Cognitive Science, 12(2), 257–285. — cognitive load theory; backs the two-tailed model in §6.
- Itti, L., & Koch, C. (2001). Computational modelling of visual attention. Nature Reviews Neuroscience, 2(3), 194–203. — bottom-up saliency models; informs the multimodal encoder in §3.
- Lavie, N. (2005). Distracted and confused?: Selective attention under load. Trends in Cognitive Sciences, 9(2), 75–82. — perceptual load theory; supports the cognitive-load spectrum in §6.
- Kahneman, D. (2011). Thinking, Fast and Slow. Farrar, Straus and Giroux. — System 1 / System 2 framing; underlies the pattern-interrupt mechanics in §2.
- Borji, A., & Itti, L. (2013). State-of-the-art in visual attention modeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 185–207. — survey of attention modeling architectures predating Soma.
Citations are foundational works in cognitive science and computational attention modeling. Soma is an applied production system; the underlying theory traces to the literature above.
Run Soma on a real video.
Drop in any cut. We'll return the retention curve, the cognitive load spectrum, and ms-precise cut suggestions.
Try Soma free Or talk to research