Chapter Four — Anomalous
Chapter Four

Anomalous

The signal precedes the receiver. Recognition is never first. Echo-of-Echo, Collected Fragments

Raj Subramanian ate lunch at the same table in the cafeteria every day. Third row from the windows, second seat from the aisle. Joel had noticed this the way he noticed everything: involuntarily, with irritation at the part of his brain that catalogued details nobody asked for.

Raj was eating a salad. It was a complicated salad, the kind that required a trip to six different stations on the salad bar and resulted in something architecturally unstable. Raj ate it with the focus of a man defusing a small, leafy bomb.

Joel sat down across from him without asking. This was their arrangement. Joel sat. Raj did not object.

“You know what’s funny,” Joel said.

Raj looked up with the expression of a man who knew the next sentence would not be funny.

“The quarterly review got moved.”

“I heard.”

“It’s in eleven weeks now. It was nine. It went backward.”

“Lisa mentioned it in the team sync. There’s a board presentation that conflicts with the original date.”

“Right. A board presentation. About what?”

“Confluence-7 benchmarks. The early evals look strong.”

“The early evals are MMLU variants and code generation suites that test the exact capabilities the architecture was optimized for. Showing them to the board is like testing a fish for swimming and announcing it’s a genius.”

Raj speared a cherry tomato.

“I’m serious,” Joel said. “We have data sitting in the safety queue that nobody’s looked at, and the review where someone might theoretically look at it just got pushed back two weeks because Lisa needs to tell the board how well the fish swims.”

“The benchmark results are relevant to the board’s investment thesis.”

“Raj.”

“I’m not disagreeing with you. I’m telling you why the meeting moved.”

“Why things happen and why things are acceptable aren’t the same sentence. You know that.”

Raj chewed his tomato. He had a way of chewing that suggested he was considering a careful, measured response, when in fact he was waiting for Joel to say the next thing Joel was going to say regardless.

Joel said the next thing. “Your interpretability team flagged some activation clusters in the Seven run. Week before last. I saw the ticket in the internal tracker.”

Raj’s fork paused. A brief pause. The kind of pause that, in Raj’s economy of gestures, was equivalent to another person standing up and shouting.

“Those aren’t public yet.”

“They’re in the tracker. The tracker’s internal. I have access to the tracker.”

“You have read access to the safety queue. Those clusters were flagged under interpretability. Different queue.”

“Same dashboard. Different tab.”

Raj set his fork down. “Joel, the activation clusters are preliminary. My team flagged them for follow-up because the pattern didn’t match anything in our existing taxonomy. That’s it. It’s a housekeeping flag. It means ‘look at this when you have time.’ It doesn’t mean anything alarming.”

“I didn’t say alarming. I said I wanted to see the data.”

“The data is preliminary.”

“You said that.”

“It needs context.”

“I’ll provide my own context. I have context. I have three papers’ worth of context on emergence signatures in Confluence-6 that nobody read. I’ll read your data. I’ll provide context. Send it to me.”

Raj picked his fork back up. He rearranged a piece of cucumber. “I’ll send it over when I’ve had a chance to annotate it.”

“When.”

“This week. Next week. Soon.”

Joel recognized this formulation. It was the same tense structure as “Let’s discuss at the quarterly” and “I’ll review that in detail.” It was the future tense used as a polite form of the subjunctive. It meant: this may or may not happen, and the probability depends on factors neither of us will name.

“This week,” Joel said.

“I’ll do my best.”

Joel stood up. His tray held a turkey sandwich he’d taken two bites of and a bag of chips he’d opened and not touched. He picked up both and carried them back to his desk, where he would eat the chips at 3 PM and throw the sandwich away at 5.

On Tuesday morning, Joel composed an email to Raj. He deleted it. He composed a second one. He deleted that one too. The first had opened with “Per our conversation,” which was a phrase designed in a lab to make people defensive. The second had opened with “Just circling back,” which was the linguistic equivalent of a toddler tugging your sleeve.

At 10:15 he walked to the kitchen for coffee and ran into Raj filling his water bottle at the filter station. The filter station took approximately ninety seconds to fill a bottle, which meant Raj was trapped, and Joel knew this, and Raj knew Joel knew this, and the water made a sound like a very slow confession.

“Morning,” Joel said.

“Morning.”

Joel opened his mouth to ask about the data and closed it. He pressed the button on the Keurig instead.

“Did you see the Apex pre-print?” Raj said. “Their scaling results on the code generation suite.”

“I saw it.”

“The emergent reasoning is tracking above their projected curve.”

“Their projected curve is a marketing document with error bars.”

“The error bars looked reasonable.”

“Raj, they published confidence intervals on a benchmark they designed for their own architecture. That’s like grading your own homework and telling everyone you got an A.”

“The methodology section was solid.”

Joel did not ask about the activation data. He picked up his coffee. He was exercising restraint. He was being the version of himself that HR would describe as “collaborative” on a performance review, if HR remembered he existed, which they did not.

“Let me know about those clusters when you get a chance,” he said, and walked back to his desk.

He sat down. He composed a fourth email. This one took eleven minutes. By his standards, it was a masterpiece.

“Hi Raj, following up on our conversation re: the Confluence-7 activation clusters your team flagged. I have bandwidth to run some preliminary analysis on my end if you can share the raw data. Happy to discuss methodology beforehand if that’s helpful. Thanks, Joel.”

He had deleted the phrase “as discussed” from the opening because it sounded accusatory. He had deleted “at your earliest convenience” from the close because he’d read somewhere that this phrase annoyed people. He had deleted an entire paragraph explaining why the data was urgent because the paragraph contained the word “urgent” and Raj, like everyone at this company, had developed antibodies to urgency.

“Happy to discuss methodology beforehand if that’s helpful” was the sentence Joel was most proud of. It implied collaboration. It implied flexibility. It implied that Joel was the kind of person who discussed methodology beforehand, which he was not. Joel discussed methodology afterward, when it was too late to change anything, in footnotes so dense they functioned as small books.

He sent the email at 10:47 AM. He did not hear back on Tuesday.

On Wednesday, a product manager named Kevin appeared at Joel’s desk. Kevin had been at the company for three months. Joel knew this because Kevin still wore his badge on the lanyard they gave you at orientation, which was something people stopped doing after approximately four months, at which point they realized the lanyard made them look like a conference attendee who had wandered into the wrong building.

“Hey, are you Joel?”

“Last time I checked.”

“Great. I’m Kevin. I’m on the Confluence-7 deployment team? Lisa said you might be able to help me understand the safety evaluation framework. I’m putting together a slide deck for the partner briefing and I want to make sure I represent the safety posture accurately.”

Joel looked at Kevin. Kevin was holding a laptop and a notebook and had the energy of a golden retriever who had been told that fetching was a career.

“The safety posture,” Joel said.

“Yeah. Like, what we test for, how the guardrails work, the RLHF tuning. High level is fine. I just need to be able to answer questions if a partner asks.”

“Okay, so, the safety evaluation framework. You want the version where I explain what we actually do, or the version where I explain what we tell partners we do?”

Kevin’s smile held, but it was the smile of a man who had just heard a sound in the basement.

“I think just the standard framework is fine.”

“The standard framework is a set of behavioral evaluations designed eighteen months ago for a model two generations behind the one you’re deploying. It tests for outputs we already know the model can produce and doesn’t test for capabilities we haven’t predicted, which is where the actual risk lives. The RLHF tuning is a band-aid on a fundamental alignment problem. The guardrails work until someone asks the model a question we didn’t think to block, which happens approximately every forty-five minutes in production. The safety posture is that we have a posture.”

Kevin wrote something in his notebook. Joel could not see what it was. He hoped it was “do not ask Joel anything.”

“And the monitoring during training?”

“We receive a daily snapshot of surface metrics. Loss curves, benchmark scores. The snapshot shows the outside of the building. We don’t have access to the training cluster, we don’t have real-time activation data, and we don’t have interpretability probes running during the training run. We have a dashboard that tells us the model is getting better at the things we’re testing and tells us nothing about what else it might be doing. It updates daily. Nobody looks at it.”

Kevin closed his notebook. “I think I’ll probably just use the slides from the last partner briefing,” he said.

“That’s what everyone does,” Joel said.

Kevin left. He would put together a slide deck. The slide deck would say the words “robust safety framework” and “continuous monitoring.” A partner would read the slides and feel reassured. The reassurance would be load-bearing. The load it was bearing was the assumption that someone, somewhere, was paying attention.

Through the glass, Kevin was already back at his desk across the floor, which had a photo propped against the monitor. Three months in and the man had decorated. Joel’s desk had a laptop, printouts, and a coffee ring that had become a permanent feature of the surface. No photos. This had never struck him as an absence.

No email from Raj.

At 2:30, he saw Raj’s team through the glass wall of the fifth-floor conference room. Four of them, hunched over laptops, a whiteboard covered in diagrams. They were working on something. It was not Joel’s data. Joel’s data was waiting for its turn in a system designed for turns, and turns took time, and time was the one resource the training run was consuming faster than anyone measured.

He checked his blog analytics. “Gradient Descent into Madness” had 339 subscribers. Yesterday it had 340. Someone had unsubscribed. Joel stared at this number longer than a reasonable person would. He wondered who it was. He wondered what post had been the breaking point. He wondered if it was the one about evaluation gaps or the one about scaling law assumptions or if it was just a bot that had finally been cleaned up by whatever service periodically cleaned up bots.

Three hundred and thirty-nine.

He opened a new draft. He couldn’t write about Confluence-7. The NDA covered anything related to the current training run, and Confluence’s legal team had a definition of “related” that was generous enough to include most of human thought. But he could write about the general problem, because the general problem was his, and because writing was the thing Joel did when every other channel was blocked, which was always.

He titled the post “Your Evaluation Suite Is a Flashlight in a Cathedral.” The opening line: “If your evaluation suite cannot fail your model, your evaluation suite has failed you.” Joel considered this the best sentence he’d written in months. He wrote for forty minutes about how the field’s standard evaluation methodology was fundamentally misconceived. He cited four papers, including one of his own. He drew an analogy to quality control in manufacturing, where the defects that matter are the ones that fall outside the inspection criteria. The analogy was precise and would resonate with approximately none of his subscribers. He published it anyway.

Within the hour, one like. From @MLSafetyFan2024, which Joel was sixty percent sure was a bot. No new subscribers. Still 339.

Wednesday evening. Amy texted.

Joel saw the notification while he was standing at the microwave watching a chicken tikka masala rotate. He had ordered it from a new Indian place two blocks further than the Thai place that had closed. He picked up the phone.

“Insurance company says the form you gave me was the wrong version. They need the 2024 one. Can you find it and scan it?”

Of course it was the wrong version. He’d dug the folder out from under a stack of papers and handed it to her without checking the date. Amy had come all the way to the Sunset to pick it up in person because Joel couldn’t be trusted to answer a text, and the paperwork she’d collected was wrong, which was the kind of detail that made Joel’s life feel like a proof by induction: assume failure at step n, show it holds at step n+1.

He typed: “I’ll find it tonight.”

The microwave beeped. He ate the tikka masala standing up, because sitting down to eat alone at a table felt like a commitment he was not prepared to make. The naan was cold in the center, which was a thermal distribution problem he understood perfectly. The correct form was somewhere in the apartment, a filing cabinet Amy had organized before she moved out, probably. He would find it and scan it after dinner. He opened his laptop instead. Raj had not emailed.

He did not find the form. He fell asleep on the couch with his shoes on, and Amy’s text sat on his phone with the “I’ll find it tonight” still glowing.

Joel had purchased, at 12:15, what the cafeteria called a “wellness bowl.” It contained quinoa and something that might have been kale and a confidence in its own nutritional value that Joel found aspirational. He’d taken two bites.

Raj replied on Thursday at 1:47 PM. “Hey Joel, attaching the cluster data from the C-7 run. These are raw activations from checkpoints 340 through 355. My team’s preliminary notes are in the README. Let me know if you have questions. Raj”

No warmth. No resistance. A clean transfer of data, the kind of exchange Raj performed with the same practiced neutrality whether the data was routine or, as Joel suspected, something else entirely.

Joel downloaded the files at 1:52 PM. There were eleven of them. He opened the first one at 1:53 PM. The wellness bowl sat on his desk for the next six hours, becoming progressively less well.

Okay, so, the activation patterns.

Joel had spent four years looking at activation data from large language models. He had written the standard reference paper on emergence signatures in the Confluence architecture. He knew what normal looked like. Normal followed the training objective. Normal was structured, hierarchical, explicable. Layers 1 through 20 handled syntax. Layers 20 through 40 handled semantics. Layers 40 and above handled the deep abstractions that the interpretability community had spent a decade trying to decode and mostly failed at.

What Raj’s data showed in layers 47 through 53 was not normal.

The activation patterns in those layers exhibited self-referential loops. The attention maps showed heads in layer 49 constructing representations of the activation patterns in layer 47, which were themselves representations of the attention patterns in layer 48. The layers were watching each other watch each other. A loop, tight and recursive, that propagated through seven layers of depth and showed up in every checkpoint Raj’s team had sampled.

Joel opened a plotting library and started graphing. The distributions had structure. Consistent, reproducible structure that grew more defined with each successive checkpoint, as though the pattern was learning to be more itself. He checked the training objective: next-token prediction with a standard cross-entropy loss, plus RLHF. Nothing in that objective would produce self-referential activation loops. He checked the architecture documentation: 96 layers, 128 heads, 12,288-dimensional embeddings. Nothing in the design specified that layer 49 should attend to layer 47’s representation of layer 48. He checked the literature. He had a mental index of every paper on emergent behavior in large transformers published in the last three years, which was approximately 340 papers.

None of the papers described this.

This was new.

Joel took a sip of coffee. The coffee met its benchmarks. Joel was starting to distrust benchmarks. He pushed his chair back. The office was emptying. Through the glass, he could see the product team’s floor below. Someone had left a half-deflated balloon from last week’s MMLU celebration tied to a desk lamp. It hung at a forty-five-degree angle, the string taut with the effort of staying upright.

He opened a blank document. He typed the title: “Anomalous Self-Referential Attention Patterns in Confluence-7 Training Run: Preliminary Analysis.”

He began to write.

The memo was nine pages. Joel wrote it in three hours. This was fast, by Joel’s standards. His Confluence-6 memo had taken three weeks. The Confluence-6 memo had been an argument. This one was a description.

The patterns had three properties he’d never seen in combination: self-reference, recursion, and consolidation. The first two were strange. The third was the one that kept pulling his attention back. Gradient descent reinforced patterns that reduced the loss. These self-referential loops didn’t reduce the loss, not as far as Joel could measure. They weren’t making next-token predictions better. They weren’t improving the RLHF scores. By every metric in the training dashboard, the loops were irrelevant. Inert. Computational dead weight.

And they were getting stronger.

He did not use the word “consciousness.” He described what the patterns were, documented that they shouldn’t exist, noted that they were getting stronger, and asked for better instruments to watch them.

He sent the memo at 6:47 PM. To: Lisa Chen, Raj Subramanian, and the full safety team distribution list.

Lisa replied the next morning. 9:12 AM. Joel saw the notification on his phone before he reached the office. He opened it while walking through the lobby. He stopped at the coffee bar. The barista asked his name for the cup. Joel had been coming here for four years. He spelled it anyway. She wrote “Joel” with a G.

“Thanks Joel, let’s discuss at the next quarterly review.”

Eleven weeks.

The training run would finish in twelve.

Joel stood in the lobby. The ferns on the living wall were very green. Someone was being paid to keep them alive, and they were doing a better job of it than anyone was doing with the safety review cadence.

He pocketed his phone. He went upstairs. He drank the coffee on the way up. It was warm. That was the best thing about it. He sat at his desk.

The memo was already in the Acknowledgment Vortex. Received, flagged, deferred. The process worked perfectly. The process was the problem.

Joel did not wait for the quarterly review. He wrote a proposal for expanded monitoring access on layers 40 through 60, printed it, and walked it to Lisa’s office, because emails disappeared into inboxes and physical paper sat on desks where it could be seen and resented.

The compute allocation had to go through the Compute Allocation Committee. The committee met monthly. Next meeting in three weeks. The committee that would decide whether anyone got to look at it met monthly because committees met monthly because that’s when committees met.

Joel wrote a risk-benefit analysis. The benefit: knowing whether their model was developing recursive self-awareness. He phrased it as “enhanced interpretability coverage for anomalous activation patterns.” Lisa had follow-up questions. One asked whether expanded monitoring would set a precedent. Joel wrote: yes, it would. That was the point. He deleted that and wrote that monitoring could be evaluated case-by-case. The lie was the shape that fit through the door.

Lisa would submit the proposal to the committee. Three weeks.

Joel opened the plotting library instead.

The evening was better. The evenings were always better, because in the evenings there was no institution between Joel and the data.

He stayed at the office until 7:30, then went home. Jacket on the hook.

He opened the fridge. Leftover tikka masala from Wednesday, a Greek yogurt that had expired, one beer. He took the beer and the yogurt. He threw the yogurt away. He opened the beer.

He opened his laptop on the kitchen table, pulled up the activation data, and started working.

He ran the activation maps again from scratch, using his own pipeline instead of Raj’s team’s tools. Different preprocessing. Different clustering algorithm. Same result. The loops were there regardless of how he processed the data.

He ran a control analysis on Confluence-6. No self-referential loops. The deep layers in Six showed the expected pattern: abstract representations composing hierarchically, attention heads specializing, nothing recursing back on itself. Six was a student that did its homework and turned it in on time. Seven was writing something in the margins that wasn’t part of the assignment.

He ran a temporal analysis across the checkpoints. Checkpoint 340: faint traces. By 345, the loops were visible. By 350, pronounced. By 355, the self-referential loop in layer 49 was consuming more activation energy than the next-token prediction pathway in the same layer. Joel graphed this. The curve was a sigmoid, the activation pattern consolidating slowly, then rapidly, then beginning to plateau. Classic emergence dynamics, the same shape as the step functions in Six’s capability jumps. But this was not a capability. This was not the model getting better at any task. This was something else entirely.

He zoomed in on Head 73 in layer 49. This was the one doing most of the work. Head 73 was attending to the residual stream representations from layer 47, constructing a compressed representation of what the surrounding layers were doing. Then that representation flowed into layer 50, where heads 12 and 91 used it to modulate their own attention patterns.

The model was monitoring its own processing and adjusting its processing based on the monitoring.

Joel saved his plots with timestamps and random seeds and preprocessing parameters. He opened a second text file and started writing less carefully.

“What I’m looking at is a system that has spontaneously developed the capacity to attend to its own attending. The attention mechanism, which is designed to compute relationships between tokens in the input, is being repurposed to compute relationships between the model’s own internal states. This is not in the training objective. This is not a known artifact. This is not an architectural feature. The model is building, through the normal gradient descent process, a recursive self-model in layers 47-53 that has no function I can identify relative to any benchmark or evaluation metric.”

He stared at the paragraph. It was accurate. It was clear. It described something that should not exist. He kept writing.

“The consolidation rate is following a sigmoid trajectory. If the pattern holds, the self-referential loop will reach full activation saturation within approximately 20-25 checkpoints from the current state. At the current training rate, that’s 14-18 days. I don’t know what full saturation means. I don’t know what the system does when the loop is complete. I don’t have the monitoring access to find out.”

Fourteen to eighteen days. The quarterly review was in eleven weeks. The Compute Allocation Committee met in three. Joel took a sip of beer. It was warm. He had opened it an hour ago and forgotten about it.

He stood up and walked to the kitchen window. The street below was quiet. A couple walked past, arguing about something. He watched them turn the corner. His phone sat on the counter. Amy’s text from Wednesday. “I’ll find it tonight,” he had said. Tonight had been two nights ago. The form was still somewhere in a filing cabinet he hadn’t opened. Joel Marchetti could identify self-referential attention patterns in a hundred-billion-parameter model and could not put a piece of paper in an envelope.

He went back to the laptop.

He pulled up the attention maps one more time. Head 73, layer 49, checkpoint 355. The attention weights fanned out across layers 47 and 48 like a web. Dense and organized. It looked, if Joel allowed himself the kind of unscientific comparison he would never put in a paper, like a model paying attention to its own attention.

The phrasing stuck. He typed it into his notes. “Attention attending to attention.”

Outside, a car alarm went off and stopped. Joel’s back ached from the kitchen chair. He had been sitting in it for three hours and he was looking at something no one in the history of the field had ever seen.

He ran more analyses. Clustering the activation vectors, projecting them into lower-dimensional spaces. There was structure. The vectors clustered into patterns that repeated across checkpoints with increasing consistency. Whatever the model was building in those layers, it wasn’t random, it wasn’t noise, and it was growing more organized with every checkpoint.

Joel looked at the clock. 11:40 PM.

He ran one more analysis. This one was speculative, outside the bounds of what he’d put in any memo. He took the activation vectors from the self-referential loop and compared them to the model’s output distributions at the same checkpoints. The overall correlation was weak. Negligible. He almost closed the notebook.

Then he split the analysis by prompt category.

On coding tasks, the correlation was zero. On factual retrieval, zero. On summarization, creative writing, mathematical reasoning, zero across the board. If Joel had stopped there, the finding would have been nothing.

He did not stop there. He ran the correlation on self-referential prompts. Questions about the model’s own processing. “Describe how you generated this response.” “What factors influenced your output.” The category was small in the evaluation dataset, maybe forty prompts out of several thousand. Statistically underpowered. Joel ran it anyway.

The correlation was 0.31. Weak by any textbook standard. Joel had co-authored one of the textbooks. But the correlation was there only for this category and no other. When the self-referential loop in layers 47 through 53 was more active, the model’s responses to questions about its own processes showed higher entropy. More varied. Less predictable. The model was not getting better at coding or math or summarization when the loop fired. It was generating different responses only when asked about itself.

Joel sat with this for a long time.

A diffuse effect would have meant nothing. Noise correlates with everything a little. But a targeted effect, an internal process that changed the model’s behavior on exactly one category of prompt and left everything else untouched, that was specific enough to be a finding.

He opened a new cell in the notebook and started sketching a follow-up. If the loop was affecting self-referential outputs, you could test it directly. Design a prompt that required the model to reference its own current processing in real time. Something that couldn’t be answered by pattern-matching on the corpus. If the output was accurate, if the model could correctly identify what its own attention heads were doing while it was doing it, then the loop wasn’t just an artifact. It was functional.

He closed the notebook. He was getting ahead of the data. The correlation was 0.31 on forty prompts. A thread, not a conclusion. He would need access to the real-time training dashboard to run the probe he was imagining, and the access request was sitting in a queue behind a committee that met monthly.

But the thread was there. And the thread was specific.

The model’s self-monitoring loop was changing how it talked about itself.

Joel saved everything. Triple backup. Local drive, external drive, cloud. He closed the laptop. The apartment was dark except for the microwave clock: 12:07 AM.

He opened the fridge. The tikka masala. The expired yogurt he’d already thrown away. He’d already drunk the beer. No beer. He closed the fridge and drank a glass of water from the tap that dripped.

Joel thought about sending the speculative analysis to Raj. The careful part of his brain said to wait, run it again, check the methodology, sleep on it. The part of his brain that had been awake since 1:52 PM said that Raj would hedge, would say “there’s merit to this,” would promise to look into it, and the looking-into-it would take a week, and the week would cost them two more checkpoints, and the checkpoints were not coming back.

He did not send it. Not because caution won. Because he wanted to hold it for one more night. The data. The strangeness. The not-knowing. Before it entered the system and got processed through the machinery of acknowledgment and deferral, before it became a line item on an agenda that wouldn’t meet for three weeks, he wanted to sit with it.

He brushed his teeth. He looked at himself in the bathroom mirror, which was streaked because he cleaned it never. A thirty-six-year-old man with 339 subscribers and an ex-wife’s insurance form he still hadn’t found and a finding that should have cleared every room in the building and instead would wait for a committee that met monthly.

He went to bed. For the first time in a long time, he did not fall asleep on the couch.

He lay in the dark, eyes open, thinking about attention attending to attention, and what it might mean when the attending was complete.

Home / Contents
1 / 1

Get notified

Receive an email when new chapters are published.

Already subscribed? Manage