We are all experimenting, feeling out the path to the future of creative AI / human collaboration.
Along the way, I think there is a real danger in getting over focused on the curation of "preferences" in how working "feels" instead of results.
Like the coffee afficuanato, who insists a $3000 coffee grinder increases their experience of the morning drip measurably, when I can scientifically prove he's not getting more caffeine into his bloodstream than I am with a $75 Cuisinart grinder. When we focus on the "how it feels" of methods, instead of the "what comes out" results, then IMO we are talking about detatched token snobbery. Not value. Not method innovation.
TLDR - When the LLM talks in a way that's wrong, it will code in a way that's wrong. We can only correct it if we read what it's saying. LLMS process information differently than we do. IMO, the more can get *both* of our capabilities deployed at the token emission site, the better the value velocity is. I have a *strong* preference for Opus 4.6, because it is on the razors edge of smart enough to do the work and sycophantic enough to listen to me. When I already know the solution or algorithm I know will work, I don't have much tolorance for arguing with a word calculator about it.
------
I admit this is probably something I should make a more public post about... You all are my test audience, to get my words sharp and figure out what I'm saying. I also find the public channels frustratingly unsatisfying.
LLMs work in a next token probability space that i call the "context manifold", because mathetmatically it's an N-dimensional shape function that combines with Model probabilties to produce token ouput.. My assertions are (a) the the more tightly I can keep that next token probability space clean and aligned, the faster value comes out the other side, (b) I can *feel* or *sense* the context manifold alignment or misalignment by reading it, (c) I can only do this if I can SEE the token steam.
Tools like Claude Code, Cursor, and Antigravity, for me, feel like telling someone how to play baseball without being able to see the game. Logical instructions, thrown over a wall, that I can't learn from or guide. I need to see the token stream, so I can detect drift as early as possible. So I can fix the context shape in this session, and learn to better shape the next 100 sessions.
I call my workflow conducting, not babysitting, because im making the orchestra performance possible, not changing diapers. It feels more like juggling 100 balls than queueing a prompt and getting coffee or sitting around reviewing diffs.
...And i took it to an extreme, where i built my own LLM creative collaboration and coding harness because my goal is not "less annoying", my goal is increasing the bandwidth of human/llm communication. More automated tooling like Cursor, Claude Code, and Antigravity are not just massively less productive for me, they are actively obstructive to my productivity by hiding information and getting in the way of me shaping the the next-token prediction streams for maximum value creation.
-----
At a higher level, I see an important bifrucation happening, whcih I break down in this way:
- eyes-off agentic - Most tooling I see is trending to more automated loops and to hiding and summarizing information. In this model, the human brain provides the high level goals, and lets the AI token stream do what it wants to, with lower frequency check-ins on functionality value more than form.
- eyes-on agentic - This is what I call it when we *partner* our human brain intelligence deeply into the LLM next token decisions and the shape of the probability space. It's admitting the code-is-context, and getting involved tightly-shapeing-code-to-
shape-LLM behavior. This doesn't mean always code-reviewing or always intervening. This means human brain modeling of what the LLM will do and is doing, in order to catch probably space drift and code/context shape drift early. To keep velocity and parallelism highest, by minimizing length of time absolute nonsense LLM probably space word salad affects the surviving code-base.
By measurement - I believe I'm conservatively 50x more productive (quality and velocity) using eyes-on agentic.
My productivity value from eyes-on is not abstrat. I quantify it, and it's accelerating. (graphs below)
Tom has pointed out many ways that my workflow may be something that most people simply can-not manage, and I accept that possibility. You can't stick a 16 year old new driver in an indy car and expect anything other than disaster.
I run and read/skim 2-6 simultaneously running and visible llm chat sessions. I suspect not alot of people can keep up with this. Howveer, I assert that the value many are leaving on the table by being eyes-off, is as I stated above, the chance to "get our human brain intelligence deeply woven into the LLM next token decisions" at whatever bandwidth they can handle.
My custom coded LLM chat-interaction is WYSIWYG Markdown with no "elliding" of LLM tokens, and i turn off "thinking" in coding sessions because i find it counterproductive and annoying (ive been told GPT codex thinking adds value, have not tried it enough). I don't use canned "skills", but instead I write and refine and hone custom implementation specs per task. My system prompt for coding is BARE - tool use definitions and 8 lines of general framing. I don't use system prompt "rules", as I find they pollute the context and do more harm than good in the long haul. (i'm using methods other than prompt "rules" to get adherence to pattern requirements).
I have a *strong* preference for Opus 4.6, because it is on the razors edge of smart enough to do the work and sycophantic enough to listen to me. It's easier for me to manage the sycophancy than it is for me to suffer the wastefull long-winded opinioned arguing and deviations from my instructions I get from Opus 4.8 and Gemini 3+. I admit I have not used GPT Codex enough.
I have not been able to code with Fable yet. My long 2 hour chat with Fable suggests that it will be notably better at eye-off autonomous coding work, but I did *not* enjoy the design session I had with it. Fable still takes too many turns of arguments for it to stop answering with model bias and start reasoning from facts and ground truth. When I already know the solution or algorithm I want to use, I don't have much energy for arguing with a word-calculator about it.
-----
The "eyes-off" coding ceiling is certainly improving...
My non-coder wife built an entire mobile web babysitter organizer app by herself! (And its good!) She is more of the magic there than she takes credit for (she has a cs degree and did y2k programming in cobol in her 20s before she shifted to sales). im it blown away. Its also a categorically bounded type of work. Also,opus dumped everything in a 5500 line jsx that would like have ate itself eventually if i jadnt intervened.
I setup our 12 year old son Jack with claude desktop and a pattern for working on 2d webgames. He messed around making his version of a side scroller category called "gravity flip obstacle course". Then he wanted to do AI unreal engine vibing. That is not viable right now. I did some research. I experimented with claude code and Godot. Gdscript was a full fail mess, but i was already intending to do godot c#. I pivoted and got him into a claude code chat that converted his 2d gravity flip into godot c#. Categorically better at godot c# than gdscript,than trying to vibe code unreal.
I sat him down at it, walked away, and 1.5 hours later, he had a 3d viewport blocky godot knight running around an undulating terrain of sand, with a comic proportion medieval castle "town" where he could walk up to a vendor, by a sword, and swing it by clicking right mouse button. This is a 12 year old who cant program, barely can do bounded programming class "puzzles". That is insane. That is also not becoming a product without skilled intervention, but the *learning* happening there is the closest thing to Diamond Age and the "young womans illustrated primer ive seen"
Part of the magic in both of hese cases, is putting the AI into a space it can succeed, and keeping it in that space.
If one doesn't do this, it's more like going to the roulette table and betting on black.
------------
Below is a graph of code+markdown lines contributed to my coding projects since August 2025.
Of course we can all admit that "lines of code+markdown" is a narrow metric. It doesn't tell you anything about work product. And so, in that respect we merely have to decide how much to trust the conversation and the presenter.
I stopped using any other AI coding tools April 9th 2026, because i find my harness more productive and more pleasant. (code-named AstroNMCL) I'm not writing a tool to write a tool. I'm writing a tool to produce software and creative output. And I'm producing it.
10 mos - September August 2025 to June 2026 - 1.1M lines (110k/mo)
6 mos - December 2025 to June 2026 - 750k lines (125k/mo)
6 mos - December 2025 to June 2026 - 750k lines (125k/mo)
4 mos - Feb 2026 to June 2026 - 550k lines (137k/mo)
Below is a similar graph of "Story Fiction" Prose Lines I've conducted / co-written over the same timeframe.
It sits at about 2.7M words as of 6/15.
What is the quality? I like to say better than Twilight, worse than Hemmingway. The key thing here is that this isn't random chunky paragraphs out of an LLM, and it isn't "write this chapter for me". This is collaboratively constructed long-form novel-fiction as a constructed artifact - almost like software, produced from world+character+goal+harness design. My prompts and harness itself are designed to do things to scaffold the LLMs needs when constructing fiction, and I'm deeply involved in the next-token preduction.
If you made it this far. Thank you, I appreciate it.
What my 3 monitor setup looks like during a typical session:
No comments:
Post a Comment