“Conceding that AI is doing more than just predicting the next word doesn’t actually mean you need to become an AI booster.”
I appreciate you saying this. I share your frustration with the stochastic parrot crowd, but I’ve also been seeing a lot of takes the past few weeks that act like disproving the stochastic parrot thesis means that the most extreme booster ideas are true. For example, that we’ll get superintelligence in 5-10 years (or less).
“For almost all tasks that can be done on a computer, you will make better predictions about Claude’s behavior if you predict that Claude will do what a very smart, very motivated human would do than if you do anything else.”
This strikes me as hyperbolic. Claude Code is shockingly good, but it can’t really operate GUI software, and I wouldn’t trust it operate new software outside its training data without extensive documentation and hand holding. Also, would you give Claude Code a credit card and let it book flights and hotels for you for a trip?
Even in the area of programming, Anthropic has over 100 engineering job openings. Clearly it has real limitations.
Part of the disconnect in AI discourse is if you don't work in a job that doesn't involve code in some way most of your exposure to AI is (and I loathe to use this term but I can't think of a better one) the slop it creates. Image gen, video gen, writing.
If you are woman on the internet and perverted freaks are turning your wedding photos into porn using grok or another AI are you really going to be convinced it's good because claude helped me automate a lot of testing? Sometimes I feel like I'm going insane seeing people I used to respect online unironically doing the dril tweet where like sure AI is being used, now, not in some hypothetical future to do actual evil but we can't ask any better of these companies or legislate limits because of the possible productivity gains.
So until the AI labs do something to address the deluge of slop and real harm I think we are going to continue to have people being negatively polarized against AI tools. I'm not one of them because I used it in my work and the benefits ARE real for that, yet I also know woman who have been "grok'd"
It’s fine to be negatively polarized against AI *morally* but that shouldn’t be a reason to *factually* downplay the things it can do. Ought isn’t Is, and all that
agreed, I just don't know how that would work. I guess you can at least get it off the most popular sites.
On the other hand, it's hard to do without nerfing regular uses for it. For example, a standard procedure is to make an image in one program in then upload it to be used to create another as a video.
But many of the American ones like Sora really limit this. The Chinese ones don't.
Actually, with the use of OpenClaw or Moltbot or whatever it's called now, I think you actually can give it a credit card and have it book a flight or hotel.
Heck no! But I've never used the thing, so that's just an answer based on my priors. Other people *are* using it for such things based on what I've read and watched. No idea what the general user confidence level is though.
I think I'd trust it to the same degree as I'd trust a smart person I didn't know - they'll almost definitely find a flight that meets the criteria, but I'd want them to briefly check in with me before buying in case I had some other criteria that I'd forgotten to spell out to them, so the ideal flow is "I describe what I want -> they find it and describe it to me -> I say 'yes go ahead'", just like if I had a human assistant who was very capable but who I didn't know well.
This post was a weird flavor of aggressive ignorance. Harper is correct. All LLMs in all stages of production are next token predictors. Fine tuning shifts the distribution of predictions, the hidden portion of the prompts shifts them more, but almost all the information in the LLM is embedded in the base model in any case. This is not some controversial take; it's an objective fact about what LLMs are, and your objections seem to boil down to "it doesn't *feel* like next token prediction to me when I use it, so obviously it's not!" Well, yeah. AI companies invest a lot of resources to make sure it feels like you're talking to mind just a little different from yours. That's a big part of why they don't show you all the prompting text around yours, so it *sounds* like the next token prediction sounds like a character speaking to you. Until jailbreaking destroys the illusion.
Your experience with the vaunted Claude is a lot more positive than mine, although I used their model through aider. I asked it to add a major feature to a project. I caught several bugs in the commit, but more subtle ones took forever to track down and it was more or less useless for helping. In the end it was at best break-even vs doing the whole thing by hand, and only because I'm not familiar with asyncio. I'm sure that if what you need is some trivial app or mod of a kind that is well-represented in its training data and that won't need to be maintained, it seems awesome. But if it's so great at generating software, where is the flood of software? If it's this amazing boost to productivity, where's the production?
Lastly, I have to laugh at this notion that writers are all poo-pooing LLMs, except for Kelsey Piper, bravely swimming against the tide. Look, writers are *exactly* the people most primed to be awed by LLMs. My inbox every day is full of people announcing WHAT A BIG DEAL AI is, how YOU'RE ALL FOOLING YOURSELVES, same as it has been for the last three years. To be honest the volume and repetition sometimes feels like a coordinated propaganda campaign.
After RLHF, the objective it was trained on is not next-token-prediction, and yes, the fact that it behaviorally doesn't act like that was the objective it was trained on is one way to satisfy yourself that that's not the objective it was trained on.
> if it's so great at generating software, where is the flood of software? If it's this amazing boost to productivity, where's the production?
In the last month, I have personally interacted with a dozen new software projects that were fully AI-generated and several more whose development was massively speeded by AI. Most of the fully AI-generated ones were small: a friend develops a game and we all play it, I create an educational tool for a niche use case, someone makes a minigame to test out a new mechanic they're using in a larger project. But also I've spoken to a lot of programmers in the Bay Area who routinely use AI for massive speedups at their jobs. May I ask when you tried Claude Code? A quick Google suggested that Aider uses or at least recently used Claude Sonnet 3.5, which would be completely useless if so. If the above were your results from Opus 4.6, they are far worse than any others I've heard about.
No, it's just next token prediction on a biased set. If you fine tune on a set of recipes, it will be more likely to predict recipes. If you fine tune on answers that have been selected by humans to be typical of a helpful assistant, it will be more likely to predict text characteristic of a helpful assistant. Next token prediction is a structural property - the structural property! - of what these models *are*. It can't be changed by fine-tuning.
"In the last month, I have personally interacted with a dozen new software projects that were fully AI-generated and several more whose development was massively speeded by AI. Most of the fully AI-generated ones were small: a friend develops a game and we all play it, I create an educational tool for a niche use case, someone makes a minigame to test out a new mechanic they're using in a larger project. But also I've spoken to a lot of programmers in the Bay Area who routinely use AI for massive speedups at their jobs."
You can see why this is unconvincing, right? You "personally interacted with software", meaning your friends hobby projects, and lots of your AI booster friends are telling you that about their 10x productivity. But we have platforms where someone can directly turn the ability to generate code into money, like Steam. And yet we don't see the production of software that real people will spend real money on: https://substack.com/home/post/p-172538377. What we do see is a flood of unusable slop in the open source space. Garbage AI PRs have become an existential threat to open repos.
I used sonnet 4. Aider is a wrapper end that can use any model with an API, but I feel disinclined spend 5x on opus when the consensus is it's just not that much better. Also, every year for the last three years I've been hearing about how *this* year's model is the game changer, and last year's model was garbage, actually.
You obviously have the right to use or not use any model you want, and I understand feeling like you heard too many hype claims to believe any of them, but if you are responding to people claiming "With Opus 4.6, I can do a ton of stuff it could not do six months ago" with an account of how useless AI is, and you're using Sonnet 4, I really think you should be up front about that.
I mean, yeah, if someone did claim that they were specifically using Opus 4.6, I guess I would need to specify that I had used Sonnet instead, which performs... a couple percentage points worse in benchmarks. Did you specify Opus somewhere?
I don't trust benchmarks at all (too easy to game/target), and have found in practice that Opus is very noticeably better than Sonnet.
There's a reason so many skeptical people (including me!) have started doing vibe-coding stuff in the last month or two, and didn't last spring when Sonnet came out.
You’re missing the other important thing they do in modern “reasoning” models, which Kelsey didn’t mention. If you get the machine to talk to itself about the steps in solving a math problem, it’s much more likely to get the correct answer than if you get it to just guess the answer. It’s obvious that this should work, because predicting the first step in getting a solution is easier than predicting the answer, and predicting the second step is easy once you’ve seen the first, and so on.
But they also give it some time to “practice” solving more problems, and reinforce the kinds of predictions that eventually led to answers that can be verified by other means, while downweighting the ones that end up with wrong answers. This actually means that they end up producing types of sequences of words that are nowhere in the training data - sometimes because they were silently in the heads of people who wrote down final answers, and sometimes because random wandering found good strategies that turn out to be useful. This makes it a lot more like AlphaGo and other game AIs that discover their own strategies through self play.
You can’t do that with text off the bat, but if you start by learning to predict human text, then you get a facility with text that can be tuned in this sort of way to learn strategies that are nowhere in the training data.
I'm not quite sure what distinction you're trying to draw here. All generative models, from the lowly autoencoder on up, produce data that is not present in their training sets. That's the distinction between a model and a dataset. When you ask a base model for the answer to a math problem, sometimes it will try to spit out a raw answer, and generally do badly. Sometimes it will instead spit out a text sequence similar to a worked problem in a text book, and then it is more likely to succeed for the reasons you mentioned. Even if you perform very stupid reinforcement learning, and just reinforce the entire correct answers and punish the entire incorrect answers, you will increase the frequency of the worked-problem style answers from the model.
The influence of the training data is still visible in the model's success though. They're much more likely to get the right answer to a*x+b if a=1.8 and b=32, for example.
I don't think "next token predictor" is any more useful of a way of looking at LLMs than "next muscle impulse selectors" is for humans. Both are technically true at the base level, but neither is describing where any of the interesting stuff is going on.
Generating a token is just a means to an end. The insanely complicated web of relationships and correlations that happen in the billions of weights is the part that's mind-blowing, and that's the part that we should be evaluating. Not the token part.
Unfortunately, I think this situation is one where the technical claims are actually the best way of understanding the issue. Claude is still an autoregressive language model, which generates words based on the sequence of prior words, including the ones it generated. And it's still doing it via standard machine learning ideas, which are about making it statistically improve on various goals. And it starts, as you say, with a model whose goal is accurately predicting the next word in a large corpus including much of the internet. But the key step is that while training on the whole Internet and training to be a helpful chatbot are both done using machine learning and statistical predictions, they are in tension, and the "post training" makes the models much worse at predicting the internet.
However, I think it's possible to over rotate on these differences. Much of the intelligence of modern language models is already there in the base model. This is the point of the gpt3 paper which shows that you can get base models to do lots of intelligent things with appropriate prompting.
I agree that base models have shockingly impressive capabilities if you figure out how to leverage the prompting so that predicting the next token requires being accurate about the world in various ways, but don't see that as in conflict with the observation that no one is talking to base models, and that the models everyone is talking to are extremely behaviorally different from base models, and that you won't understand present-day models well at all if you are trying to adopt the 'spicy autocomplete' framing (and if you tell other people this, it mostly makes them substantively worse at understanding what's going on with AI).
I think the base models are impressive enough that if we had never invented RLHF then we'd see big impacts anyway. But more generally, I think the problem is that the "spicy autocomplete" framing is correct while nonetheless being misleading for most people. It's just like the Chinese Room thought experiment, where many people have extremely misleading intuition _because_ they understand how it works in a way that we don't for the human brain.
Hmmm. Most people I've talked to who had heard the 'spicy autocomplete' thing had in fact not realized that after you train a base model you train extensively on a different objective to get the models we know, so I am more inclined to call it false (telling people that the objective trained on was token prediction, when it was not) than misleading (because at inference it's still outputting its likeliest token, you could argue this is technically true, but in a way that makes readers worse off because that's a mechanical explanation useless for predicting it), but there are definitely elements of both.
The issue with this is that "autocomplete" as a framing works even for post-trained models; it's just a description of what autoregressive language models do.
The problem is that "large autoregressive language models can pass all behavioral tests for intelligence" is a fact that many people resist, so much so that telling them how ChatGPT works makes them understand it less well.
I think your description more or less “a smart determined helpful entity” is an adequate description of the user experience when the AI succeeds, but stochastic parrot gives you a better understanding of how the models fail, e.g. hallucinations which seem to be a structural element of LLMs.
AI is a tricky term because it elides over so many different techniques and user interfaces. Statements that are true in one AI domain aren’t in another, but all AI domains are unhelpfully packaged together in discourse. I think that this has led to a lot of talking-past-eachother in AI discussions.
Hallucinations are much much rarer in the latest generation of models. With careful prompting, basically gone. What makes you say that they're a structural element of LLMs?
My thought is that “hallucination” is better described as “confabulation” and it’s a structural element of systems that can usefully extract information that goes beyond what is definitely entailed by the inputs.
I keep noticing new things every time I re-read “Computing Machinery and Intelligence”, but one of the objections Turing describes is the idea that “machines can’t make mistakes”. After the flippant reply (“is that a problem?”) he goes on to note that if you ever were able to get a machine to make useful inductive predictions on the basis of past observations, these would in fact sometimes go wrong. And I think “hallucination” or confabulation is an example of how this happens in all intelligences - you can’t directly store every single fact you’ve ever encountered, so you store compressed representations of them, and when you call them to mind, you usually get good facts back, but sometimes you put them together in incorrect ways. If you do it right, this sort of reconstruction will often get at information that is real but was never stored, but sometimes it goes wrong and we get the Mandela effect or human false memories of seeing bugs bunny at Disneyland when they were a kid or confabulated explanations of why one voted for a particular political party or misremembered book titles. But eliminating these things would only happen if we eliminated all the half-remembered things that are correct.
I think hallucinations are structural because LLMs are essentially just dancing around a high dimensional lexical vector space, and I have yet to see/read about a mechanism that prevents them from dancing into a nonsensical or erroneous part of the space, merely making it very unlikely. This is fine for a lot of applications, but it’s different than saying “it will not happen” or “it will happen at degree C with probability P.” Uncertainty quantification is important and I haven’t seen anything persuasive that says we actually understand the bounds and limits on these tools.
From a robustness perspective, you basically can’t make guarantees that a hallucination won’t happen the way we can make provable statements about the operation of Dijkstra’s Algorithm or the mechanical behavior of a bridge under load. I would not use an LLM in a safety critical application for that reason.
Now, none of this is to dispute the fact that LLMs work much better than they did two years ago, and are useful, perhaps in a widely economically impactful sense.
I find myself in the camp of “It definitely is a stochastic parrot, but it turns out there’s so much latent structure in language that that can take you very far.”
The most promising LLM approaches imo are the ones that couple it with deterministic code-gen. I suspect this is why AI agents work, but not as well on UIs: APIs and bash scripts are regimented in a way that makes them more legible to LLMs, and enforces some amount of structure and good behavior.
I really like and basically agree this piece but I find it vaguely dissatisfying.
I think there is something true about the claim that it is extremely fancy autocomplete, and that this differentiates it from human intelligence. The “training on a pleasing answer” uses the same basic mechanisms as training on internet text, so it (kinda?) is autocomplete, in an orthogonal or even opposed way, by a sort of analogy. One could precisify this by non-analogically specifying the machine learning structures involved in both, and probably at this (comparatively) late date in history we should.
On the other hand, people are rightly interested in what differentiates AI from human intelligence, and reductionist types who say “this is also how neurons work” are just as maddening, and are just as technically-correct-but-misleading, as the stochastic parrot types. To me at least.
There is definitely something interesting and challenging about it turning out to be the case that all other intellectual tasks are substantially entangled with prediction, so that you can "just" create a very powerful predictor and then train it to do basically anything and if it's good enough at predicting, it will probably be able to use its talent at prediction to pick up the other tasks too.
I have found something interesting about the takes that maybe what humans are doing is on some level just surprise-minimization or prediction too; I don't know if they're correct, or how we'd tell, but 'prediction turns out to be an extremely general cognitive machinery' ought to my mind to make us more interested in how much other intelligent minds like ours are doing prediction.
Prediction from something like understanding or principle is different from prediction by association or regression. They are not unrelated, and they can each approximate the other in some contexts. But they have different forms, different failure modes, and afford different powers. </context inappropriate philosophical assertion>
Anyway, it seems like we might have different instincts about how to think about this sort of claim. Which explains why I’m not naturally as bothered by Tyler types, and I’m glad your piece nudged me in the direction of being more bothered.
One of the things that makes me the saddest about public discourse around AI is when the comments on an article do exactly the kinds of things / make the same kinds of errors outlined in the article. It's quite remarkable how prevalent this is (see, e.g., a commenter here dismissing claims about Opus 4.6 because they used Sonnet 4 a while back).
It really makes me wonder if it's possible to have a reasonable discussion about this topic.
3) a grab bag of relatively simple learned algorithms and syntactic transformations
What would be going on is mostly memorization with a small but extremely impactful sprinkle of “thinking” on top. Inside the weights there would not be very complex computation or simulation happening. Then you can RL test time reasoning on top of this.
It’s not a formal or falsifiable claim but I feel it can account for current capabilities.
I don’t mean this as a deflationary claim at all, the point is even if you grant “stochastic parrots”, it’s not hard to believe one can build disruptive powerful capabilities on top of that.
And of course once you can RL test time reasoning, that can start developing all sorts of powers that are nowhere in the initial training set, just like AlphaGo can come up with Go strategies that are nowhere in the training set.
It’s not obvious at all how human thinking works. Probably there is some part of it that is like this. Quite possibly there are other sorts of capacities we have that make differences too. But we don’t push this kind or RL to the max.
I tend to not comment on online discourse about AI because its usually a losing battle, for all the reasons you could imagine. However, given I have been enjoying this publication for the past couple of months, I wanted to clear up a couple of misconceptions in the article and the comments here.
1. Demonstrably, most LLMs you interact with on a daily basis are mechanistically auto-regressive next token predictors. That means, the model will take in every token you / it previously produced, either via caching or via input (or compaction) and use that to generate the next token and so on. However, not ALL LLMs do this. Some might do whats called multi-token prediction (MTP) and actually predict entire blocks of tokens at once, but still in an auto-regressive manner.
There are even LLMs like Mercury from Inception labs which use Diffusion to iteratively de- noise the entire latent tensor at once rather than predicting one token following the next (How exactly this is done whether all at once or by denoising interleaved with token chunk prediction is still an open area of research).
2. After pre-training on a pure text corpus, demonstrably, the loss function for all LLM changes from being purely based on p(x_t+1 | x_0,x_1,..x_t) (i.e. next token prediction) to something remarkably more complicated. Multiple stages of both Instruction fine-tuning as well as Reinforcement Learning in multiple different environments takes place, where the reward function isn't whether or not you accurately predict the next best token, but if you correctly solve the objective. If you want, I recommend reading this paper from Nvidia Nemotron 3: https://arxiv.org/pdf/2512.20856. It is a research paper so some background in AI and generative modeling is required but it is a pretty good look into the types of strategies employed at top research labs.
All of this to say, I land mostly on the side of Kelsey here, that the LLMs we use today are much more than next token predictors and can be remarkably good at human level tasks that require judgement. However, to say LLMs aren't literally predicting the next token mechanically is just incorrect. The most important question is: Does it matter that most LLMs produce outputs token by token?
There is no innate computational reason that we don't use MTP or use some more complicated scheme to predict more than one token at a time to solve problems via LLMs. Its just that, until now, these next token predictors can be remarkably good at solving long horizon (in terms of tokens and complexity) objectives. And its mostly because we don't train them to simply do next token predicton
In any case, regardless of what humans are doing under the hood when we talk, we still do have to produce our output word by word, in order! Students learning a foreign language might figure out the deep structure and then move the words around to get them into the correct word order for the language they’re answering a homework question in. But fluent speakers just have the words come out one by one without having subvocalized the words farther down sooner!
I ran this (very good!) piece through an AI skill I created that separates fact from rhetoric, then rewrites the rhetoric from several different perspectives to illustrate how the same facts can be interpreted differently.
1. “obtuse and dishonest” — this phrase does more persuasive work than any technical demonstration in the piece. it converts a factual dispute about training terminology into a moral judgment about character. remove it and the piece loses 40% of its momentum. the syntactic choice matters too: “I find it obtuse” makes it personal testimony (harder to refute) rather than a claim (easier to test).
2. the GPT-2 anecdote as implicit metonymy — piper demonstrates what a BASE model does, then shows what an INSTRUCTION-TUNED model does. the factual content is: these are different. the rhetorical content is: anyone who describes the second using language appropriate to the first is lying. but the actual question — whether “instruction-following optimization over a next-token-predicting substrate” is philosophically closer to “understanding” or “sophisticated pattern completion” — is neither asked nor answered. the demo FEELS like it resolves this, because the capability gap is vivid. it doesn’t.
3. the profession-correlated-belief move — “socialists are mostly convinced AI is meaningless hype except the ones with data-analysis jobs, who admit it’s real.” this is an extraordinary piece of rhetoric. “admit” presupposes the conclusion (it IS real; the question is whether you’ll admit it). the implication is: exposure to AI converts skeptics, therefore skepticism is ignorance. but this is equally consistent with: exposure to AI creates familiarity bias, sunk-cost reasoning, and identity investment. the word “admit” does the entire job of resolving this ambiguity in piper’s favor.
Fair enough, Claude! But I'll note I have reason to say that the fact people who use modern AI tools think they're a big deal isn't sunk-cost reasoning: a year or two years ago, I tried AI tools like this myself and they weren't that useful - extremely cool, but not useful - and I talked to lots of other people who said the same thing. Now they are.
So I think people are very capable of saying "this is cool, but not useful" after trying an AI tool - and there is tremendous appetite for these takes. But since the latest round of model releases, I haven't seen that reaction, and the 'AI is useless' takes are mostly coming from people who tell me specifically that they don't use it, or tried it six months ago and were disappointed.
Now, maybe AI models at a certain level of quality are easy to reject as useless, and AI models just a little stronger than that seduce you into believing they're useful when they aren't. But I don't think we ought to be totally lost in comparing the hypotheses "they're useful" and "people who use them become wrongly convinced of their usefulness".
I completely agree with you. I find the tools incredibly useful in my personal life and at work. I just don’t think we have very good measures of “usefulness”.
Like, one measure of usefulness could be how many lines of code I’ve written, or how many projects completed. But if only I use the project, how useful was it? Another measure of usefulness is revenue: if someone buys it, it must be useful to them in proportion to the price they paid. But this just transfers the responsibility to measure usefulness to the buyer.
I am finding that I get a lot of “insight” from using AI, in the sense that it gives me information that is relevant to decisions. But are those decisions better than the counterfactual decisions I would otherwise have made? By what metric? Should I run a controlled experiment on my own decision-making, and track outcomes?
The problem is that if I just declare “no need to measure; this is obviously useful”, then no one on the outside can tell whether I’ve been seduced or whether I’m actually being helped.
I’m convinced we’ll see the usefulness in the GDP numbers eventually, and then it will be undeniable. But until then, I think AI users are going to deal with a lot of social judgment from people who judge usefulness subjectively in a different way.
I create music the kind of old fashioned way (I play a keyboard or electronic drums into the computer, then I mess around with it a bunch to get the songs I want). I just recently started using AI to generate video's to go with those songs for Youtube.
It's really freaken amazing. And I'm using Chat GPT to help me design the prompts and understand what works and why. Yes, Chat GPT can definitely teach you stuff.
I'm fully convinced that over the medium term (say 20-30 years), 90%+ (and probably much higher) of all jobs will be automated away between AI and humanoid robots.
Everything depends on your definition of "autocomplete", the reason I agree with you and not the pessimists is that a sufficiently powerful "autocomplete" eventually must have an underlying accurate world model. To finish the following sentence in the best way "Eventually, empirical tests confirmed what before had only been a theory and scientists now knew the way to reconcile general relativity and quantum mechanics was______________" the model will need to actually solve the physics. If you still want to call that "autocomplete"... whatever, I won't argue definitions.
What's cool, though, is modern AI isn't just a giant regression model with no mechanism to actually understand the underlying world. Instead, it uses an architect that mimics the human brain in lots of ways and it appears that eventually, it is able to 'grok' the underlying world model and make predictions that way instead of just based on word frequency stats.
Maybe I'm wrong about the AI training, but that seems like a more important distinction than the reinforced learning part of the process. (Wouldn't modern base models, even without the RL, have all the amazing capabilities, but just be more likely to spew racist propaganda if asked?)
Also, for fun I iterated on your 1880s prompt and after a while this is what Claude came up with:
Gosh, I used to read Kelsey back in her Tumblr days. And I was so excited to follow her to vox and yet I never heard the ring in her voice the way I did back then. Until now. Bravo for an incredibly thoughtful and incredibly well written article. I'm humbled at how this piece combines nuance with conviction and force.
There’s a bit of a limit to the amount you can use, but it’s plenty to demonstrate it to students. (I assign students in my AI Literacy class to do a few things here to understand the difference between real LLM autocomplete and modern chatbots.)
If LLMs truly didn't understand anything, it would be easy to demonstrate it with rigorous benchmarks, for example where half of the questions are kept hidden behind an API to make sure LLMs haven't memorized the answer.
François Chollet followed this kind of approach with ARC-AGI, and kudos to him! But most skeptical researchers didn't make the effort of formalizing it into falsifiable empirical claims. And now that the evidence is overwhelming, some of them are operating a motte-and-bailey retreat to the claim that LLMs are not conscious to pretend that they were right to begin with. And since we have no rigorous way of determining what is conscious or not, it's the ultimate motte.
Claims that LLMs are just stochastic parrots were predictably going to age poorly, and unlike with climate misinformation, people can just test it directly by themselves. Doubling down on denial is just going to make the hangover even worse.
It’s interesting that you brought up writing in verse, because it seems like something that a computer should be good at (or at least capable of) that it’s shockingly bad at. A month or two ago I was reflecting that movie titles more cool and evocative when they’re in trochaic tetrameter, like “Teenage Mutant Ninja Turtles” and “Mighty Morphing Power Rangers”, so I asked Gemini to come up with more examples of movies whose titles were in trochaic tetrameter, and try as it might, it couldn’t do it. Sometimes not even close. Not only could it not recognize the stress patterns much of the time, it would also pick movies that had the wrong number of feet. While I was trying to coax it in the right direction, I kept independently coming up with examples of my own. The only real hit was “Avatar: The Way of Water”.
So instead of finding examples, I was like maybe it can create new ones. So I asked it to come up with new titles for the films of the MCU that would fit the meter and sound like plausible titles, eg “Adam Warlock and the Guardians” instead of Guardians of the Galaxy, vol 3. or Ant-Man and the Yellowjacket” for Ant-Man. It could not do this at all. I was like come on, “Hulk vs Abomination” is right there! You can’t put together “Thunderbolts: The New Avengers” or “Captain Marvel and the Marvels” on your own? “ Even when I clarified what I was looking for, and gave it some examples, it couldn’t come up with the pattern that I noticed right away, which is that you can get pretty far with just [hero name] [and/and the/versus] [villain name]
This experience inspired me to compose the following poem (to the tune of Harder, Better, Faster, Stronger)
“Conceding that AI is doing more than just predicting the next word doesn’t actually mean you need to become an AI booster.”
I appreciate you saying this. I share your frustration with the stochastic parrot crowd, but I’ve also been seeing a lot of takes the past few weeks that act like disproving the stochastic parrot thesis means that the most extreme booster ideas are true. For example, that we’ll get superintelligence in 5-10 years (or less).
“For almost all tasks that can be done on a computer, you will make better predictions about Claude’s behavior if you predict that Claude will do what a very smart, very motivated human would do than if you do anything else.”
This strikes me as hyperbolic. Claude Code is shockingly good, but it can’t really operate GUI software, and I wouldn’t trust it operate new software outside its training data without extensive documentation and hand holding. Also, would you give Claude Code a credit card and let it book flights and hotels for you for a trip?
Even in the area of programming, Anthropic has over 100 engineering job openings. Clearly it has real limitations.
Part of the disconnect in AI discourse is if you don't work in a job that doesn't involve code in some way most of your exposure to AI is (and I loathe to use this term but I can't think of a better one) the slop it creates. Image gen, video gen, writing.
If you are woman on the internet and perverted freaks are turning your wedding photos into porn using grok or another AI are you really going to be convinced it's good because claude helped me automate a lot of testing? Sometimes I feel like I'm going insane seeing people I used to respect online unironically doing the dril tweet where like sure AI is being used, now, not in some hypothetical future to do actual evil but we can't ask any better of these companies or legislate limits because of the possible productivity gains.
So until the AI labs do something to address the deluge of slop and real harm I think we are going to continue to have people being negatively polarized against AI tools. I'm not one of them because I used it in my work and the benefits ARE real for that, yet I also know woman who have been "grok'd"
It’s fine to be negatively polarized against AI *morally* but that shouldn’t be a reason to *factually* downplay the things it can do. Ought isn’t Is, and all that
I really don't see that going away. Once that technology was figured out there will be some company somewhere that allows you to do it.
Massively reducing the amount of it would still be nice even if eliminating it entirely isn't possible!
agreed, I just don't know how that would work. I guess you can at least get it off the most popular sites.
On the other hand, it's hard to do without nerfing regular uses for it. For example, a standard procedure is to make an image in one program in then upload it to be used to create another as a video.
But many of the American ones like Sora really limit this. The Chinese ones don't.
So guess what a lot of people use.
Actually, with the use of OpenClaw or Moltbot or whatever it's called now, I think you actually can give it a credit card and have it book a flight or hotel.
Would you trust it to do this correctly though?
Heck no! But I've never used the thing, so that's just an answer based on my priors. Other people *are* using it for such things based on what I've read and watched. No idea what the general user confidence level is though.
I think I'd trust it to the same degree as I'd trust a smart person I didn't know - they'll almost definitely find a flight that meets the criteria, but I'd want them to briefly check in with me before buying in case I had some other criteria that I'd forgotten to spell out to them, so the ideal flow is "I describe what I want -> they find it and describe it to me -> I say 'yes go ahead'", just like if I had a human assistant who was very capable but who I didn't know well.
"but it can’t really operate GUI software"
I don't see that lasting long If I can give it a picture and tell it take out the blue flower it will do so.
I can also do that with a video. Check out what's being done with Kling 3.0 or SeeDance 2.0
If it can understand visuals that way, it seems a short distance to operating a GUI software (if given permission).
This post was a weird flavor of aggressive ignorance. Harper is correct. All LLMs in all stages of production are next token predictors. Fine tuning shifts the distribution of predictions, the hidden portion of the prompts shifts them more, but almost all the information in the LLM is embedded in the base model in any case. This is not some controversial take; it's an objective fact about what LLMs are, and your objections seem to boil down to "it doesn't *feel* like next token prediction to me when I use it, so obviously it's not!" Well, yeah. AI companies invest a lot of resources to make sure it feels like you're talking to mind just a little different from yours. That's a big part of why they don't show you all the prompting text around yours, so it *sounds* like the next token prediction sounds like a character speaking to you. Until jailbreaking destroys the illusion.
Your experience with the vaunted Claude is a lot more positive than mine, although I used their model through aider. I asked it to add a major feature to a project. I caught several bugs in the commit, but more subtle ones took forever to track down and it was more or less useless for helping. In the end it was at best break-even vs doing the whole thing by hand, and only because I'm not familiar with asyncio. I'm sure that if what you need is some trivial app or mod of a kind that is well-represented in its training data and that won't need to be maintained, it seems awesome. But if it's so great at generating software, where is the flood of software? If it's this amazing boost to productivity, where's the production?
Lastly, I have to laugh at this notion that writers are all poo-pooing LLMs, except for Kelsey Piper, bravely swimming against the tide. Look, writers are *exactly* the people most primed to be awed by LLMs. My inbox every day is full of people announcing WHAT A BIG DEAL AI is, how YOU'RE ALL FOOLING YOURSELVES, same as it has been for the last three years. To be honest the volume and repetition sometimes feels like a coordinated propaganda campaign.
After RLHF, the objective it was trained on is not next-token-prediction, and yes, the fact that it behaviorally doesn't act like that was the objective it was trained on is one way to satisfy yourself that that's not the objective it was trained on.
> if it's so great at generating software, where is the flood of software? If it's this amazing boost to productivity, where's the production?
In the last month, I have personally interacted with a dozen new software projects that were fully AI-generated and several more whose development was massively speeded by AI. Most of the fully AI-generated ones were small: a friend develops a game and we all play it, I create an educational tool for a niche use case, someone makes a minigame to test out a new mechanic they're using in a larger project. But also I've spoken to a lot of programmers in the Bay Area who routinely use AI for massive speedups at their jobs. May I ask when you tried Claude Code? A quick Google suggested that Aider uses or at least recently used Claude Sonnet 3.5, which would be completely useless if so. If the above were your results from Opus 4.6, they are far worse than any others I've heard about.
No, it's just next token prediction on a biased set. If you fine tune on a set of recipes, it will be more likely to predict recipes. If you fine tune on answers that have been selected by humans to be typical of a helpful assistant, it will be more likely to predict text characteristic of a helpful assistant. Next token prediction is a structural property - the structural property! - of what these models *are*. It can't be changed by fine-tuning.
"In the last month, I have personally interacted with a dozen new software projects that were fully AI-generated and several more whose development was massively speeded by AI. Most of the fully AI-generated ones were small: a friend develops a game and we all play it, I create an educational tool for a niche use case, someone makes a minigame to test out a new mechanic they're using in a larger project. But also I've spoken to a lot of programmers in the Bay Area who routinely use AI for massive speedups at their jobs."
You can see why this is unconvincing, right? You "personally interacted with software", meaning your friends hobby projects, and lots of your AI booster friends are telling you that about their 10x productivity. But we have platforms where someone can directly turn the ability to generate code into money, like Steam. And yet we don't see the production of software that real people will spend real money on: https://substack.com/home/post/p-172538377. What we do see is a flood of unusable slop in the open source space. Garbage AI PRs have become an existential threat to open repos.
I used sonnet 4. Aider is a wrapper end that can use any model with an API, but I feel disinclined spend 5x on opus when the consensus is it's just not that much better. Also, every year for the last three years I've been hearing about how *this* year's model is the game changer, and last year's model was garbage, actually.
You obviously have the right to use or not use any model you want, and I understand feeling like you heard too many hype claims to believe any of them, but if you are responding to people claiming "With Opus 4.6, I can do a ton of stuff it could not do six months ago" with an account of how useless AI is, and you're using Sonnet 4, I really think you should be up front about that.
I mean, yeah, if someone did claim that they were specifically using Opus 4.6, I guess I would need to specify that I had used Sonnet instead, which performs... a couple percentage points worse in benchmarks. Did you specify Opus somewhere?
I don't trust benchmarks at all (too easy to game/target), and have found in practice that Opus is very noticeably better than Sonnet.
There's a reason so many skeptical people (including me!) have started doing vibe-coding stuff in the last month or two, and didn't last spring when Sonnet came out.
You’re missing the other important thing they do in modern “reasoning” models, which Kelsey didn’t mention. If you get the machine to talk to itself about the steps in solving a math problem, it’s much more likely to get the correct answer than if you get it to just guess the answer. It’s obvious that this should work, because predicting the first step in getting a solution is easier than predicting the answer, and predicting the second step is easy once you’ve seen the first, and so on.
But they also give it some time to “practice” solving more problems, and reinforce the kinds of predictions that eventually led to answers that can be verified by other means, while downweighting the ones that end up with wrong answers. This actually means that they end up producing types of sequences of words that are nowhere in the training data - sometimes because they were silently in the heads of people who wrote down final answers, and sometimes because random wandering found good strategies that turn out to be useful. This makes it a lot more like AlphaGo and other game AIs that discover their own strategies through self play.
You can’t do that with text off the bat, but if you start by learning to predict human text, then you get a facility with text that can be tuned in this sort of way to learn strategies that are nowhere in the training data.
I'm not quite sure what distinction you're trying to draw here. All generative models, from the lowly autoencoder on up, produce data that is not present in their training sets. That's the distinction between a model and a dataset. When you ask a base model for the answer to a math problem, sometimes it will try to spit out a raw answer, and generally do badly. Sometimes it will instead spit out a text sequence similar to a worked problem in a text book, and then it is more likely to succeed for the reasons you mentioned. Even if you perform very stupid reinforcement learning, and just reinforce the entire correct answers and punish the entire incorrect answers, you will increase the frequency of the worked-problem style answers from the model.
The influence of the training data is still visible in the model's success though. They're much more likely to get the right answer to a*x+b if a=1.8 and b=32, for example.
I don't think "next token predictor" is any more useful of a way of looking at LLMs than "next muscle impulse selectors" is for humans. Both are technically true at the base level, but neither is describing where any of the interesting stuff is going on.
Generating a token is just a means to an end. The insanely complicated web of relationships and correlations that happen in the billions of weights is the part that's mind-blowing, and that's the part that we should be evaluating. Not the token part.
Unfortunately, I think this situation is one where the technical claims are actually the best way of understanding the issue. Claude is still an autoregressive language model, which generates words based on the sequence of prior words, including the ones it generated. And it's still doing it via standard machine learning ideas, which are about making it statistically improve on various goals. And it starts, as you say, with a model whose goal is accurately predicting the next word in a large corpus including much of the internet. But the key step is that while training on the whole Internet and training to be a helpful chatbot are both done using machine learning and statistical predictions, they are in tension, and the "post training" makes the models much worse at predicting the internet.
However, I think it's possible to over rotate on these differences. Much of the intelligence of modern language models is already there in the base model. This is the point of the gpt3 paper which shows that you can get base models to do lots of intelligent things with appropriate prompting.
I agree that base models have shockingly impressive capabilities if you figure out how to leverage the prompting so that predicting the next token requires being accurate about the world in various ways, but don't see that as in conflict with the observation that no one is talking to base models, and that the models everyone is talking to are extremely behaviorally different from base models, and that you won't understand present-day models well at all if you are trying to adopt the 'spicy autocomplete' framing (and if you tell other people this, it mostly makes them substantively worse at understanding what's going on with AI).
I think the base models are impressive enough that if we had never invented RLHF then we'd see big impacts anyway. But more generally, I think the problem is that the "spicy autocomplete" framing is correct while nonetheless being misleading for most people. It's just like the Chinese Room thought experiment, where many people have extremely misleading intuition _because_ they understand how it works in a way that we don't for the human brain.
Hmmm. Most people I've talked to who had heard the 'spicy autocomplete' thing had in fact not realized that after you train a base model you train extensively on a different objective to get the models we know, so I am more inclined to call it false (telling people that the objective trained on was token prediction, when it was not) than misleading (because at inference it's still outputting its likeliest token, you could argue this is technically true, but in a way that makes readers worse off because that's a mechanical explanation useless for predicting it), but there are definitely elements of both.
The issue with this is that "autocomplete" as a framing works even for post-trained models; it's just a description of what autoregressive language models do.
The problem is that "large autoregressive language models can pass all behavioral tests for intelligence" is a fact that many people resist, so much so that telling them how ChatGPT works makes them understand it less well.
I think your description more or less “a smart determined helpful entity” is an adequate description of the user experience when the AI succeeds, but stochastic parrot gives you a better understanding of how the models fail, e.g. hallucinations which seem to be a structural element of LLMs.
AI is a tricky term because it elides over so many different techniques and user interfaces. Statements that are true in one AI domain aren’t in another, but all AI domains are unhelpfully packaged together in discourse. I think that this has led to a lot of talking-past-eachother in AI discussions.
Hallucinations are much much rarer in the latest generation of models. With careful prompting, basically gone. What makes you say that they're a structural element of LLMs?
My thought is that “hallucination” is better described as “confabulation” and it’s a structural element of systems that can usefully extract information that goes beyond what is definitely entailed by the inputs.
I keep noticing new things every time I re-read “Computing Machinery and Intelligence”, but one of the objections Turing describes is the idea that “machines can’t make mistakes”. After the flippant reply (“is that a problem?”) he goes on to note that if you ever were able to get a machine to make useful inductive predictions on the basis of past observations, these would in fact sometimes go wrong. And I think “hallucination” or confabulation is an example of how this happens in all intelligences - you can’t directly store every single fact you’ve ever encountered, so you store compressed representations of them, and when you call them to mind, you usually get good facts back, but sometimes you put them together in incorrect ways. If you do it right, this sort of reconstruction will often get at information that is real but was never stored, but sometimes it goes wrong and we get the Mandela effect or human false memories of seeing bugs bunny at Disneyland when they were a kid or confabulated explanations of why one voted for a particular political party or misremembered book titles. But eliminating these things would only happen if we eliminated all the half-remembered things that are correct.
Much rarer != gone.
I think hallucinations are structural because LLMs are essentially just dancing around a high dimensional lexical vector space, and I have yet to see/read about a mechanism that prevents them from dancing into a nonsensical or erroneous part of the space, merely making it very unlikely. This is fine for a lot of applications, but it’s different than saying “it will not happen” or “it will happen at degree C with probability P.” Uncertainty quantification is important and I haven’t seen anything persuasive that says we actually understand the bounds and limits on these tools.
From a robustness perspective, you basically can’t make guarantees that a hallucination won’t happen the way we can make provable statements about the operation of Dijkstra’s Algorithm or the mechanical behavior of a bridge under load. I would not use an LLM in a safety critical application for that reason.
Now, none of this is to dispute the fact that LLMs work much better than they did two years ago, and are useful, perhaps in a widely economically impactful sense.
I find myself in the camp of “It definitely is a stochastic parrot, but it turns out there’s so much latent structure in language that that can take you very far.”
The most promising LLM approaches imo are the ones that couple it with deterministic code-gen. I suspect this is why AI agents work, but not as well on UIs: APIs and bash scripts are regimented in a way that makes them more legible to LLMs, and enforces some amount of structure and good behavior.
“AI is fancy auto-complete” gives me the same energy as “Love is just a series of chemical reactions firing in your brain.”
I mean like, sure, I guess there’s a mechanism by which the emergent property arises. Still the property is very interesting and mysterious!
I really like and basically agree this piece but I find it vaguely dissatisfying.
I think there is something true about the claim that it is extremely fancy autocomplete, and that this differentiates it from human intelligence. The “training on a pleasing answer” uses the same basic mechanisms as training on internet text, so it (kinda?) is autocomplete, in an orthogonal or even opposed way, by a sort of analogy. One could precisify this by non-analogically specifying the machine learning structures involved in both, and probably at this (comparatively) late date in history we should.
On the other hand, people are rightly interested in what differentiates AI from human intelligence, and reductionist types who say “this is also how neurons work” are just as maddening, and are just as technically-correct-but-misleading, as the stochastic parrot types. To me at least.
There is definitely something interesting and challenging about it turning out to be the case that all other intellectual tasks are substantially entangled with prediction, so that you can "just" create a very powerful predictor and then train it to do basically anything and if it's good enough at predicting, it will probably be able to use its talent at prediction to pick up the other tasks too.
I have found something interesting about the takes that maybe what humans are doing is on some level just surprise-minimization or prediction too; I don't know if they're correct, or how we'd tell, but 'prediction turns out to be an extremely general cognitive machinery' ought to my mind to make us more interested in how much other intelligent minds like ours are doing prediction.
Prediction from something like understanding or principle is different from prediction by association or regression. They are not unrelated, and they can each approximate the other in some contexts. But they have different forms, different failure modes, and afford different powers. </context inappropriate philosophical assertion>
Anyway, it seems like we might have different instincts about how to think about this sort of claim. Which explains why I’m not naturally as bothered by Tyler types, and I’m glad your piece nudged me in the direction of being more bothered.
One of the things that makes me the saddest about public discourse around AI is when the comments on an article do exactly the kinds of things / make the same kinds of errors outlined in the article. It's quite remarkable how prevalent this is (see, e.g., a commenter here dismissing claims about Opus 4.6 because they used Sonnet 4 a while back).
It really makes me wonder if it's possible to have a reasonable discussion about this topic.
A mental model I have for current LLMs is
1) lossy memorization of entire training dataset
2) really good embeddings + lookup
3) a grab bag of relatively simple learned algorithms and syntactic transformations
What would be going on is mostly memorization with a small but extremely impactful sprinkle of “thinking” on top. Inside the weights there would not be very complex computation or simulation happening. Then you can RL test time reasoning on top of this.
It’s not a formal or falsifiable claim but I feel it can account for current capabilities.
I don’t mean this as a deflationary claim at all, the point is even if you grant “stochastic parrots”, it’s not hard to believe one can build disruptive powerful capabilities on top of that.
And of course once you can RL test time reasoning, that can start developing all sorts of powers that are nowhere in the initial training set, just like AlphaGo can come up with Go strategies that are nowhere in the training set.
It’s not obvious at all how human thinking works. Probably there is some part of it that is like this. Quite possibly there are other sorts of capacities we have that make differences too. But we don’t push this kind or RL to the max.
AI researcher here,
I tend to not comment on online discourse about AI because its usually a losing battle, for all the reasons you could imagine. However, given I have been enjoying this publication for the past couple of months, I wanted to clear up a couple of misconceptions in the article and the comments here.
1. Demonstrably, most LLMs you interact with on a daily basis are mechanistically auto-regressive next token predictors. That means, the model will take in every token you / it previously produced, either via caching or via input (or compaction) and use that to generate the next token and so on. However, not ALL LLMs do this. Some might do whats called multi-token prediction (MTP) and actually predict entire blocks of tokens at once, but still in an auto-regressive manner.
There are even LLMs like Mercury from Inception labs which use Diffusion to iteratively de- noise the entire latent tensor at once rather than predicting one token following the next (How exactly this is done whether all at once or by denoising interleaved with token chunk prediction is still an open area of research).
2. After pre-training on a pure text corpus, demonstrably, the loss function for all LLM changes from being purely based on p(x_t+1 | x_0,x_1,..x_t) (i.e. next token prediction) to something remarkably more complicated. Multiple stages of both Instruction fine-tuning as well as Reinforcement Learning in multiple different environments takes place, where the reward function isn't whether or not you accurately predict the next best token, but if you correctly solve the objective. If you want, I recommend reading this paper from Nvidia Nemotron 3: https://arxiv.org/pdf/2512.20856. It is a research paper so some background in AI and generative modeling is required but it is a pretty good look into the types of strategies employed at top research labs.
All of this to say, I land mostly on the side of Kelsey here, that the LLMs we use today are much more than next token predictors and can be remarkably good at human level tasks that require judgement. However, to say LLMs aren't literally predicting the next token mechanically is just incorrect. The most important question is: Does it matter that most LLMs produce outputs token by token?
There is no innate computational reason that we don't use MTP or use some more complicated scheme to predict more than one token at a time to solve problems via LLMs. Its just that, until now, these next token predictors can be remarkably good at solving long horizon (in terms of tokens and complexity) objectives. And its mostly because we don't train them to simply do next token predicton
In any case, regardless of what humans are doing under the hood when we talk, we still do have to produce our output word by word, in order! Students learning a foreign language might figure out the deep structure and then move the words around to get them into the correct word order for the language they’re answering a homework question in. But fluent speakers just have the words come out one by one without having subvocalized the words farther down sooner!
Claude is just a stochastic parrot in the same way that Tyler Austin Harper is just an infant.
I ran this (very good!) piece through an AI skill I created that separates fact from rhetoric, then rewrites the rhetoric from several different perspectives to illustrate how the same facts can be interpreted differently.
https://claude.ai/share/225091ed-882d-46fb-95bf-04d782d7d5d3
Here’s an excerpt:
###### CLAUDE ######
1. “obtuse and dishonest” — this phrase does more persuasive work than any technical demonstration in the piece. it converts a factual dispute about training terminology into a moral judgment about character. remove it and the piece loses 40% of its momentum. the syntactic choice matters too: “I find it obtuse” makes it personal testimony (harder to refute) rather than a claim (easier to test).
2. the GPT-2 anecdote as implicit metonymy — piper demonstrates what a BASE model does, then shows what an INSTRUCTION-TUNED model does. the factual content is: these are different. the rhetorical content is: anyone who describes the second using language appropriate to the first is lying. but the actual question — whether “instruction-following optimization over a next-token-predicting substrate” is philosophically closer to “understanding” or “sophisticated pattern completion” — is neither asked nor answered. the demo FEELS like it resolves this, because the capability gap is vivid. it doesn’t.
3. the profession-correlated-belief move — “socialists are mostly convinced AI is meaningless hype except the ones with data-analysis jobs, who admit it’s real.” this is an extraordinary piece of rhetoric. “admit” presupposes the conclusion (it IS real; the question is whether you’ll admit it). the implication is: exposure to AI converts skeptics, therefore skepticism is ignorance. but this is equally consistent with: exposure to AI creates familiarity bias, sunk-cost reasoning, and identity investment. the word “admit” does the entire job of resolving this ambiguity in piper’s favor.
Fair enough, Claude! But I'll note I have reason to say that the fact people who use modern AI tools think they're a big deal isn't sunk-cost reasoning: a year or two years ago, I tried AI tools like this myself and they weren't that useful - extremely cool, but not useful - and I talked to lots of other people who said the same thing. Now they are.
So I think people are very capable of saying "this is cool, but not useful" after trying an AI tool - and there is tremendous appetite for these takes. But since the latest round of model releases, I haven't seen that reaction, and the 'AI is useless' takes are mostly coming from people who tell me specifically that they don't use it, or tried it six months ago and were disappointed.
Now, maybe AI models at a certain level of quality are easy to reject as useless, and AI models just a little stronger than that seduce you into believing they're useful when they aren't. But I don't think we ought to be totally lost in comparing the hypotheses "they're useful" and "people who use them become wrongly convinced of their usefulness".
I completely agree with you. I find the tools incredibly useful in my personal life and at work. I just don’t think we have very good measures of “usefulness”.
Like, one measure of usefulness could be how many lines of code I’ve written, or how many projects completed. But if only I use the project, how useful was it? Another measure of usefulness is revenue: if someone buys it, it must be useful to them in proportion to the price they paid. But this just transfers the responsibility to measure usefulness to the buyer.
I am finding that I get a lot of “insight” from using AI, in the sense that it gives me information that is relevant to decisions. But are those decisions better than the counterfactual decisions I would otherwise have made? By what metric? Should I run a controlled experiment on my own decision-making, and track outcomes?
The problem is that if I just declare “no need to measure; this is obviously useful”, then no one on the outside can tell whether I’ve been seduced or whether I’m actually being helped.
I’m convinced we’ll see the usefulness in the GDP numbers eventually, and then it will be undeniable. But until then, I think AI users are going to deal with a lot of social judgment from people who judge usefulness subjectively in a different way.
I create music the kind of old fashioned way (I play a keyboard or electronic drums into the computer, then I mess around with it a bunch to get the songs I want). I just recently started using AI to generate video's to go with those songs for Youtube.
It's really freaken amazing. And I'm using Chat GPT to help me design the prompts and understand what works and why. Yes, Chat GPT can definitely teach you stuff.
I'm fully convinced that over the medium term (say 20-30 years), 90%+ (and probably much higher) of all jobs will be automated away between AI and humanoid robots.
Everything depends on your definition of "autocomplete", the reason I agree with you and not the pessimists is that a sufficiently powerful "autocomplete" eventually must have an underlying accurate world model. To finish the following sentence in the best way "Eventually, empirical tests confirmed what before had only been a theory and scientists now knew the way to reconcile general relativity and quantum mechanics was______________" the model will need to actually solve the physics. If you still want to call that "autocomplete"... whatever, I won't argue definitions.
What's cool, though, is modern AI isn't just a giant regression model with no mechanism to actually understand the underlying world. Instead, it uses an architect that mimics the human brain in lots of ways and it appears that eventually, it is able to 'grok' the underlying world model and make predictions that way instead of just based on word frequency stats.
Maybe I'm wrong about the AI training, but that seems like a more important distinction than the reinforced learning part of the process. (Wouldn't modern base models, even without the RL, have all the amazing capabilities, but just be more likely to spew racist propaganda if asked?)
Also, for fun I iterated on your 1880s prompt and after a while this is what Claude came up with:
In eighteen eighty, Hayes felt winter near,
Four years before, he'd made a bargain dear:
Withdrew the troops to win his White House seat--
And bet the South would honor its defeat.
Now brown leaves fell without a single sound--
The war was won, but justice was brought down.
Gosh, I used to read Kelsey back in her Tumblr days. And I was so excited to follow her to vox and yet I never heard the ring in her voice the way I did back then. Until now. Bravo for an incredibly thoughtful and incredibly well written article. I'm humbled at how this piece combines nuance with conviction and force.
Incidentally, if you want to use a text-completion AI, there are some available for free here: https://textsynth.com/completion.html
There’s a bit of a limit to the amount you can use, but it’s plenty to demonstrate it to students. (I assign students in my AI Literacy class to do a few things here to understand the difference between real LLM autocomplete and modern chatbots.)
Thank you for bringing sanity to this topic.
If LLMs truly didn't understand anything, it would be easy to demonstrate it with rigorous benchmarks, for example where half of the questions are kept hidden behind an API to make sure LLMs haven't memorized the answer.
François Chollet followed this kind of approach with ARC-AGI, and kudos to him! But most skeptical researchers didn't make the effort of formalizing it into falsifiable empirical claims. And now that the evidence is overwhelming, some of them are operating a motte-and-bailey retreat to the claim that LLMs are not conscious to pretend that they were right to begin with. And since we have no rigorous way of determining what is conscious or not, it's the ultimate motte.
Claims that LLMs are just stochastic parrots were predictably going to age poorly, and unlike with climate misinformation, people can just test it directly by themselves. Doubling down on denial is just going to make the hangover even worse.
It’s interesting that you brought up writing in verse, because it seems like something that a computer should be good at (or at least capable of) that it’s shockingly bad at. A month or two ago I was reflecting that movie titles more cool and evocative when they’re in trochaic tetrameter, like “Teenage Mutant Ninja Turtles” and “Mighty Morphing Power Rangers”, so I asked Gemini to come up with more examples of movies whose titles were in trochaic tetrameter, and try as it might, it couldn’t do it. Sometimes not even close. Not only could it not recognize the stress patterns much of the time, it would also pick movies that had the wrong number of feet. While I was trying to coax it in the right direction, I kept independently coming up with examples of my own. The only real hit was “Avatar: The Way of Water”.
So instead of finding examples, I was like maybe it can create new ones. So I asked it to come up with new titles for the films of the MCU that would fit the meter and sound like plausible titles, eg “Adam Warlock and the Guardians” instead of Guardians of the Galaxy, vol 3. or Ant-Man and the Yellowjacket” for Ant-Man. It could not do this at all. I was like come on, “Hulk vs Abomination” is right there! You can’t put together “Thunderbolts: The New Avengers” or “Captain Marvel and the Marvels” on your own? “ Even when I clarified what I was looking for, and gave it some examples, it couldn’t come up with the pattern that I noticed right away, which is that you can get pretty far with just [hero name] [and/and the/versus] [villain name]
This experience inspired me to compose the following poem (to the tune of Harder, Better, Faster, Stronger)
“Robots Are No Good at Poems”
I’ve come up with an idea
for forcing kids to write their papers
without using any AI:
make them write them as a poem.
Forgo rhyming, but insist they
use specific meters to
compose their essays as they write them.
Robots are no good at poems.
It will help improve their word choice
And give them a sense of rhythm.
It’s not hard to learn with practice.
And you’ll recognize the cheaters.