You can be so far ahead of someone, their input (if you act on it) can only make things worse. That's it. If a human 'teams up' with chess AI today and does anything other than agree with its moves, it will just drag things down.
These human in the loop systems basically lists possible moves with likelihood of winning, no?
So how would the human be a demerit? It'd mean that the human for some reason decided to always use the option that the ai wouldn't take, but how would that make sense? Then the AI would list the "correct" move with a higher likelihood of winning.
The point of this strategy was to mitigate traps, but this would now have to become inverted: the opponent AI would have to be able to gaslight the human into thinking he's stopping his AI from falling into a trap. While that might work in a few cases, the human would quickly learn that his ability to overrule the optimal choice is flawed, thus reverting it back to baseline where the human is essentially a non-factor and not a demerit
>So how would the human be a demerit? It'd mean that the human for some reason decided to always use the option that the ai wouldn't take, but how would that make sense? Then the AI would list the "correct" move with a higher likelihood of winning.
The human will be a demerit any time it's not picking the choice the model would have made.
>While that might work in a few cases, the human would quickly learn that his ability to overrule the optimal choice is flawed, thus reverting it back to baseline where the human is essentially a non-factor and not a demerit
Sure, but it's not a Centaur game if the human is doing literally nothing every time. The only way for a human+ai team to not be outright worse than only ai is for the human to do nothing at all and that's not a team. You've just delayed the response of the computer for no good reason.
Exactly what part of your brain can you point to and say, "This is it. This understands Chinese" ? Your brain is every bit a Chinese Room as a Large Language Model. That's the flaw.
And unless you believe in a metaphysical reality to the body, then your point about substrate independence cuts for the brain as well.
Why are you commenting if you can't even take a few minutes to read this ? It's quite bizarre. There's a quote and repo for Cheeseman, and a paper for Biomni.
There is only one quote in the entire article, though:
> Cheeseman finds Claude consistently catches things he missed. “Every time I go through I’m like, I didn’t notice that one! And in each case, these are discoveries that we can understand and verify,” he says.
Pretty vague and not really quantifiable. You would think an article making a bold claim would contain more than a single, hand-wavy quote from an actual scientist.
>Pretty vague and not really quantifiable. You would think an article making a bold claim would contain more than a single, hand-wavy quote from an actual scientist.
Why? What purpose would quotes serve better than a paper with numbers and code? Just seems like nitpicking here. The article could have gone without a single quote (or had several more) and it wouldn't really change anything. And that quote is not really vague in the context of the article.
Do you think LLMs sidestep cause and effect somehow ? There's an explanation there too, we just don't know it, But that's the case for many natural phenomena.
Science often advances by accumulation, and it’s true that multiple people frequently converge on similar ideas once the surrounding toolkit exists. But “it becomes obvious” is doing a lot of work here, and the history around relativity (special and general) is a pretty good demonstration that it often doesn’t become obvious at all, even to very smart people with front-row seats.
Take Michelson in 1894: after doing (and inspiring) the kind of precision work that should have set off alarm bells, he’s still talking like the fundamentals are basically done and progress is just “sixth decimal place” refinement.
"While it is never safe to affirm that the future of Physical Science has no marvels in store even more astonishing than those of the past, it seems probable that most of the grand underlying principles have been firmly established and that further advances are to be sought chiefly in the rigorous application of these principles to all the phenomena which come under our notice. It is here that the science of measurement shows its importance — where quantitative work is more to be desired than qualitative work. An eminent physicist remarked that the future truths of physical science are to be looked for in the sixth place of decimals." - Michelson 1894
The Michelson-Morley experiments weren't obscure, they were famous, discussed widely, and their null result was well-known. Yet for nearly two decades, the greatest physicists of the era proposed increasingly baroque modifications to existing theory rather than question the foundational assumption of absolute time. These weren't failures of data availability or technical skill, they were failures of imagination constrained by what seemed obviously true about the nature of time itself.
Einstein's insight wasn't just "connecting dots" here, it was recognizing that a dot everyone thought was fixed (the absoluteness of simultaneity) could be moved, and that doing so made everything else fall into place.
People scorn the 'Great Man Hypothesis' so much they sometimes swing too much in the other direction. The 'multiple discovery' pattern you cite is real but often overstated. For Special Relativity, Poincaré came close, but didn't make the full conceptual break. Lorentz had the mathematics but retained the aether. The gap between 'almost there' and 'there' can be enormous when it requires abandoning what seems like common sense itself.
>Unfortunately, none of that has anything to do with what LLMs are doing. The LLM is not thinking about concepts and then translating that into language. It is imitating what it looks like to read people doing so and nothing more.
'Language' is only the initial and final layers of a Large Language Model. Manipulating concepts is exactly what they do, and it's unfortunate the most obstinate seem to be the most ignorant.
They do not manipulate concepts. There is no representation of a concept for them to manipulate.
It may, however, turn out that in doing what they do, they are effectively manipulating concepts, and this is what I was alluding to: by building the model, even though your approach was through tokenization and whatever term you want to use for the network, you end up accidentally building something that implicitly manipulates concepts. Moreover, it might turn out that we ourselves do more of this than we perhaps like to think.
Nevertheless "manipulating concepts is exactly what they do" seems almost willfully ignorant of how these systems work, unless you believe that "find the next most probable sequence of tokens of some length" is all there is to "manipulating concepts".
>They do not manipulate concepts. There is no representation of a concept for them to manipulate.
Yes, they do. And of course there is. And there's plenty of research on the matter.
>It may, however, turn out that in doing what they do, they are effectively manipulating concepts
There is no effectively here. Text is what goes in and what comes out, but it's by no means what they manipulate internally.
>Nevertheless "manipulating concepts is exactly what they do" seems almost willfully ignorant of how these systems work, unless you believe that "find the next most probable sequence of tokens of some length" is all there is to "manipulating concepts".
"Find the next probable token" is the goal, not the process. It is what models are tasked to do yes, but it says nothing about what they do internally to achieve it.
please pass on a link to a solid research paper that supports the idea that to "find the next probable token", LLM's manipulate concepts ... just one will do.
Thanks for that. I've read the two Lindsey papers before. I think these are all interesting, but they are also what used to be called "just-so stories". That is, they describe a way of understanding what the LLM is doing, but do not actually describe what the LLM is doing.
And this is OK and still quite interesting - we do it to ourselves all the time. Often it's the only way we have of understanding the world (or ourselves).
However, in the case of LLMs, which are tools that we have created from scratch, I think we can require a higher standard.
I don't personally think that any of these papers suggest that LLMs manipulate concepts. They do suggest that the internal representation after training is highly complex (superposition, in particular), and that when inputs are presented, it isn't unreasonable to talk about the observable behavior as if it involved represented concepts. It is useful stance to take, similar to Dennett's intentional stance.
However, while this may turn out to be how a lot of human cognition works, I don't think it is what is the significant part of what is happening when we actively reason. Nor do I think it corresponds to what most people mean by "manipulate concepts".
The LLM, despite the prescence of "features" that may correspond to human concepts, is relentlessly forward-driving: given these inputs, what is my output? Look at the description in the 3rd paper of the arithmetic example. This is not "manipulating concepts" - it's a trick that often gets to the right answer (just like many human tricks used for arithmetic, only somewhat less reliable). It is extremely different, however, from "rigorous" arithmetic - the stuff you learned when you somewhere between age 5 and 12 perhaps - that always gives the right answer and involves no pattern matter, no inference, no approximations. The same thing can be said, I think, about every other example in all 4 papers, to some degree or another.
What I do think is true (and very interesting) is that it seems somewhere between possible and likely that a lot more human cognition than we've previously suspected uses similar mechanisms as these papers are uncovering/describing.
>That is, they describe a way of understanding what the LLM is doing, but do not actually describe what the LLM is doing.
I’m not sure what distinction you’re drawing here. A lot of mechanistic interpretability work is explicitly trying to describe what the model is doing in the most literal sense we have access to: identifying internal features/circuits and showing that intervening on them predictably changes behavior. That’s not “as-if” gloss; it’s a causal claim about internals.
If your standard is higher than “we can locate internal variables that track X and show they causally affect outputs in X-consistent ways,” what would count as “actually describing what it’s doing”?
>However, in the case of LLMs, which are tools that we have created from scratch, I think we can require a higher standard.
This is backwards. We don’t “create them from scratch” in the sense relevant to interpretability. We specify an architecture template and a training objective, then we let gradient descent discover a huge, distributed program. The “program” is not something we wrote or understand. In that sense, we’re in a similar epistemic position as neuroscience: we can observe behavior, probe internals, and build causal/mechanistic models, without having full transparency.
So what does “higher standard” mean here, concretely? If you mean “we should be able to fully enumerate a clean symbolic algorithm,” that’s not a standard we can meet even for many human cognitive skills, and it’s not obvious why that should be the bar for “concept manipulation.”
>I don't personally think that any of these papers suggest that LLMs manipulate concepts. They do suggest that the internal representation after training is highly complex (superposition, in particular), and that when inputs are presented, it isn't unreasonable to talk about the observable behavior as if it involved represented concepts. It is useful stance to take, similar to Dennett's intentional stance.
You start with “there is no representation of a concept,” but then concede “features that may correspond to human concepts.” If those features are (a) reliably present across contexts, (b) abstract over surface tokens, and (c) participate causally in producing downstream behavior, then that is a representation in the sense most people mean in cognitive science. One of the most frustrating things about these sorts of discussions is the meaningless semantic games and goalpost shifting.
>The LLM, despite the prescence of "features" that may correspond to human concepts, is relentlessly forward-driving: given these inputs, what is my output?
Again, that’s a description of the objective, not the internal computation. The fact that the training loss is next-token prediction doesn’t imply the internal machinery is only “token-ish.” Models can and do learn latent structure that’s useful for prediction: compressed variables, abstractions, world regularities, etc. Saying “it’s just next-token prediction” is like saying “humans are just maximizing inclusive genetic fitness,” therefore no real concepts. Goal ≠ mechanism.
> Look at the description in the 3rd paper of the arithmetic example. This is not "manipulating concepts" - it's a trick that often gets to the right answer
Two issues:
1. “Heuristic / approximate” doesn’t mean “not conceptual.” Humans use heuristics constantly, including in arithmetic. Concept manipulation doesn’t require perfect guarantees; it requires that internal variables encode and transform abstractions in ways that generalize.
2. Even if a model is using a “trick,” it can still be doing so by operating over internal representations that correspond to quantities, relations, carry-like states, etc. “Not a clean grade-school algorithm” is not the same as “no concepts.”
>Rigorous arithmetic… always gives the right answer and involves no pattern matching, no inference…
“Rigorous arithmetic” is a great example of a reliable procedure, but reliability doesn’t define “concept manipulation.” It’s perfectly possible to manipulate concepts using approximate, distributed representations, and it’s also possible to follow a rigid procedure with near-zero understanding (e.g., executing steps mechanically without grasping place value).
So if the claim is “LLMs don’t manipulate concepts because they don’t implement the grade-school algorithm,” that’s just conflating one particular human-taught algorithm with the broader notion of representing and transforming abstractions.
> You start with “there is no representation of a concept,” but then concede “features that may correspond to human concepts.” If those features are (a) reliably present across contexts, (b) abstract over surface tokens, and (c) participate causally in producing downstream behavior, then that is a representation in the sense most people mean in cognitive science. One of the most frustrating things about these sorts of discussions is the meaningless semantic games and goalpost shifting.
I'll see if I can try to explain what I mean here, because I absolutely don't believe this is shifting the goal posts.
There are a couple of levels of human cognition that are particularly interesting in this context. One is the question of just how the brain does anything at all, whether that's homeostasis, neuromuscular control or speech generation. Another is how humans engage in conscious, reasoned thought that leads to (or appears to lead to) novel concepts. The first one is a huge area, better understood than the second though still characterized more by what we don't know than what we do. Nevertheless, it is there that the most obvious parallels with e.g. the Lindsey papers can be found. Neural networks, activation networks and waves, signalling etc. etc. The brain receives (lots of) inputs, generates responses including but not limited to speech generation. It seems entirely reasonable to suggest that maybe our brains, given a somewhat analogous architecture at some physical level to the one used for LLMs, might use similar mechanisms as the latter.
However, nobody would say that most of what the brain does involves manipulating concepts. When you run from danger, when you reach up grab something from a shelf, when you do almost anything except actual conscious reasoning, most of the accounts of how that behavior arises from brain activity does not involve manipulating concepts. Instead, we have explanations more similar to those being offered for LLMs - linked patterns of activations across time and space.
Nobody serious is going to argue that conscious reasoning is not built on the same substrate as unconscious behavior, but I think that most people tend to feel that it doesn't make sense to try to shoehorn it into the same category. Just as it doesn't make much sense to talk about what a text editor is doing in terms of P and N semiconductor gates, or even just logic circuits, it doesn't make much sense to talk about conscious reasoning in terms of patterns of neuronal activation, despite the fact that in both cases, one set of behavior is absolutely predicated on the other.
My claim/belief is that there is nothing inside an LLM that corresponds even a tiny bit to what happens when you are asked "What is 297 x 1345?" or "will the moon be visible at 8pm tonight?" or "how does writer X tackle subject Y differently than writer Z?". They can produce answers, certainly. Sometimes the answers even make significant sense or better. But when they do, we have an understanding of how that is happening that does not require any sense of the LLM engaging in reasoning or manipulating concepts. And because of that, I consider attempts like Lindsey's to justify the idea that LLMs are manipulating concepts to be misplaced - the structures Lindsey et al. are describing are much more similar to the ones that let you navigate, move, touch, lift without much if any conscious thought. They are not, I believe, similar to what is going on in the brain when you are asked "do you think this poem would have been better if it was a haiku?" and whatever that thing is, that is what I mean by manipulating concepts.
> Saying “it’s just next-token prediction” is like saying “humans are just maximizing inclusive genetic fitness,” therefore no real concepts. Goal ≠ mechanism.
No. There's a huge difference between behavior and design. Humans are likely just maximizing genetic fitness (even though that's really a concept, but that detail is not worth arguing about here), but that describes, as you note, a goal not a mechanism. Along the way, they manifest huge numbers of sub-goal directed behaviors (or, one could argue quite convincingly, goal-agnostic behaviors) that are, broadly speaking, not governed by the top level goal. LLMs don't do this. If you want to posit that the inner mechanisms contain all sorts of "behavior" that isn't directly linked to the externally visible behavior, be my guest, but I just don't see this as equivalent. What humans visibly, mechanistically do covers a huge range of things; LLMs do token prediction.
>Nobody would say that most of what the brain does involves manipulating concepts. When you run from danger, when you reach up grab something from a shelf, when you do almost anything except actual conscious reasoning, most of the accounts of how that behavior arises from brain activity does not involve manipulating concepts.
This framing assumes "concept manipulation" requires conscious, deliberate reasoning. But that's not how cognitive science typically uses the term. When you reach for a shelf, your brain absolutely manipulates concepts - spatial relationships, object permanence, distance estimation, tool affordances. These are abstract representations that generalize across contexts. The fact that they're unconscious doesn't make them less conceptual
>My claim/belief is that there is nothing inside an LLM that corresponds even a tiny bit to what happens when you are asked "What is 297 x 1345?" or "will the moon be visible at 8pm tonight?"
This is precisely what the mechanistic interpretability work challenges. When you ask "will the moon be visible tonight," the model demonstrably activates internal features corresponding to: time, celestial mechanics, geographic location, lunar phases, etc. It combines these representations to generate an answer.
>But when they do, we have an understanding of how that is happening that does not require any sense of the LLM engaging in reasoning or manipulating concepts.
Do we? The whole point of the interpretability research is that we don't have a complete understanding. We're discovering that these models build rich internal world models, causal representations, and abstract features that weren't explicitly programmed. If your claim is "we can in principle reduce it to matrix multiplications," sure, but we can in principle reduce human cognition to neuronal firing patterns too.
>They are not, I believe, similar to what is going on in the brain when you are asked "do you think this poem would have been better if it was a haiku?" and whatever that thing is, that is what I mean by manipulating concepts.
Here's my core objection: you're defining "manipulating concepts" as "whatever special thing happens during conscious human reasoning that feels different from 'pattern matching.'" But this is circular and unfalsifiable. How would we ever know if an LLM (or another human, for that matter) is doing this "special thing"? You've defined it purely in terms of subjective experience rather than functional or mechanistic criteria.
>Humans are likely just maximizing genetic fitness... but that describes, as you note, a goal not a mechanism. Along the way, they manifest huge numbers of sub-goal directed behaviors... that are, broadly speaking, not governed by the top level goal. LLMs don't do this.
LLMs absolutely do this, it's exactly what the interpretability research reveals. LLMs trained on "token prediction" develop huge numbers of sub-goal directed internal behaviors (spatial reasoning, causal modeling, logical inference) that are instrumentally useful but not explicitly specified, precisely the phenomenon you claim only humans exhibit. And 'token prediction' is not about text. The most significant advances in robotics in decades are off the back of LLM transformers. 'Token prediction' is just the goal, and I'm tired of saying this for the thousandth time.
HN comment threads are really not the right place for discussions like this.
> Here's my core objection: you're defining "manipulating concepts" as "whatever special thing happens during conscious human reasoning that feels different from 'pattern matching.'" But this is circular and unfalsifiable. How would we ever know if an LLM (or another human, for that matter) is doing this "special thing"? You've defined it purely in terms of subjective experience rather than functional or mechanistic criteria.
I think your core objection is well aligned to my own POV. I am not claiming that the subjective experience is the critical element here, but I am claiming that whatever is going on when we have the subjective experience of "reasoning" is likely to be different (or more specifically, more usefully described in different ways) than what is happening in LLMs and our minds when doing something else.
How would we ever know? Well the obvious answer is more research into what is happening in human brains when we reason and comparing that to brain behavior at other times.
I don't think it's likely to be productive to continue this exchange on HN, but if you would like to continue, my email address is in my profile.
His point is that we can't train a Gemini 3/Claude 4.5 etc model because we don't have the data to match the training scale of those models. There aren't trillions of tokens of digitized pre-1900s text.
reply