OK, I am going to wade into deep water here.
I side with Gary Marcus – see for example: https://garymarcus.substack.com/p/a-knockout-blow-for-llms
and here I will take a look at an article by LawrenceC: https://www.lesswrong.com/posts/5uw26uDdFbFQgKzih/beware-general-claims-about-generalizable-reasoning
The argument Marcus makes, also found in the recent Apple-paper, is that LLMs don’t really think. They just look like they think. But what do we mean by “think”? Here we will use “reason” instead of think, and reserve thinking for sentient beings – humans. I will also use reason in a specific meaning.
The question then is – do LLMs reason? It seems pretty clear that they don’t. When a machine reasons, it does so by executing an algorithm. A key feature of algorithms is that they seek to solve something, and they contain criteria for stopping; until the criterion/criteria is/are satisfied, the reasoning continues. Referring to the Hanoi tower, and a number of similar cases, increasing complexity leads to increasing number of iterations (e.g. O(n), O (2n), etc) before a solution will be found; but there is no qualitative difference between how n=3 and n=5 are solved. The Apple-paper clearly shows what I think we all understand anyway: this is NOT how an LLM works. Yes, execution time rises with increasing n, but not the way it would if a classic algorithmic approach was running behind the scenes. LLMs operate in a different way.
The “classic” LLM is one-shot, whereas the later ones employ Chain of Thought (CoT): OpenAI o3 etc. The CoT LLM must contain some kind of evaluation metric to determine when to stop running – though I have found little to support this assumption. We know that one-shot LLMs answer in less than a second, while CoT-runs can take many minutes. So clearly the system is iterating in some manner, until it decides that enough is enough.
Going back to one-shot LLMs, which form the backbone of o3, too, each cycle produces one token from a huge number of activated nodes and their weights. The entire “reasoning”, or “knowledge”, therefore resides in the network at the moment that the next token is predicted (strictly, the token probability distribution is calculated and a token with high probability is selected at random). Hence, the “reasoning” has already happened at this point. Most of it happened when the model was trained. The rest happens when the prompt is entered and causes the node values to be calculated. This process is controlled by an algorithm, of course, but what the algorithm does is calculate values in a completely deterministic way: the algorithm runs the same way regardless of the content or indeed the complexity of the problem/text/image that the LLM is working on. The algorithm is completely blind to the content, like a washing machine is blind to its content.
We must guess that o3 does take a look at the content, so it must have a meta-level orchestrator that coordinates the many one-shot runs of the actual model, and at some point intervenes and stops and spits out the output based on some metrics.
What about LawrenceC, then? Well, I find the article full of special pleading.
Section 3: “Criticism of LLMs and neural nets is nothing new”. That’s true, but entirely beside the point. LawrenceC calls the tower of Hanoi and the other puzzles “toy settings”. Why does he do that? Presumably to belittle them. But if an LLM cannot solve basic logical puzzles reliably, we have a problem, Houston. LawrenceC pretends we do not. The onus of proof is on LawrenceC: prove that LLMs are not limited in their ability. So far, all the evidence clearly indicates that they are, and that success depends on prior exposure.
Section 4 : the four puzzles. This is the meat in the sandwich, and LawrenceC essentially has nothing to say about them. A clear weakness in his argument.
Section 5: what did the model say? Again, this is short and sketchy. The article includes output from Claude 3.7 Sonnet:
“The Tower of Hanoi follows a recursive pattern. For 10 disks, the solution requires exactly 2*10-1 = 1023 moves. While I understand the algorithm perfectly, manually writing out all 1023 moves would be extremely tedious and error-prone.Let me demonstrate my understanding by showing the structure and first several moves“
I find it amazing that LawrenceC does not comment on the disingenuous/dishonest output. First of all there is no “I” in the computer. Second, a computer does not find any activity tedious, and it does not do manual work, nor is it error–prone. All these terms make it sound like a human, not like a machine, or like some self-conscious sci-fi robot. But it’s all smoke and mirrors – a reflection of text written by humans. It’s all fake. Obviously there is no “I” here that can’t be bothered to write out stuff “manually” – a word which makes no sense in this context, and yet LawrenceC repeats it later in the article. That is not why the LLM fails. It fails because it’s not solving the problem algorithmically, but LLM-ally.
Up to this point, LawrenceC has revealed where his bias lies. In Section 6 the knotty questions arise. They remind me to some extent of discussions of Universal Grammar UG, language and cognition. The “hard AI” folks, like LawrenceC (?), seem to believe that exposure to patterns leads to the creation of explicit rules (akin to UG; in the case of UG, Chomsky’s claim was that exposure to patterns, that is, language use, activated latent patterns in the brain). I don’t. I agree that LLMs show an incredible (human term) ability to see patterns and remember and apply these. But LLMs do not extract algorithmic patterns – how could it? So when LawrenceC points out that Claude 3.7 Sonnet can spit out a Python algorithm, he is making a huge mistake. It’s unthinkable that the LLM should have inferred the algorithm from observed data. Why? Because it was not designed to carry out that feat. It’s just repeating stuff it has been trained on. Could the algorithm have emerged? Prove it!
When I see the Fibonacci series 1,1,2,3,5,8,13,21,34.. I find the pattern by trial and error, and I have enough mathematical experience to be able to say Fn = Fn-1 + Fn-2. When an LLM describes Fibonacci it is simply regurgitating its training material.
To return to LawrenceC, he says: “There’s a common assumption in many LLM critiques that reasoning ability is binary: either you have it, or you don’t. Either you have true generalization, or you have pure memorization. Under this dichotomy, showing that LLMs fail to learn or implement the general algorithm in a toy example is enough to conclude that they must be blind pattern matchers in general.”
Note the term “toy example” again. LawrenceC claims that the dichotomy is false, it seems. But is that the key question?
Consider these three:
Humans can do all three: we can rote learn 7*7=49, we can detect and learn patterns 2,4,6,8,10, and we can define algorithms FOR X = 1 to 10; SUM= SUM + X; NEXT.
Traditional computers can remember, and they can execute algorithms that we define.
Traditional neural nets can remember and recognize patterns bases on numerical values.
LLMs can certainly remember and recognize patterns – but can they infer algorithms? It seems to me that this is the key issue. And algorithms require symbol manipulation, that is abstraction, and variables, and instantiation/binding. The burden of proof is on those that claim these capabilities for LLMs; especially in light of what we know of their architecture.

Final note: I just asked ChatGPT to do Fibonacci in reverse from 55. It did it well, and offered Python code. I said yes please, and got it, including its supposed output. That was fine, except the output isn’t in reverse. Try it yourself.
def reverse_fibonacci(start_value):
fib = [start_value]
# Get the second-last Fibonacci number that led to start_value
# We know 55 is a Fibonacci number, so we find its predecessor
a, b = 0, 1
while b < start_value:
a, b = b, a + b
if b != start_value:
raise ValueError(f"{start_value} is not a Fibonacci number")
fib.insert(0, a) # insert the second last value before the start
# Continue calculating in reverse using F(n-2) = F(n) - F(n-1)
while fib[0] != 0:
prev = fib[1] - fib[0]
fib.insert(0, prev)
return fib
# Example usage
result = reverse_fibonacci(55)
print(result)