Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Wow. That's an impressive result, but how did they do it?

Wei references scaling up test-time compute, so I have to assume they threw a boatload of money at this. I've heard talk of running models in parallel and comparing results - if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

If this is legit, then we need to know what tools were used and how the model used them. I'd bet those are the 'techniques to make them better at hard to verify tasks'.



Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting? Regardless of the tools for verification or even solvers - why is the goal post moving so fast? There is no bonus for “purity of essence” and using only neural networks. We live in an era where it’s hard to tell if machines are thinking or not, which for since the first computing machines was seen as the ultimate achievement. Now we Pooh Pooh the results of each iteration - which unfold month over month not decade over decade now.

You don’t have to be hyped to be amazed. You can retain the ability to dream while not buying into the snake oil. This is amazing no matter what ensemble of techniques used. In fact - you should be excited if we’ve started to break out of the limitations of forcing NN to be load bearing in literally everything. That’s a sign of maturing technology not of limitations.


>> Why is that less exciting? A machine competing in an unconstrained natural language difficult math contest and coming out on top by any means is breath taking science fiction a few years ago - now it’s not exciting?

Half the internet is convinced that LLMs are a big data cheating machine and if they're right then, yes, boldly cheating where nobody has cheated before is not that exciting.


I don't get it, how do you "big data cheat" an AI into solving previously unencountered problems? Wouldn't that just be engineering?


I don’t know how you’d cheat at it either, but if you could, it would manifest as the model getting gold on the test and then in six months when its released to the public, exhibiting wild hallucinations and basic algebra errors. I don’t kbow if that’s how it’ll play out this time but I know how it played out the last ten.


It depends on what you mean by "engineering". For example, "engineering" can mean that you train and fine-tune a machine learning system to beat a particular benchmark. That's fun times but not really interesting or informative.


> previously unencountered problems

I haven't read the IMO problems, but knowing how math Olympiad problems work, they're probably not really "unencountered".

People aren't inventing these problems ex nihilo, there's a rulebook somewhere out there to make life easier for contest organizers.

People aren't doing these contests for money, they are doing them for honor, so there is little incentive to cheat. With big business LLM vendors it's a different situation entirely.


I mean, solutions for the 2025 IMO problems are already available on the internet. How can we be sure these are “unencountered” problems?


They probably have an archived data set from before then that they trained on.


They would if theyre honest. Which we just dont know for sure these days


Without sharing their methodology, how can we trust the claim ? questions like:

1) did humans formalize the input 2) did humans prompt the llm towards the solution etc..

I am excited to hear about it, but I remain skeptical.


>Why is that less exciting?

Because if I have to throw 10000 rocks to get one in the bucket, I am not as good/useful of a rock-into-bucket-thrower as someone who gets it in one shot.

People would probably not be as excited about the prospect of employing me to throw rocks for them.


If you don't have automatic way to verify solution then picking correct answer from 10 000 is more impressive than coming with some answer in the first place. If AI tech will be able to effectively prune tree search without eval that would be super big leap but I doubt they achieved this.


It’s exciting because nearly all humans have 0% chance of throwing the rock into the bucket, and most people believed a rock-into-bucket-thrower machine is impossible. So even an inefficient rock-into-bucket-thrower is impressive.

But the bar has been getting raised very rapidly. What was impressive six months ago is awful and unexciting today.


You're putting words in my mouth. It's not "awful and unexciting", it is certainly an important step, but the hype being invited with the headline is the immensely greater one of an accurate rock-thrower. And if they have the inefficient one and trying to pretend to have the real deal, that's flim-flam-man levels of overstatement.


I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result.

Certainly the emergent behaviour is exciting but we tend to jump to conclusions as to what it implies.

This means we are far more trusting with software that lacks formal guarantees than we should be. We are used to software being sound by default but otherwise a moron that requires very precise inputs and parameters and testing to act correctly. System 2 thinking.

Now with NN it's inverted: it's a brilliant know-it-all but it bullshits a lot, and falls apart in ways we may gloss over, even with enormous resources spent on training. It's effectively incredible progress on System 1 thinking with questionable but evolving System 2 skills where we don't know the limits.

If you're not familiar with System 1 / System 2, it's googlable .


> These models cannot reason

Not trying to be a smarty pants here, but what do we mean by "reason"?

Just to make the point, I'm using Claude to help me code right now. In between prompts, I read HN.

It does things for me such as coding up new features, looking at the compile and runtime responses, and then correcting the code. All while I sit here and write with you on HN.

It gives me feedback like "lock free message passing is going to work better here" and then replaces the locks with the exact kind of thing I actually want. If it runs into a problem, it does what I did a few weeks ago, it will see that some flag is set wrong, or that some architectural decision needs to be changed, and then implements the changes.

What is not reasoning about this? Last year at this time, if I looked at my code with a two hour delta, and someone had pushed edits that were able to compile, with real improvements, I would not have any doubt that there was a reasoning, intelligent person who had spent years learning how this worked.

It is pattern matching? Of course. But why is that not reasoning? Is there some sort of emergent behavior? Also yes. But what is not reasoning about that?

I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.


I think the biggest hint that the models aren't reasoning is that they can't explain their reasoning. Researchers have shown for explained that how a model solves a simple math problem and how it claims to have solved it after the fact have no real correlation. In other words there was only the appearance of reasoning.


People can't explain their reasoning either. People do a parallel construction of logical arguments for a conclusion they already reached intuitively in a way they have no clue how it happened. "The idea just popped into my head while showering" to our credit, if this post-hoc rationalization fails we are able to change our opinion to some degree.


Interestingly people have to be trained in logic and identifying fallacies because logic is not a native capability of our mind. We aren’t even that good at it once trained and many humans (don’t forget a 100 IQ is median) can not be trained.

Reasoning appears to actually be more accurately described as “awareness,” or some process that exists along side thought where agency and subconscious processes occur. It’s by construction unobservable by our conscious mind, which is why we have so much trouble explaining it. It’s not intuition - it’s awareness.


Yeah, surprisingly I think the differences are less in the mechanism used for thought and more in the experience of being a person alive in a body. A person can become an idea. An LLM always forgets everything. It cannot "care"


Is this true though? I've suggested things that it pushed back on. Feels very much like a dev. It doesn't just dumbly do what I tell it.


Sure but it isn't reasoning that it should push back. It isn't even "pushing" which would require an intent to change you which it lacks


I'm having actual coding conversations that I used to only have with senior devs, right now, while browsing HN, and code that does what I asked is being produced.

I’m using Opus 4 for coding and there is no way that model demonstrates any reasoning or demonstrates any “intelligence” in my opinion. I’ve been through the having conversations phase etc but doesn’t get you very far, better to read a book.

I use these models to help me type less now, that’s it. My prompts basically tell it to not do anything fancy and that works well.


> It will do something brilliant and another 5 dumb things in the same prompt.

it me


YOU are reasoning.


You raise a far point. These criticisms based on "it's merely X" or "it's not really Y" don't hold water when X and Y are poorly defined.

The only thing that should matter is the results they get. And I have a hard time understanding why the thing that is supposed to behave in an intelligent way but often just spew nonsense gets 10x budget increases over and over again.

This is bad software. It does not do the thing it promises to do. Software that sometimes works and very often produces wrong or nonsensical output is garbage software. Sink 10x, 100x, 1000x more resources into it is irrational.

Nothing else matters. Maybe it reasons, maybe it's intelligent. If it produces garbled nonsense often, giving the teams behind it 10x the compute is insane.


"Software that sometimes works and very often produces wrong or nonsensical output" can be extremely valuable when coupled with a way to test whether the result is correct.


> It does not do the thing it promises to do. Software that sometimes works and very often produces wrong or nonsensical output...

Is that very unlike humans?

You seem to be comparing LLMs to much less sophisticated deterministic programs. And claiming LLMs are garbage because they are stochastic.

Which entirely misses the point because I don't want an LLM to render a spreadsheet for me in a fully reproducible fashion.

No, I expect an LLM to understand my intent, reason about it, wield those smaller deterministic tools on my behalf and sometimes even be creative when coming up with a solution, and if that doesn't work, dream up some other method and try again.

If _that_ is the goal, then some amount of randomness in the output is not a bug it's a necessary feature!


You're right, they should never have given more resources and compute to the OpenAI team after the disaster called GPT-2, which only knew how to spew nonsense.


We already have highly advanced deterministic software. The value lies in the abductive “reasoning” and natural language processing.

We deal with non determinism any time our code interacts with the natural world. We build guard rails, detection, classification of false/true positive and negatives, and all that all the time. This isn’t a flaw, it’s just the way things are for certain classes of problems and solutions.

It’s not bad software - it’s software that does things we’ve been trying to do for nearly a hundred years beyond any reasonable expectation. The fact I can tell a machine in human language to do some relative abstract and complex task and it pretty reliably “understands” me and my intent, “understands” it’s tools and capabilities, and “reasons” how to bridge my words to a real world action is not bad software. It’s science fiction.

The fact “reliably” shows up is the non determinism. Not perfectly, although on a retry with a new seed it often succeeds. This feels like most software that interacts with natural processes in any way or form.

It’s remarkable that anyone who has ever implemented exponential back off and retry, has ever implemented edge cases, and sir and say “nothing else matters,” when they make their living dealing with non determinism. Because the algorithmic kernel of logic is 1% of programming and systems engineering, and 99% is coping with the non determinism in computing systems.

The technology is immature and the toolchains are almost farcically basic - money is dumping into model training because we have not yet hit a wall with brute force. And it takes longer to build a new way of programming and designing highly reliable systems in the face of non determinism, but it’s getting better faster than almost any technology change in my 35 years in the industry.

Your statement that it “very often produces wrong or nonsensical output” also tells me you’re holding onto a bias from prior experiences. The rate of improvement is astonishing. At this point in my professional use of frontier LLMs and techniques they are exceeding the precision and recall of humans and there’s a lot of rich ground untouched. At this point we largely can offload massive amounts of work that humans would do in decision making (classification) and use humans as a last line to exercise executive judgement often with the assistance of LLMs. I expect within two years humans will only be needed in the most exceptional of situations, and we will do a better job on more tasks than we ever could have dreamed of with humans. For the company I’m at this is a huge bottom line improvement far and beyond the cost of our AI infrastructure and development, and we do quite a lot of that too.

If you’re not seeing it yet, I wouldn’t use that to extrapolate to the world at large and especially not to the future.


>I think the main hesitancy is due to rampant anthropomorphism. These models cannot reason, they pattern match language tokens and generate emergent behaviour as a result

This is rampant human chauvinism. There's absolutely no empirical basis for the statement that these models "cannot reason", it's just pseudoscientific woo thrown around by people who want to feel that humans are somehow special. By pretty much every empirical measure of "reasoning" or intelligence we have, SOTA LLMs are better at it than the average human.


> This is rampant human chauvinism

What in the accelerationist hell?


There's nothing accelerationist about recognising that making unfalsifiable statements about LLMs lacking intelligence or reasoning ability serves zero purpose except stroking the speaker's ego. Such people are never willing to give a clear criteria for what would constitute proof of machine reasoning for them, which shows their belief isn't based on science or reason.


I’ve used these AI tools for multiple hours a day for months. Not seeing the reasoning party honestly. I see the heuristics part.


I guess your work doesn't involve any maths then, because then you'd see they're capable of solving maths problems that require a non-trivial amount of reasoning steps.


Just the other day I needed to code some interlocked indices. It wasn't particularly hard but I didn't want to context switch and think so instead I asked gpt 4o. After a back and worth for 4 or 5 times, where it gave wrong answers I finally decided to just take a pen and paper and do it by hand. I have a hard time believing that these models are reasoning, because if they are they are very poor at it.


GPT-4o isn't classified as a "reasoning" model (by common 2025 terminology at least) - I suggest trying again with o3 or Claude 4 or Gemini 2.5.


Because the usefulness of an AI model is reliably solving a problem, not being able to solve a problem given 10,000 tries.

Claude Code is still only a mildly useful tool because it's horrific beyond a certain breadth of scope. If I asked it to solve the same problem 10,000 times I'm sure I'd get a great answer to significantly more difficult problems, but that doesn't help me as I'm not capable of scaling myself to checking 10,000 answers.


>if OpenAI ran this 10000 times in parallel and cherry-picked the best one, this is a lot less exciting.

That entirely depends on who did the cherry picking. If the LLM had 10000 attempts and each time a human had to falsify it, this story means absolutely nothing. If the LLM itself did the cherry picking, then this is just akin to a human solving a hard problem. Attempting solutions and falsifying them until the desired result is achieved. Just that the LLM scales with compute, while humans operate only sequentially.


The key bit here is whether the LLM doing the cherry picking had knowledge of the solution. If it didn't, this is a meaningful result. That's why I'd like more info, but I fear OpenAI is going to try to keep things under wraps.


> If it didn't

We kind of have to assume it didn't right? Otherwise bragging about the results makes zero sense and would be outright misleading.


> would be outright misleading

why would not they? what are the incentives not to?


Corporations mislead to make money all the damn time.


"You really think someone would do that, just go on the internet and tell lies?"

[https://youtube.com/watch?v=YWdD206eSv0]


openai have been caught doing exactly this before


Why do people keep making up controversial claims like this? There is no evidence at all to this effect


it was widely covered in the press earlier in the year


Source?


Mark Chen posted that the system was locked before the contest. [1] It would obviously be crazy cheating to give verifiers a solution to the problem!

[1] https://x.com/markchen90/status/1946573740986257614?s=46&t=H...


I don't think it's much less exciting if they ran it 10000 parallel? It implies an ability to discern when the proof is correct and rigorous (which o3 can't do consistently) and also means that outputting the full proof is within capabilities even if rare.


The whole point of RL is if you can get it to work 0.01% of the time you can get it to work 100% of the time.


> what tools were used and how the model used them

According to the twitter thread, the model was not given access to tools.


> if OpenAI ran this 10000 times in parallel and cherry-picked the best one

This is almost certainly the case, remember the initial o3 ARC benchmark? I could add this is probably multi-agent system as well, so the context length restriction can be bypassed.

Overall, AI good at math problems doesn't make news to me. It is already better than 99.99% of humans, now it is better than 99.999% of us. So ... ?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: