They counted multiple hallucinations in a single paper toward the 100, and explicitly call out one paper with 13 incorrect citations that are claimed (reasonably, IMO) to be hallucinated.
>GPTZero's analysis 4841 papers accepted by NeurIPS 2025 show there are at least 100 with confirmed hallucinations
Is not true. [Edit - that sounds a bit harsh making it seem like you are accusing them, it's more that this is a logical conclusion of your(imo reasonable) interpretation.
I spot-checked one of the flagged papers (from Google, co-authored by a colleague of mine)
The paper was https://openreview.net/forum?id=0ZnXGzLcOg and the problem flagged was "Two authors are omitted and one (Kyle Richardson) is added. This paper was published at ICLR 2024." I.e., for one cited paper, the author list was off and the venue was wrong. And this citation was mentioned in the background section of the paper, and not fundamental to the validity of the paper. So the citation was not fabricated, but it was incorrectly attributed (perhaps via use of an AI autocomplete).
I think there are some egregious papers in their dataset, and this error does make me pause to wonder how much of the rest of the paper used AI assistance. That said, the "single error" papers in the dataset seem similar to the one I checked: relatively harmless and minor errors (which would be immediately caught by a DOI checker), and so I have to assume some of these were included in the dataset mainly to amplify the author's product pitch. It succeeded.
>this error does make me pause to wonder how much of the rest of the paper used AI assistance
And this is what's operative here. The error spotted, the entire class of error spotted, is easily checked/verified by a non-domain expert. These are the errors we can confirm readily, with obvious and unmistakable signature of hallucination.
If these are the only errors, we are not troubled. However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations. They are a signature that some LLM was used to generate parts of the paper and the responsible authors used this LLM without care.
Checking the rest of the paper requires domain expertise, perhaps requires an attempt at reproducing the authors' results. That the rest of the paper is now in doubt, and that this problem is so widespread, threatens the validity of the fundamental activity these papers represent: research.
> If these are the only errors, we are not troubled. However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations. They are a signature that some LLM was used to generate parts of the paper and the responsible authors used this LLM without care.
I am troubled by people using an LLM at all to write academic research papers.
It's a shoddy, irresponsible way to work. And also plagiarism, when you claim authorship of it.
I'd see a failure of the 'author' to catch hallucinations, to be more like a failure to hide evidence of misconduct.
If academic venues are saying that using an LLM to write your papers is OK ("so long as you look it over for hallucinations"?), then those academic venues deserve every bit of operational pain and damaged reputation that will result.
I would argue that an LLM is a perfectly sensible tool for structure-preserving machine translation from another language to English. (Where by "another language", you could also also substitute "very poor/non-fluent English." Though IMHO that's a bit silly, even though it's possible; there's little sense in writing in a language you only half know, when you'd get a less-lossy result from just writing in your native tongue, and then having it translate from that.)
Google Translate et al were never good enough at this task to actually allow people to use the results for anything professional. Previous tools were limited to getting a rough gloss of what words in another language mean.
But LLMs can be used in this way, and are being used in this way; and this is increasingly allowing non-English-fluent academics to publish papers in English-language journals (thus engaging with the English-language academic community), where previously those academics they may have felt "stuck" publishing in what few journals exist for their discipline in their own language.
Would you call the use of LLMs for translation "shoddy" or "irresponsible"? To me, it'd be no more and no less "shoddy" or "irresponsible" than it would be to hire a freelance human translator to translate the paper for you. (In fact, the human translator might be a worse idea, as LLMs are more likely to understand how to translate the specific academic jargon of your discipline than a randomly-selected human translator would be.)
Autotranslating technical texts is very hard. After the translation, you muct check that all the technical words were translated correctly, instead of a fancy synonym that does not make sense.
(A friend has an old book translated a long time ago (by a human) from Russian to Spanish. Instead of "complex numbers", the book calls them "complicated numbers". :) )
I remember one time when I had written a bunch of user facing text for an imaging app and was reviewing our French translation. I don't speak French but I was pretty sure "plane" (as in geometry) shouldn't be translated as "avion". And this was human translated!
You'd be surprised how shoddy human translations can be, and it's not necessarily because of the translators themselves.
Typically what happens is that translators are given an Excel sheet with the original text in a column, and the translated text must be put into the next column. Because there's no context, it's not necessarily clear to the translator whether the translation for plane should be avion (airplane) or plan (geometric plane). The translator might not ever see the actual software with their translated text.
The convenient thing in this case (verification of translation of academic papers from the speaker's native language to English) is that the authors of the paper likely already 1. can read English to some degree, and 2. are highly likely to be familiar specifically with the jargon terms of their field in both their own language and in English.
This is because, even in countries with a different primary spoken language, many academic subjects, especially at a graduate level (masters/PhD programs — i.e. when publishing starts to matter), are still taught at universities at least partly in English. The best textbooks are usually written in English (with acceptably-faithful translations of these texts being rarer than you'd think); all the seminal papers one might reference are likely to be in English; etc. For many programs, the ability to read English to some degree is a requirement for attendance.
And yet these same programs are also likely to provide lectures (and TA assistance) in the country's own native language, with the native-language versions of the jargon terms used. And any collaborative work is likely to also occur in the native language. So attendees of such programs end up exposed to both the native-language and English-language terms within their field.
This means that academics in these places often have very little trouble in verifying the fidelity of translation of the jargon in their papers. It's usually all the other stuff in the translation that they aren't sure is correct. But this can be cheaply verified by handing the paper to any fluently-multilingual non-academic and asking them to check the translation, with the instruction to just ignore the jargon terms because they were already verified.
To that point I think it's lovely how LLMs democratize science. At ICLR a few years ago I spoke with a few Korean researchers that were delighted that their relative inability to write in English was no being held against them during the review process. I think until then I underestimated how pivotal this technology was in lowering the barrier to entry for the non-English speaking scientific community.
If they can write a whole draft in their first language, they can easily read the translated English version and correct it. The errors described by gp/op were generated when authors directly required LLM to generate a full paragraph of text. Look at my terrible English; I really have the experience of the full process from draft to English version before :)
We still do not have a standardized way to represent Machine Learning concepts. For example in vision model, I see lots of papers confused about the "skip connections" and "residual connection" and when they concatenate channels they call them "residual connection" while it shows that they haven't understood why we call them "residual" in the first place. In my humble opinion, each conference, and better be a confederation of conferences, work together to provide a glossary, a technical guideline, and also a special machine translation tool, to correct a non-clear-with-lots-of-grammatical-error-English like mine!
I'm surprised by these results. I agree that LLMs are a great tool for offsetting the English-speaking world's advantage. I would have expected non-Anglo-American universities to rank at the top of the list. One of the most valuable features of LLMs from the beginning has been their ability to improve written language.
Why is their use more intense in English-speaking universities?
Good point. There may be a place for LLMs for science writing translation (hopefully not adding nor subtracting anything) when you're not fluent in the language of a venue.
You need a way to validate the correctness of the translation, and to be able to stand behind whatever the translation says. And the translation should be disclosed on the paper.
There are legitimate, non-cheating ways to use LLMs for writing. I often use the wrong verb forms ("They synthesizes the ..."), write "though" when it should be "although", and forget to comma-separate clauses. LLMs are perfect for that. Generating text from scratch, however, is wrong.
I agree, but I don't think any of the broadly acceptable uses would result in easily identifiable flaws like those in the post, especially hallucinated URLs.
>I am troubled by people using an LLM at all to write academic research papers.
I'm an outsider to the academic system. I have cool projects that I feel push some niche application to SOTA in my tiny little domain, which is publishable based on many of the papers I've read.
If I can build a system that does a thing, I can benchmark and prove it's better than previous papers, my main blocker is getting all my work and information into the "Arxiv PDF" format and tone. Seems like a good use of LLMs to me.
> And also plagiarism, when you claim authorship of it.
I don't actually mind putting Claude as a co-author on my github commits.
But for papers there are usually so many tools involved. It would be crowded to include each of Claude, Gemini, Codex, Mathematica, Grammarly, Translate etc. as co-authors, even though I used all of them for some parts.
Maybe just having a "tools used" section could work?
> It's a shoddy, irresponsible way to work. And also plagiarism, when you claim authorship of it.
It reminds me of kids these days and their fancy calculators! Those new fangled doohickeys just aren't reliable, and the kids never realize that they won't always have a calculator on them! Everyone should just do it the good old fashioned way with slide rules!
Or these darn kids and their unreliable sources like Wikipedia! Everyone knows that you need a nice solid reliable source that's made out of dead trees and fact checked but up to 3 paid professionals!
I doubt that it's common for anyone to read a research paper and then question whether the researcher's calculator was working reliably.
Sure, maybe someday LLMs will be able to report facts in a mostly reliable fashion (like a typical calculator), but we're definitely not even close to that yet, so until we are the skepticism is very much warranted. Especially when the details really do matter, as in scientific research.
> I doubt that it's common for anyone to read a research paper and then question whether the researcher's calculator was working reliably
Reproducibility and repeatability in the sciences?
Replication crisis > Causes > Problems with the publication system in science > Mathematical errors; Causes > Questionable research practices > In AI research, Remedies > [..., open science, reproducible workflows, disclosure, ]
https://en.wikipedia.org/wiki/Replication_crisis#Mathematica...
Already verifiable proofs are too impossibly many pages for human review
There are "verify each Premise" and "verify the logical form of the Argument" (P therefore Q) steps that still the model doesn't do for the user.
For your domain, how insufficient is the output given process as a prompt like:
Identify hallucinations from models prior to (date in the future)
Check each sentence of this: ```{...}```
Research ScholarlyArticles (and then their Datasets) which support and which reject your conclusions. Critically review findings and controls.
Suggest code to write to apply data science principles to proving correlative and causative relations given already-collected observations.
Design experiment(s) given the scientific method to statistically prove causative (and also correlative) relations
Identify a meta-analytic workflow (process, tools, schema, and maybe code) for proving what is suggested by this chat
> whether the researcher's calculator was working reliably.
LLM's do not work reliably, that's not their purpose.
If you use them that way it's akin to using a butter knife as a screwdriver. You might get away with it once or twice, but then you slip and stab yourself. Better to go find screwdriver if you need reliable.
Im really not motivated by this argument; it seems a false equivalence. Its not merely a spell checker or removing some tedium.
As a professional mathematician I used wikipedia all the time to lookup quick facts before verifying it myself or elsewhere. A calculator well; I can use an actual programming language.
Up until this point neither of those tools were asvertised or used by people to entirely replace human input.
There are some interesting possibilities for LLMs in math, especially in terms of generating machine-checked proofs using languages like Lean. But this is a supplement to the actual result, where the LLM would actually be adding a more rigorous version of a human's argument with all the boring steps included.
In a few cases, I see Terrance Tao has pointed out examples LLMs actually finding proofs of open problems unassisted. Not necessarily problems anyone cared deeply about. But there's still the fact that if the proof holds, then it's valid no matter who or what came up with it.
AI People: "AI is a completely unprecedented technology where its introduction is unlike the introduction of any other transformative technology in history! We must treat it totally differently!"
Also AI People: "You're worried about nothing, this is just like when people were worried about the internet."
The internet analogy is apt because it was in fact a massive bubble, but that bubble popping didn't mean the tech went away. Same will happen again, which is a point both extremes miss. One would have you believe there is no bubble and you should dump all your money into this industry, while the other would have us believe that once the bubble pops all this AI stuff will be debunked and discarded as useless scamware.
Well the internet has definitely changed things; but also it wasnt initially controlled by a bunch of megacorps with the same level of power and centralisation today.
> Those new fangled doohickeys just aren't reliable
Except they are (unlike a chatbot, a calculator is perfectly deterministic), and the unreliability of LLMs is one of their most, if not the most, widespread target of criticism.
Low effort doesn't even begin to describe your comment.
> Except they are (unlike a chatbot, a calculator is perfectly deterministic)
LLM's are supposed to be stochastic. That is not a bug, I can see why you find that disappointing but it's just the reality of the tool.
However, as I mentioned elsewhere calculators also have bugs and those bugs make their way into scientific research all the time. Floating point errors are particularly common, as are order of operations problems because physical devices get it wrong frequently and are hard to patch. Worse, they are not SUPPOSED TO BE stochastic so when they fail nobody notices until it's far too late. [0 - PDF]
Further, spreadsheets are no better, for example a scan of ~3,600 genomics papers found that about 1 in 5 had gene‑name errors (e.g., SEPT2 → “2‑Sep”) because that's how Excel likes to format things.[1] Again, this is much worse than a stochastic machine doing it's stochastic job... because it's not SUPPOSED to be random, it's just broken and on a truly massive scale.
That’s a strange argument. There are plenty of stochastic processes that have perfectly acceptable guarantees. A good example is Karger’s min-cut algorithm. You might not know what you get on any given single run, but you know EXACTLY what you’re going to get when you crank up the number of trials.
Nobody can tell you what you are going to get when you run an LLM once. Nobody can tell you what you’re going to get when you run it N times. There are, in fact, no guarantees at all. Nobody even really knows why it can solve some problems and why it can’t solve other except maybe it memorized the answer at some point. But this is not how they are marketed.
They are marketed as wondrous inventions that can SOLVE EVERYTHING. This is obviously not true. You can verify it yourself, with a simple deterministic problem: generate an arithmetic expression of length N. As you increase N, the probability that an LLM can solve it drops to zero.
Ok, fine. This kind of problem is not a good fit for an LLM. But which is? And after you’ve found a problem that seems like a good fit, how do you know? Did you test it systematically? The big LLM vendors are fudging the numbers. They’re testing on the training set, they’re using ad hoc measurements and so on. But don’t take my word for it. There’s lots of great literature out there that probes the eccentricities of these models; for some reason this work rarely makes its way into the HN echo chamber.
Now I’m not saying these things are broken and useless. Far from it. I use them every day. But I don’t trust anything they produce, because there are no guarantees, and I have been burned many times. If you have not been burned, you’re either exceptionally lucky, you are asking it to solve homework assignments, or you are ignoring the pain.
Excel bugs are not the same thing. Most of those problems can be found trivially. You can find them because Excel is a language with clear rules (just not clear to those particular users). The problem with Excel is that people aren’t looking for bugs.
> But I don’t trust anything they produce, because there are no guarantees
> Did you test it systematically?
Yes! That is exactly the right way to use them. For example, when I'm vibe coding I don't ask it to write code. I ask it to write unit tests. THEN I verify that the test is actually testing for the right things with my own eyeballs. THEN I ask it to write code that passes the unit tests.
Same with even text formatting. Sometimes I ask it to write a pydantic script to validate text inputs of "x" format. Often writing the text to specify the format is itself a major undertaking. Then once the script is working I ask for the text, and tell it to use the script to validate it. After that I can know that I can expect deterministic results, though it often takes a few tries for it to pass the validator.
You CAN get deterministic results, you just have to adapt your expectations to match what the tool is capable of instead of expecting your hammer to magically be a great screwdriver.
I do agree that the SOLVE EVERYTHING crowd are severely misguided, but so are the SOLVE NOTHING crowd. It's a tool, just use it properly and all will be well.
One issue with this analogy is that calculators really are precise when used correctly. LLMs are not.
I do think they can be used in research but not without careful checking. In my own work I’ve found them most useful as search aids and brainstorming sounding boards.
> I do think they can be used in research but not without careful checking.
Of course you are right. It is the same with all tools, calculators included, if you use them improperly you get poor results.
In this case they're stochastic, which isn't something people are used to happening with computers yet. You have to understand that and learn how to use them or you will get poor results.
> One issue with this analogy is that calculators really are precise when used correctly. LLMs are not.
I made this a separate comment, because it's wildly off topic, but... they actually aren't. Especially for very large numbers or for high precision. When's the last time you did a firmware update on yours?
It's fairly trivial to find lists of calculator flaws and then identify them in research papers. I recall reading a research paper about it in the 00's.
One issue with this analogy is that paper encyclopedias really are precise when used correctly. Wikipedia is not.
I do think it can be used in research but not without careful checking. In my own work I've found it most useful as a search aid and for brainstorming.
Paper encyclopedias were neither precise nor accurate. You could count on them to give you ballpark figures most of the time, but certainly not precise answers. And that's assuming the set was new, but in reality most encyclopedias ever encountered by people in reality were several years old at least. I remember the encyclopedia set I had access to in the 90s was written before the USSR fell..
> I do think it can be used in research but not without careful checking.
This is really just restating what I already said in this thread, but you're right. That's because wikipedia isn't a primary source and was never, ever meant to be. You are SUPPOSED to go read it then click through to the primary sources and cite those.
Lots of people use it incorrectly and get bad results because they still haven't realized this... all these years later.
Same thing with treating stochastic LLM's like sources of truth and knowledge. Those folks are just doing it wrong.
I don't necessarily disagree, but researchers are not required to be good communicators. An academic can lead their field and be a terrible lecturer. A specialist can let a generalist help explain concepts for them.
They should still review the final result though. There is no excuse for not doing that.
I disagree here. A good researcher has to be a good communicator. I am not saying that it is necessarily the case that you don't understand the topic if you cannot explain it well enough to someone new, but it is essential to communicate to have a good exchange of ideas with others, and consequently, become a better researcher. This is one of the skills you learn in a PhD program.
To me, this is a reminder of how much of a specific minority this forum is.
Nobody I know in real life, personally or at work, has expressed this belief.
I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.
Clearly, the authors in NeurIPS don't agree that using an LLM to help write is "plagiarism", and I would trust their opinions far more than some random redditor.
> Nobody I know in real life, personally or at work, has expressed this belief.
TBF, most people in real life don't even know how AI works to any degree, so using that as an argument that parent's opinion is extreme is kind of circular reasoning.
> I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.
I don't see parent's opinions as anti-AI. It's more an argument about what AI is currently, and what research is supposed to be. AI is existing ideas. Research is supposed to be new ideas. If much of your research paper can be written by AI, I call into question whether or not it represents actual research.
> Research is supposed to be new ideas. If much of your research paper can be written by AI, I call into question whether or not it represents actual research.
One would hope the authors are forming a hypothesis, performing an experiment, gathering and analysing results, and only then passing it to the AI to convert it into a paper.
If I have a theory that, IDK, laser welds in a sine wave pattern are stronger than laser welds in a zigzag pattern - I've still got to design the exact experimental details, obtain all the equipment and consumables, cut a few dozen test coupons, weld them, strength test them, and record all the measurements.
Obviously if I skipped the experimentation and just had an AI fabricate the results table, that's academic misconduct of the clearest form.
I am not an academic, so correct me if I am wrong, but in your example, the actual writing would probably only represent a small fraction of the time spent. Is it even worth using AI for anything other than spelling and grammar correction at that point? I think using an LLM to generate a paper from high level points wouldn't save much, if any, time if it was then reviewed the way that would require.
My brother in law is a professor, and he has a pretty bad opinion of colleagues that use LLMs to write papers, as his field (economics) doesn't involve much experimentation, and instead relies on data analysis, simulation, and reasoning. It seemed to me like the LLM assisted papers that he's seen have mostly been pretty low impact filler papers.
> I am not an academic, so correct me if I am wrong, but in your example, the actual writing would probably only represent a small fraction of the time spent. Is it even worth using AI for anything other than spelling and grammar correction at that point? I think using an LLM to generate a paper from high level points wouldn't save much, if any, time if it was then reviewed the way that would require.
Its understandable that you believe that, but its absolutely true that writing in academia is a huge time sink. Think about it, the first thing your reviewers are going to notice is not results but how well it is written.
If its written terribly you have lost, and it doesnt matter how good your results are at that point. Its common to spend days with your PI writing a paper to perfection, and then spend months back and forth with reviewers updating and improving the text. This is even more true the higher up you go in journal prestige.
Who knows? Do NeurIPS have a pedigree of original, well sourced research dating back to before the advent of LLMs? We're at the point where both of the terms "AI" and "Experts" are so blurred it's almost impossible to trust or distrust anything without spending more time on due diligence than most subjects deserve.
As the wise woman once said "Ain't nobody got time for that".
"If much of your research paper can be written by AI, I call into question whether or not it represents actual research" And what happens to this statement if next year or later this year the papers that can be autonomously written passes median human paper mark?
What does it mean to cross the median human paper mark? How os that measured?
It seems to me like most of the LLM benchmarks wind up being gamed. So, even if there were a good benchmark there, which I do not believe there is, the validity of the benchmark would likely diminish pretty quickly.
I find that hard to believe. Every creative professional that I know shares this sentiment. That’s several graphic designers at big tech companies, one person in print media, and one visual effects artist in the film industry. And once you include many of their professional colleagues that becomes a decent sample size.
> Plagiarism is using someone else's words, ideas, or work as your own without proper credit, a serious breach of ethics leading to academic failure, job loss, or legal issues, and can range from copying text (direct) to paraphrasing without citation (mosaic), often detected by software and best avoided by meticulous citation, quoting, and paraphrasing to show original thought and attribution.
Higher education is not free. People pay a shit ton of money to attend and also governments (taxpayers) invest a lot. Imagine offloading your research to an AI bot...
Where does this bizarre impulse to dogmatically defend LLM output come from? I don’t understand it.
If AI is a reliable and quality tool, that will become evident without the need to defend it - it’s got billions (trillions?) of dollars backstopping it. The skeptical pushback is WAY more important right now than the optimistic embrace.
The fact that there is absurd AI hype right now doesn't mean that we should let equally absurd bullshit pass on the other side of the spectrum. Having a reasonable and accurate discussion about the benefits, drawbacks, side effects, etc. is WAY more important right now than being flagrantly incorrect in either direction.
Meanwhile this entire comment thread is about what appears to be, as fumi2026 points out in their comment, a predatory marketing play by a startup hoping to capitalize on the exact sort of anti AI sentiment that you seem to think is important... just because there is pro AI sentiment?
Naming and shaming everyday researchers based on the idea that they have let hallucinations slip into their paper all because your own AI model has decided thatit was AI so you can signal boost your product seems pretty shitty and exploitative to me, and is only viable as a product and marketing strategy because of the visceral anti AI sentiment in some places.
No that’s a straw man, sorry. Skepticism is not the same thing as irrational rejection. It means that I don’t believe you until you’ve proven with evidence that what you’re saying is true.
The efficacy and reliability of LLMs requires proof. Ai companies are pouring extraordinary, unprecedented amounts of money into promoting the idea that their products are intelligent and trustworthy. That marketing push absolutely dwarfs the skeptical voices and that’s what makes those voices more important at the moment. If the researchers named have claims made against them that aren’t true, that should be a pretty easy thing for them to refute.
The cat is out of the bag tho. AI does have provably crazy value. Certainly not the agi hype marketing spews and who knows how economically viable it would be without vc.
However, i think any one who is still skeptical of the real efficacy is willfully ignorant. This is not a moral endorsement on how it was made or if it is moral to use but god damn it is a game changer across vast domains.
There was a front page post just a couple of days ago where the article claimed LLMs have not improved in any way in over a year - an obviously absurd statement. A year before Opus 4.5, I couldn't get models to spit out a one shot Tampermonkey script to add chapter turns to my arrow keys. Now I can one small personal projects in claude code.
If you are saying that people are not making irrational and intellectually dishonest arguments about AI, I can't believe that we're reading the same articles and same comments.
Isn’t that the whole point of publishing? This happened plenty before AI too, and the claims are easily verified by checking the claimed hallucinations.
Don’t publish things that aren’t verified and you won’t have a problem, same as before but perhaps now it’s easier to verify, which is a good thing.
We see this problem in many areas, last week it was a criminal case where a made up law was referenced, luckily the judge knew to call it out.
We can’t just blindly trust things in this era, and calling it out is the only way to bring it up to the surface.
No, obviously not. You're confusing a marketing post by people with a product to sell with an actual review of the work by the relevant community, or even review by interested laypeople.
This is a marketing post where they provide no evidence that any of these are hallucinations beyond their own AI tool telling them so - and how do we know it isn't hallucinating? Are there hallucinations in there? Almost certainly. Would the authors deserve being called out by people reviewing their work? Sure.
But what people don't deserve is an unrelated VC funded tech company jumping in and claiming all of their errors are LLM hallucinations when they have no actual proof, painting them all a certain way so they can sell their product.
> Don’t publish things that aren’t verified and you won’t have a problem
If we were holding this company to the same standard, this blog wouldn't be posted either. They have not and can not verify their claims - they can't even say that their claims are based on their own investigations.
Most research is funded by someone with a product to sell, not all but a frightening amount of it. VC to sell, VC to review.
The burden of proof is always on the one publishing and it can be a very frustrating experience, but that is how it is, the one making the claim needs to defend themselves, from people (who can be a very big hit or miss) or machines alike. The good thing is that if this product is crap then it will quickly disappear.
That's still different from a bunch of researchers being specifically put in a negative light purely to sell a product. They weren't criticized so that they could do better, be it in their own error checking if it was a human-induced issue, or not relying on LLMs to do the work they should have been. They were put on blast to sell a product.
That's quite a bit different than a study being funded by someone with a product to sell.
Yup, and no matter how flimsy an anti-ai article is, it will skyrocket to the top of HN because of it. It makes sense though, HN users are the most likely to feel threatened by LLMs, and therefore are more likely to be anxious about them.
> Clearly, the authors in NeurIPS don't agree that using an LLM to help write is "plagiarism",
Or they didn't consider that it arguably fell within academia's definition of plagiarism.
Or they thought they could get away with it.
Why is someone behaving questionably the authority on whether that's OK?
> Nobody I know in real life, personally or at work, has expressed this belief. I have literally only ever encountered this anti-AI extremism (extremism in the non-pejorative sense) in places like reddit and here.
It's not "anti-AI extremism".
If no one you know has said, "Hey, wait a minute, if I'm copy&pasting this text I didn't write, and putting my name on it, without credit or attribution, isn't that like... no... what am I missing?" then maybe they are focused on other angles.
That doesn't mean that people who consider different angles than your friends do are "extremist".
They're only "extremist" in the way that anyone critical at all of 'crypto' was "extremist", to the bros pumping it. Not coincidentally, there's some overlap in bros between the two.
How is that relevant? Companies care very little about plagiarism, at least in the ethical sense (they do care if they think it's a legal risk, but that has turned out to not be the case with AI, so far at least).
What do you mean how is that relevant? Its a vast majority opinion in society that using ai to help you write is fine. Calling it "plagiarism" is a tiny minority online opinion.
First of all, the very fact that companies need to encourage it shows that it is not already a majority opinion in society, it is a majority opinion among company management, which is often extremely unethical.
Secondly, even if it is true that it is a majority opinion in society doesn't mean it's right. Society at large often misunderstands how technology works and what risks it brings and what are its inevitable downstream effects. It was a majority opinion in society for decades or centuries that smoking is neutral to your health - that doesn't mean they were right.
> Secondly, even if it is true that it is a majority opinion in society doesn't mean it's right. Society at large often misunderstands how technology works and what risks it brings and what are its inevitable downstream effects. It was a majority opinion in society for decades or centuries that smoking is neutral to your health - that doesn't mean they were right.
That its a majority opinion instead of a tiny minority opinion is a strong signal that its more likely to be correct. For example its a majority opinion that murder is bad; this has held true for millennia.
Heres a simpler explanation: toaster frickers tend to seek out other toaster frickers online in niche communities. Occams razor.
This seems like finding spelling errors and using them to cast the entire paper into doubt.
I am unconvinced that the particular error mentioned above is a hallucination, and even less convinced that it is a sign of some kind of rampant use of AI.
I hope to find better examples later in the comment section.
I actually believe it was an AI hallucination, but I agree with you that it seems the problem is far more concentrated to a few select papers (e.g., one paper made up more than 10% of the detected errors).
I can see that either way. It could also be a placeholder until the actual author list is inserted. This could happen if you know the title, but not the authors and insert a temporary reference entry.
The first Doe and Smith example I could give that to (the title is real and the arxiv ID they give is "arXiv:2401.00001", which is definitely placeholder), but the second one doesn't match a title and has fake URL/DOI that don't actually go anywhere. There's a few that are unambiguously placeholders, but they really should have been caught in review for a conference this high up.
How does a "placeholder citation" even happens? Either enter the citation properly now, or do it properly later. What role does a "placeholder citation" serve, besides giving you something to forget about and fuck up?
I do not believe the placeholder citation theory at all.
Google scholar and the vagaries of copy/paste errors has mangled bibitex ever since it became a thing, a single citation with these sorts of errors may not even be AI, just “normal” mistakes.
agree, I dont find this evidence of AI. It often happened that authors change, there are multiple venues, or I'm using an old version of the paper. We also need to see the denominator. If this google paper had this one bad citation out of 20 versus out of 60.
Also everyone I know has been relying on google scholar for 10+ years. Is that AI-ish? There are definitely errors on there. If you would extrapolate from citation issues to the content in the age of LLMs, were you doing so then as well?
It's the age-old debate about spelling/grammar issues in technical work. In my experience it rarely gets to the point that these errors eg from non-native speakers affect my interpretation. Others claim to infer shoddy content.
> However: we do not know if these are the only errors, they are merely a signature that the paper was submitted without being thoroughly checked for hallucinations
Given how stupidly tedious and error-prone citations are, I have no trouble believing that the citation error could be the only major problem with the paper, and that it's not a sign of low quality by itself. It would be another matter entirely if we were talking about something actually important to the ideas presented in the paper, but it isn't.
What I find more interesting is how easy these errors are to introduce and how unlikely they are to be caught. As you point out, a DOI checker would immediately flag this. But citation verification isn’t a first-class part of the submission or review workflow today.
We’re still treating citations as narrative text rather than verifiable objects. That implicit trust model worked when volumes were lower, but it doesn’t seem to scale anymore
There’s a project I’m working on at Duke University, where we are building a system that tries to address exactly this gap by making references and review labor explicit and machine verifiable at the infrastructure level. There’s a short explainer here that lays out what we mean, if useful context helps: https://liberata.info/
Citation checks are a workflow problem, not a model problem. Treat every reference as a dependency that must resolve and be reproducible. If the checker cannot fetch and validate it, it does not ship.
The thing is, when you copy paste a bibliography entry from the publisher or from Google Scholar, the authors won't be wrong. In this case, it is. If I were to write a paper with AI, I would at least manage the bibliography by hand, conscious of hallucinations. The fact that the hallucination is in the bibliography is a pretty strong indicator that the paper was written entirely with AI.
Google Scholar provides imperfect citations - very often wrong article type (eg article versus conference paper), but up to and including missing authors, in my experience.
I've had the same experience. Also papers will often have multiple entries in Google Scholar, with small differences between them (enough that Scholar didn't merge them into one entry).
I'm not sure I agree... while I don't ever see myself writing papers with AI, I hate wrangling a bibtex bibliography.
I wouldn't trust today's GPT-5-with-web-search to do turn a bullet point list of papers into proper citations without checking myself, but maybe I will trust GPT-X-plus-agent to do this.
Reference managers have existed for decades now and they work deterministically. I paid for one when writing my doctoral thesis because it would have been horrific to do by hand. Any of the major tools like Zotero or Mendeley (I used Papers) will export a bibtex file for you, and they will accept a RIS or similar format that most journals export.
This seems solvable today if you treat it as an architecture problem rather than relying on the model's weights. I'm using LangGraph to force function calls to Crossref or OpenAlex for a similar workflow. As long as you keep the flow rigid and only use the LLM for orchestration and formatting, the hallucinations pretty much disappear.
I see your point, but I don’t see where the author makes any claims about the specifics of the hallucinations, or their impact on the papers’ broader validity. Indeed, I would have found the removal of supposed “innocuous” examples to be far more deceptive than simply calling a spade a spade, and allowing the data to speak for itself.
The author calls the mistakes "confirmed hallucinations" without proof (just more or less evidence). The data never "speak for itself." The author curates the data and crafts a story about it. This story presented here is very suggestive (even using the term "hallucination" is suggestive). But calling it "100 suspected hallucinations", or "25 very likely hallucinations" does less for the author's end goal: selling their service.
Bibtex are often also incorrectly generated. E.g., google scholar sometimes puts the names of the editors instead of the authors into the bibtex entry.
That's not happening for a similar reason people do not bug-check every single line of every single third-party library in their code. It's a chore that costs valuable time that you can instead spend on getting the actual stuff done. What's really important is that the scientific contribution is 100% correct and solid. For the references, the "good enough" paradigm applies. They mustn't be complete bogus, like the referenced work not existing at all which would indicate that the authors didnt even look at the reference. But minor issues like typos or rare issues with wrong authors can happen.
To be honest, validating bibliographies does not cost valuable time. Every research group will have their own bibtex file to which every paper the group ever cited is added.
Typically when you add it you get the info from another paper or copy the bibtex entry from Google scholar, but it's really at most 10 minutes work, more likely 2-5. Every paper might have 5-10 new entries in the bibliography, so that's 1 hour or less of work?
I don't think the original comment was saying this isn't a problem but that flagging it as a hallucination from an LLM is a much more serious allegation. In this case, it also seems like it was done to market a paid product which makes the collateral damage less tolerable in my opinion.
> Papers should be carefully crafted, not churned out.
I think you can say the same thing for code and yet, even with code review, bugs slip by. People aren't perfect and problems happen. Trying to prevent 100% of problems is usually a bad cost/benefit trade-off.
What's the benefit to society of making sure that academics waste even more of their valuable hours verifying that Google Scholar did not include extraneous authors in some citation which is barely even relevant to their work? With search engines being as good as they are, it's not like we can't easily find that paper anyway.
The entire idea of super-detailed citations is itself quite outdated in my view. Sure, citing the work you rely on is important, but that could be done just as well via hyperlinks. It's not like anybody (exclusively) relies on printed versions any more.
You want the content of the paper to be carefully crafted. Bibtex entries are the sort of thing you want people to copy and paste from a trusted source, as they can be difficult to do consistently correctly.
The rate here (about 1% of papers) just doesn't seem that bad, especially if many of the errors are minor and don't affect the validity of the results. In other fields, over half of high-impact studies don't replicate.
There are people who just want to punish academics for the sake of punishing academics. Look at all the people downthread salivating over blacklisting or even criminally charging people who make errors like this with felony fraud. Its the perfect brew of anti AI and anti academia sentiment.
Also, in my field (economics), by far the biggest source of finding old papers invalid (or less valid, most papers state multiple results) is good old fashioned coding bugs. I'd like to see the software engineers on this site say with a straight face that writing bugs should lead to jail time.
And research codebases (in AI and otherwise) are usually of extremely bad quality. It's usually a bunch of extremely poorly-written scripts, with no indication which order to run them in, how inputs and outputs should flow between them, and which specific files the scripts were run on to calculate the statistics presented in the paper.
If there were real consequences, we wouldn't be forced to churn out buggy nonsense by our employers. So we'd be able to take the time to do the right thing. Bug free software is possible, the world just says its not worth it today.
Getting all possible software correct is impossible, clearly. Getting all the software you release is more possible because you can choose not to release the software that it is too hard to prove correct.
Not that the suggestion is practical or likely, but your assertion that it is impossible is incorrect.
If you want to be pedantic I’m pretty sure every single general purpose OS (and thus also the programs running under it) falls into the category of not provably correct so it’s a distinction without a difference in real life.
> Between 2020 and 2025, submissions to NeurIPS increased more than 220% from 9,467 to 21,575. In response, organizers have had to recruit ever greater numbers of reviewers, resulting in issues of oversight, expertise alignment, negligence, and even fraud.
I don’t think the point being made is “errors didn’t happen pre-GPT”, rather the tasks of detecting errors have become increasingly difficult because of the associated effects of GPT.
> rather the tasks of detecting errors have become increasingly difficult because of the associated effects of GPT.
Did the increase to submissions to NeurIPS from 2020 to 2025 happen because ChatGPT came out in November of 2022? Or was AI getting hotter and hotter during this period, thereby naturally increasing submissions to ... an AI conference?
I was an area chair on the NeurIPS program committee in 1997. I just looked and it seems that we had 1280 submissions. At that time, we were ultimately capped by the book size that MIT Press was willing to put out - 150 8-page articles. Back in 1997 we were all pretty sure we were on to something big.
I'm sure people made mistakes on their bibliographies at that time as well!
And did we all really dig up and read Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953)?
I cited Watson and Crick '53 in my PhD thesis and I did go dig it up and read it.
I had to go to the basement of the library, use some sort of weird rotating knob to move a heavy stack of journals over, find some large bound book of the year's journals, and navigate to the paper. When I got the page, it had been cut out by somebody previous and replaced with a photocopied verison.
(I also invested a HUGE amount of my time into my bibliography in every paper I've written as first author, curating a database and writing scripts to format in the various journal formats. This involved multiple independent checks from several sources, repeated several times.
Totally! If you haven't burrowed in the stacks as a grad student, you missed out.
The real challenges there aren't the "biggies" above, though, it's the ones in obscure journals you have to get copies of by inter-library agreements. My PhD was in applied probability and I was always happy if there were enough equations so that I could parse out the French or Russian-language explanation nearby.
> And did we all really dig up and read Metropolis, Rosenbluth, Rosenbluth, Teller, and Teller (1953)?
If you didn't, you are lying. Full stop.
If you cite something, yes, I expect that you, at least, went back and read the original citation.
The whole damn point of a citation is to provide a link for the reader. If you didn't find it worth the minimal amount of time to go read, then why would your reader? And why did you inflict it on them?
I meant this more as a rueful acknowledgment of an academic truism - not all citations are read by those citing. But I have touched a nerve, so let me explain at least part of the nuance I see here.
In mathematics/applied math consider cited papers claimed to establish a certain result, but where that was not quite what was shown. Or, there is in effect no earthly way to verify that it does.
Or even: the community agrees it was shown there, but perhaps has lost intimate contact with the details — I’m thinking about things like Laplace’s CLT (published in French), or the original form of the Glivenko-Cantelli theorem (published in Italian). These citations happen a lot, and we should not pretend otherwise.
Here’s the example that crystallized that for me. “VC dimension” is a much-cited combinatorial concept/lemma. It’s typical for a very hard paper of Saharon Shelah (https://projecteuclid.org/journalArticle/Download?urlId=pjm%...) to be cited, along with an easier paper of Norbert Sauer. There are currently 800 citations of Shelah’s paper.
I read a monograph by noted mathematician David Pollard covering this work. Pollard, no stranger to doing the hard work, wrote (probably in an endnote) that Shelah’s paper was often cited, but he could not verify that it established the result at all. I was charmed by the candor.
This was the first acknowledgement I had seen that something was fishy with all those citations.
By this time, I had probably seen Shelah’s paper cited 50 times. Let’s just say that there is no way all 50 of those citing authors (now grown to 800) were working their way through a dense paper on transfinite cardinals to verify this had anything to do with VC dimension.
Of course, people were wanting to give credit. So their intentions were perhaps generous. But in no meaningful sense had they “read” this paper.
So I guess the short answer to your question is, citations serve more uses than telling readers to literally read the cited work, and by extension, should not always taken to mean that the cited work was indeed read.
As with anything, it is about trusting your tools. Who is culpable for such errors? In the days of human authors, the person writing the text is responsible for not making these errors. When AI does the writing, the person whose name is on the paper should still be responsible—but do they know that? Do they realize the responsibility they are shouldering when they use these AI tools? I think many times they do not; we implicitly trust the outputs of these tools, and the dangers of that are not made clear.
I'm not going to bat for GPTZero, but I think it's clearly possible to identify some AI-written prose. Scroll through LinkedIn or Twitter replies and there are clear giveaways in tone, phrasing and repeated structures (it's not just X it's Y).
Not to say that you could ever feasibly detect all AI-generated text, but if it's possible for people to develop a sense for the tropes of LLM content then there's no reason you couldn't detect it algorithmically.
> there's no reason you couldn't detect it algorithmically
For any real world classifier there is a precision/recall tradeoff. Do you care more about false positives or false negatives? If you choose to truly minimize false positives you should simply always predict negative.
For your example “it’s not just X it’s Y” I agree it’s a red flag. But the origin of the pattern is from human text which the LLM picked up on. So some people did (and likely still do) use that construction.
Sorry, but blaming it on "AI autocomplete" is the dumbest excuse ever. Author lists come from BibTeX entries and while they often contains errors since they can come from many sources, they do not contain completely made up authors. I don't share your view that hallucinated citations are less damaging in background section. Background, related works, and introduction is the sections where citations most often show up. These sections are meant to be read and generating them with AI is plain cheating.
Trust is damaged. I cannot verify that the evidence is correct only that the conclusions follow from the evidence. I have to rely on the authors to truthfully present their evidence. If they for whatever reason add hallucinated citations to their background that trust is 100% gone.
They are not harmless. These hallucinated references are ingested by Google Scholar, Scopus, etc., and with enough time they will poison those wells. It is also plain academic malpractice, no matter how "minor" the reference is.
If the mistake is one error of author and location in a citation, I find it fairly disingenuous to call that an hallucination. At least, it doesn't meet the threshold for me.
I have seen this kind of mistakes done long before LLM were even a thing. We used to call them that: mistakes.
:) I actually printed a lot so the price is cheap, and I could sell for 5$ then I sold them until I recoup the printing cost and donated the rest to schools.
I am thinking of doing a reprint, but tbh shipping is so expensive now, and I there is also USA's tariffs and etc.
I would pay triple to not have to worry about how to print and box these all neatly as you did. Please take my money for this and all your other games.
Make a social app whose goal is to get people off their phone as quickly as possible. There used to be a slew of apps where you a press a button to indicate "I'm bored/free, who wants to hang out?" and then you get matched with your contact list and anyone else who pressed the button at the same time. But for whatever reason they all flamed out and died.
I think the solution is for the app to advertise public social events where people can make connections and exchange contact information in person.
Allowing random people to message each other without meeting in person is a mistake. The nonverbal cues people get from in person interactions are helpful for discovering shared interested and personality compatibility.
one issue is that if you leave it to the free market, you will just get more issues. for monopolies, it's more profitable to keep someone unhealthy (and depressed) than it is to make someone healthy once forever.
Privacy reasons. You don’t want a VC funded startup to know your contact list and your location (because hanging out in real life requires physical proximity).
It starts from the premise that the author finds LLMs are good for limited, simple tasks with small contexts and clearly defined guidelines, and specifically not good for vibe-coding.
And the author literally mentions that they aren't making universal claims about LLMs, but just speaking from personal experience.
Huh? Where did I say that's what I like? I'm just trying to discuss for discussion's sake. Personally, I want a world that rewards the people who put their thought, care, and craftsmanship into something more than those that don't. In order to live in that world, I think we need to discuss the parts (maybe the whole) that don't and why that might be.
How would one set this sort of test up? I surely have example domains where LLMs routinely do poorly (for example, custom bazel rules and workspaces), but what would constitute a "showcase" here?
To change my
mind I’ll be satisfied with a thorough description of the domain and ideally a theory on why it does poorly in that domain. But we’re
not talking LLMs here, we’re talking opus4.5 specifically.
A theory besides... not enough training data? Is it even possible to formulate a coherent theory about this? I'm talking about customizing a widely-used build system, not exactly state-of-the-art cryptography. What could I possibly say that you wouldn't counter with "skill issue" (which goes back to the author's point)?
If you say it's demonstrably impossible that someone can't be made more productive with opus4.5, then it should probably be up to you to demonstrate impossibility.
The author claims it's not just that one evangelizes it, but that they become hostile when someone claims to not have the same experience in response. I don't recall Either Willison or Antirez scaring people by saying they will be left behind or that they are just afraid of becoming irrelevant. Instead they just talk about their positive experiences using it. Willison and Antirez seem to be fine to live and let live (maybe Antirez a bit less, but they're still not mean about it).
My gut says that is not a property of LLM evangelists, but a property of current internet culture in general. People with strong, divisive, and engaging opinions seem to do well (by some definition of well) online.
It's weird how some people seem to treat using an LLM as part of their personality in a borderline cult like way. So someone saying they don't use it or don't find it useful triggers an anger response in them.
That is not novel - see language/framework choice, OS (or even distro) preferences, editor wars, indentation. People develop strong opinions about tools, technology, and techniques regardless of domain. LLM maximalists just have the unfortunate capability to generate infinite content about their specific shiny thing.
This. For every absurd LLM cheerleader, there’s a corresponding LLM minimalist who trots out the “stochastic parrot” line at every possible occasion along with the fact that they do CrossFit and don’t own a TV.
I think the actual problem is everyone tries to assert how capable or not coding agents currently are, but how useful they are depends so much on what you are trying to get them to do and also on your communication with the model. And often its hard to tell whether you're just prompting it wrong or if they're incapable of doing it.
By now we at least agree that stochastical parrots can be useful.
It would be nice if the debate now was less polarized so we could focus on what makes them work better for some and worse for others other than just expectations.
Who knows. Maybe all the AI people will have their skills atrophy and by the time the AI crash happens and none of the models can be run, they don't be employable any more. I'm happy to take that gamble if it saves my conscience.
And yeah, as I laid out in the article (that of course, very few people actually read, even though it was short...), I really don't mind how people make code. It's those that try so hard to convince the rest of us I find very suspect.
In my case I don’t even mind if these evangelists try so hard to convince other developers. What I do mind is that they seem to be quite successful in convincing our bosses. So we get things like mandatory LLM usage, minimum number of Claude API calls per day, every commit must be co-authored by Claude, etc.
I do wish I could have a good, in-depth tutorial on how to set this up myself. Along with (pipe dream) an explanation of how it would interact with my local utility. I worry that due to some silly technicality, I won't be able to export to my local utility, or else I won't be able to run off-grid when there's an outage.
I will do a write-up in a couple of days. It's all relatively simple, you just have to expect terrible documentation and do a bit of reverse engineering and serial sniffing. I expected the battery to be complicated, but it turned out that the inverter was.
You'll encounter stuff like: manual says use RS485 port on Battery for GroWatt inverter → need to use CAN port on Battery. Meter Port (RS485 [serial] over RJ45) wiring on GroWatt is unknown (A: white orange / B: white blue, cross them over). Dinky RS485 serial → USB converter needs a 120ohm resistor between pins for line termination. Growatt meter port expects a SDM630 meter, not a DTSU666 (hardcoded), so vibe code another emulator. DIP switches for RS232 connection need to be both on the ON position (undocumented). CH340 USB→serial converter for RS232 does not work, but one with a Prolific chip does. Etc. etc. etc :)
Oh, and the biggest one... I was expecting to be able to just send a command, 'charge at 500watts', now... 'discharge at 2000watts'. But no. You have to emulate a power meter and the inverter will try to bring the net power to 0. Fun! :)
reply