In AI, hallucinations occur when an LLM produces outputs that are factually incorrect, nonsensical, or completely unrelated to the task it was assigned. Hallucinations have long dogged LLMs because they degrade their usefulness and trustworthiness and have proven vexingly difficult to predict and remedy. In a paper scheduled to be presented at the 2025 USENIX Security Symposium, they have dubbed the phenomenon “package hallucination.”
For the study, the researchers ran 30 tests, 16 in the Python programming language and 14 in JavaScript, that generated 19,200 code samples per test, for a total of 576,000 code samples. Of the 2.23 million package references contained in those samples, 440,445, or 19.7 percent, pointed to packages that didn’t exist. Among these 440,445 package hallucinations, 205,474 had unique package names.
One of the things that makes package hallucinations potentially useful in supply-chain attacks is that 43 percent of package hallucinations were repeated over 10 queries. “In addition,” the researchers wrote, “58 percent of the time, a hallucinated package is repeated more than once in 10 iterations, which shows that the majority of hallucinations are not simply random errors, but a repeatable phenomenon that persists across multiple iterations. This is significant because a persistent hallucination is more valuable for malicious actors looking to exploit this vulnerability and makes the hallucination attack vector a more viable threat.”
In other words, many package hallucinations aren’t random one-off errors. Rather, specific names of non-existent packages are repeated over and over. Attackers could seize on the pattern by identifying nonexistent packages that are repeatedly hallucinated. The attackers would then publish malware using those names and wait for them to be accessed by large numbers of developers.
The study uncovered disparities in the LLMs and programming languages that produced the most package hallucinations. The average percentage of package hallucinations produced by open source LLMs such as CodeLlama and DeepSeek was nearly 22 percent, compared with a little more than 5 percent by commercial models. Code written in Python resulted in fewer hallucinations than JavaScript code, with an average of almost 16 percent compared with a little over 21 percent for JavaScript. Asked what caused the differences, Spracklen wrote:
0 تعليق