Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

People are resistant because:

1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data 2. People glamorize math and feel like advancements in it would "be AGI"

They don't realize that having it generate "new math" is not much harder than having it generate "new programs." Instead of writing something in Python, it's writing something in Lean.



> 1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data

So then, what are they doing?

I'm seeing people creating full apps with GPT-5-pro, but nothing is novel.

Just discussed the "impressiveness" of it creating a gameboy emulator from scratch.

(There's over 3500 gameboy emulators on github. I would be suprised if it failed to produce a solution with that much training data).

Where's the novel break-throughs?

As it stands today, I'm sure it can produce a new ssl implementation or whatever it has been trained on, but to what benefit???


>1. There's this huge misconception that LLMs are literally just memorizing stuff and repeating patterns from their training data

For a lay person, what are they actually doing instead?


They can learn to generalize patterns during training and develop some model of the world. So for example, if you were to train an LLM on chess games, it would likely develop an internal model of the chess board. Then when someone plays chess with it and gives a move like Nf3, it can use that internal model to help it reason about its next move.

Or if you ask it, "what is the capital of the state that has the city Dallas?", it understands the relations and can internally reason through the two step process of Dallas is in Texas -> the capital of Texas is Austin. A simple n-gram model may occasionally get questions like that right by a lucky guess (though usually not) while we can see experimentally the LLM is actually applying the proper reasoning to the question.

You can say this is all just advanced applications of memorizing and predicting patterns, but you would have to use a broad definition of "predicting patterns" that would likely include human learning. People who declare LLMs are just glorified auto-complete are usually trying to imply they are unable to "truly" reason at all.


I don't think anyone really knows, but I also don't think it's quite an either/or. To me a more interesting way to put the question is to ask what it would mean to say that GPT-5 is just applying patterns from its training data when it finds bugs in 1000 lines of new Rust code that were missed by multiple human reviewers. "Applying a memorized pattern" seems well-defined because it is an everyday concept but I don't think it really is well-defined. If the bug "fits a pattern" but is expressed in a different programming language, with different variable names, different context, etc., recognizing that and applying the pattern doesn't seem to me like a merely mechanical process.

Kant has an argument in the Critique of Pure Reason that reason cannot be reducible to the application of rules, because in order to apply rule A to a situation, you would need a rule B to follow for applying rule A, and a rule C for applying rule B, and this is an infinite regress. I think the same is true here: any reasonable characterization of "applying a pattern" that would succeed at reducing what LLMs do to something mechanical is vulnerable to the regress argument.

In short: even if you want to say it's pattern matching, retrieving a pattern and applying it requires something a lot closer to intelligence than the phrase makes it sound.


First: while it's not technically incorrect to say that they're learning "patterns" in the training data, the word "pattern" here is extremely deep and hides a ton of detail. These aren't simple n-grams like "if the last N tokens were ___, then ___ follows." To generate fluent conversation, new code, or poetry, the model must learn highly abstract structures that start to resemble reasoning, inference, and world-modeling. You can't predict tokens well without starting to build these higher-level capabilities on some level.

Second: Generative AI is about approximating an unknown data distribution. Every dataset - text, images, video - is treated as a sample from such a distribution. Success depends entirely on the model's ability to generalize outside the training set. For example, "This Person Does Not Exist" (https://this-person-does-not-exist.com/en) was trained on a data set of 1024x1024 RGB images. Each image can be thought of as a vector in a 1024x1024x3 = 3145728-dimensional space, and since all coefficients are in [0,1], these vectors are all in the interior of a 3145728-dimensional hypercube. But almost all points in that hypercube are going to be random noise that doesn't look like a person. The ones that do will be on a lower-dimensional manifold embedded in the hypercube. The goal of these models is to infer this manifold is from the training data, and generate a random point on it.

Third: Models do what they're trained to do. Next-token prediction is one of those things, but not the whole story. A model that literally did just memorize exact fragments would not be able to zero-shot new code examples at all. That is, the transformer architecture would have learned some nonlinear transformation that is only good at repeating exact fragments. Instead, they spend a ton of time training it to get good at generalizing to new things, and it learns whatever other nonlinear transformation makes it good at doing that instead.


The definition of a language model is literally the probability distribution of the most likely next token given a preceding text. When OP says "memorizing patterns and repeating stuff", it's a strawman of a basic n-gram model, obviously with modern language it's more advanced because we techniques like vector tokenization, but at it's core it's still just probability that's limited to the corpus it was trained on.

Or at it's core, if you give it question that it's never seen, what's the most likely reply you might get, and it will give you that. But dosen't mean there is a internal world-model or anything, it's ultimately wether you think language is sufficient to model reality, which I probably think not. It obviously would be very convincing, but not necessairly correct.


This isn't true at all. The LLMs absolutely world model and researchers have shown this many times on smaller language models.

> techniques like vector tokenization

(I assume you're talking about the input embedding.) This is really not an important part of what gives LLMs their power. The core is that you have a large scale artificial neural net. This is very different than an n-gram model and is probably capable of figuring out anything a human can figure out given sufficient scale and the right weights. We don't have that yet in practice, but it's not due to a theoretical limitation of ANNs.

> probability distribution of the most likely next token given a preceding text.

What you're talking about is an autoregressive model. That's more of an implementation detail. There are other kinds of LLMs.

I think talking about how it's just predicting the next token is misleading. It's implying it's not reasoning, not world-modeling, or is somehow limited. Reasoning is predicting, and predicting well requires world-modeling.


>This is really not an important part of what gives LLMs their power. The core is that you have a large scale artificial neural net.

What seperates transformers from LSTMs is their ability to proccess the entire corpus in parallel rather in-sequence and the inclusion of the more efficient "attention" mechanism that allows them to pick up long range dependencies across a language. We don't actually understand the full nature of the latter, but I suspect that is the basis behind the more "intelligent" actions of the LLM. There's quite a general range of problems that a long-range-dependency was encompass, but that's still ultimately limited by language itself.

But if you're talking about this being a fundamentally a probability distribution model, I stand by that, because that's literally the mathematical model (softmax for the encoder and decoder) that's being used in transformers here. It very much is generating a probability distribution over the vocabulary and just picking the highest probability (or beam search) as your next output.

>The LLMs absolutely world model and researchers have shown this many times on smaller language models.

We don't have a formal semantic definition of a "world model", I would take alot of what these researchers are writing with a grain of salt because something like that crosses more into philosophy (especially in the limits of language and logic) than hard engineering that these researchers are trained on.


This question becomes difficult whenever a system becomes sufficiently complex. Take any chaotic system, like a double pendulum, and press play at step 100,000. You ask 'what is it doing'? Well, it's just applying it's rule. Step to step.

Zoom out and look at it's trajectory over those 100,00 steps and ask again.

The answer is something alien. Probabilistically it is certain the description of its behavior is not going to exist in a space we as humans can understand. Maybe if we were god beings we could say 'No no, you see the behavior of the double pendulum isn't seemingly random, you just have to look at it like this'. Encryption is a decent analogy here.

We're fooled into thinking we can understand these systems because we forced them to speak English. Under the hood is a different story.


1) They absolutely do sometimes repeat training data verbatim.[0]

2) That's not even the point. The point is being trained on stolen data without permission, pretending that the resulting model of the training data is not a derived work of the training data and that the output of the model plus a prompt is not derived work of the training data.

Point 1 is just an extreme edge case which is a symptom of point 2 and yet people still have trouble accepting it.

GPL was about user freedom and now if derived work no longer applies as long as you run code through a sufficiently complex plagiarism automator, plagiarism is unprovable and GPL is broken. Great, we lost another freedom.

[0]: I recall a study or court document with 100 examples of plagiarising multiple whole paragraphs from the New York Times, don't have time to look for it now


> I recall a study or court document with 100 examples of plagiarising multiple whole paragraphs from the New York Times, don't have time to look for it now

Convenient. Well then, I recall two studies that said the opposite. Unfortunately pressed for time as well.


https://en.lmgtfy2.com/query/?q=ONE+HUNDRED+EXAMPLES+OF+GPT-...

You didn't have to be rudely dismissive and lie, you chose to.

I would happily respond politely to a polite request.

Please be mindful of your behavior next time.

---

Link for everyone else: https://nytco-assets.nytimes.com/2023/12/Lawsuit-Document-dk...


Not very convincing. If you prompt GPT-4 (nobody uses it) with a huge chunk of an article (nobody does this), sometimes it'll output another chunk of said article. Conveniently omitted, how many attempts did not result in this behavior, how much of the the articles were not repeated (you can see they cut off mid answer)


> trained on stolen data without permission

My sympathies to academic publishers ;)


This all seems totally orthogonal to the statement: "I don't get why people are so resistant to the idea that AI can prove new mathematical theorems."

I don't necessarily disagree about the copyleft stuff.

Transformers do sometimes overfit to exact token sequences from training data, but that isn't really what they the architecture does in general.


When you say new mathematical theorems, they absolutely can. So can infinite monkeys on typewriters, though LLMs have a much better heuristic to arrive at valid trheorems.

The same applies to valid new programs.

The issue I have with this is pretending that the word "new" is sufficient justification for giving all the credit/attribution and subsequent reward (reputational, financial, etc.) to the person who wrote the prompt instead of distributing it to the people in the whole chain of work according to how much work and what quality of work they did.

How many man-hours did it take to create the training data? How many to create the LLM training algorithm and the electricity to run it? How many to write the prompts?

The most work by many, many orders of magnitude was put in by the first group. They often did it with altruistic goals in mind and released their work under permissive or copyleft licenses.

And now somebody found a way to monetize this effort without giving them anything in return. In fact, they will have to pay to access the LLMs which are based on their own work.

Copyright or plagiarism are perhaps the wrong terms to use when talking about it. I think copyright should absolutely apply but it was designed to protect creative works, not code in the first place.

Either way it's a form of industrialized exploitation and we should use all available tools to defend against it.


You're completely correct in your two points, however people _do_ regularly assert that LLMs cannot possibly generate anything novel: "they are just regurgitating and recombining the original".

I mean, sure. But so am I (in what is likely a far more advanced manner, but still). I also find it somewhat funny that I am also partially trained on stolen data without permission. I also jaywalk occasionally (perhaps I am trivializing the topic too much, but show me a researcher who hasn't _once_ downloaded a paper they really needed, in less than perfectly legal ways).


Human time is valuable, LLM time is not. If you spend hundreds of hours creating something, nobody should have the right to copy (verbatim or with automatic modifications) it unless you allow them.

Human rights are valuable. LLMs allow laundering GPL code (removing both attribution and users' rights to inspect and modify the code). Free software cannot compete against proprietary in a world where making a copy is trivial but proving it's a copy is nearly impossible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: