Nebula 101 Creation

How is a Large Language Model created?
Creating an LLM consists of several stages:

  1. Data acquisition
  2. Training
  3. Post-Training
  4. Adding tools

DATA ACQUISITION – Read the Internet
The first step is to get as much data as possible. For this the Internet is read based on a curated list to exclude web sites that are deemed unsuited for the model to learn. Every developer of LLMs will make its own choices here and determine what the base knowledge of the LLM will be!
Through the use of different tools the content of the web pages (about 2.7B pages as of April 2024) is stripped of everything but the raw text, resulting in about 44TB of data.
Take a look at the datasets at Hugging Face to get an idea of how that looks.

Tokenization – Convert the data
The next step is to transform this text dataset into something a computer program can handle more easily: numbers. For AI model training this is done by a process called tokenization. This assigns unique numbers to words and parts of sentences. To get an idea of how this works take a look a TikTokenizer and type in some text.
At the end all the text from the Internet is stored in a dataset in the form of tokens.

TRAINING – Neural network training
Now the real training starts. This is the step that takes many months and is the most expensive (GPT4 computer resources cost about 150M$).
The training will use the tokens from the dataset, predict the next token (word) and slowly adjust the probability of which token is the next most likely based on the text from the initial dataset. This learning process will result in a large set of parameters (billions) that determine what the most likely next token (or word) will be given any input. These parameters are the numbers you often see next to a model’s name like: gpt-oss20b. The ’20b’ stands for 20 billion parameters. Basically, the more parameters a model has the better it will be in predicting the next token. The better it will be in answering your questions.
The process of calculating the parameters is unique to each developer of an LLM although the base ideas are the same (for a nice but complex visualization of how this process works take a look here).

At this point the model is not yet suited for user interaction. Posing the same simple question like “What is 2+2?” multiple times will generate very different and often nonsensical answers. [With some clever prompting it is possible for the model to ‘recognize’ patterns in the input and reuse that in the output.]

At this point we have what is called the BASE model of the LLM. It simulates the Internet documents it has learned. To see an example look here (Note: need to register for free).

POST-TRAINING
Now we need to learn the model how to behave as a digital assistant doing human language conversations and how to solve problems. This training will adjust the parameters of the model so its statistically prediction will result in a behavior that we want.
Modern LLMs are post-trained by:

  1. Conversations
  2. Practicing on their own
  3. Practicing with human feedback

Conversations: Supervised Fine Tuning
This uses many thousands of examples of fictional interactions that are created and supplied by humans to the model. This can be in the form of conversations like:

Human: What is 2+2?
Assistant: 2+2=4
Human: What if it was *?
Assistant: 2*2=4, that is the same as 2+2!

For more examples take a look hereOpen Source conversations examples.
Nowadays other AI models can generate a lot of these training conversations.
The conversation tweak the parameters of the LLM so that the LLM learns to respond like a human would do.
This training can also include some hard-coded answers that the developers do not want the assistant to hallucinate<link terminology or popup> about .

Practicing: Reinforced Learning
After learning the basic and fixed ways of solving of problems through Supervised Fine Tuning, the model now gets new problems to solve. The LLM will process each given problem many times. When a try leads to the correct answer it will give this method of solving a higher probability. This way the correct methods are reinforced, hence the name of this stage: Reinforced Learning (RL).
The results are again a tweaked parameter set that allow the model to solve problems in many different ways, sometimes in a way humans have never come up with.

Practicing: Reinforced Learning with Human Feedback
Reinforced Learning with Human Feedback (RLHF) works the same as normal RL but handles problems without a clear correct answer much better. Such as: Come up with a joke. The LLM will output some answers and lets humans score these answers. The LLM then takes these scores and trains further. (in actuality a separate neural network is trained for this to act as a human simulator that then trains the actual LLM with RL). This method can be overused making the LLM to choose non-sensical answer so RLHF needs to be cut off at a certain point.

TOOLS
Now the training of the LLM is done and the LLM is usable as a helpful assistant. By adding tools and extensions the LLM can grow beyond its base Internet information set and training methods.

continue reading about LLMs…