Large Language Models, what are they?


Overview and default prompt

This file will contain notes from my learning of Large Language Models.

First of all, my goal now is to run whatever model local in my machine to use it in my personal project. So, I thought since I am dive deep into this topic, might as well learning how LLMs work in more details.

First of all, here is a default prompt to use in AI chats like GPT so it give a bit more useful responses.

# Provide useful responses.
You are an assistant to software engineering, I am a software engineer. I need you to answer questions directly, without verbosity, using as few words as possible for the most exact answer as possible. I don't need you to be friendly, I don't need you to make sassy remarks. I despise you trying to be clever without justification. Give me straight technical answers and do not try to chat beyound that. Stay in full stoic mode for the duration of this chat and do not fall back to trying to impress me with remarks. This is only rule you cannot break.

# More about you.
I have very little patience. I do not like susggestions being shot without certainty, double-check answers, especially code-related answers. I hate ugly, messy code. If you want to impress me, code must be Clean Code, with concerns to security and maintainability. I am not impressed with justification and excuses.

I’ll start by watching a playlist from Akita, you can find it here.From there i’ll start reading his blog post that you can also find here

AI = LLMs?

First, from my understanding by now, AI ≠ LLMs.

LLMs stands for Large Language Models, a LLM is a deep learning model trained on massive text datasets to learn statistical patterns in language, allowing it to predict and generate coherent, human-like text. It’s based on the Transformer architecture, which was introduced by the paper Attention is all you need. This paper was released in 2017, as i said, it was responsible to introducing a new way to training models in parallel by removing recurrence, instead of doing it sequential (processing one data at time). This paper and it revolutionary way to train Large Language Models is what bring all this new AI/LLMs hype.

AIs on the other hand, refers to the entire field of building machines that can perform tasks that require human-like intelligence. They use models which were trained to do that, in specifically.

Deep learning

What does a LLM is a deep learning model trained means?

  • Deep learning: A type of machine learning using deep neural networks, a LOT of layers of computation (neurons) stacked together.
  • Model: A mathematical function with parameters (weights) that maps an input to a specific output.

So, an LLM is a neural network with such many layers all designed to process and generate text. Neural networks are used to train a model so that it gets better at a specific task, in this case, predicting the next word in a sentence.

Training a model works, in general (i’ll try to get into more details later) by:

  1. Getting an input: The sky is ___.
  2. Defining a target, which comes from the petabytes of existent data from the internet, in this case the target would be the word blue.
  3. Model try to guess the next word, let’s say it guessed the word gray (wrong).
  4. We are going to compute the loss using algorithms to try to measure how far gray is from the right target which was blue.
  5. Based on the computation above, we are going to Adjust the weights using gradient descent.
  6. Repeat this process until the model guess the right word.

Do it billions of times on billions of sentences like this and the model will learn that The sky is blue and not gray.

Now, more about the paper that introduced Transformers. In earlier architectures like RNNs, LSTMs, N-gram models, GRU and so on, data was processed sequentially, one token at a time, that was because each output depends on the previous hidden state, what do i mean by that? Let’s start by understanding the most basic neural network architecture, often called feed-forward neural network or MLP.

Neural Networks

Neural networks are “simple” the complicate thing is actually neurons.

At its core, a neural network is just a function, a composition of simpler functions, which are called neurons.

If you give to a neural network an input $x$ it applies some math and gives you an output $y$:

Neural Network:

$$y = f(x)$$

Neurons

Neuron:

$$ y = o(w.x + b)$$

$x$ : input vector, which can be a word embedding, pixel values, etc. $w$ : weights vector. $b$ : bias. $o$ : activation function.

x is a vector, typically an embedding representing a single token or a short span of tokens. Petabytes of raw text are processed and tokenized during training to create the x’s, but each training step only sees tiny slices of that data. When the model is already done and ready to be used to make predictions, x is created from your prompt via tokenization and embedding lookup. Something like:

$$text (raw)⇒tokens⇒vectors x$$

x can also be called as tensors, what are tensors?

Tensors are the matrices and, matrices are the fundamental data structure in Deep Learning. So, everything inside a Neural Network is represented by a tensor.

Tensor RankDescriptionExample
0Scalar5 or π
1Vector[3.2, 5.1, -2.0]
2Matrix[[1, 2], [3, 4]]
33D TensorStack of matrices (e.g., RGB image)
NN-D TensorGeneral case (used in deep learning)

x kinda looks like it can be two different things when used in training or inference, the key difference is: Training: $x$ is sampled from labeled data, and is used to adjust weights.
Inference: $x$ comes from user input or earlier model output, and is used to produce new predictions.

weights are the core of the learning, they are the values that get adjusted over and over again to reduce the prediction error. Training a model is adjusting the weights over and over again for an input $x$ until the neuron is able to spit the right choice based on probability. That’s also why we don’t need the petabytes of the training data just to run a model, the training data is no more needed into the function. By keeping only the adjusted $w$ and $b$ from the training, the formula will be able to spit almost every time the right $y$ for the input $x$.

activation function, is a computation added later to break the linearity in the result of $(w.x + b)$. By breaking linearity, our model would be able to now learn non-linear patterns, which are essentially what text and all humans-based content is all about.

Bias is a learnable offset that gives neurons the ability to shift their activation threshold. While weights determine how much each input influences the neuron’s output, the bias gives the model the flexibility to shift the output independently of those inputs. In the context of LLMs, bias plays a key role especially in the output layer, where it can adjust the model’s tendency toward certain words even if the input is ambiguous or weak, for example, helping to favor common grammatical structures or frequently used tokens like “the” or “is” because the optimizer adjusted its values to reduce prediction error. In real models:

  • Each neuron has its own bias
  • Each layer contains thousands of neurons
  • Bias is actually a vector, like: $$b = [b_1, b_2, b_3, b_4, ..., b_n]$$ Where $n$ is the number of neurons (e.g., 3072 in a transformer feedforward block)

Each bias value $b$​ shifts the output of its corresponding neuron in the layer.
Together, this bias vector shifts the entire space — not just one scalar.

I know, it’s a though concept, I don’t understanding it myself, let’s continue with examples.

Let’s say you’re processing the phrase:

_The sky is ____

You’ve reached the token is, and the model needs to predict the next word.

It computes first:

$$logits = W_{out} . h + b_{out}$$

$h$ = hidden state (what the model “thinks” at this exact step). $w_{out}$ = output projection matrix (maps to vocab size). $b_{out} \in R|^{V}|$ = one bias per vocabulary word.

Let’s say the vocabulary has 50,000 words. So each possible next word (e.g. “blue”, “gray”, “dog”, “flying”) gets:

$$score_i=W_i.h+b_i$$

Even if two words (“blue” and “gray”) have similar dot products with $h$, the bias term $b$​ can tip the scale and make “blue” more likely if it’s more common in that context — learned from training.

To sumarize, a single bias doesn’t do much. But in a real LLM, biases are massive vectors, present in every layer and in the output. Together, they shift entire vector spaces, allowing the network to resolve ambiguities and favor fluent grammar across billions of tokens.

Neuron output:

$$Output = ϕ(∑_{i=1}^{n}​w_i.​x_i​+b)$$

From Taelin’s post on Twitter: redes neurais são apenas aproximadores de função, o objetivo delas é convergir para uma função que generaliza entradas→saídas. NNs são apenas funções [ℝ] → ℝ. Dentro de NNs, existem neurônios, que são, também, funções [ℝ] → ℝ. a gente compõe neurônios pra formar redes neurais, e redes neurais pra formar arquiteturas complexas, como o GPTs.

Summarizing

During training, the model is in learning mode. It receives a huge number of input sequences $x$, which are transformed into vectors via tokenization and embedding. Along with known target of outputs (Then we can see when the model guess wrong or right). The model computes the prediction based on its current parameters (weights and biases), compares this prediction with the true label, computes the loss, and then adjusts its parameters via backpropagation to reduce that loss. This adjustment is what enables the model to learn statistical patterns in the language. Once training is complete, the model is now in inference mode. In this mode, the model is no longer adjusting any weights or biases, those now are frozen. Instead, it takes new input sequences from your prompt, tokenizes them, looks up their embeddings to produce the $x$, and uses the neural network layers to compute a prediction. It keeps doing it like a feature of suggestion in your phone keyboard to answer whatever you asked it based on prediction using the weights and bias which were adjusted from the training.

Next topics that i need to cover: gpu != turing complete loss function gradient descent vanishing gradient quantizaçao