RéeLLM
Ahan R
You shall know a word by the company it keeps.
Being Boring
Firth's quote has become something like scripture over the past few months. Large Language Models have shown enormous potential and it's encouraged certain philosophers to ascribe 'meaning' or 'understanding' to them. The terms themselves are always poorly defined, as with most philosophical debate. Abstraction allows argumentation, since if you perfectly clarify your words - you're left with nothing. I feel this has been my experience with most philosophical 'problems'.
Arguing against AI is easy. To an extent, acadamia rewards being boring. Philosophical reasoning is meant to be rock solid. Deviation from this pattern is usually emergent only within overconfident undergraduates - an attitude that is soon corrected. Scandals appear only to be watered down in order to become drinkable - bland. Descartes is an obvious example. There is a habit to reduce "things" (Philosophers, schools of thought, events) into a couple of sentences before discarding them wholly. For ChatGPT (and Large Language Models in general) this was the term "stochastic parrot". Not entirely wrong, but engendering a certain dismissal. No longer is AI 'understanding' taken seriously as a topic of inquiry. There have been attempts. Notably Søgaard in his paper "Understanding models understanding language" which argues that certain LLMs can map sentence semantics or text pragmatics.
Søgaard argues that, by modeling relationships between symbols, LLMs employ a certain strain of semantics - inferential. A type of semantics seperated from concerns about the intentionality of the speaker - specifically what they're referring to when they utter something. Internalist semantics is an example of this: where "meanings are instructions for how to build concepts of a special sort" This quote is taken from Pietroski in his book "Conjoining Meanings: Semantics Without Truth Values".
Generally internalist semantics represents a break from classical semantics by denying that "mind-world reference relations should play any role in semantic theorizing". A meaning of an expression helps (instructs) in forming a certain mental representation. The consequence of this is only concepts are granted extensions - expressions are not. Inferential semantics focuses more on the systems and axioms on which propositions are formed. In this view, embodied experience is unnecessary in order to capture "semantic meaning". This means that AI, by capturing certain correlation between words and sentences, can perform this type of semantics. An interesting trick, but still a trick. Landgrebe and Smith in their paper "Why machines do not understand" respond well to Søgaard.
Wrapping and Unwrapping
I was already familiar with the word2vec and other relevant embedding techniques within LLMs. However, this section was a good refresher. Text is preprocessed in a variety of ways in order to allow its transformation into a multidimensional vector. To accomplish this:
- Text is first cleaned. All whitespaces are removed (redundant) and punctuation is split as to occupy its own position - seperated from its relevant word. Everything is turned into lowercase.
preprocessed = re.split(r'([,.:;?_!"()\']|--|\\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
The output looks something like this:
['In', 'November', '1910', ',', 'a', 'Jewish', 'engineer', 'at', 'Victoria', 'University']
-
Tokens are assigned a unique ID.
Black = 0; Paper = 1; Water = 3;
etc. This gives us a vocabulary list. Since our text is only 6000 words long, our vocab will be quite limited. These are my first 25 tokens: [818, 3389, 31953, 11, 257, 5582, 11949, 379, 12313, 2059, 287, 9502, 6492, 257, 12701, 329, 257, 649, 1611, 286, 9551, 261, 37073, 44408, 6051] To account for this we introduce an unknown (footnote) symbol, to allow our program to parse words that aren't in its training data.
In preprocessing and tokenizing my data there was an abudance of years and names. These are the 50 entries in my vocabulary list.
('(', 0)
(')', 1)
(',', 2)
('-', 3)
('.', 4)
('1066', 5)
('1798', 6)
('1813', 7)
('1831', 8)
('1838', 9)
('1844', 10)
('1845', 11)
('1859', 12)
('1860', 13)
('1910', 14)
('1919', 15)
('1920', 16)
('1922', 17)
('1928', 18)
('1929', 19)
('1940s', 20)
('1951', 21)
('1996', 22)
('2001', 23)
('2009', 24)
('21', 25)
('22', 26)
('25', 27)
('62', 28)
(':', 29)
(';', 30)
('?', 31)
('A', 32)
('After', 33)
('All', 34)
('Alps', 35)
('America', 36)
('An', 37)
('Annie’s', 38)
('Another', 39)
('As', 40)
('Asperger’s', 41)
('Australia', 42)
('Austria', 43)
('Austrian', 44)
('Autobiography', 45)
('Ayer', 46)
('BBC', 47)
('Bay', 48)
('Beagle', 49)
('Beagle’', 50)
I suppose this is to be expected for two historical essays. What I've described here is an encoder. The decoder performs the same process in reverse. Special context tokens can also be added. I used <|endoftext|> and <|unk|>
to handle the end of a text and an unknown word respectively. Other (better tokenizers) also employ [beginning of sequence], [end of sequence], and [padding]
tokens. .
Byte Pair Encoding
So far we've been encoding words and certain typographical symbols. Byte Pair encode certain sub-words and individual characters. A simple explanation is as follows:
- A vocabulary is generated and transformed as to only contain bytes or characters from the text. For example, given the vocab list
The Byte Pair vocabulary will contain:{“cat”, “bat”, “rat”, “bats”}
{“a”, “b”, “c”, “r”, “s”, “t”}
- The frequency of each entry is calculated and the most frequent pairs are merged together
The most frequent adjacent pair is "at", appearing 2 times. We merge it into a single token "at" and update the corpus and vocabulary:"ca" appears 1 time. "at" appears 2 times. "ba" appears 1 time. "ra" appears 1 time. "ts" appears 1 time.
"cat" becomes "c at" "bat" becomes "b at" "rat" becomes "r at" "bats" becomes "b at s" Updated Vocabulary: {"a", "b", "c", "r", "s", "t", "at"}
- This is continued multiple times until a desired level of tokens is reached.
The book suggests using the tiktoken Byte Pair encoder. A consequence of breaking down words into subwords is that there's no longer any need to the special token '<|unk|>'A side effect of using a Byte Pair encoder is that the model can now output words it was never trained with just because certain of the frequency of certain pairs appearing and creating words. Now we have to implement a Dataloader and Dataset. The Datasets function is to tokenize the entire text and break it into chunks (subsamples). Chunks are split into input-target pairs and stored as tensors. Where the target is just the input + strideThe stride represents the amount we move the input field by. It can be any number. For this example, I'll be keeping it at one, so the target is adjacent to the input. Whereas a stride of 4 would move the input field by 4 positions. For example:
Input: "On the edge of" Target: "The river stood an". For example:
Input tensor x:
x = torch.tensor([
["On", "the", "edge", "of"],
["the", "river", "stood", "an"],
["ancient", "bridge", "under", "the"]
])
Output tensor y:
y = torch.tensor([
["the", "edge", "of", "the"],
["river", "stood", "an", "ancient"],
["bridge", "under", "the", "moon"]
])
Our output is returned using tokens, so it looks something like this:
Input Tensor:
inputs = torch.tensor([
[101, 2054, 2003, 1996],
[2060, 1011, 3203, 1998],
[2129, 2515, 2111, 2424],
[102, 2024, 2045, 1998],
[2285, 2568, 2023, 2391],
[1996, 2831, 1012, 102],
[2054, 102, 2129, 3203],
[1011, 2391, 2831, 1999]
])
Target Tensor:
targets = torch.tensor([
[2054, 2003, 1996, 2060],
[1011, 3203, 1998, 2129],
[2515, 2111, 2424, 102],
[2024, 2045, 1998, 2285],
[2568, 2023, 2391, 1996],
[2831, 1012, 102, 2054],
[102, 2129, 3203, 1011],
[2391, 2831, 1999, 2129]
])
Embedding Tokens
We embed our tokens within tensors. Given a certain embedding size (lets say 3) our embedding returns with a matrix containing the dimensions tokens x embedding size. So if we have 8 tokens and an embedding size of 3, we get a tensor that looks like this:
tensor([[-0.2365, -0.0142, 0.4671],
[ 0.0844, 0.2479, 0.3541],
[-0.2084, -0.2362, -0.3989],
[-0.1478, -0.1628, 0.2216],
[ 0.3945, 0.2482, -0.2355],
[-0.4514, 0.3583, -0.0440],
[ 0.4975, 0.1994, -0.3351],
[-0.2396, -0.4771, 0.1130]])
The numbers within the tensor are random, these are the weights of the program. The token ID is correlated with its weights position in the tensor. For example, the token '2' is embedded within the third rowBecause we start counting at 0. This allows for the embedding layers to perform lookup operations. The problem with this approach is that the same token results in the same embedding vectors, so the relative position within the input sequence is lost. For example, in the sentence "I had had a shower before bed", the "had"s would've been mapped onto position 1 and 2 within the embedding layer with the exact same values. In order to overcome this - we need to add some sort of positional tracking mechanism.
We just add a unique positional vector to each token in the sequence. For example, given the tokens "The", "Cat", and "Sits" we construct our embeddings. Create a system for tracking the absolute positional embeddings and add both values together.
Token Embeddings:
"The" → [1, 1, 1, 1]
"cat" → [1, 1, 1, 1]
"sits" → [1, 1, 1, 1]
Positional Embeddings:
Position 1 → [1.1, 1.2, 1.3, 1.4]
Position 2 → [2.1, 2.2, 2.3, 2.4]
Position 3 → [3.1, 3.2, 3.3, 3.4]
Adding Together:
"The" (position 1): [1, 1, 1, 1] + [1.1, 1.2, 1.3, 1.4] = [2.1, 2.2, 2.3, 2.4]
"cat" (position 2): [1, 1, 1, 1] + [2.1, 2.2, 2.3, 2.4] = [3.1, 3.2, 3.3, 3.4]
"sits" (position 3): [1, 1, 1, 1] + [3.1, 3.2, 3.3, 3.4] = [4.1, 4.2, 4.3, 4.4]
So far we've been focusing on relatively small embedding sizes (3 or 4). Real world LLMs have far more. Our tokens will have 256-dimensions. Each batch will contain 8 text samples that contain 4 tokens in each layer. Our output is as such: So our tensor will be 256 x 8 x 4
Token IDs:
tensor([[ 818, 3389, 31953, 11],
[ 257, 5582, 11949, 379],
[12313, 2059, 287, 9502],
[ 6492, 257, 12701, 329],
[ 257, 649, 1611, 286],
[ 9551, 261, 37073, 44408],
[ 6051, 13, 679, 373],
[ 655, 2310, 11, 290]])
torch.Size([8, 4])
Embedding our tokens isn't enough. We all need to find a way to embed positional information. The easy solution is to create another embedding layer with the same amount of dimensions. Since our tokens are found within sequences that are at maximum 4 tokens long, our tensor size is 4 x 256. We can then finally add these positional embeddings to our normal token embeddings.
context_length = max_length
pos_embedding_layer = torch.nn.Embedding(context_length, output_dim)
pos_embeddings = pos_embedding_layer(torch.arange(context_length))
print(pos_embeddings.shape)
# Adding Together:
input_embeddings = token_embeddings + pos_embeddings
Our final tensor size is "torch.Size([8, 4, 256])".
To recap:
- Input text is cleaned by removing all whitespaces and separating all conjoined punctuation. This is called tokenization.
- Each token is given a unique numerical ID for identification when performing encoding or decoding processes.
- Tokens are embedded within an n-dimensional tensor after being seperated into chunks.
- Positional embeddings are created and added to Token embeddings to create our input embeddings
Attention
- Self-Attention
- Casual Attention
- Multi-Head Attention
Self-Attention
If you've ever tried to translate something then you understand the pitfalls with picking up a dictionary and going word-by-word (part of the reason I was always so uncomfortable with Searle's Chinese room argument). The relationships that words have with each other and their ability to affect the meaning of a sentence is an unavoidable phenomena. Words are polysemantic, "bow" can refer to the weapon and the act of showing respect. The self-attention mechanism is a way for models to learn how to process and calculate the polysemy and sentential meaning of words. We start with building a simplified attention mechanism/
This is done with the help of a context vector.A context vector can be interpreted as an enriched embedding vector An elements context vector is computed as a combination of all input vectors weighted with respect to the elements embedding vector. The result is a value which represents how a vector relates to all the other values within its input data.
Our first step is to calculate something called an "attention score" (w) which acts as an intermediary value between tokens. To calculate this we need the dot product between an embedded token and all other tokens. These are labeled w(x,y)
. Where w represents the attention score between query (x) and input (y). For example, w(2,1)
represents the attention score between embedded token x2
and input x1
We use a softmax function in order to normalize the scores.
\[ \text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}} \]
Arriving at our context vector by multiplying each input vector with our newly derived normalized attention weights. Resulting in a tensor that looks something like this:
tensor([0.2219, 0.1513, 0.8389])
To summarize:
- Compute Attention Scores
- Calculated as the dot product of the query and all other tokens in the input. Output looks like:
tensor([0.9544, 1.4950, 1.4754, 0.8434, 0.7070, 1.0865])
- Compute Attention Weights
- Normalize attention scores through a softmax function
- Compute Context Vectors
- Multiply input tokens with corresponding attention weights and sum the resulting vectors
- This results in the context vector z(x). Where x is the query token.,
In order to properly implement self attention we require Query (Q), Key (K), and Value (V) matrices which are used to derive K,Q, and V vectors. These matrices contain weights which we will adjust through training. They are obtained through the following matrix multiplication:
\[ q(i) = W_q x(i) \quad \text{for} \quad i \in [1, T] \]
\[ k(i) = W_k x(i) \quad \text{for} \quad i \in [1, T] \]
\[ v(i) = W_v x(i) \quad \text{for} \quad i \in [1, T] \]
Key and Values are calculated by multiplying input vectors by randomly created QKV matrices. Token key value multiplied by token query value results in attention score. We softmax this to receive the attention weight and multiply them all with their respective value vector to produce the context vector. Using this information we can produce a simple self-attention mechanism:
class SelfAttention(nn.Module):
def __init__(self, input_dim, output_dim):
super().__init__()
self.W.query = nn.Parameter(torch.rand(input_dim, output_dim))
self.W.key = nn.Parameter(torch.rand(input_dim, output_dim))
self.W.value = nn.Parameter(torch.rand(input_dim, output_dim))
def forward(self, x):
keys = x @ self.W.key
queries = x @ self.W.query
values = x @ self.W.value
attention_scores = queries @ keys.T # omega
attention_weights = torch.softmax (
attention_scores / keys.shape[-1] ** 0.5, dim=-1
)
context_vector = attention_weights @ values
return context_vector
__init__ generates random values for QKV weight matrices. The function 'Forward' calculates the individual KQV values and multiplies these by the queries in order to work out context vectors. This is the basis of the self-attention mechanism
Casual Attention
Casual attention is forcing the model to consider tokens which have appeared prior to its current position when predicting its next word. To achieve this we black-out (mask) future tokens and normalize the current known tokens in the sequence. To implement this we apply a "casual attention mask" on the program by concealing tokens that appear above the diagonal within the inputs. The example sentence used is "Your journey starts with one step"
BEFORE:
+--------+-------+---------+--------+-------+-------+-------+
| | Your | journey | starts | with | one | step |
+--------+-------+---------+--------+-------+-------+-------+
| Your | 0.19 | 0.16 | 0.15 | 0.17 | 0.15 | |
| journey| 0.20 | 0.16 | 0.14 | 0.16 | 0.14 | |
| starts | 0.20 | 0.16 | 0.14 | 0.16 | 0.14 | |
| with | 0.18 | 0.16 | 0.15 | 0.16 | 0.15 | |
| one | 0.18 | 0.16 | 0.15 | 0.16 | 0.15 | |
| step | 0.19 | 0.16 | 0.15 | 0.16 | 0.15 | |
+--------+-------+---------+--------+-------+-------+-------+
AFTER:
+--------+-------+---------+--------+-------+-------+-------+
| | Your | journey | starts | with | one | step |
+--------+-------+---------+--------+-------+-------+-------+
| Your | 1.00 | | | | | |
| journey| 0.55 | 0.44 | | | | |
| starts | 0.38 | 0.30 | 0.31 | | | |
| with | 0.27 | 0.24 | 0.24 | 0.23 | | |
| one | 0.21 | 0.19 | 0.19 | 0.18 | 0.19 | |
| step | 0.19 | 0.16 | 0.16 | 0.15 | 0.15 | 0.15 |
+--------+-------+---------+--------+-------+-------+-------+
To implement this we use the PyTorch tril
function. To make sure the missing values don't influence the softmax normalization we convert them into -∞ values. This is because softmax converts our values into a probability distribution, so when a negative infinity value is present (since e^-∞ approaches 0), the function treats it as 0
To prevent overfitting we use a method called Dropout. This is usually applied either after the attention weights are calculated or after the attention weights are compared to the value vectors. It works by creating a dropout mask and applying this to the input.
Dropout Mask:
+--------+-------+---------+--------+-------+-------+-------+
| | Your | journey | starts | with | one | step |
+--------+-------+---------+--------+-------+-------+-------+
| Your | 1 | X | X | X | X | X |
| journey| X | 0.44 | X | X | X | X |
| starts | 0.38 | X | 0.31 | X | X | X |
| with | X | 0.24 | X | 0.23 | X | X |
| one | X | X | 0.19 | X | 0.19 | X |
| step | 0.19 | X | X | 0.15 | X | 0.15 |
+--------+-------+---------+--------+-------+-------+-------+
Weights after Dropout Mask:
+--------+-------+---------+--------+-------+-------+-------+
| | Your | journey | starts | with | one | step |
+--------+-------+---------+--------+-------+-------+-------+
| Your | 1.00 | | | | | |
| journey| | 0.44 | | | | |
| starts | 0.38 | | 0.31 | | | |
| with | | 0.24 | | 0.23 | | |
| one | | | 0.19 | | 0.19 | |
| step | 0.19 | | | 0.15 | | 0.15 |
+--------+-------+---------+--------+-------+-------+-------+
If we use a dropout percentage of 50%, then half of all the attention weights would be masked out. Dropout outputs actually change depending on your operating system We can now implement both casual attention and dropout into our attention function.
class CasualAttention(nn.Module):
def __init__(self, d_in, d_out, context_length, dropout, qkv_bias=False):
super().__init__()
self.d_out = d_out
self.W_query = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_key = nn.Linear(d_in, d_out, bias=qkv_bias)
self.W_value = nn.Linear(d_in, d_out, bias=qkv_bias)
self.dropout = nn.Dropout(dropout)
self.register_buffer(
'mask',
torch.triu(torch.ones(context_length, context_length),
diagonal=1)
)
def forward(self, x):
b, num_tokens, d_in = x.shape
keys = self.W_key(x)
queries = self.W_query(x)
values = self.W_value(x)
attn_scores = queries @ keys.transpose(1, 2)
attn_scores.masked_fill_(
self.mask.bool()[:num_tokens, :num_tokens], -torch.inf)
attn_weights = torch.softmax(
attn_scores / keys.shape[-1] ** 0.5, dim=-1
)
attn_weights = self.dropout(attn_weights)
context_vec = attn_weights @ values
return context_vec
The resulting context vector is a three-dimensional tensor in which each token has a two-dimensional embedding. Casual attention is a "single-head" attention module. This is because there is only one set of attention weights processing the input.
Multi-Head Attention
Multi-Head attention just means we have more than one set of attention weights processing the input so we can capture more information within the data. What this requires is multiple instances of the self-attention mechanism. So instead of having a single Wv, Wq, and Wk
matrix - we have multiple (e.g. Wv1 & Wv2
. The result is that we have multiple context vectors which are then combined into one.