AI or ain't: LLMs

Previously we covered early chatbots, bots talking gibberish, and self-taught number crunchers.

But what we got so far is still boring. AI was promised to overthrow the world order and not just classify arrays of floats. Can we have a chat?

GPT

As unimpressive as it is, neural networks take arrays of numbers and return arrays of numbers. Just like our brain takes electric signals and emits electric signals. It’s only a matter of how we encode such inputs and outputs. In a human body our brain is taught how to “sense” pictures, temperature, touch. It learns how to move legs and walk, how to write and speak. It takes years for our brain to learn such rudimentary signal encodings.

A neural network can also be “connected” to some webcam and treat pixels as arrays of numbers. With enough training data and time a network can learn how to recognise images, how to read handwriting or how to tell Chihuahua from a muffin. By encoding inputs as MIDI notes we can teach a network to compose music. By encoding inputs as words we can teach it a language and some common knowledge written in that language.

GPT models, currently popularised with OpenAI’s ChatGPT, are general-purpose transformers (that’s network architecture) and are used to generate text by the given input. We’ll start with GPT-2 models published by OpenAI a while ago, since they are fairly small and easy to work with. Code examples will be in Go.

Tokens

In all the previous parts we simply split a sentence by whitespace. Despite its simplicity, this approach is not the best: it treats “hello” and “Hello!” as two different words, it assumes that words “greet” and “greeting” are completely unrelated etc.

A more advanced solution would be store a list of known tokens that could represent a complete word or a certain part of the word. Our tokenisation process then would have to find the longest possible token for each part of the sentence. In a list of tokens each token would have a unique index, so tokenisation essentially converts a string of words to an array of integers.

Our GPT-2 model knows ~50K words. They are stored in tokens.dat file as a sequence of null-terminated strings.

// load tokens from a file
tokens := []string{}
b, _ := ioutil.ReadFile("tokens.dat")
for len(b) > 0 {
  i := bytes.IndexByte(b, 0)
  tokens = append(tokens, string(b[:i]))
  b = b[i+1:]
}

func findToken(s string) (index, overlap int) {
  for i, t := range tokens {
    j := 0
    for ; j < len(s) && j < len(t) && s[j] == t[j]; j++ {
    }
    if j > overlap || (j == overlap && j == len(t)) {
      overlap, index = j, i
    }
  }
  return index, overlap
}

func tokenise(s string) (context []int) {
  for len(s) > 0 {
    t, n := findToken(s)
    if t < 0 {
      return context
    }
    context = append(context, t)
    s = s[n:]
  }
  return context
}

tokenise("Paris is the capital of") // [40313 318 262 3139 286]
tokenise("The capital of Germany is") // [464 3139 286 4486 318]

See that “capital” is token #3139, “is” - #318, “of” - #286. “The” and “the” are two different tokens, since they may indicate the start of the sentence and thus may have different meaning.

Word vectors

However it would be very difficult to train a network to understand what the words mean if they are represented by token indices. For example, “Paris” and “capital” are semantically very close to each other, but their token IDs are very much apart: 40313 and 3139. The word “Germany” is closer to “capital” than the actual capital city.

This is why GPT model has another layer of indirection that converts a token index to a vector of numbers, representing the word “meaning”. Many years ago a similar Word2Vec algorithm has been invented, that assigned a vector to each word and the more related the words are - the smaller is the difference between their vectors. In fact, arithmetic operations on such vectors allowed to find associated words, i.e. “Kind - Man = Queen”.

GPT-2 model comes with a similar “word token embedding” matrix (WTE), in our case stored in a wte.dat file. This file was created as a result of GPT-2 trainig phase, performed by OpenAI. We’re using its contents without thinking too much how it was obtained:

// read a file into a slice of 32-bit floats
func read(filename string) []float32 {
  b, _ := ioutil.ReadFile(filename)
  return unsafe.Slice((*float32)(unsafe.Pointer(&b[0])), len(b)/4)
}

// get a vector of floats for the given token in the WTE table
func wordvec(wte []float32, token int) []float32 {
  return wte[WordVecSize*token : WordVecSize*(token+1)]
}

Let’s pick words “king”, “monarch” and “lettuce”. These correspond to the tokens of 5822, 26464 and 39406 in tokens.dat. An AI might get an illusion that “king” is rather a salad plant than a monarch. But if we get the word vectors for each token and calculate the difference between the elements we get 11.1 for “king-monarch” distance and 21.9 for “king-lettuce” distance. Also “tea+biscuit” gives us 16 points, “tea+coffee” 9, but “tea+Hare” is 21, despite all Caroll’s work.

Similarly, you can take a word, find its vector and find a range of the most related vectors. You may see that for input token “absurd” the closest matched would be:

 absurd 0
 ridiculous 3.4613547
 ludicrous 3.7072947
 outrageous 6.004274
 ridiculously 6.71082
 nonsensical 6.7171283
 bizarre 6.8727217
 laughable 7.240555
 outlandish 7.274483
 silly 7.331988
 insanely 7.7595344
 astonishing 7.8167872
 grotesque 7.818375
 absurdity 7.8630023
 insane 8.12205
 incredible 8.283726
 monstrous 8.381887
 weird 8.558824
 astounding 8.615417
 stupid 8.698734

Which makes total sense! But processing one word at a time without having contextual knowledge about other words is unlikely to give any good results for natural language (consider “working hardly” vs “hardly working”).

This is where another large matrix, WPE, helps. It encodes a position of the word in the input context (sentence) and how this position corrects the word vector. We’ll see how it’s used in the network later.

Layers

GPT-2 model is represented with a list of neuron layers. In the previous part we represented each neuron as an object with scalar parameters, which was convenient for training a neuron. Here we only focus on forward propagation and can speed up our network significantly if we treat all the weights of all the neurons in a single layer as one vector. This would batch all arithmetic operations and reduce the number of “for” loops a lot.

Here is all the layer data in the decomposed GPT-2 model:

4.0K    h1_attn_cproj_b.dat
4.0K    h1_ln1_b.dat
4.0K    h1_ln1_g.dat
4.0K    h1_ln2_b.dat
4.0K    h1_ln2_g.dat
4.0K    h1_mlp_cproj_b.dat
 12K    h1_attn_cattn_b.dat
 12K    h1_mlp_cfc_b.dat
2.3M    h1_attn_cproj_w.dat
6.8M    h1_attn_cattn_w.dat
9.0M    h1_mlp_cfc_w.dat
9.0M    h1_mlp_cproj_w.dat
...
4.0K    h2_...
...
4.0K    lnf_b.dat
4.0K    lnf_g.dat
364K    tokens.dat
3.0M    wpe.dat
147M    wte.dat

GPT-2 is a multi-layer neural network, just like the toy network we used to calculate XOR or classify moon-shaped points. Except for it is much, much larger. In its smallest variant (“124M”) it comes with 12 layers, where each layer is in fact a combination of a few smaller layers with their own weights and biases.

Each vector or matrix within the layer is stored in an individual file, extracted from the original GPT-2 model. Simply reading that as a sequence of float32 numbers would fill in the parameter data into each layer.

Here’s the complete code for loading a model:

type Model struct {
  dir    string
  lnf_g  []float32
  lnf_b  []float32
  wte    []float32 // word token embeddings
  wpe    []float32 // word position embeddings
  layers []Layer
}

type Layer struct {
  ln1_b        []float32
  ln1_g        []float32
  ln2_b        []float32
  ln2_g        []float32
  mlp_cfc_b    []float32
  mlp_cfc_w    []float32
  mlp_cproj_b  []float32
  mlp_cproj_w  []float32
  attn_cattn_b []float32
  attn_cattn_w []float32
  attn_cproj_b []float32
  attn_cproj_w []float32
  k            []float32
  v            []float32
}

func LoadModel(dir string) (m Model) {
  m.dir = dir
  m.lnf_g = m.read("lnf_g.dat")
  m.lnf_b = m.read("lnf_b.dat")
  m.wte = m.read("wte.dat")
  m.wpe = m.read("wpe.dat")
  m.layers = make([]Layer, NumLayers)
  for i := range m.layers {
    l := &m.layers[i]
    l.ln1_g = m.read(fmt.Sprintf("h%d_ln1_g.dat", i))
    l.ln1_b = m.read(fmt.Sprintf("h%d_ln1_b.dat", i))
    l.ln2_g = m.read(fmt.Sprintf("h%d_ln2_g.dat", i))
    l.ln2_b = m.read(fmt.Sprintf("h%d_ln2_b.dat", i))
    l.mlp_cfc_w = m.read(fmt.Sprintf("h%d_mlp_cfc_w.t", i))
    l.mlp_cfc_b = m.read(fmt.Sprintf("h%d_mlp_cfc_b.dat", i))
    l.mlp_cproj_w = m.read(fmt.Sprintf("h%d_mlp_cproj_w.t", i))
    l.mlp_cproj_b = m.read(fmt.Sprintf("h%d_mlp_cproj_b.dat", i))
    l.attn_cproj_w = m.read(fmt.Sprintf("h%d_attn_cproj_w.t", i))
    l.attn_cproj_b = m.read(fmt.Sprintf("h%d_attn_cproj_b.dat", i))
    l.attn_cattn_w = m.read(fmt.Sprintf("h%d_attn_cattn_w.t", i))
    l.attn_cattn_b = m.read(fmt.Sprintf("h%d_attn_cattn_b.dat", i))
    l.k = make([]float32, ContextSize*WordVecSize)
    l.v = make([]float32, ContextSize*WordVecSize)
  }
  return m
}

Now we have lots and lots of numbers. But what kind of operations should we perform to make it compose a sentence?

Some arrays end with “w” and some with “b”, those are weights and biases, so we probably would end up doing the usual multiply-and-add math a lot here. All layers having “proj” in their names perform this operation (also known as projection from one vector into another vector space).

There are also “g” and “b” pairs. These stand for “gamma” and “beta” from the layer normalisation process (will be described below). They are used to scale the values by… multiplying them by “g” and adding “b” to that.

Self-attention

So far we’ve figured out the roles of the following blocks (in the order they are applied to the input):

The only unknowns are k, v and the blocks related to “attention” projection. But what is attention anyway?

Just like with us, human beings, attention helps the model to focus on certain words in the sentence rather than the others. Self-attention means that it only uses words from the same input context (sentence).

The main components of the self-attention layer are: a query q, which represents the current word, keys k that represent other words in the current context and values v that are added to the current word data if the associated key looks relevant to the query.

Multiplying query vector by each of the key vectors gives us the score of how relevant the word with the given key is to the query. Multiplying corresponding values to their scores and summing them up results in a self-attention vector that can be used to give the model more context about the input data.

Imagine a team meeting, where every team member is talking about their own work. However, other team members adjust their attention depending on how relevant each topic is. At the end, the team collectively focuses on the most important topics for the whole team. Now, replace “team members” with individual words/tokens in the current sentence (context). Self-attention helps to score individual tokens based on their importance to the whole context.

Do the math

At this point we’ve figured out how to load model data into float32 slices, how to read tokens and how to tokenise input prompt. What’s missing is a few helper math functions that could simplify layer operations.

First, we’ll need the good old linear X·W+b function:

func lin(x, w []float32, b float32) float32 {
  for i := range x {
    b += x[i] * w[i]
  }
  return b
}

We also will need a slightly different activation function. Instead of ReLU, which we’ve used so far, GPT-2 model requires us to use GeLU function, which is similar in shape but has a more complex implementation:

func gelu(x float32) float32 {
  return 0.5 * x * (1 + float32(math.Tanh(0.7978845676080871*float64(x+0.044715*x*x*x))))
}

To handle normalisation within layers we would also need a separate function. It multiplies every element from X by a mean square and gamma, then adds beta value:

func norm(x, beta, gamma []float32) []float32 {
  mean, sqmean := float32(0.0), float32(0.0)
  for i := range x {
    mean += x[i]
  }
  mean = mean / float32(len(x))
  for _, xi := range x {
    sqmean += (xi - mean) * (xi - mean)
  }
  sqmean = float32(math.Max(float64(sqmean/float32(len(x))), 0.0000001))
  m := float32(math.Sqrt(1.0 / float64(sqmean)))
  out := make([]float32, len(x))
  for i, xi := range x {
    out[i] = (xi-mean)*m*gamma[i] + beta[i]
  }
  return out
}

Finally, we will need “softmax”, that converts a vector X into a probability distribution of possible values, so that each value is in the range [0..1] and the sum of them equals 1:

func softmax(x []float32) []float32 {
  out := make([]float32, len(x))
  max, sum := float32(math.Inf(-1)), float32(0)
  for i := range x {
    if x[i] > max {
      max = x[i]
    }
  }
  for i := range x {
    x[i] = float32(math.Exp(float64(x[i] - max)))
    sum += x[i]
  }
  for i := range x {
    out[i] = x[i] / sum
  }
  return out
}

That’s all the math we need to run a GPT-2 model.

Running a single layer

A single layer takes a vector as an input, as well as a slot pointer. First of all it handles the self-attention. It normalises the input vector and using the attention weight+bias it calculates (lin(x, cattn_w, cattn_b)) the query q, key k, and value v vectors. Followed by a softmax(q*k) it results in a vector of scores for each key.

In GPT-2 self-attention happens simultaneously a number of times. Each calculation happens more or less independently from the others. The smalles GPT-2 model has 12 “heads” and each head has its own query, key, and value vectors resulting in its own set of scores. Concatenating all the scores from all the heads we have a vector that represents the multi-head attention of the whole layer.

However, simply forwarding this vector to the next layer would not give good results. We need another operation that would project the self-attention results into a more suitable vector. This part is called “projecting” and is implemented as another matrix multiplication lin(attn, cproj_w, cproj_b).

Finally there are two fully-connected dense sub-layers, one few times larger than the word vector length, another reducing it back to the word vector length (786 for GPT-2). Why two of them and why are they different? The more neurons the network has - the more “knowledge” it can contain, but all this knowledge needs to be compressed back to fit the dimensionality of the following layer.

Perhaps, code speaks better than words, here’s how a single layer operates:

func (m Model) runLayer(x []float32, layer, slot int) {
	l := m.layers[layer]
	xn := norm(x, l.ln1_b, l.ln1_g)
	q := make([]float32, WordVecSize)
	for i := 0; i < WordVecSize*3; i++ {
		a := lin(xn, l.attn_cattn_w[WordVecSize*i:WordVecSize*(i+1)], l.attn_cattn_b[i])
		if i < WordVecSize {
			q[i] = a
		} else if i < WordVecSize*2 {
			l.k[slot*WordVecSize+(i-WordVecSize)] = a
		} else {
			l.v[(i-WordVecSize*2)*ContextSize+slot] = a
		}
	}

	const headSize = 64
	tmp := make([]float32, WordVecSize)
	for h := 0; h < NumHeads; h++ {
		att := make([]float32, slot+1)
		for i := 0; i <= slot; i++ {
			att[i] = lin(q[h*headSize:(h+1)*headSize], l.k[i*WordVecSize+h*headSize:], 0) / 8
		}
		att = softmax(att)
		for j := 0; j < headSize; j++ {
			tmp[h*headSize+j] = lin(att, l.v[(j+h*headSize)*ContextSize:], 0)
		}
	}
	for i := 0; i < WordVecSize; i++ {
		x[i] += lin(tmp, l.attn_cproj_w[WordVecSize*i:], l.attn_cproj_b[i])
	}
	xn = norm(x, l.ln2_b, l.ln2_g)
	mlp := make([]float32, WordVecSize*4)
	for i := 0; i < WordVecSize*4; i++ {
		mlp[i] = gelu(lin(xn, l.mlp_cfc_w[WordVecSize*i:], l.mlp_cfc_b[i]))
	}
	for i := 0; i < WordVecSize; i++ {
		x[i] += lin(mlp, l.mlp_cproj_w[WordVecSize*4*i:], l.mlp_cproj_b[i])
	}
}

This was the most complicated part of GPT-2, the rest is just calling runLayer in a loop for every layer:

func (m Model) Run(context []int, slot int) []float32 {
	x := make([]float32, WordVecSize)
	wv := m.WordVec(context[slot])
	for i := range x {
		x[i] = m.wpe[i+WordVecSize*slot]
		x[i] += wv[i]
	}
	for i := range m.layers {
		m.runLayer(x, i, slot)
	}
	return norm(x, m.lnf_b, m.lnf_g)
}

You might wonder why do we need a slot parameter. Self-attention tends to consider the complete input vector, treating all its “slots” equally. But as we feed input token by token into the network - the slots are filled one after the other and no slots after the current one have any meaning yet. So we give a hint to the network that it should the consider the previous slots but ignore the following ones. This is called “masked self-attention”.

Decoding

The final layer of the network returns another 786-element vector of numbers. How can be translate it into a word?

For every known token from tokens.dat we multiply the output vector by the word vector. Result would be a single number, that indicates how good the token is to become the next one in the sentence. We choose a certain subset of such candidates (the most suitable ones) and randomly pick one. The resulting token is being added to the context and the whole context is being fed into the network again. The output of the network creates candidates for the following token and so on. The process continues as long as needed, or until the network returns a special “end of text” token, meaning that it had rest its case.

Time to test our network. GPT-2 comes in different model sizes, the one with 124M parameters is in the repo, the rest can be converted from the publicly available GPT-2 weights.

We can ask the network to continue phrases and see how it copes with the task:

"We finish each other's..."
> sentences
> work
> meals

"Berlin is a..."
> small, but important city.
> major center.
> major transit destination

"Politicians are..."
> divided at their view of Russia
> now calling a referendum for May
> already looking forward, saying they want better services, more choice
> not opposed to abortion rights, nor can their views be a...
> concerned about China's growing political sophistication

"To be or not to be?"
> Are al your lives in vain?
> This isn't the final question we got
> And yet, it is the greatest of virtues

The whole code is available on GitHub and it’s all under 300 LOC!

AI?

Clearly, the model is hallucinating. Berlin is not a small city at all, not there is referendum in May. But for an absolutely tiny model having 124M parameters (about the size of a mouse brain) it can produce rather meaningful text. If you print a list of top candidates for each token you might be surprised to see know much knowledge the model has about the world.

The final decision whether such models are “intelligent” or not is left to the reader, but the results prove that large networks trained on large data sets is a way to go for the artificial intelligence today. Or is it?

Next: TinyStories (Coming soon!)

I hope you’ve enjoyed this article. You can follow – and contribute to – on Github, Mastodon, Twitter or subscribe via rss.

Jan 04, 2024

See also: AI or ain't: Neural Networks and more.