AI or ain't: Neural Networks

Previously we explored early “intelligent” software based on rules and Markov chains, but the results were not too convincing. Modern AI is all about neural networks, so let’s fast-forward to the time when neural networks have been invented, which is… 1944? Correct, the idea of an artificial neuron dates many decades back, and it’s a very simple one, really.

neural network in 6 lines of code

A neural network is nothing but a few multiplications and additions. Maybe with an occasional if to add some flavour.

Here’s a bite-sized neural network in JavaScript – a single input layer, one output layer, and one hidden layer. It can predict the value of a binary XOR function using two numbers as inputs:

const relu = x => x < 0 ? 0 : x;
// Run the network for the given input array X
const nn = X => {
  // Define the parameters for the network
  const W1 = [[-1, -1], [-0.7, -0.7]], W2 = [-1.8, 1.3];
  const b1 = [1, 1.3], b2 = 0;
  // Calculate the output
  const hidden = b1.map((b, i) => relu(X.reduce((a, x, j) => a+x*W1[i][j], 0) + b))
  return relu(hidden.reduce((a, x, i) => a+x*W2[i], 0) + b2);
};

// Predict XOR function:
console.log(nn([0, 0])); // 0^0 => 0
console.log(nn([0, 1])); // 0^1 => 0.9916
console.log(nn([1, 0])); // 1^0 => 0.9916
console.log(nn([1, 1])); // 1^1 => 0

In six lines of code we have an “AI”. Maybe not the flashiest in town, but it gets the job done.

Now, for the uninitiated, things might seem a bit muddled. What’s with this “relu”, what are W1, W2, b1, b2 and why do we do these map/reduces?

W are the weight arrays. All inputs in a network are multiplied by some weight coefficients, then summed up. Optionally, a “bias” number (b) is added to that.

ReLU is a “rectified linear unit”, but practically it’s just a max(x, 0) to add some non-linearity to the network. Without it our network would stuck with linear functions like ax+b, no matter how you shuffle the things. But ReLU function makes a “broken line” and for negative numbers returns zero. Combining linear functions with ReLU and growing the size of W/b arrays could suddenly result in very complex shapes and functions and solve really complex problems.

Functions like ReLU are known as “activation functions”. You could swap ReLU for sin(), tanh(), exp(), Gaussian, but we’ll keep it straightforward and stick with ReLU.

Now there’s one last bit of magic left in those 6 lines of code implementing a network. Who gets to decide the values for W and b?

Training a network

This is where neural networks shine. They can be automatically trained to pick the right weights to solve the given problem. Humans only need to feed them inputs and expected outputs, and give them some time to grok it.

For each iteration in the training phase a network predicts an output, compares with with the expected one, calculates the difference and adjusts the weights.

Basically, we have a rather long chain of functions like relu(X*W+b) and our network tries to pick the lock by finding the right W and b to match the expected results.

Imagine a guessing game. We have a formula X+b. I picked some b that you have to guess. For every guess of yours I will be naming some X and some expected output of X+b. You would have to correct the value of b according to my clues.

Say, your guess is b=5. I say: actually, for X=b I expected the output to be 17, while with your b=5 the output is 13. Now you know you are off by 4 and should increase your b by 4 as well, getting b=5+4 or b=9.

Now, we have a formula X•W. You’ve picked some random W=3. I say that for X=5 the result should be not 15 but 25. Now you know that you are off by 10, and since your guess for X was 5 - you need to adjust W by 10÷5=2, making your next guess W=5 the right one.

Such adjustments are known as gradients. These are very basic examples, but as our network becomes larger and the formulas become more complex finding the right gradients becomes not so trivial.

Autograd

Some machine learning libraries have a concept of autograd, automatic differential calculator for gradients. How does it work?

Let’s start with a scalar type for all our numbers:

const Scalar = (val) => {
  let x = {
    val,
    add: y => Scalar(x.val + y.val),
    mul: y => Scalar(x.val * y.val),
  };
  return x;
};

const a = Scalar(5), b = Scalar(3), c = Scalar(2);
const d = a.add(b).mul(c); // d.val == 16

So far it’s only used to abstract away all arithmetic operations. Most ML frameworks do that to build a graph of operations and execute that graph later on GPU or somewhere in the cloud.

As we build the operation graph we can additionally store references to other scalars (that were inputs to this operation) and how much they would impact the result of this operation (gradients):

const Scalar = (val, refs = [], gradrefs = []) => {
  let x = {
    val, refs, gradrefs, grad: 0,
    add: (y) => Scalar(x.val + y.val, [x, y], [1, 1]),
    mul: (y) => Scalar(x.val * y.val, [x, y], [y.val, x.val]),
  };
  return x;
};

From the guessing game above we’ve learned that for addition operation we should adjust the parameter linearly (e.g. we missed the target by 10, we adjust the parameter by 10). But for the multiplication it depends on the multiplier value (e.g. if we are off by 12 and the other multiplier was 4 – we should adjust by 12/4=3). We can add more operations: division is inverted to multiplication and subtraction is inverted to addition:

const Scalar = ... {
    sub: (y) => Scalar(x.val - y.val, [x, y], [1, -1]),
    div: (y) => Scalar(x.val / y.val, [x, y], [1/y.val, -x.val/(y.val*y.val)]),
    pow: (n) => Scalar(Math.pow(x.val, n), [x], [n*Math.pow(x.val, n-1)]),
    relu: () => Scalar(Math.max(x.val, 0), [x], [+(x.val > 0)]),
    ...
}

Note that for pow() we only support constant N as an input (not a scalar), to simplify the derivative calculation.

Now, having a sufficiently complex graph of operations - how to we propagate the gradient value among them? We need to iterate recursively from the result of the operation graph back to its inputs, processing each node once and only once. We can have a flag inside Scalar, or we can put all nodes into some ordered set as we scan the graph to avoid repetitions:

const Scalar = ... {
    ...
    backward: () => {
      const visited = [], nodes = [];
      const buildGraph = (n) => {
        if (visited.indexOf(n) < 0) {
          visited.push(n);
          n.refs.forEach(buildGraph)
          nodes.push(n);
        }
      };
      buildGraph(x);
      x.grad = 1;
      nodes.reverse();
      nodes.forEach(n => n.refs.forEach((ref, i) =>
        ref.grad += n.grad * n.gradrefs[i]));
    }
}

In this way the value of grad=1 is propagated back to the leaves of the operation tree, to the very inputs. Resulting gradients there would indicate how much influence each of the input has on the resulting value of the formula, allowing us to adjust those parameters accordingly.

Building a smarter network

Having a Scalar we can use it as a basis to our multi-layer network that would be able to train itself.

Probably worth mentioning that real ML frameworks use “tensors” instead scalars to perform operations on complete arrays and matrices, using most of the processing power. Also they tend to use operator overloading to hide this asynchronous operation graph behind the scenes. We, on the other hand, are building a very slow, inefficient, but a very explicit and simple network.

The smallest element of a neural network is a neuron. It’s a virtual unit that in our case performs a single relu(W*X+b) operation, where W is an array of weights, X is an input array and b is bias value.

const Neuron = (numInputs) => {
  let n = {
    w: Array(numInputs).fill(0).map((_, i) => Scalar(Math.random()*2-1)),
    b: Scalar(0),
    params: () => [...n.w, n.b],
    eval: (x) => x.map((xi, i) => xi.mul(n.w[i])).reduce((a, xwi) => a.add(xwi), n.b).relu(),
  };
  return n;
};

Neurons are combined into layers. The most common layer type of a network is “dense” layer. This means all outputs from the previous layer serve as neuron inputs for all the neurons in the current layer. Here nout is the number of layer outputs (equals to number of neurons in the layer) and nin is the number of inputs of each neuron in this layer:

const Layer = (nin, nout) => {
  let l = {
    neurons: Array(nout).fill(0).map(() => Neuron(nin)),
    params: () => l.neurons.reduce((p, n) => [...p, ...n.params()], []),
    eval: x => l.neurons.map(n => n.eval(x)),
  };
  return l;
};

Finally, the network (or “multi-layer perceptron”) is an array of dense layers, where outputs from one layer are passed as inputs to the following one:

const MLP = (N) => {
  let mlp = {
    layers: N.slice(1).map((n, i) => Layer(N[i], N[i+1], i < N.length - 1)),
    eval: x => mlp.layers.reduce((a, l) => a = l.eval(a), x),
    params: () => mlp.layers.reduce((p, l) => [...p, ...l.params()], []),
    train: (Xset, Yset, rate = 0.1) => {
      const out = Xset.map(xs => mlp.eval(xs.map(x => Scalar(x))));
      const err = Yset.map((ys, i) => ys.map((ysi, j) => Scalar(ysi).sub(out[i][j]).pow(2)));
      const totalErr = err.reduce((e, es) => e.add(es.reduce((a, e) => a.add(e), Scalar(0))), Scalar(0)).div(Scalar(err.length));
      mlp.params().map(p => p.grad = 0);
      totalErr.backward();
      mlp.params().map(p => p.val -= p.grad * rate);
      return totalErr;
    },
  };
  return mlp;
};

The most exciting part here is the “train” method. It takes an array of inputs (Xset) and an array of expected outputs (Yset). It calculates the actual outputs (out) by evaluation every input vector from Xset. Then it calculates the total error ∑(out[i]-Y[i])^2 between the expected output and the actual output.

Total error is the final result of the calculation graph of our network. So we reset the network gradients and run the autograd backward() method to propagate the gradients back to each and every parameter according to its influence.

Finally, we adjust those parameters in the loop. Note that we don’t adjust them by the actual gradient. Instead we use a much smaller portion of it, defined by the learning rate value. The higher is the learning rate - the more radical are the changes in the network. This may sometimes cause it missing the goal. Smaller steps of course take more time, but are usually more efficient.

Can we teach it to solve the XOR function?

const X = [[0, 0], [0, 1], [1, 0], [1, 1]];
const Y = [[0], [1], [1], [0]];
for (let i = 0; i < 100; i++) {
    nn.train(X, Y, 0.05);
}
nn.params().map(p => console.log(p.val));
// Params: 1, 1, -1, 1, 1, 0, -1.8, 0.9, 0

These params are not even close to the ones we used at the beginning, but nevertheless - the new network predicts a XOR function just fine.

What about more complex problems? There is a well-known “two moon” problem, where 2D points that are shaped like two almost-overlapping moons needed to be classified. A network needs to learn how to tell by the point parameters (x,y) which moon does it belong to.

Test data for the two moons problem can be generated like this:

// Test data generator for the "two moons" problem
const moons = (n) => {
  const unif = (low, high) => Math.random() * (high - low) + low;
  const data = [], labels = [];
  for (let i = 0; i < Math.PI; i += (Math.PI * 2) / n) {
    data.push([Math.cos(i) + unif(-0.1, 0.1) - 0.5, Math.sin(i) + unif(-0.1, 0.1) - 0.3]);
    labels.push([0]);
  	data.push([0.5 - Math.cos(i) + unif(-0.1, 0.1), 0.2 - Math.sin(i) + unif(-0.1, 0.1)]);
  	labels.push([1]);
  }
  return [data, labels];
}

const [X, Y] = moons(200);

We can now define the structure of our network and train it:

const nn = MLP([2,8,4,1])

for (let i = 0; i < 100; i++) {
  const e = nn.loss(X, Y, 0.2);
  console.log(i, e.val);
}

In the first few iterations the error drops rapidly to only ~5% and then the network keeps reducing the error slowly and steadily. Finally, after 100 iterations it manages to classify the points belonging to the two moons perfectly:

moons

Full code is available on GitHub.

Of course training a real practical network is much harder. It’s a very slow process, that would probably require a GPU or a TPU. You would need a proper framework like PyTorch or Tensorflow or Keras (although even a tinygrad works surprisingly well for small cases and is so much easier to learn and understand). You would have to be aware of proper weights initialisation, optimisers, managing learning rate, regulations, dropout layers, batching and many other topics that turn machine learning into its own branch of science.

AI?

But even our toy neural network despite its slowness is a real machine leaning algorithm. Is it intelligent though? It can’t speak yet, but it can be trained automatically and gain knowledge about the problems it never seen before. What if instead of numbers such a network could take words as an input and produce sentences as an output?

Next part: Large Language Models.

I hope you’ve enjoyed this article. You can follow – and contribute to – on Github, Mastodon, Twitter or subscribe via rss.

Jan 03, 2024

See also: AI or ain't: Markov Chains and more.