Intro
Thereβs a lot. like, a LOT. of math, and history, that we can go into with regards to neural networks. For the sake of trying to make the article more approachable (if youβre using my articles as a reference paper for your classes you REALLY shouldnβt) Iβm going to gloss over anything thatβs not essentially high school maths. Let it be known that I will not be directly referencing many of the founding papers of neural networks.
If youβre interested about the specifics of neural models with regards to LLMs, try reading this paper by Bengio et al. And for those who want to approach the math of backprop in more detail, read this blog post by Karpathy!
Learning Basics
Neural networks can easily be seen as a black box that takes in some input, and spits out some output. Okay, great. For the sake of introduction, suppose that it is, in fact, a black box. Letβs create a basic 1 node network that will learn to give us y=x
.
The node starts off by operating on one input (a number), and gives one output (also a number). The node has a weight, which is a number that it multiplies the input by. The node also has a bias, which is a number that it adds to the output. These are both randomly initialized.
Using python, our node looks like this:
python
class Node:def __init__(self):self.weight = random.random()self.bias = random.random()def forward(self, x):return x * self.weight + self.bias
Note that we are using the forward
method to compute the output of the node. This is because we will later add more methods to the node, and we want to keep the code organized.
Adjusting for Loss
Now, we want it to give us y=x
, the βtargetβ. We will do this by using a loss function, which will figure out how far away the output is from the βtargetβ and then update the weights and biases accordingly.
python
class Node:def __init__(self):self.weight = random.random()self.bias = random.random()def forward(self, x):return x * self.weight + self.biasdef loss(self, x, y):output = self.forward(x)# Note that we square this here to make sure that the loss is always positive# This is called the Mean Squared Error (MSE) lossreturn (output - y) ** 2
Now that we have this loss, how do we update the weights and biases? We will use a method called gradient descent. This is a method that will update the weights and biases in the direction that will minimize the loss. This goes into the math of derivatives, but the essence is that we figure out how much the loss will change if we change the weight or bias by a small amount, and then update the weight or bias in the direction that will minimize the loss.
We calculate the gradient as
where is the loss, is the weight, is the number of samples, is the target, is the output, and is the input.
This is the gradient of, specifically, the MSE loss function. MSE is commonly used for regression problems (like this one), where we want to have a common loss function that is easy to compute and differentiable.
Much more generally, gradient can be calculated as
where is the output of the model, and is the weight. This is called the chain rule, and it allows us to compute the gradient of any loss function with respect to any weight.
Then, we update the weights and biases as
Where is the learning rate, which is a hyperparameter that controls how much we update the weights and biases. Hyperparameters are parameters that are not learned (like weights and biases are, for us) but are set before training.
Generalizing this again, we could imagine that a node with many weights and many biases can be computed as
where is some function (weβre going to go over this later, but if youβre familiar with activation functions, this is that), and and are the weights and biases. For now note we arenβt using any activation function, so is just the identity function.
Training
Now we have all the pieces, we can βtrainβ our node by giving it a bunch of inputs and the corresponding targets (we call this the βdatasetβ), and then updating the weights and biases using gradient descent.
Essentially, since we already know y=x
, we can just give it a bunch of input/output pairs, and if we define the gradient descent correctly it will learn to give us y=x
.
python
class Node:def __init__(self):self.weight = random.random()self.bias = random.random()def forward(self, x):return x * self.weight + self.biasdef loss(self, x, y):output = self.forward(x)return (output - y) ** 2def gradient(self, x, y):output = self.forward(x)dL_dy = 2 * (output - y)dy_dw = xreturn dL_dy * dy_dwdef update(self, x, y, alpha):grad = self.gradient(x, y)self.weight -= alpha * graddef train(self, dataset, alpha):for x, y in dataset:self.update(x, y, alpha)node = Node()dataset = [(i, i) for i in range(100)]node.train(dataset, 0.01)
Note that we never actually defined the function y=x
anywhere in here! We just gave it a bunch of random numbers and it was able to βcorrectβ itself to learn the function.
Imagine youβre blind, and you have a ball in your hand. You want to throw the ball to your friend, but you canβt see them. First, you throw the ball. It likely wonβt hit your friend, but you have an βoracleβ that tells you how far off you were. You then adjust your throw based on the feedback, and throw again. You keep doing this until you hit your friend. This is essentially what we did here.
Neural Networks
Well, neural networks are just a bunch of nodes. Imagine that instead of one we had, letβs say, 10 nodes, 100 nodes, or (as many LLMs do nowadays), tens of billions of nodes.
Structure
How do we connect these nodes? Well, we can connect them in a few ways. The most common way is to have a βlayerβ of nodes, where each node is connected to every other node in the next layer. This is called a fully connected layer.
There are also other types of layers, such as convolutional layers, which are used for image processing, and recurrent layers, which are used for sequence processing.
We can also have βskip connectionsβ, where we connect nodes from one layer to nodes in a later layer. This is called a residual connection.
Other types of model architectures include transformers, which are used for sequence processing, and graph neural networks, which are used for graph processing.
Many of these are not very relevant for LLMs and language modelling, so we wonβt go too hard into them in this series. However, I highly suggest going over them if youβre interested in the field of machine learning!
Activation Functions
We were talking about that function earlier. This is called an activation function, and it is used to introduce non-linearity into the model.
Essentially, if you didnβt have any activation functions, then the nodes (or neurons), could just βcollapseβ. To illustrate, suppose we had 2 nodes as simple as the one earlier, in series. Then,
and
Then, you could expand this to
Note that this is just a linear function where and .
This means that if you had a network of 1000 nodes, you could just collapse it into one node with a weight of and a bias of .
Thus, we want βnonlinearitiesβ in between in order to allow for complex behavior.
There are many activation functions, but the most common ones are:
- ReLU:
- Sigmoid:
- Tanh:
- Softmax:
Thereβs a whole wikipedia article with a table of other ones if youβre interested!
The most common one is the ReLU, which is used in most modern neural networks. The sigmoid and tanh are used in older neural networks, and the softmax is used in the output layer of classification models.
Thereβs a lot of reasons why you might use one over the other (and softmax in particular is only really used in the output layer of non-regression models), but it actually turns out that with enough compute and data you can pretty much use any activation function and get similar results.
Fin
Well, thatβs it for this one, just a quickie. This basic setup of nodes allows you to βlearnβ and regress to very basic functions, like polynomial math equations.
In the next one weβll go over a bit of intermediary knowledge and then jump to RNNs, which is where LLMs really started to become powerful.