From Systems of Equations to Neural Networks: A Linear Algebra Journey

AI is all about prediction. Have you ever wondered how AI predicts? The answer is really simple: Maths. AI uses maths to make better predictions by calculating sets of inputs. Generally, the inputs are a lot of data, which is a good thing for better prediction. For the sake of better input, we can use linear equations, but how much can we use—a million systems of equations? Obviously not. So, to make models predict better, we need something better, maybe better algorithms or even matrices.

One of the algorithms that AI uses is linear regression, in which we solve a simple line equation and optimize its weights (how much data is important) and bias for better prediction. We need a lot of inputs (equations) to predict better. Here comes the role of matrices. We get all the variables in matrices and then apply some rules, saving us time and giving us better results. So understanding matrices can help us. Well, it’s really simple; we can try to visualize it with lines. Let, for the sake of understanding, assume we have a lot of data inputs (features) to feed, such as wind speed, temperature, pressure, humidity, and so on for better prediction.

I know it’s not as good as seeing pictures of valleys, but it may equivalently help us to get into new dimensions in which we can better engage with information. Information is really an important thing for better prediction, but it does not mean that we should have more data or maybe more equations. Because we don’t need data, we need information. It may be a good approach in this topic to call information solutions.

So, we get to know that generally, we get three results: infinite solutions, unique solutions, and no solutions. And for a model to predict better, we need more information. But what if matrices are linearly dependent? It is sort of more of a waste. We can take the help of some equations to understand.

Seeing the above picture makes it clear that some equations bring similar information, which we don’t need. It can be computationally inefficient and expensive too, so we tend to have lots of things to do, like row reduction, Gaussian elimination, rank, row echelon: for the sake of better results. Well, we can take the example of an image in which we want more data and see how pixels can be reduced irrespective of quality (rank).

Rank really helps us. We can preserve more information with less data. So, we can take away from it that a system of equations is non-singular if it carries as many pieces of information as equations, meaning that we can get the maximum amount of information while giving less load to the computer. For the sake of understanding, think of data as points, and then we can try to visualize systems of equations as linear transformations as we were doing before with matrices; it would be better for understanding.

Now moving to biology. Below is the picture of a neuron, which tends to get input signals in our brain to generate output.

A lot of bio, isn’t it? So, now coming back and seeing how neural networks and matrices work together. Let’s take a simple example of NLP in which an AI model has to tell whether the mail is spam or not. Now, we try to get some structured data, and for the sake of training the model, we set the check function. And we are doing simple matrix multiplication to train our model to predict better results, just following the same structure as our neurons do.

Now think of a neuron and try to compare the above data tables with it. Yeah, you got the intuition. Now try matching the neuron to the image below.

Here the word U is an activation function, a check, which only outputs 0 and 1, or in other words, whether the mail is spam or not.

Now, let’s enhance our artificial neuron further using Principal Component Analysis (PCA). Before diving into PCA, we need to understand eigendecomposition.

Just as integers can be broken down into prime factors, matrices can be decomposed into eigenvalues and eigenvectors:

Eigenvalues represent the "force" that matrices are bound to.
Eigenvectors indicate the directions where this force is applied.

This decomposition breaks matrices into meaningful components, making them easier to analyze, compress, and use for dimensionality reduction. This concept is used in Singular Value Decomposition (SVD).

PCA, a simple machine learning algorithm, applies lossy compression to data. It stores data in a way that requires less memory but may lose some precision. PCA is computationally efficient and helps improve AI performance. By leveraging eigendecomposition, PCA reduces data dimensions by finding directions of maximum variance, focusing on the most important features (inputs). This process relies on the covariance matrix derived from the data.