Technology and Art

We continue looking at the **Transformer** architecture from where we left from Part 1. When we’d stopped, we’d set up the Encoder stack, but had stopped short of adding positional encoding, and starting work on the Decoder stack. In this post, we will focus on setting up the training cycle.

This is part of a series of posts breaking down the paper Plenoxels: Radiance Fields without Neural Networks, and providing (hopefully) well-annotated source code to aid in understanding.

This is part of a series of posts breaking down the paper Plenoxels: Radiance Fields without Neural Networks, and providing (hopefully) well-annotated source code to aid in understanding.

This is part of a series of posts breaking down the paper Plenoxels: Radiance Fields without Neural Networks, and providing (hopefully) well-annotated source code to aid in understanding.

It may seem strange that I’m jumping from implementing a simple neural network into **Transformers**. I will return to building up the foundations of neural networks soon enough: for the moment, let’s build a **Transformer** using PyTorch.

Programming guides are probably the first posts to become obsolete, as APIs are updated. Regardless, we will look at building simple neural networks in **PyTorch**. We won’t be starting from models with a million parameters, however. We will proceed from the basics, starting with a single neuron, talk a little about the tensor notation and how that relates to our usual mathematical notation of representing everything with column vectors, and scale up from there.

In this article, we will build up our mathematical understanding of **Gaussian Processes**. We will understand the conditioning operation a bit more, since that is the backbone of inferring the **posterior distribution**. We will also look at how the **covariance matrix** evolves as training points are added.

In this article, we will build up our intuition of Gaussian Processes, and try to understand how it models uncertainty about data it has not encountered yet, while still being useful for regression. We will also see why the **Covariance Matrix** (and consequently, the **Kernel**) is a fundamental building block of our assumptions around the data we are trying to model.

This article builds upon the previous material on **kernels** and **Support Vector Machines** to introduce some simple examples of **Reproducing Kernels**, including a simplified version of the frequently-used **Radial Basis Function kernel**. Beyond that, we finally look at the actual application of kernels and the so-called **Kernel Trick** to avoid expensive computation of projections of data points into higher-dimensional space, when working with **Support Vector Machines**.

This article uses the previous mathematical groundwork to discuss the construction of **Reproducing Kernel Hilbert Spaces**. We’ll make several assumptions that have been proved and discussed in those articles. There are multiple ways of discussing Kernel Functions, like the **Moore–Aronszajn Theorem** and **Mercer’s Theorem**. We may discuss some of those approaches in the future, but here we will focus on the constructive approach here to characterise **Kernel Functions**.

This article lays the groundwork for an important construction called **Reproducing Kernel Hilbert Spaces**, which allows a certain class of functions (called **Kernel Functions**) to be a valid representation of an **inner product** in (potentially) higher-dimensional space. This construction will allow us to perform the necessary higher-dimensional computations, without projecting every point in our data set into higher dimensions, explicitly, in the case of **Non-Linear Support Vector Machines**, which will be discussed in the upcoming article.

This article discusses a set of two useful (and closely related) factorisations for **positive-definite matrices**: the **Cholesky** and the **\(LDL^T\)** factorisations. Both of them find various uses: the Cholesky factorisation particularly is used when **solving large systems of linear equations**.

We discuss an important factorisation of a matrix, which allows us to convert a linearly independent but non-orthogonal basis to a **linearly independent orthonormal basis**. This uses a procedure which iteratively extracts vectors which are orthonormal to the previously-extracted vectors, to ultimately define the orthonormal basis. This is called the **Gram-Schmidt Orthogonalisation**, and we will also show a proof for this.

We have looked at how **Lagrangian Multipliers** and how they help build constraints as part of the function that we wish to optimise. Their relevance in **Support Vector Machines** is how the constraints about the classifier margin (i.e., the supporting hyperplanes) is incorporated in the search for the **optimal hyperplane**.

This article concludes the (very abbreviated) theoretical background required to understand **Quadratic Optimisation**. Here, we extend the **Lagrangian Multipliers** approach, which in its current form, admits only equality constraints. We will extend it to allow constraints which can be expressed as inequalities.

We consider the more frequently utilised viewpoints of **matrix multiplication**, and relate it to one or more applications where using a certain viewpoint is more useful. These are the viewpoints we will consider.

We pick up from where we left off in Quadratic Optimisation using Principal Component Analysis as Motivation: Part One. We treated **Principal Component Analysis** as an optimisation, and took a detour to build our geometric intuition behind **Lagrange Multipliers**, wading through its proof to some level.

In this article, we finally put all our understanding of **Vector Calculus** to use by showing why and how **Lagrange Multipliers** work. We will be focusing on several important ideas, but the most important one is around the **linearisation of spaces at a local level**, which might not be smooth globally. The **Implicit Function Theorem** will provide a strong statement around the conditions necessary to satisfy this.

In this article, we take a detour to understand the mathematical intuition behind **Constrained Optimisation**, and more specifically the method of **Lagrangian multipliers**. We have been discussing **Linear Algebra**, specifically matrices, for quite a bit now. **Optimisation theory**, and **Quadratic Optimisation** as well, relies heavily on **Vector Calculus** for many of its results and proofs.

This series of articles presents the intuition behind the **Quadratic Form of a Matrix**, as well as its optimisation counterpart, **Quadratic Optimisation**, motivated by the example of **Principal Components Analysis**. PCA is presented here, not in its own right, but as an application of these two concepts. PCA proper will be presented in another article where we will discuss **eigendecomposition**, **eigenvalues**, and **eigenvectors**.

This article aims to start the road towards a theoretical intuition behind **Gaussian Processes**, another Machine Learning technique based on **Bayes’ Rule**. However, there is a raft of material that I needed to understand and relearn before fully appreciating some of the underpinnings of this technique.

We will derive the intuition behind **Support Vector Machines** from first principles. This will involve deriving some basic vector algebra proofs, including exploring some intuitions behind hyperplanes. Then we’ll continue adding to our understanding the concepts behind quadratic optimisation.

The **dot product of two vectors** is geometrically simple: the product of the magnitudes of these vectors multiplied by the cosine of the angle between them. What is not immediately obvious is the algebraic interpretation of the dot product.

Let’s look at **Linear Regression**. The “linear” term refers to the fact that the output variable is a **linear combination** of the input variables.

I’d like to introduce some basic results about the rank of a matrix. Simply put, the rank of a matrix is the number of independent vectors in a matrix. Note that I didn’t say whether these are column vectors or row vectors; that’s because of the following section which will narrow down the specific cases (we will also prove that these numbers are equal for any matrix).

Some of these points about matrices are worth noting down, as aids to intuition. I might expand on some of these points into their own posts.

We will discuss the Column-into-Rows computation technique for matrix outer products. This will lead us to one of the important factorisations (the **LU Decomposition**) that is used computationally when solving systems of equations, or computing matrix inverses.

This is the easiest way I’ve been able to explain to myself around the orthogonality of matrix spaces. The argument will essentially be based on the geometry of planes which extends naturally to hyperplanes.

We will discuss the value-wise computation technique for matrix outer products. This will lead us to a simple sketch of the proof of reversal of order for transposed outer products.

Matrix multiplication (outer product) is a fundamental operation in almost any Machine Learning proof, statement, or computation. Much insight may be gleaned by looking at different ways of looking matrix multiplication. In this post, we will look at one (and possibly the most important) interpretation: namely, the **linear combination of vectors**.

Linear Algebra deals with matrices. But that is missing the point, because the more fundamental component of a matrix is what will allow us to build our intuition on this subject. This component is the vector, and in this post, I will introduce vectors, along with common notations of expression.

I’ve always been fascinated by Machine Learning. This began in the seventh standard when I discovered a second-hand book on Neural Networks for my ZX Spectrum.