The Geek Diver | The Geek Diver

Implement Convolutional Neural Network from scratch

April 25, 2024 · 12 min read

The pro diver

Introduction

In this article, we will cover various theoretical topics with the aim of implementing a neural network in C# that is capable of recognizing a digit present in an image.

The article is divided into three parts:

Presenting the key principles of a neural network.
Explaining the architecture of a Convolutional Neural Network.
C# source code.

If you are familiar with the concept of neural networks, feel free to skip this chapter and proceed to the next.

Neural Network Architecture

A neural network is composed of multiple interconnected layers, with neurons forming connections between them.

There are three types of layers.

1. Input Layer

The input layer consists of neurons representing the features of the input data. Each neuron corresponds to a feature, and its value represents the value of that feature.

2. Hidden Layer

Between the input layer and the output layer, there can be multiple hidden layers.

Generally, each neuron in a hidden layer is connected to all neurons in the preceding layer, a configuration known as a fully connected or dense layer type.

A hidden layer has two parameters:

weights : Each connection between two neurons has a weight, which is a parameter that influences the amount of information transmitted between them.
bias : Constant assigned to a layer.

These two parameters are part of the calculation formula for the WeightedSum :

SumWeighted = bias + \sum_{i=1}^{n}xi *weighti

These two parameters are adjusted during the learning process, specifically during the back propagation step.

An activation function can then be applied to the calculated parameter $SumWeighted$ . Generally, the same type of function is chosen for all hidden layers. The choice depends on the nature of the neural network; for example, a CNN or MLP network typically uses ReLU.

There are several types of functions, listed as follows.

output = f(SumWeighted)

Activation	Network type
Rectified Linear Activation (ReLU)	MLP,CNN
Logistic (Sigmoid)	RNN
Hyperbolic Tangent (Tanh)	RNN

danger

There are layers where the parameters weights, bias and the activation function are not used.

3. Output Layer

The final layer of the neural network is the output layer. It shares the same structure as a hidden layer as it comprises multiple neurons and may have the parameters weights and bias.

This layer receives information from the last hidden layer and conducts computations to predict or classify the received data.

Depending on the nature of the problem the neural network aims to solve, you can select one of these layers:

Problem	Alg
Classification of more than two classes	Softmax
Binary classification	Sigmoid
Regression	Linear regression

Neural network training algorithm

Now that you have an overview of the components that constitute a neural network.

We will explain the various steps to train a neural network, outlined as follows:

Initialize the parameters of all layers : weights ou bias.
Execute the following steps N times.

2.1 For each layer, execute the forward propagation step.

2.2 Calculate the error.

2.3 For each layer, execute the backward propagation step.

1. Initialize the parameters

The learning parameters of the layers, such as weights and bias, must be initialized with random values.

According to the Xavier Initialization method, this step is more crucial than one might think because if the initial values are too large or too small, then the training of the neural network becomes inefficient.

Depending on the nature of the layer, it may not be necessary to initialize these parameters.

2. Forward propagation

The Forward propagation step is executed when a layer receives data from another. It consists of the following steps:

For each neuron, calculate the sum weighted : $sm = \sum_{i=1}^{n}xi *wi$ .

1.1. $wi$ is the value of the weight that the neuron has with the i-th neuron.

1.2. $xi$ is the value of the i-th neuron.
Add the sum-weighted to the bias parameter : $t = b + \sum_{i=1}^{n}xi *wi$ .
Call the activation function: $f(b + \sum_{i=1}^{n}xi *wi)$ .

3. Calculate the error.

When data is received from the last layer, the result can be compared with the expected outcome. The difference indicates the error that the network made during its prediction.

There are various methods to calculate the error; once again, the choice depends on the nature of the problem.

Problem	Loss function
Classification of more than two classes	Cross Entropy Loss
Binary classification	Logistic loss

4. Backpropagation

The error calculated by the loss function is then propagated through the various layers.

The learning process begins with Backpropagation aiming to reduce the cost of the loss function by adjusting the parameters weights and bias of the different layers.

The degree of adjustment of the variables weights and bias is calculated by gradients. They help understand how a variable like the weight can influence the total result Loss.

\frac{\delta Loss}{\delta weight} , \frac{\delta Loss}{\delta bias}

Convolutional Neural Network (CNN)

The neural network tailored for image recognition is of the Convolutional type.

The typical architecture of a CNN consists of four layers."

Input layer : Extract the data from an image in one of the two formats:
- Three two-dimensional RGB matrices.
- A two-dimensional matrix with grayscale values.
First hidden layer : Convolutional layer.
Second hidden layer : Max pooling layer.
Output layer : softmax layer.

Parameter	Value
Number of filters	3
Kernel size	3
Pool size	2
Number of classes	3

1. Convolutional Layer

Before describing the structure of this layer, it is important to understand the concept of a convolution matrix.

A convolution matrix, also known as a kernel, is a two-dimensional matrix. When applied to an image, it enables various effects, as demonstrated in the example below. For a comprehensive list of filters, please refer to the Wikipedia website.

Operation	Kernel	Result
Edge detection

Assuming the image has been extracted into a grayscale matrix of size $(imw * imh)$ , the convolution algorithm consists of the following steps:

Create a convolution matrix/kernel of size $(kw * kh)$ . By default, the matrix will have a size of 3x3.
Create an output matrix of size $(imw - kw + 1) * (imh - kh + 1)$ .
Retrieve all windows from the input image. The center of the kernel matrix is used as a pointer, and the algorithm moves the pointer over each column and each row of the image. The area covered by the kernel, also called a window, is then stored in a list called windows.
For each element in the windows list:

4.1. Multiply the value by the kernel.

4.2. Calculate the average and store the result in the output matrix.

To summarize, here is the mathematical equation to calculate each element of the output matrix.

V= \frac{\sum_{m}^{imgsize}\sum_{k}^{ksize}kernel(kx;ky) * img(mx,my)}{F}

$kernel(kx;ky)$ : the coefficient of the convolution kernel at position kx,ky.
$img(mx,my)$ : the data of the pixel that corresponds to img(mx,my).
$F$ : Sum of the coefficients of the kernel.

By applying this algorithm multiple times, there is a risk of losing pixels on the edges of the image. To address this issue, the padding parameter has been introduced.

Padding parameter

This parameter indicates the number of pixels to be added to the edges of the image.

Assuming that the padding parameter is defined by the variable p, the size of the output matrix will be calculated as follows:

(imw - kw + pw + 1) * (imh - kh + ph + 1)

In many cases, the padding value is set to $(kw-1,kh-1)$ so that the output image has the same size as the input image.

Stride parameter

For each pixel of the image, a convolution window is applied. This operation can be resource-intensive, especially when the algorithm is applied to a large image.

To decrease the number of iterations, the stride parameter has been introduced. It defines the number of elements to be skipped in the width and height of the image. By default, this parameter is set to (1,1).

Assuming that the stride parameter is defined by the variable s, the size of the output matrix will be calculated as follows:

((imw - kw + pw + sw) / sw) * ((imh - kh + ph + sh) / sh)

Forward propagation

A convolutional layer consists of 1 or more neurons, and each neuron has its own convolution window.

During the forward propagation phase, each neuron executes the convolution algorithm on a window of the image.

Here are the main steps of the forward propagation algorithm:

Retrieve all windows of the input image.
For each input window, execute the convolution algorithm of each neuron and store the result in a list.
Store the list in the output matrix.

Here is the architecture of a neuron in a convolutional layer.

Backward propagation

Given that in the article, the convolutional layer does not have bias parameters and activation functions, the backward propagation algorithm amounts to computing this formula :

\frac{\delta Loss}{\delta filter(x,y))} = \sum_{i}^{}\sum_{j}^{}\frac{\delta Loss}{\delta conv(i,j)}\frac{\delta conv(i,j)}{\delta filter(x,y)}

2. Max Pooling layer

This layer consists of no neurons and does not have any bias or weights parameters. It reduces the size of the input matrix by applying the max operation to each window of the input matrix.

Its objective is twofold:

Reduce dimensionality.
For each region, find the maximum.

Forward propagation

The forward propagation algorithm consists of the following steps:

Define the size of the pooling window $(pw * ph)$ .
Create an output matrix of size $(iw / pw) * (ih / ph)$ .
Divide the input matrix into several windows of size $(pw * ph)$ .
For each window of the matrix, find the maximum value and assign this value to the cell of the output matrix.

Backward propagation

Given that the Pooling layer has no learning parameters, weights, or bias.

The backward propagation algorithm simply reconstructs a matrix of the same size as the last matrix received by the layer. Here are the main steps of the algorithm:

Create an output matrix of the same size as the input matrix.
Divide the matrix into several windows of size $(pw * ph)$ .
For each cell of each window, if the element is not the maximum, set it to 0; otherwise, take the gradient of the error.

3. Softmax layer

The activation dense layer softmax consists of one or more neurons and has the learning parameters weights and bias. The number of neurons is equal to the number of classes to predict.

The softmax function is used to calculate the probabilities of belonging to each class.

Here is the architecture of a softmax neuron.

Forward propagation

The forward propagation algorithm consists of the following steps:

Define the number of classes in a variable n.
For each class, create a neuron and initialize its weight parameter.
Initialize the layer's bias parameter.
For each neuron, multiply the input matrix by its weight and store the result in a variable $Weights = \sum_{i=1}^{n}xi *weighti$ .
Calculate the sum with the bias and store the result in a variable $SumWeighted = bias + Weights$ .
Finally, execute the activation function on each class $softmax(ci) = \frac{exp(ci)}{\sum_{i}^{n}exp(i)}$ .

Backward propagation

The softmax layer has two learning parameters that need to be updated, and a variable learningRate which weighs the importance of the partial derivatives to update these parameters.

The algorithm must be able to compute the partial derivatives for these two parameters.

\frac{\delta Loss}{\delta weight} = \frac{\delta Loss}{\delta out} * \frac{\delta out}{\delta net} * \frac{\delta net}{\delta weight}

\frac{\delta Loss}{\delta bias} = \frac{\delta Loss}{\delta out} * \frac{\delta out}{\delta bias}

The derivative $\frac{\delta Loss}{\delta out}$ is quite simple to calculate.

If the predicted class is different from the expected one:

\frac{\delta Loss}{\delta out(i)}=0

If the predicted class is the same as the expected one:

\frac{\delta Loss}{\delta out(i)}=\frac{-1}{pi}

The derivative $\frac{\delta net}{\delta weight}$ is equals to :

net = bias + \sum_{i=1}^{n}x(i)*weight(i) net = input

The derivative $\frac{\delta out}{\delta bias}$ is equals to 1.

The last derivative, $\frac{\delta out}{\delta net}$ , is more complex to calculate. If you desire a complete demonstration, I invite you to read the article published on medium.

Once all these partial derivatives are calculated, the parameters can be updated as follows:

weight = weight - learningRate * \frac{\delta Loss}{\delta weight}

bias = bias - learningRate * \frac{\delta Loss}{\delta bias}

C# implementation

The source code of the C# project can be found here.

The project uses a MINST dataset from the kaggle website to train the Convolutional Neural Network.

After 3 iterations performed on a dataset of 5000 entries, the achieved performance is quite satisfactory.

The accuracy is approximately 85%!

Conclusion

Congratulations! You've reached the end of this article. 😉

The C# implementation is not optimal and can still be improved.

For production use, I recommend using the excellent library keras.

Resources

https://observablehq.com/@lemonnish/cross-correlation-of-2-matrices, Cross-correlation of 2 matrices
https://www.ibm.com/topics/convolutional-neural-networks, What are convolutional neural networks?
https://developer.nvidia.com/discover/convolution#:~:text=The%20convolution%20algorithm%20is%20often,convolution%20operation%20is%20called%20deconvolution., Convolution
https://web.pdx.edu/~jduh/courses/Archive/geog481w07/Students/Ludwig_ImageConvolution.pdf, Image Convolution
https://d2l.ai/chapter_convolutional-neural-networks/conv-layer.html#fig-correlation, Convolutions for image
https://towardsdatascience.com/a-comprehensive-guide-to-convolutional-neural-networks-the-eli5-way-3bd2b1164a53, A Comprehensive Guide to Convolutional Neural Networks — the ELI5 way
https://victorzhou.com/blog/softmax/, Softmax
https://www.sciencedirect.com/topics/computer-science/convolutional-layer, Convolutional Layer
https://medium.com/@nerdjock/deep-learning-course-lesson-5-forward-and-backward-propagation-ec8e4e6a8b92, Forward and Backward Propagation

Part 2 - Optimization of the SGD algorithm

April 18, 2024 · 8 min read

The Geek Diver

The pro diver

Introduction

The naive implementation of the SGD algorithm presented in the first part is not efficient and can be improved by addressing several points.

The paper Parallelized Stochastic Gradient Descent suggests executing the algorithm on multiple machines in parallel.
Utilizing the GPU for matrix computations.

Supporting Parallelism

The main idea is to divide the dataset into subsets/mini-batches and distribute them across multiple machines (learners/processing units).

This method was initially proposed by Google's DistBelief. It uses a central parameter server (PS) that updates the learning model with gradients calculated by the learning nodes.

Architecture

There are two modes for updating the learning model:

Synchrone : The parameter server waits to receive gradients from all learning nodes before updating the model parameters. This approach takes longer and is riskier, as it requires the completion of execution of a portion or all of the nodes.
Asynchrone : The parameter server does not wait for the learning nodes. There is a risk that learning nodes execute the algorithm on an outdated version of the model.

The synchronous configuration consists of three methods:

Full Sync SGD : The parameter server (PS) waits to receive the gradient calculation from all learning nodes.

K-Sync SGD : The parameter server (PS) waits to receive K gradient calculation results, where each result comes from a different node. When received, all ongoing calculations on the nodes are canceled. Here, for each iteration, a mini-batch is distributed to each learning node.

K-batch-sync SGD : The parameter server (PS) waits to receive K gradient calculation results, where one or more results can come from the same node.

The asynchronous configuration consists of three methods:

Async SGD : As soon as the parameter server (PS) receives the gradient calculation from a learning node, it updates the learning model.

K-async SGD : The parameter server (PS) waits to receive K gradient calculation results, where each result comes from a different node. Here, for each iteration, a mini-batch is distributed to each learning node.

K-batch async SGD : The parameter server (PS) waits to receive K gradient calculation results, where one or more results can come from the same node.

Unlinke the synchronous mode, the asynchronous mode accepts all results from learning nodes and does not cancel ongoing executions.

At each iteration, if more than one gradient has been received by the PS server, we calculate the average and use this value to update the learning model parameters. Here is the formula for updating the parameter:

w_{j+1} = w_{j} - \frac{\eta}{K} * \sum_{l=1}^{K} g(w_{\tau(l,j)},\xi_{l,j})

$l$ : denotes the index of the K learners that contribute to the update at the corresponding iteration.
$\tau_{l,j}$ : is one mini-batch of m samples used by the l-th learner at the j-th iteration.
$\xi_{l,j}$ : denotes the iteration index when the l-th learner last read from the central PS.
$g(w_{\tau(l,j)},\xi_{l,j}) = \frac{1}{n} * \sum_{}^{} \nabla f(w_{\tau(l,j)},\varepsilon_{l,j})$ : average gradient of the loss function evaluated over the mini-batch $\tau_{l,j}$ .

Here is the K-batch async SGD algorithm.

Algorithm for the PS server to update the parameters w and itp of a complex-shaped linear function.

Define the number of learning nodes N.
Define the number of mini-batch K required to update the model parameters.
Define the number of iterations epoch.
Define the learning rate η. This determines the step size at each iteration while moving toward a minimum of a loss function.
Initialize the vector w for weight.
Initialize the variable itp for intercept.
for i = 1 .... epoch

7.1 for n .... N do in parallel

7.1.1 Randomly shuffle samples in the training set and store the result in the tx and ty variables.

7.1.2 Send the variables tx, ty, w and itp to node k

7.2 Wait to receive K mini-batches and store the results in two variables nwi and nitpi .

7.3 $w_{j+1} = w_{j} - \frac{\eta}{K} \sum_{l=1}^{K} (nwi)$

7.4 $itp = itp - \eta * \sum_{\eta}^{K}(nitpi)$

Algorithm for a learning node to calculate the gradient of a complex-shaped linear function.

Receive the mini-batch tx and ty.
Receive the parameter w.
Receive the parameter itp.
Calculate the number of observations N.
Calculate the gradient $g = \frac{1}{N} \sum_{i}^{N} -2 * wT * tx * (ty - (itp + w*tx))$
Calculate the intercept $itp = \frac{1}{N} * \sum_{i}^{N} -2 * itp * (ty - (itp + w*tx))$
Send the variables g and itp to the PS server.

GPU

As you may have noticed in the previous chapter, both the parameter server and learning nodes perform computations on matrices.

It is recommended to carry out such calculations on the GPU, especially when dealing with large matrices.

If you are not familiar with the key concepts of a GPU, I encourage you to read the next chapter; otherwise, proceed to the CUDA chapter.

CUDA

In this article, we use CUDA C++, an extension of C++ that enables developers to define kernels, which, when invoked, are executed N times in parallel by N different CUDA threads.

The thread identifier is calculated as follows: idx = threadIdx.x + blockIdx.x * blockDim.x. The parameters for this formula are as follows:

BlockDim : Dimension of a block, corresponding to the number of threads.
BlockIdx : Index of the block in a grid.
ThreadIdx : Index of the thread in a block.

CUDA introduces the following entities:

Thread : An abstract entity representing the execution of the kernel.
Block : Composed of multiple wraps, and each wrap has 32 threads.
Grid : Blocks are combined to form a grid.

During their execution, CUDA threads have access to different memory spaces, including but not limited to:

Register : Each thread has its own private memory.
Shared memory : Each Block has memory shared among all its threads. This memory has the same lifespan as the block.
Global GPU memory : All threads have access to global memory.

If you want more information on the subject, I recommend reading the excellent programming guide https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programming-model.

Now that you have an overview of all the involved entities when developing with CUDA, you can write the algorithm for calculating the gradient of a function on the GPU, considering the following points:

If the matrix is too large, it can be divided into several tiles. The CUDA documentation explains how this works.
Reduce the number of interactions between a thread and the Global GPU Memory; the thread should, whenever possible, use shared memory.

You are now ready to write your first matrix calculation algorithm on a GPU.

Performing Matrix Operations

The matrix operation most commonly used in the SGD algorithm is the multiplication of a matrix with a vector.

This article on GPU matrix-vector product (gemv) explains clearly the various ways to implement this algorithm.

In our case, we will not implement the algorithm from scratch or others. Instead, we will use the BLAS library provided by CUDA, which is optimized for performing this type of calculation.

It offers several useful functions for writing an optimized version of the SGD algorithm:

Fonction	Description	URL
gemv	Multiply a matrix by a vector	https://docs.nvidia.com/cuda/cublas/index.html#cublas-t-gemv
scal	Multiply a vector by a variable	https://docs.nvidia.com/cuda/cublas/index.html#cublas-t-scal
axpy	Multiply a vector `x` by `alpha` and add the result to vector `y`	https://docs.nvidia.com/cuda/cublas/index.html#cublas-t-axpy

We used the ILGPU library to write our algorithm; you can find the source code here https://github.com/simpleidserver/DiveIt/Projects/SGDOptimization/GPUMathUtils.cs.

Performance

We conducted several tests to compare performance.

All tests were run on the same dataset.

Paramètre	Valeur
Number of test data	1500
Nubmer of parameters	15000
Number of iterations (epoch)	10
Number of machines (k)	2

Sequential SGD - CPU (6837 MS)

SGD-seq-CPU

Sequential SGD - GPU (3243 MS)

SGD-seq-GPU

Parallel SGD - CPU (4602 MS)

SGD-parallel-CPU

Parallel SGD - GPU (2911 MS)

SGD-parallel-GPU

As you can see, the performance is much better when the GPU is used to compute matrices. Even though the project does not distribute the computational load of the function gradient across multiple nodes but across multiple threads, the performance remains entirely acceptable.

Conclusion

The proposed algorithm is not ready for use in a .NET library for production. However, understanding it can help you choose a library for machine learning.

To choose a library, you should check if it supports the following points. They are crucial as they will help reduce the training time of an ML model.

Capable of supporting parallelism.
Capable of supporting execution on one or multiple GPUs.

The source code of the project can be found here https://github.com/simpleidserver/DiveIt/tree/main/Projects/SGDOptimization.

Ressources

Link
https://eurocc.cyi.ac.cy/wp-content/uploads/2ndIntermediate-GPUCUDA.pdf, GPU programming using CUDA
https://wlandau.github.io/gpu/lectures/cublas-cula/cublas-cula.pdf, The CUBLAS and CULA librarie
https://www.bealto.com/gpu-gemv_v2.html, GPU matrix-vector product
https://github.com/ecrc/kblas-gpu/blob/8af76dc862c74cbe880569ff2ccf6e5e54245430/src/blas_l2/sgemv.cu, KBLAS is a high performance CUDA library for subset of BLAS and LAPACK routines optimized for NVIDIA GPUs.
https://docs.nvidia.com/cuda/cublas/index.html, CUDA.

Part 1 - Stochastic Gradient Descent (SGD) from scratch

April 11, 2024 · 9 min read

The Geek Diver

The pro diver

Introduction

In this article, we will implement the Stochastic Gradient Descent Regression algorithm from scratch in C#. This algorithm is well-known and widely used in the field of Machine Learning. It will be utilized for performing Linear Regression on both simple and complex forms.

If you are not familiar with the concept of a Regression function, please read the following chapter. Otherwise, feel free to skip directly to the Gradient Descent chapter.

Regression Versus Classification problems

Variables of a function can be characterized as either quantitative or qualitative (also known as categorical). Quantitative variables take on numerical values. Examples include a person's age, height, or income. In contrast, qualitative variables take on values in one of K different classes, or categories. Examples of qualitative variables include a person's gender (male or female), the brand of product purchased (branch A, B or C).

We tend to refer to problems with a quantitative response as regression problem, while those involving a qualitative response are often referred to as classification problems.

Least squares linear regression is used with a quantitative response, whereas Logistic regression is used with a qualitative (two-class, or binary) response.

Gradient Descent

A Gradient Descent function is an iterative algorithm designed to find the minimum of a loss function. This algorithm can be applied to any differentiable function.

Here are the steps of the algorithm:

Initialize the vector `w` for all parameters. These parameters are computed by the algorithm.
Define the number of iterations `epoch`.
Define the learning rate `η`. It determines the step size at each iteration while moving toward a minimum of a loss function.
for i = 1 .... epoch do

1 $w = w - η ∇ Qi(w)$

Parameter	Desription
$Qi(w)$	Loss function
$∇$	Gradient of the loss function

We have chosen to apply this algorithm to both Simple and Complex forms of Linear expressions.

Gradient Descent on Simple Linear Expression

A simple linear function has this form.

y = itp + w*xi

The best loss function to choose for evaluating the performance of the Gradient model is the Mean Squared Error (MSE).

MSE = \frac{1}{N} \sum_{i}^{n} (yi - f(xi))^{2}

MSE = \frac{1}{N} \sum_{i}^{n} (yi - (itp + w*xi))^{2}

f(xi) is the prediction that the function f provides for the ith observation.

To calculate the gradient of the $MSE$ / $Qi(itp,w)$ function, it is necessary to determine the partial derivatives with respect to the parameters $\frac{\delta}{\delta itp}$ and $\frac{\delta}{\delta w}$ . The chain rule is used to compute them.

For the function f(g(x)), its derivative is f'(g(x)) * g'(x).

Here are the details of the partial derivatives calculations:

\frac{\delta}{\delta itp} = \frac{1}{N}(\sum_{i}^{n} (yi - (itp + w*xi))^{2})* (yi - (itp + w*xi))

\frac{\delta}{\delta itp} = \frac{1}{N}(\sum_{i}^{n} 2*(yi - (itp + w*xi)))* -itp

\frac{\delta}{\delta itp} = \frac{1}{N}(\sum_{i}^{n} -2*(yi - (itp + w*xi)))

\frac{\delta}{\delta w} = \frac{1}{N}(\sum_{i}^{n} (yi - (itp + w*xi))^{2})* (yi - (itp + w*xi))

\frac{\delta}{\delta w} = \frac{1}{N}(\sum_{i}^{n} 2*(yi - (itp + w*xi)))*- w*xi

\frac{\delta}{\delta w} = \frac{1}{N}(\sum_{i}^{n} -2*xi*(yi - (itp + w*xi)))

Here is the updated gradient algorithm to minimize the loss function of a simple linear function.

Initialize the vector x with values.
Initialize the vector y with values.
Initialize the variable w for weight.
Initialize the variable ipt for intercept.
Assign random values to these two variables.
Define the number of iterations epoch.
Define the learning rate η.
for i = 1 .... epoch do

8.1 $w = w - \eta * (\frac{1}{N} \sum_{i}^{n} -2 * xi * (yi - (itp + w * xi)))$

8.2 $itp = itp - \eta * (\frac{1}{N} \sum_{i}^{n} -2 * (yi - (itp + w * xi)))$

A simple-shaped function may not be sufficient for you, as you have more than 2 parameters to calculate.

In this situation, you need to use the Gradient Descent algorithm to calculate all the parameters of your complex linear function.

Gradient Decent on complex linear expression

A complex linear function has this form:

y = itp + w*xi1 + w*xi2 + w*xi3 + ... + w*xi4

Here, the values of the variables xi can be represented in the form of a matrix of size (O, V), where O represents the number of observations, and V represents the number of variables. The variable w can be represented as a vector of size V.

The algorithm is very similar to that applied for a simple linear function but contains a few differences.

Initialize the vector x with values.
Initialize the vector y with values.
Initialize the variable w for weight.
Initialize the variable ipt for intercept.
Assign random values to these two variables.
Define the number of iterations epoch.
Define the learning rate η.
for i = 1 .... epoch do

8.1 $w = w - \eta * ( \frac{1}{N} * \sum_{i}^{n} -2 * wT * xi * (yi - (itp + w * xi)))$

8.2 $itp = itp - \eta * ( \frac{1}{N} * \sum_{i}^{n} 2 * itp * (yi - (itp + w*xi)))$

At step 8.1, the matrix w must be transposed. For more information on this operation, refer to this site https://en.wikipedia.org/wiki/Transpose.

At first glance, this algorithm works well on small-sized matrices but is not suitable for large matrices due to a risk of memory loss. To address this issue, the Stochastic Gradient Descent algorithm has been introduced.

Stochastic Gradient Descent

The idea of the algorithm is to take a random dataset from the variables x and y and apply the Gradient Descent algorithm to it. This algorithm diverges slightly from the previous one:

Initialize the vector x with values.
Initialize the vector y with values.
Initialize the variable w for weight.
Initialize the variable ipt for intercept.
Assign random values to these two variables.
Define the number of iterations epoch.
Define the learning rate η.
for i = 1 .... epoch do

8.1 Randomly shuffle samples in the training set and store the result in the tx and ty variables.

8.1 $w = w - \eta * ( \frac{1}{N} * \sum_{i}^{n} -2 * wT * txi * (tyi - (itp + w * txi)))$

8.2 $itp = itp - \eta * ( \frac{1}{N} * \sum_{i}^{n} 2 * itp * (tyi - (itp + w*txi)))$

Now that you have an overview of the Gradient Descent algorithm, we will develop a naive implementation in C# of the Stochastic Gradient Descent algorithm.

C# implementation

To be able to write the Stochastic Gradient Descent algorithm, we use two classes, CPUMathUtils and MathUtils. These classes facilitate matrix and vector operations, and you can find their source code on GitHub : https://github.com/simpleidserver/DiveIt/Projects/SGD.

Create a class GradientDescentRegression with a static function Compute.

Parameter	Description
epoch	Number of iterations
x	matrix where each cell corresponds to the value of the parameter
y	vector where each value corresponds to the value of an observation
isStochastic	Enable or disable the stochastic algorithm. If true, then the dataset used to execute the GD algorithm will be randomly selected.
batchSize	Number of observations that the algorithm will randomly select on each iteration. This parameter is used only if `isStochastic` is true.

public static void Compute(
    double learningRate,
    int epoch,
    double[,] x,
    double[] y,
    bool isStochastic = false,
    int batchSize = 0)

Initialize the variables, such as intercept and an empty vector for w.

int nbSamples = x.GetLength(0);
var nbWeight = x.GetLength(1);
var w = MathUtils.InitializeVector(nbWeight);
double itp = 0;
var lst = new List<double>();

Create a loop where the number of iterations is equal to epoch.

for (var step = 0; step < epoch; step++)
{

}

In this loop, add the logic to randomly draw a dataset of size batchSize if isStochastic is true.

var tmpX = x;
var tmpY = y;
if(isStochastic)
{
    var shuffle = MathUtils.Shuffle(x, y, nbSamples, nbWeight, batchSize);
    tmpX = shuffle.a;
    tmpY = shuffle.b;
}

Calculate the new value of y (itp + w*txi)

var newY = CPUMathUtils.Add(
  CPUMathUtils.Multiply(tmpX, w),
  itp);          

Calculate the lose factor (losefactor = tyi - (itp + w*txi))

var loseFactor = CPUMathUtils.Substract(tmpY, newY);

Update the w vector (w = w - η * (1/N Σ(i/n) -2 * wT*txi * (loseFactor)))

var tmp = CPUMathUtils.Multiply(
    CPUMathUtils.Divide(
        CPUMathUtils.Multiply(
            CPUMathUtils.Multiply(
                CPUMathUtils.Transpose(tmpX),
                loseFactor
            ),
            -2
        ),
        nbSamples
    ),
    learningRate);
// 4. Update weight
w = CPUMathUtils.Substract(w, tmp);

Update the intercept (itp = itp - η * (1/N Σ(i/n) 2 * itp * (loseFactor)))

itp -= learningRate * (CPUMathUtils.Multiply(loseFactor, -2).Sum() / nbSamples);

Conclusion

Congratulations! 🥳

If you have reached the end of the article, it may mean that you have successfully implemented the Stochastic Gradient Descent algorithm for a complex linear function. This algorithm is not intended for use in a library as it is not optimized and cannot work on non-convex functions. Therefore, I suggest using a library such as sklearn for linear regression.

In future articles, we will explain how to improve the performance of the algorithm.

Source code

You will find the source code at this link: https://github.com/simpleidserver/DiveIt/tree/main/Projects/SGD.

Resources

Link
https://en.wikipedia.org/wiki/Stochastic_gradient_descent, Stochastic gradient descent
https://sebastianraschka.com/faq/docs/gradient-optimization.html, What are gradient descent and stochastic gradient descent?
https://towardsdatascience.com/stochastic-gradient-descent-explained-in-real-life-predicting-your-pizzas-cooking-time-b7639d5e6a32, Stochastic Gradient Descent explained in real life
https://optimization.cbe.cornell.edu/index.php?title=Stochastic_gradient_descent, Stochastic gradient descent
https://docs.nvidia.com/deeplearning/performance/dl-performance-matrix-multiplication/index.html, Matrix Multiplication Background User's Guide

Part 1 - The magic behind the JIT compiler

April 4, 2024 · 6 min read

The Geek Diver

The pro diver

Introduction

The goal of this tutorial is to give you an overview of the Dotnet Runtime project. It is intended for developers at all experience levels.

We will explain the various steps that enable the transformation of C# code into native code, executable on any type of machine architecture, such as OSX, Unix-x64, etc.

The transformation process is constructed as follows:

This process can be broken down into two distinct parts :

The first part involves compiling the C# code into intermediate code.
The second part involves compiling this intermediate code into native code on-the-fly during program execution.

We will explain each part in the following chapters.

Compile the C# code into intermediate code

Imagine you have the following C# file. The initial step is to convert the C# code into intermediate code. To achieve this, you must execute the dotnet build command.

Console.WriteLine("Hello world!");
Console.ReadLine();

This command is a part of the dotnet SDK project: https://github.com/dotnet/sdk, and it performs the following actions:

Resolves the path of the MSBuild.dll file, which must be present in the root directory of your DOTNET SDK.
Executes the command MSBuild <projectName> -t:Rebuild -consoleloggerparameters:Summary.
The MSBUILD project, whose source code is located at this address https://github.com/dotnet/msbuild, uses the Roslyn library to compile the C# code into intermediate code (IL).

The compilation process consists of several steps:

Tokenizer : In the context of a compiler, tokenization (also known as lexing or lexical analysis) is the process of converting the source code into a sequence of strings with assigned meanings (tokens). The program that performs tokenization is called a tokenizer or lexer.
Parser : In the parsing phase, output from the tokenization phase is used to build a data structure named an abstract syntax tree, giving a structural representation of the input while checking for correct syntax.
Declaration : Source and imported metadata are analyzed to form named symbols.
Binding : Match identifiers in the code to symbols.
IL Emission : Intermediate Language is emitted with all the information.

Now that the intermediate code has been generated, we can proceed with the program's execution.

Compilation of the intermediate code on-the-fly

Execute the command dotnet run to launch the execution of your application.

Virtual Machine

Upon its launch, the Virtual Machine must first be initialized, and it performs a sequence of actions:

Set up the infrastructure that needs to be in place before anything else can run.
Initialize the core, low-level components.
Start up the low-level components, i.e., error handling, profiling API, debugging.
Start the main components, i.e., Garbage Collector (GC), AppDomains, Security.
Finalize the setup and then notify other components that the Virtual Machine has started.

Once the Virtual Machine is initialized, it passes the intermediate code to the Just-In-Time compiler. There are two contracts between the Just-In-Time compiler and the virtual machine, allowing for bilateral communication between these two components

The ICorJitCompiler interface (https://github.com/dotnet/runtime/blob/4765dd1b9f1aa58f16d6922438bcc6cb01b4a666/src/coreclr/inc/corjit.h) serves as the entry point for the Virtual Machine and includes several important methods, such as:

compileMethod: The Virtual Machine passes an object ICorJitInfo that holds the IL code.

The ICorJitInfo interface (https://github.com/dotnet/runtime/blob/b8e1296c832858e7f4b0eb2949afa73147a91981/src/coreclr/inc/corinfo.h) is an interface implemented by the Virtual Machine. It comprises a large number of methods and is used by the Just-In-Time compiler to look up metadata tokens, traverse type signatures, compute field and vtable offsets, and find method entry points.

In our case, the compileMethod method is crucial as it contains the logic to compile the intermediate code into native code.

Just-In-Time compiler

Here are the most important steps:

Importation : For each Statement instruction, we will create the HIR form. The HIR form of int Foo(int x) => x * 2 is:

STMT 0
*         RETURN          INT
\--*      MUL             INT
   +--*   LCL_VAR         INT  arg0
   \--*   CNS_INT         INT  2

Clean and optimize

2.1 Morphing : Including a certain number of transformations, for example, to improve performance. The operation x * 2 can be replaced by a left bit shift operation. The HIR form then becomes :
```
STMT 0
*         RETURN          INT
\--*      LSH             INT
   +--*   LCL_VAR         INT  arg0
   \--*   CNS_INT         INT  1
```
2.2 Inlining : Determines whether a function call is likely to be a good candidate for inlining by estimating the size of the native code of the function. If the situation is advantageous, a Compiler object is created, and the nodes are copied into this object after the importation phase has been executed.

2.3 Optimize layout : The goal here is to eliminate redundant Basic Blocks that will never be executed. For instance, in the following code, the instruction block throw new Exception("Sid"); is disregarded because the condition is never fulfilled.
```
if ('B' == 'C')
    throw new Exception("Sid");
```
2.4 Building SSA form : The goal here is to remove unnecessary variables.
Rationalization : Eliminating the parental context in trees results in a transition from a hierarchical model (HIR) to a linear model (LIR).
Emitter : For each block, we generate the function prologue, body, and epilogue, and then emit the result into memory. The code is located in the emitarm.cpp class (https://github.com/dotnet/runtime/blob/b8e1296c832858e7f4b0eb2949afa73147a91981/src/coreclr/jit/emitarm.cpp).

Now that the native code has been emitted, the compileMethod function returns a pointer to it, which can then be executed by the virtual machine.

Conclusion

Now that you have a comprehensive understanding of the various components involved in just-in-time compilation, we will describe in the upcoming articles different sections of the DOTNET RUNTIME library that seem interesting to me. Feel free to leave a comment if you have any remarks 😀

Resources

Link
https://github.com/dotnet/roslyn/blob/main/docs/wiki/Roslyn-Overview.md, .NET Compiler Platform ("Roslyn") Overview
https://github.com/dotnet/coreclr/blob/master/Documentation/botr/ryujit-overview.md#pre-import, JIT Compiler Structure
https://mattwarren.org/2017/03/23/Hitchhikers-Guide-to-the-CoreCLR-Source-Code/, A Hitchhikers Guide to the CoreCLR Source Code
https://mattwarren.org/2017/02/07/The-68-things-the-CLR-does-before-executing-a-single-line-of-your-code/, The 68 things the CLR does before executing a single line of your code
https://mattwarren.org/2018/07/05/.NET-JIT-and-CLR-Joined-at-the-Hip/, .NET JIT and CLR - Joined at the Hip
https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/if-conversion-within-dotnet-part-1, If Conversion Within .NET - Part 1
https://medium.com/@sruthk/cracking-assembly-function-prolog-and-epilog-in-x86-cb3c3461bcd3, Cracking Assembly — Function Prolog and Epilog in x86

Introduction​

Neural Network Architecture​

1. Input Layer​

2. Hidden Layer​

3. Output Layer​

Neural network training algorithm​

1. Initialize the parameters​

2. Forward propagation​

3. Calculate the error.​

4. Backpropagation​

Convolutional Neural Network (CNN)​

1. Convolutional Layer​

Padding parameter​

Stride parameter​

Forward propagation​

Backward propagation​

2. Max Pooling layer​

Forward propagation​

Backward propagation​

3. Softmax layer​

Forward propagation​

Backward propagation​

C# implementation​

Conclusion​

Resources​

Introduction​

Supporting Parallelism​

GPU​

CUDA​

Performing Matrix Operations​

Performance​

Conclusion​

Ressources​

Introduction​

Regression Versus Classification problems​

Gradient Descent​

Gradient Descent on Simple Linear Expression​

Gradient Decent on complex linear expression​

Stochastic Gradient Descent​

C# implementation​

Conclusion​

Source code​

Resources​

Introduction​

Compile the C# code into intermediate code​

Compilation of the intermediate code on-the-fly​

Virtual Machine​

Just-In-Time compiler​

Conclusion​

Resources​

Introduction

Neural Network Architecture

1. Input Layer

2. Hidden Layer

3. Output Layer

Neural network training algorithm

1. Initialize the parameters

2. Forward propagation

3. Calculate the error.

4. Backpropagation

Convolutional Neural Network (CNN)

1. Convolutional Layer

Padding parameter

Stride parameter

Forward propagation

Backward propagation

2. Max Pooling layer

Forward propagation

Backward propagation

3. Softmax layer

Forward propagation

Backward propagation

C# implementation

Conclusion

Resources

Introduction

Supporting Parallelism

GPU

CUDA

Performing Matrix Operations

Performance

Conclusion

Ressources

Introduction

Regression Versus Classification problems

Gradient Descent

Gradient Descent on Simple Linear Expression

Gradient Decent on complex linear expression

Stochastic Gradient Descent

C# implementation

Conclusion

Source code

Resources

Introduction

Compile the C# code into intermediate code

Compilation of the intermediate code on-the-fly

Virtual Machine

Just-In-Time compiler

Conclusion

Resources