Switch Transformer - Sparse Routed Networks/MoEs

In this short post, I am going to talk about the Switch Transformer paper. Background and Architecture To begin: In Dense Models, each parameter/neuron of the network gets activated during the forward pass. Hence, to be able to derive more performance by means of increasing the model size, more computation is required. Implicitly, more model parameters => more computation performed by a single token. And tangential to this, work by Kaplan et al. (2020) becomes super relevant as they discuss training larger models on comparatively smaller amounts of data as the computationally optimal approach. ...

June 17, 2024

[WIP] Learning CUDA Programming: A Primer

This post covers the very basic foundation needed to learn GPU programming and/or Parallel programming on CPUs only. I will cover the architectural details of two of the several processors that empower the modern day computer - the CPUs and the GPUs. By the end of this post, one should have a good understanding of the following terms - (in no particular order) chips, processors, microprocessors, cores, latency device, throughput device, clock speed, threads, processes, instructions, memory bandwidth, memory system. ...

March 31, 2024

An Introduction to Differential Privacy

I think differential privacy is beautiful! Why are we here? Protecting the privacy of data is important and not trivial. To help make sense of things here, the Fundamental Law of Information Recovery becomes useful which states: Overly accurate estimates of too many statistics can completely destroy (data) privacy. Another example that provides a good incentive for why privacy is important is the ability of LLMs to memorize data which is an undesirable outcome as it risks the leak of PII. ...

March 25, 2024

PyTorch Datasets and Dataloaders

This piece of mine got featured in the Weights and Biases fully connected blog. It can be found here.

August 5, 2022