Blogs
Switch Transformer - Sparse Routed Networks/MoEs
In this short post, I am going to talk about the Switch Transformer paper. Background and Architecture To begin: In Dense Models, each parameter/neuron of the network gets activated during the forward pass. Hence, to be able to derive more performance by means of increasing the model size, more computation is required. Implicitly, more model parameters => more computation performed by a single token. And tangential to this, work by Kaplan et al....
[WIP] Learning CUDA Programming: A Primer
I want this post to be an entry point for anyone willing to learn GPU programming without a formal background in compsci/comp arch (I myself belong to this category so sorry in advance for any errors that I might make, and please email me to get them corrected). I will cover the architectural details of two of the several processors that empower the modern day computers - the CPUs and the GPUs....
An Introduction to Differential Privacy
I think differential privacy is beautiful! Why are we here? Protecting the privacy of data is important and not trivial. To help make sense of things here, the Fundamental Law of Information Recovery becomes useful which states: Overly accurate estimates of too many statistics can completely destroy (data) privacy. Another example that provides a good incentive for why privacy is important is the ability of LLMs to memorize data which is an undesirable outcome as it risks the leak of PII....
PyTorch Datasets and Dataloaders
This piece of mine got featured in the Weights and Biases fully connected blog. It can be found here.