Switch Transformer - Sparse Routed Networks/MoEs

In this short post, I am going to talk about the Switch Transformer paper. Background and Architecture To begin: In Dense Models, each parameter/neuron of the network gets activated during the forward pass. Hence, to be able to derive more performance by means of increasing the model size, more computation is required. Implicitly, more model parameters => more computation performed by a single token. And tangential to this, work by Kaplan et al....

June 17, 2024

PyTorch Datasets and Dataloaders

This piece of mine got featured in the Weights and Biases fully connected blog. It can be found here.

August 5, 2022