Switch Transformer - Sparse Routed Networks/MoEs
In this short post, I am going to talk about the Switch Transformer paper. Background and Architecture To begin: In Dense Models, each parameter/neuron of the network gets activated during the forward pass. Hence, to be able to derive more performance by means of increasing the model size, more computation is required. Implicitly, more model parameters => more computation performed by a single token. And tangential to this, work by Kaplan et al....