NLP, PyTorch

In this short post, I am going to talk about the Switch Transformer paper. Background and Architecture To begin: In Dense Models, each parameter/neuron of the network gets activated during the forward pass. Hence, to be able to derive more performance by means of increasing the model size, more computation is required. Implicitly, more model parameters => more computation performed by a single token. And tangential to this, work by Kaplan et al. (2020) becomes super relevant as they discuss training larger models on comparatively smaller amounts of data as the computationally optimal approach. ...

NLP, PyTorch

Switch Transformer - Sparse Routed Networks/MoEs

PyTorch Datasets and Dataloaders