[WIP] Optimizing matmul on CPU
Introduction and Setup This blog starts with a naive implementation of matmul in C and optimizes it one step at a time. I am using the following machine with a 10 core CPU (4 Performance cores+6 Efficiency cores). Food for thought: Is matrix multiplication on CPU compute bound or memory bound? Think about it… Algorithmic complexity of Matrix Multiplication: Calculating the FLOPs required Matrix multiplication is ubiquitous in many areas of computer sciene. As the matrices grow in sizes, the amount of FLOPs required to calculate the matmul grow cubically1. Let’s see how. ...