矩阵求导的一般规律
Consider two vector \(x\in \mathbb{R}^{d_{I}}\) and \(y\in \mathbb{R}^{d_{O}}\), and \(y=Wx\) with \(W\in \mathbb{R}^{d_{o}\times d_{I}}\) .
\begin{equation} \begin{aligned} &\frac{dy}{dx}\in\mathbb{R}^{d^{O}\times~d^{I}}\\ &\frac{dy}{dx}=W \end{aligned} \end{equation} \begin{equation} \begin{aligned} &\frac{dy}{dW}\in\mathbb{R}^{d^{O}\times~d^{O}\times d_{I}}\\ &\frac{dy}{dW}=I_{d_{O}\times~d_{O}}\otimes x^{T} \end{aligned} \end{equation}If we define \(W=MN\), where \(M\in \mathbb{R}^{d_{O}\times r}\) and \(N\in \mathbb{R}^{r\times d_{I}}\), we have
\begin{equation} \begin{aligned} &\frac{dy}{dM}\in\mathbb{R}^{d^{O}\times~d^{O}\times r}\\ &\frac{dy}{dM}=I_{d_{O}}\otimes (Nx)^{T}=I_{d_{O}}\otimes (x)^{T}N^{T}\\ \end{aligned} \end{equation}and
\begin{equation} \begin{aligned} &\frac{dy}{dN}\in\mathbb{R}^{d_{O}\times~r\times d_{I}}\\ &\frac{dy}{dN}=\frac{dy}{dNx}\cdot \frac{dNx}{dN}=M\cdot I_{r}\otimes x^{T}\\ \end{aligned} \end{equation}Gradient is different from the derivative. It can be seen as the transpose of derivatives.
\begin{equation} \label{eq:3} \nabla_{x}y=(\frac{\partial y}{\partial x})^{T}. \end{equation}In derivative, we may have:
\begin{equation} \frac{dL}{d \theta}=\frac{dL}{df}\cdot \frac{df}{d\theta}, \end{equation}with the shape \(\frac{dL}{d \theta}\in \mathbb{R}^{1\times N_{\theta}}\) while in the gradient, it is:
\begin{equation} \begin{aligned} &\nabla_{\theta}L\in \mathbb{R}^{N_{\theta}\times~1}\\ &\nabla_{\theta}L=\nabla_{\theta}f~\cdot\nabla_{f}L\\ \end{aligned} \end{equation}