Summary

multi_head_attention.drawio.png

$$ \text{input : }X\in\R^{N\times{d_{model}}}\\ \\\,\\ Q = XW_Q\\ K = XW_K\\ V = XW_V\\\,\\

Attention(Q, K, V) = softmax\left(\frac{QK^T}{\sqrt{d_k}}\right)V\\\text{concat and Mul }\mathbf W_0 $$

Input

$$ X\in\R^{N\times{d_{model}}} $$

Q, K, V 구하기

$$ Q = XW_Q\\ K = XW_K\\ V = XW_V\\ $$

Input $X$에 $d_{model} \times (d_{model}/h)$ 크기를 갖는 $W_Q,W_K,W_V$를 곱해서 얻는다.

Softmax 계산

$$ \mathrm{SoftMax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$