Method

Variational Lower Bound

Nź°œģ˜ ė°ģ“ķ„° ķ¬ģøķŠøģ—ģ„œ ėŖØėøģ“ ė°ģ“ķ„° ģƒ˜ķ”Œģ„ ģƒģ„±ķ•˜ėŠ” log likelihoodģø log⁔pĪø(x(1),x(2),⋯,x(N))=āˆ‘i=1Nlog⁔(pĪø(x(1)))\log p_{\bm \theta}(\bold x^{(1)},\bold x^{(2)},\bold \cdots, x^{(N)})=\sum^N_{i=1}\log(p_{\bm \theta}(\bold x^{(1)}))넼 ģµœģ ķ™”ķ•˜źø° ģœ„ķ•“ log⁔(pĪø(x(i)))\log(p_{\bm \theta}(\bold x^{(i)}))넼 계산핓야 ķ•œė‹¤(Maximum Likelihood).

ź³„ģ‚°ģ“ ė¶ˆź°€ėŠ„ķ•“ ź·øėŒ€ė”œėŠ” ģµœģ ķ™”ķ•  순 ģ—†ģ§€ė§Œ, z\bold z ģƒģ„±ģ— ģ‚¬ģš©ė˜ėŠ” ģøģ½”ė” qĻ•(z∣x)q_{\bm \phi}(\bold z|\bold x)넼 ģ ģš©ķ•“ log-likelihoodģ˜ lower bound(ķ•˜ķ•œ)ģ€ 계산할 수 ģžˆźø°ģ—, ģ“ė„¼ ģµœėŒ€ķ™”ķ•˜ė©“ likelihoodź°€ ģµœėŒ€ķ™”ė  ź°€ėŠ„ģ„±ģ“ ģžˆė‹¤.

ģ—¬źø°ģ„œė¶€ķ„°, ķ•˜ė‚˜ģ˜ ė°ģ“ķ„°ķ¬ģøķŠøx(i)\bold x^{(i)}ėŠ” ź°„ė‹Øķžˆ x\bold x 딜 ķ‘œķ˜„ķ•˜ź² ė‹¤. ė˜ķ•œ ģ‹ģ„ ķ•˜ė‚˜ģ”©ķ•˜ė‚˜ģ”© 정리핓가며 ģ•Œźø°ģ‰½ź²Œ ģ‚“ķŽ“ė³“ģž.

qĻ•(z∣x)q_{\bm \phi}(\bold z|\bold x)와 pĪø(x)p_{\bm \theta}(\bold x)ėŠ” ģ—°ź“€ģ“ ģ—†ėŠ” ė¶„ķ¬ģ“ź³ , qĻ•(z∣x)q_{\bm \phi}(\bold z|\bold x)ģ˜ ģ ė¶„ģ“ 1ģ“ėÆ€ė”œ log-likelihood넼 ė‹¤ģŒź³¼ ź°™ģ“ 바꾼다.

log⁔(pĪø(x))=∫qĻ•(z∣x)log⁔(pĪø(x))dz\log(p_{\bm \theta}(\bold x))=\int q_{\bm \phi}(\bold z|\bold x)\log (p_{\bm \theta}(\bold x))d\bold z

ė² ģ“ģ¦ˆ 정리넼 ģ“ģš©ķ•“ ģ •ė¦¬ķ•œė‹¤.

=∫qĻ•(z∣x)log⁔(pĪø(x,z)pĪø(z∣x))dz=∫qĻ•(z∣x)log⁔(pĪø(x,z))dzāˆ’āˆ«qĻ•(z∣x)log⁔(pĪø(z∣x))dz=\int q_{\bm \phi}(\bold z|\bold x) \log \left(\frac{p_{\bm \theta}(\bold x,\bold z)}{p_{\bm \theta}(\bold z|\bold x)}\right)d\bold z= \int q_{\bm \phi}(\bold z|\bold x) \log \left(p_{\bm \theta}(\bold x,\bold z)\right)d\bold z -\int q_{\bm \phi}(\bold z|\bold x) \log \left(p_{\bm \theta}(\bold z|\bold x)\right)d\bold z

ź°„ė‹Øķžˆ Cross Entropy꼓딜 ķ‘œķ˜„ķ•“ė³“ģž. Cross EntropyėŠ” H(p,q)=āˆ’Ep[q]=āˆ’āˆ«plog⁔qH(p,q)=-\mathbf E_p[q]=-\int p \log q ģ“ė‹¤.

=āˆ’H(qĻ•(z∣x),pĪø(x,z))+H(qĻ•(z∣x),pĪø(z∣x))=-H(q_{\bm \phi}(\bold z|\bold x),p_{\bm \theta}(\bold x,\bold z))+H(q_{\bm \phi}(\bold z|\bold x),p_{\bm \theta}(\bold z|\bold x))

ķ•˜ķ•œģ„ ź³„ģ‚°ź°€ėŠ„ķ•˜ź²Œ ė§Œė“¤źø° ģœ„ķ•“, KL Divergenceź¼“ģ„ ģœ ė„ķ•œė‹¤. KL DivergenceėŠ” DKL(p∣∣q)=H(p,q)āˆ’H(p)D_{KL}(p||q)=H(p,q)-H(p) ģ“ė‹¤.

=āˆ’H(qĻ•(z∣x),pĪø(x,z))+H(qĻ•(z∣x),pĪø(z∣x))āˆ’H(qĻ•(z∣x))+H(qĻ•(z∣x))=-H(q_{\bm \phi}(\bold z|\bold x),p_{\bm \theta}(\bold x,\bold z))+H(q_{\bm \phi}(\bold z|\bold x),p_{\bm \theta}(\bold z|\bold x))-H(q_{\bm \phi}(\bold z|\bold x))+H(q_{\bm \phi}(\bold z|\bold x))
=āˆ’H(qĻ•(z∣x),pĪø(x,z))+DKL(qĻ•(z∣x)∣∣pĪø(z∣x))+H(qĻ•(z∣x))=-H(q_{\bm \phi}(\bold z|\bold x),p_{\bm \theta}(\bold x,\bold z))+D_{KL}(q_{\bm \phi}(\bold z|\bold x)||p_{\bm \theta}(\bold z|\bold x))+H(q_{\bm \phi}(\bold z|\bold x))

ģ“ ģ‹ģ— ė‚˜ķƒ€ė‚œ KL Divergence(근사적 ģøģ½”ė”ģ™€ ģ‹¤ģ œ ģøģ½”ė”ģ˜ ė¶„ķ¬ ģ°Øģ“)ėŠ” pĪø(z∣x)p_{\bm \theta}(\bold z|\bold x) 넼 ģ•Œ 수 ģ—†ģœ¼ėÆ€ė”œ 구할 수 ģ—†ģ§€ė§Œ, KL DivergenceėŠ” ķ•­ģƒ ģ–‘ģˆ˜ģ“ėÆ€ė”œ lower bound넼 구할 수 ģžˆė‹¤.

log⁔(pĪø(x))ā‰„āˆ’H(qĻ•(z∣x),pĪø(x,z))+H(qĻ•(z∣x))\log(p_{\bm \theta}(\bold x))\ge -H(q_{\bm \phi}(\bold z|\bold x),p_{\bm \theta}(\bold x,\bold z))+H(q_{\bm \phi}(\bold z|\bold x))

Lower Bound넼 씰금 ė” ź³„ģ‚°ķ•˜źø° ķŽøķ•˜ź²Œ 바꿀 수 ģžˆė‹¤. pĪø(x,z)=pĪø(x∣z)pĪø(z)p_{\bm \theta}(\bold x,\bold z)=p_{\bm \theta}(\bold x| \bold z)p_{\bm \theta}(\bold z)ģ“ėÆ€ė”œ cross entropy넼 ė¶„ķ•“ķ•˜ė©“

=āˆ’H(qĻ•(z∣x),pĪø(x∣z))āˆ’H(qĻ•(z∣x),pĪø(z))+H(qĻ•(z∣x))=-H(q_{\bm \phi}(\bold z|\bold x),p_{\bm \theta}(\bold x|\bold z))-H(q_{\bm \phi}(\bold z|\bold x),p_{\bm \theta}(\bold z))+H(q_{\bm \phi}(\bold z|\bold x))

ģ“ė©°, ģ—¬źø°ģ„œ ķ•œė²„ ė” KL Divergenceź°€ ė‚˜ķƒ€ė‚œė‹¤.

=āˆ’H(qĻ•(z∣x),pĪø(x∣z))āˆ’DKL(qĻ•(z∣x)∣∣pĪø(z))=-H(q_{\bm \phi}(\bold z|\bold x),p_{\bm \theta}(\bold x|\bold z))-D_{KL}(q_{\bm \phi}(\bold z|\bold x)||p_{\bm \theta}(\bold z))

논문에 ė‚˜ģ˜Ø 것처럼, źø°ėŒ“ź°’ ķ˜•ķƒœė”œ 바꾸멓 ė‹¤ģŒź³¼ 같다.

log⁔(pĪø(x))≄Ez∼qĻ•(z∣x)[pĪø(x∣z)]āˆ’DKL(qĻ•(z∣x)∣∣pĪø(z))\log(p_{\bm \theta}(\bold x))\ge \mathbb E_{\mathbf z\sim q_{\bm \phi}(\bold z|\bold x)}[p_{\bm \theta}(\bold x|\bold z)]-D_{KL}(q_{\bm \phi}(\bold z|\bold x)||p_{\bm \theta}(\bold z))

pĪø(z)p_{\bm \theta}(\bold z)ėŠ” ė¶„ķ¬ė„¼ 가정핓 ģ •ķ•  수 ģžˆź³ , qĻ•(z∣x)q_{\bm \phi}(\bold z|\bold x)ź·¼ģ‚¬ģ‹œķ‚Ø ģøģ½”ė”, pĪø(x∣z)p_{\bm \theta}(\bold x|\bold z)ėŠ” ķ•™ģŠµģ‹œķ‚¬ ė””ģ½”ė”ģ“ėÆ€ė”œ 모두 계산 ź°€ėŠ„ķ•˜ė‹¤(tractable). ė”°ė¼ģ„œ ģ“ė„¼ 미분핓 Īø\bm \theta와 Ļ•\bm \phi넼 모두 ģ—…ė°ģ“ķŠøķ•˜ė©“ ėœė‹¤.

ź·øėŸ¬ė‚˜, Ļ•\bm \phi넼 ģ—…ė°ģ“ķŠøķ•˜źø° ģœ„ķ•“ ģ‚¬ģš©ķ•  수 ģžˆėŠ” naive Monte Carlo gradient estimatorėŠ” ģ–“ė–¤ ķ•Øģˆ˜ģ˜ źø°ėŒ“ź°’ģ„ źµ¬ķ•˜źø° ģœ„ķ•“ ėžœė¤ķ•˜ź²Œ 뽑아 ķ‰ź· ģ„ ė‚“ėŠ”ė°, ė„ˆė¬“ 큰 ė¶„ģ‚°ģ„ 가져 ģ‹¤ģš©ģ ģœ¼ė”œ ģµœģ ķ™”ķ•  수 없다.

Stochastic Gradient VB estimator & Auto-Encoding VB

Lower bound넼 ģž˜ ģµœģ ķ™”ķ•  수 ģžˆėŠ” ģ‹¤ģš©ģ  estimator넼 ģ†Œź°œķ•œė‹¤. qĻ•(z∣x)q_{\bm \phi}(\bold z|\bold x)넼 ģµœģ ķ™”ķ•˜ģ§€ė§Œ, qĻ•(z)q_{\bm \phi}(\bold z)ģ—ė„ ģ ģš©ķ•  수 ģžˆė‹¤. Gradient descent넼 ģ“ģš©ķ•˜źø° ģœ„ķ•“ģ„  loss넼 미분핓 ģøģ½”ė”ź¹Œģ§€ backpropagation ė˜ģ–“ģ•¼ ķ•˜ėÆ€ė”œ 미분 ź°€ėŠ„ķ•“ģ•¼ķ•œė‹¤. ķ•˜ģ§€ė§Œ ģøģ½”ė”ėŠ” z~∼qĻ•(z∣x)\bold{\tilde z}\sim q_{\bm \phi}(\bold z|\bold x)z넼 ė¶„ķ¬ģ—ģ„œ ķ•˜ė‚˜ ģƒ˜ķ”Œė§ķ•˜ėŠ” ź³¼ģ •ģ“źø°ģ— 미분 ė¶ˆź°€ėŠ„ķ•˜ėÆ€ė”œ(ė“±ģ‹(=)ģ“ ģ•„ė‹ˆėÆ€ė”œ 미분 ė¶ˆź°€ėŠ„), ģ—°ģ‡„ė²•ģ¹™ģ“ ģ“ ź³¼ģ •ģ—ģ„œ 깨져 ģøģ½”ė”ģ— GD넼 ģ‚¬ģš©ķ•  수 없다.

ė”°ė¼ģ„œ, ėÆøė¶„ź°€ėŠ„ķ•œ ķ•Øģˆ˜ģø gĻ•(ϵ,x)g_{\bm \phi}(\bm \epsilon,\mathbf x)넼 ģ“ģš©ķ•“ reparameterize(ģž¬ė§¤ź°œķ™”)ķ•œė‹¤. ϵ\bm \epsilonėŠ” ė…øģ“ģ¦ˆģ— ėŒ€ķ•œ ė³€ģˆ˜ģ“ė©° 결딠적으딜

z~=gĻ•(ϵ,x),ϵ∼p(ϵ)\bold{\tilde z}=g_{\bm \phi}(\bm \epsilon,\mathbf x), \quad \bm \epsilon\sim p(\bm \epsilon)

딜 ģž¬ė§¤ź°œķ™”ķ•œė‹¤. p(ϵ)p(\bm \epsilon)ėŠ” ė…øģ“ģ¦ˆģ— ėŒ€ķ•œ ģ ģ ˆķ•œ ķ™•ė„ ė¶„ķ¬ģ“ė‹¤. z넼 ķ•Øģˆ˜ė”œ 두고 ė…øģ“ģ¦ˆė§Œ 다넸 ė¶„ķ¬ģ—ģ„œ ė½‘ėŠ” ķŽøė²•ģ“ė‹¤. ė“±ģ‹ģ“ ė˜ėÆ€ė”œ 미분 ź°€ėŠ„ķ•“ģ§„ė‹¤. ģ“ė„¼ ģ“ģš©ķ•œ Monte Carlo źø°ėŒ“ź°’ estimatorėŠ” ė‹¤ģŒź³¼ 같다.

EqĻ•(z∣x(i))[f(z)]=Ep(ϵ)[f(gĻ•(ϵ,x(i)))]ā‰ƒ1Lāˆ‘l=1Lf(gĻ•(ϵ(l),x(i)))\mathbb E_{q_{\bm \phi}(\bold z|\bold x^{(i)})}[f(\bold z)] = \mathbb E_{p(\bm\epsilon)}[f(g_{\bm \phi}(\bm \epsilon,\mathbf x^{(i)}))]\simeq \frac 1 L \sum_{l=1}^{L}f(g_{\bm \phi}(\bm \epsilon^{(l)},\mathbf x^{(i)}))

Monte Carlo expectation으딜 piź°’ģ„ źµ¬ķ•˜ėŠ” ź³¼ģ •ģ„ ė³ø ģ ģ“ ģžˆģ„ ź²ƒģ“ė‹¤. ģ“ģ™€ ź°™ģ“, ϵ(l)\bm \epsilon^{(l)}넼 ė§Žģ“ ģƒ˜ķ”Œė§ķ• ģˆ˜ė” ģ‹¤ģ œ źø°ėŒ“ź°’ģ— ź°€ź¹Œģ›Œģ§ˆź²ƒģ“ė‹¤. 기씓엔 고차원 ė¶„ķ¬ģø z넼 ģƒ˜ķ”Œė§ķ•“ģ•¼ķ–ˆźø° ė•Œė¬øģ— ė¶„ģ‚°ģ“ ė„ˆė¬“ ģ¦ź°€ķ•“ ķšØģœØģ ģ“ģ§€ ģ•Šģ•˜ģœ¼ė‚˜, reparameterizationģ„ 통핓 ė¶„ģ‚°ģ„ ģ¤„ģ¼ 수 ģžˆė‹¤. ģ•Œź³  ģžˆėŠ” ė¶„ķ¬ ķ•˜ė‚˜ģ—ģ„œ ģƒ˜ķ”Œė§ķ•˜źø° ė•Œė¬øģ“ė‹¤.

ģ“ ė°©ė²•ģ„ variational lower bound에 ė˜‘ź°™ģ“ ģ ģš©ķ•“ ģµœģ ķ™”ķ•œė‹¤. ėŒ€ģ‹  Lower Bound넼 ģƒ˜ķ”Œė§ķ•  ė•Œ ϵ\bm \epsilonģ„ ģ¶”ģ¶œķ•“ ź³„ģ‚°ķ•˜ź³ , SGDė‚˜ Adagrad ė“±ģ˜ optimizer넼 ģ“ģš©ķ•“ ģˆ˜ė “ķ•  ė•Œ ź¹Œģ§€ ģµœģ ķ™”ķ•œė‹¤.

ėÆøė‹ˆė°°ģ¹˜ 크기가 큰 ź²½ģš°ėŠ” Lģ“ 1ģ“ģ–“ė„ ģƒź“€ 없다.

ėŖ©ģ ķ•Øģˆ˜ė„¼ Auto-Encoder처럼 ģƒź°ķ•˜ė©“, 두 번째 ķ•­ log⁔pĪø(x(i)∣z(i,l))\log p_{\bm \theta}(\bold x^{(i)}|\bold z^{(i,l)})ėŠ” ģž…ė „ģœ¼ė”œė¶€ķ„° z(i,l)=gĻ•(ϵ(l),x(i))\bold z^{(i,l)}=g_{\bm \phi}(\bm \epsilon^{(l)},\mathbf x^{(i)}) ģ—ģ„œ ģƒ˜ķ”Œėœ z딜 x넼 ė³µģ›ķ•˜ėŠ” ģ˜¤ģ°Øģø reconstruction errorģ“ė‹¤. KL Divergence ķ•­ģ€ posterior ė¶„ķ¬ź°€ prior ė¶„ķ¬ģ—ģ„œ 멀얓지지 ģ•Šė„ė” ź·œģ œķ•˜ėŠ” regularizerģ“ė‹¤.

Reparameterization Trick

ģµœģ ķ™” ź°€ėŠ„ķ•˜ė„ė” 문제넼 ķ•“ź²°ķ•˜źø° ģœ„ķ•“ qĻ•(z∣x)q_{\bm \phi}(\bold z|\bold x)ģœ¼ė”œė¶€ķ„° ģƒ˜ķ”Œė§ģ„ ėŒ€ģ²“ķ•˜ėŠ” ķ•Øģˆ˜ė„¼ ģ •ģ˜ķ–ˆė‹¤. z\bold zź°€ ķ™•ė„ ģ ģ“ģ§€ ģ•Šź³  deterministicķ•˜ź²Œ ķ‘œķ˜„ė  수 ģžˆė„ė” z=gĻ•(ϵ,x)\bold{z}=g_{\bm \phi}(\bm \epsilon,\mathbf x)딜 ģ •ģ˜ķ–ˆź³ , 볓씰 ė³€ģˆ˜ė”œĻµāˆ¼p(ϵ)\bm \epsilon\sim p(\bm \epsilon)넼 ė‘ģ—ˆė‹¤.

z=gĻ•(ϵ,x)\bold{z}=g_{\bm \phi}(\bm \epsilon,\mathbf x)ģ“ź³  ϵ∼p(ϵ)\bm \epsilon\sim p(\bm \epsilon) ģ“ėÆ€ė”œ ϵ\bm \epsilon변화에 ėŒ€ķ•œ p(ϵ)p(\bm \epsilon)ģ˜ ė¶„ķ¬ģ™€ z\bold z변화에 ėŒ€ķ•œ qĻ•(z∣x)q_{\bm \phi}(\bold z|\bold x) ģ˜ ė¶„ķ¬ėŠ” 같다(ģ¹˜ķ™˜ģ ė¶„).

p(ϵ)dϵ=qĻ•(z∣x)dzp(\bm \epsilon) d\bm \epsilon=q_{\bm \phi}(\bold z|\bold x)d\bold z

ģ“ė„¼ ģ ģš©ķ•˜ė©“, ėÆøė¶„ź°€ėŠ„ķ•œ estimator넼 구할 수 ģžˆė‹¤.

EqĻ•(z∣x(i))[f(z)]=∫qĻ•(z∣x)f(z)dz=∫p(ϵ)f(ϵ,x)dĻµā‰ƒ1Lāˆ‘l=1Lf(gĻ•(ϵ(l),x))\mathbb E_{q_{\bm \phi}(\bold z|\bold x^{(i)})}[f(\bold z)] = \int q_{\bm \phi}(\bold z|\bold x)f(\bold z)d\bold z=\int p(\bm \epsilon)f(\bm \epsilon,\mathbf x)d\bm \epsilon\simeq \frac 1 L \sum_{l=1}^{L}f(g_{\bm \phi}(\bm \epsilon^{(l)},\mathbf x))

예넼 들얓 ģž ģž¬ė³€ģˆ˜ź°€ ź°€ģš°ģ‹œģ•ˆ ė¶„ķ¬ė„¼ ź°€ģ§ˆ 경우 z∼N(μ,σ2)\bold z \sim \mathcal N(\mu,\sigma^2)ģ“ė©° ģ“ 경우 ģ ģ ˆķ•œ reparameterizationģ€

z=μ+Ļƒā‹…Ļµ,ϵ∼N(0,1)z=\mu+\sigma\cdot\epsilon,\qquad\epsilon\sim\mathcal N(0,1)

ģ“ ģ™øģ—ė„ reparameterizationģ„ ģ •ķ•˜ėŠ” ė°©ė²•ģ“ ģžˆė‹¤.

  1. ϵ\epsilon넼 ź· ė“±ė¶„ķ¬( ∼U(0,I)\sim \mathcal U(0,\bold I) )딜 두고, gĻ•(ϵ,x)g_{\bm \phi}(\bm \epsilon,\mathbf x)ź°€ ėˆ„ģ ė¶„ķ¬ķ•Øģˆ˜ģ˜ ģ—­ķ•Øģˆ˜ź°€ ė˜ė„ė” ķ•œė‹¤. ź· ė“±ė¶„ķ¬ė„¼ ź°€ģ§€ėŠ” ė³€ģˆ˜ė„¼ ėˆ„ģ ė¶„ķ¬ķ•Øģˆ˜ģ— ėŒ€ģž…ķ•˜ė©“ ķ™•ė„ ė¶„ķ¬ķ•Øģˆ˜ź°€ 되기 ė•Œė¬øģ“ė‹¤.

  2. location, scale딜 ģ •ķ•“ģ§€ėŠ” ė¶„ķ¬ķ•Øģˆ˜ė„¼ ģ •ķ•œė‹¤. ė¼ķ”Œė¼ģŠ¤, ķƒ€ģ›, T, ė”œģ§€ģŠ¤ķ‹±,ź· ė“±ė¶„ķ¬ ė“±ģ“ ģžˆė‹¤.

  3. 다넸 ė¶„ķ¬ė”œ ķ‘œķ˜„ ź°€ėŠ„ķ•œ ź²½ģš°ė„ ģžˆė‹¤. ė”œź·øģ •ź·œė¶„ķ¬, ė² ķƒ€, ģ¹“ģ“ģ œź³±, Fė¶„ķ¬ź°€ ģžˆė‹¤.

Last updated

Was this helpful?