Chapter 1 序論

Exercise 1.1 \(\mathbb{P}(A)>0, \mathbb{P}(B)>0\)とする． \[\mathbb{P}(A\mid B)=\mathbb{P}(B\mid A)\] が成り立つのために，\(\mathbb{P}(A), \mathbb{P}(B)\)および\(\mathbb{P}(A\cap B)\)が満たすべき条件を求めよ．

Solution. 本問はひっかけ問題である．条件付き確率の定義から，もとめる等号が成り立つことと \[\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(A)}\] が成り立つことは同値である．もし\(\mathbb{P}(A\cap B)>0\)なら，両辺を\(\mathbb{P}(A\cap B)\)でわって，\(\mathbb{P}(A)=\mathbb{P}(B)\)を得る．じつは，これだけがこたえではなく，\(\mathbb{P}(A\cap B)=0\)なら，\(\mathbb{P}(A), \mathbb{P}(B)\)の値がなんであれ成り立つ．したがって，求める等号が成り立つのはつぎの２つの場合である．ひとつは \(\mathbb{P}(A\cap B)>0\)かつ\(\mathbb{P}(A)=\mathbb{P}(B)\)のとき．もうひとつは\(\mathbb{P}(A\cap B)=0\)となるときである．

Exercise 1.2 事象\(A, B\)は独立であり，\(\mathbb{P}(B)>0\)とする．また，\(\mathbb{P}(B\mid A\cup B)=2/3, \mathbb{P}(A\mid B)=1/2\)とする．このとき\(\mathbb{P}(B)\)を求めよ．

Solution. \(A\)と\(B\)が独立だから， \[\mathbb{P}(A\mid B)=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}=\frac{\mathbb{P}(A)\mathbb{P}(B)}{\mathbb{P}(B)}=\mathbb{P}(A)\] となる．したがって\(\mathbb{P}(A)=\mathbb{P}(A^c)=1/2\)である．いっぽう， \[\mathbb{P}(A\cup B)=\mathbb{P}(A)+\mathbb{P}(A^c\cap B)=\mathbb{P}(A)+\mathbb{P}(A^c)\mathbb{P}(B)=\frac{1}{2}(1+\mathbb{P}(B))\] である．よって \[\mathbb{P}(B\mid A\cup B)=\frac{\mathbb{P}(B)}{\mathbb{P}(A\cap B)}=\frac{\mathbb{P}(B)}{(1+\mathbb{P}(B))/2}=\frac{2}{3}\] となり，かんたんな計算から\(\mathbb{P}(B)=1/2\)を得る．

Exercise 1.3 ある大学では，学生のうち\(60\%\)はスニーカーを履いている．一方，教職員のうち\(20\%\)がスニーカーを履いている．大学の構成員のうち，\(60\%\)が学生，\(40\%\)が教職員とする．その大学の，ある構成員はスニーカーを履いていたとする．このとき彼もしくは彼女が学生である確率を求めよ．

Solution. （20200312解答修正）大学の構成員を勝手に一人選んだとき，それが学生である事象を\(A\)，スニーカーを履いている事象を\(B\)とする．すると \[\mathbb{P}(A)=0.6, \mathbb{P}(B\mid A)=0.6, \mathbb{P}(B\mid A^c)=0.2\] である．また，求める確率は\(\mathbb{P}(A\mid B)\)である．条件付き確率の定義から \[\begin{align*} \mathbb{P}(A\mid B)&=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(B)}\\ &=\frac{\mathbb{P}(A\cap B)}{\mathbb{P}(A\cap B)+\mathbb{P}(A^c\cap B)}\\ &=\frac{\mathbb{P}(B\mid A)\mathbb{P}(A)}{\mathbb{P}(B\mid A)\mathbb{P}(A)+\mathbb{P}(B\mid A^c)\mathbb{P}(A^c)}\\ &=\frac{0.6\times 0.6}{0.6\times 0.6+0.2\times 0.4}=\frac{9}{11}. \end{align*}\]

Exercise 1.4 \(N\)は正の整数，\(\sigma^2\)は正の実数．観測\(x_1,\ldots, x_N\)は独立で\(\mathcal{N}(\theta,\sigma^2)\)に従い，\(\theta\)の事前分布を\(\mathcal{N}(0,\sigma^2)\)とする．このとき\(\theta\)の事後分布を求めよ．

Solution. 観測をまとめて\(x^N\)と書くと，尤度は \[\begin{align*} p(x^N\mid \theta)&=\prod_{n=1}^N\frac{1}{\sqrt{2\pi\sigma^2}}\exp\left(-\frac{(x_n-\theta)^2}{2\sigma^2}\right)\\ &\propto \exp\left(-\frac{1}{2\sigma^2}\sum_{n=1}^N(x_n-\theta)^2\right) \end{align*}\] となる．従って，事後確率密度関数は \[\begin{align*} p(\theta\mid x^N)&\propto p(x^N\mid \theta)p(\theta)\\ &\propto \exp\left(-\frac{1}{2\sigma^2}\sum_{n=1}^N(x_n-\theta)^2-\frac{\theta^2}{2\sigma^2}\right) \end{align*}\] となる．指数の中身を平方完成すると \[\begin{align*} -\frac{N+1}{2\sigma^2}\left(\theta-\frac{1}{N+1}\sum_{n=1}^N x_n\right)^2+C \end{align*}\] となる．ただし，実数\(C\)は\(\theta\)によらない定数とする．よって事後密度関数の形から，事後分布は \(\mathcal{N}\left(\sum_{n=1}^Nx_n/(N+1), \sigma^2/(N+1)\right)\).

rstanを用いて事後分布を描画してみよう．rstanはベイズ統計学の応用でしばしば使われるStan言語のR版である．教科書では扱えなかったが，ちょっとした計算に便利だ．なお，rstanでは常に，事後分布を導出する際には乱数を用いた近似が利用されることに注意せよ．まずrstanのために，モデルを記述しよう．こうしたことはたいてい，固い教科書よりもユーザによるウェブサイトが役に立つ．説明はそちらに譲る．

モデルの記述の仕方はいくつかあるが，ここでは下記のモデルをnormal.stanというチャンクに記述する．

data {
  int N;
  real x[N];
  real<lower=0> sigma;
}
parameters {
  real theta;
}
model {
  theta ~ normal(0, sigma); // 事前分布
  x ~ normal(theta, sigma); // 尤度
}

そしてそのモデルをrstanの関数に代入しよう．最初の二行はおまじないなので気にしなくて良い．

rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

# データの生成
sigma <- 1.0
x <- rnorm(10,1,sigma)
normal_dat <- list(N =  length(x), x = x, sigma = sigma)

# コンパイル
fit <- rstan::sampling(normal.stan, data = normal_dat, 
            iter = 10^4, chains = 4)

# stanからparameterの取り出し
theta <- extract(fit, 'theta')
theta <- unlist(theta, use.names=FALSE)


# 理論的な事後密度関数
t <- mean(x)+seq(-1,1,length=1000)
fit <- dnorm(t, sum(x)/(length(x)+1), sigma/sqrt(length(x)+1))

fig <- plot_ly(x = theta, type="histogram", histnorm = "probability density", name="rstan")
fig %>% add_lines(x = t, y = fit, mode = 'lines', name="exact")

おおよそ上で計算した理論的な事後密度関数と，rstanで乱数を用いて計算したものは一致するようだ．

Exercise 1.5 \(\nu,\alpha\)が正の実数であるとき，式(1.7)をもちいて，\(\mathcal{G}(\nu,\alpha)\)の平均を求めよ．

Solution. 式(1.7)より \[\begin{align*} \int_0^\infty x~\frac{\alpha^\nu x^{\nu-1}e^{-\alpha x}}{\Gamma(\nu)}\mathrm{d}x &=\frac{\alpha^\nu}{\Gamma(\nu)}\int_0^\infty x^\nu e^{-\alpha x}\mathrm{d}x\\ &=\frac{\alpha^\nu}{\Gamma(\nu)}\frac{\Gamma(\nu+1)}{\alpha^{\nu+1}}\\ &=\frac{\nu}{\alpha}. \end{align*}\] ただし，\(\Gamma(\nu+1)=\nu~\Gamma(\nu)\)を用いた．

Remark. 記号\(\propto\)を使うのは，事後分布を計算するために不要な計算を回避する，計算のテクニックである．計算をこのように適宜省かないと，事後分布の計算はとても煩雑になりがちだからだ．しかし，初学者は記号\(\propto\)の意味するものが well-defined ではないと感じることだろう．そのため，初学者は教科書に最初に紹介した，記号\(\propto\)をもちいずきっちり事後分布の確率密度関数を計算する方法を何度か経験したほうが良い．その経験のあとなら記号\(\propto\)の便利さと意味を納得できるだろう．

Exercise 1.6 \(N\)は正の整数，\(\mu\)は実数，\(\alpha,\beta,\tau\)は正の実数とする．観測\(x_1,\ldots, x_N\)は\(\mathcal{N}(\mu,\tau^{-1})\)に従い独立とする．また，\(\tau\sim\mathcal{G}(\alpha,\beta)\)とする．実数\(\alpha,\beta,\mu\)は既知とするとき，パラメータ\(\tau\)の事後分布を求めよ．

Solution. 尤度は \[\begin{align*} p(x^N\mid \tau)&=\prod_{n=1}^N\sqrt{\frac{\tau}{2\pi}}\exp\left(-\frac{(x_n-\mu)^2}{2}\tau\right)\\ &\propto \tau^{N/2}\exp\left(-\sum_{n=1}^N\frac{(x_n-\mu)^2}{2}\tau\right) \end{align*}\] となる．よって事後密度関数は \[\begin{align*} p(\tau\mid x^N)\propto \tau^{\alpha+N/2-1}\exp\left(-\left\{\beta+\sum_{n=1}^N\frac{(x_n-\mu)^2}{2}\right\}\tau\right) \end{align*}\] となる．よって事後密度関数の形から，事後分布は \[\mathcal{G}\left(\alpha+\frac{N}{2}, \beta+\sum_{n=1}^N\frac{(x_n-\mu)^2}{2}\right). \]

Remark. 確率分布を求めよ，という問題は，求めるべき確率分布がよく知られた確率分布になることを暗に意味している．その確率分布が正規分布やガンマ分布など，どの確率分布なのか，そしてパラメータの値がなにかということが聞かれているのである．

これもrstanを使って確認しよう．次のモデルをnormal_gamma.stanというチャンクに記述した．

data {
  int N;
  real x[N];
  real<lower=0> alpha;
  real<lower=0> beta;
  real mu;
}
parameters {
  real tau;
}
model {
  tau ~ gamma(alpha, beta); // 事前分布
  x ~ normal(mu, 1/sqrt(tau)); // 尤度
}

rstanの結果と理論の結果は整合性があるようだ．

# データの生成
alpha <- 2.0
beta <- 3.0
mu <- 2.5
tau <- 1.0
x <- rnorm(10,mu,1/sqrt(tau))
normal_gamma_dat <- list(N =  length(x), x = x, alpha = alpha, beta = beta, mu = mu)

# コンパイル
fit <- rstan::sampling(normal_gamma.stan, data = normal_gamma_dat, 
            iter = 10^4, chains = 4)

## 
## SAMPLING FOR MODEL '4158cf09bd62d921e268909ea92a86fb' NOW (CHAIN 1).
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: normal_lpdf: Scale parameter is nan, but must be > 0!  (in 'modela68872b8d8a9_4158cf09bd62d921e268909ea92a86fb' at line 13)
## 
## Chain 1: Rejecting initial value:
## Chain 1:   Error evaluating the log probability at the initial value.
## Chain 1: Exception: normal_lpdf: Scale parameter is nan, but must be > 0!  (in 'modela68872b8d8a9_4158cf09bd62d921e268909ea92a86fb' at line 13)
## 
## Chain 1: 
## Chain 1: Gradient evaluation took 1.8e-05 seconds
## Chain 1: 1000 transitions using 10 leapfrog steps per transition would take 0.18 seconds.
## Chain 1: Adjust your expectations accordingly!
## Chain 1: 
## Chain 1: 
## Chain 1: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 1: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 1: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 1: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 1: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 1: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 1: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 1: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 1: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 1: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 1: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 1: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 1: 
## Chain 1:  Elapsed Time: 0.084998 seconds (Warm-up)
## Chain 1:                0.062902 seconds (Sampling)
## Chain 1:                0.1479 seconds (Total)
## Chain 1: 
## 
## SAMPLING FOR MODEL '4158cf09bd62d921e268909ea92a86fb' NOW (CHAIN 2).
## Chain 2: Rejecting initial value:
## Chain 2:   Error evaluating the log probability at the initial value.
## Chain 2: Exception: normal_lpdf: Scale parameter is nan, but must be > 0!  (in 'modela68872b8d8a9_4158cf09bd62d921e268909ea92a86fb' at line 13)
## 
## Chain 2: Rejecting initial value:
## Chain 2:   Error evaluating the log probability at the initial value.
## Chain 2: Exception: normal_lpdf: Scale parameter is nan, but must be > 0!  (in 'modela68872b8d8a9_4158cf09bd62d921e268909ea92a86fb' at line 13)
## 
## Chain 2: 
## Chain 2: Gradient evaluation took 4e-06 seconds
## Chain 2: 1000 transitions using 10 leapfrog steps per transition would take 0.04 seconds.
## Chain 2: Adjust your expectations accordingly!
## Chain 2: 
## Chain 2: 
## Chain 2: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 2: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 2: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 2: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 2: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 2: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 2: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 2: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 2: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 2: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 2: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 2: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 2: 
## Chain 2:  Elapsed Time: 0.082411 seconds (Warm-up)
## Chain 2:                0.062013 seconds (Sampling)
## Chain 2:                0.144424 seconds (Total)
## Chain 2: 
## 
## SAMPLING FOR MODEL '4158cf09bd62d921e268909ea92a86fb' NOW (CHAIN 3).
## Chain 3: 
## Chain 3: Gradient evaluation took 5e-06 seconds
## Chain 3: 1000 transitions using 10 leapfrog steps per transition would take 0.05 seconds.
## Chain 3: Adjust your expectations accordingly!
## Chain 3: 
## Chain 3: 
## Chain 3: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 3: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 3: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 3: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 3: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 3: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 3: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 3: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 3: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 3: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 3: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 3: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 3: 
## Chain 3:  Elapsed Time: 0.079113 seconds (Warm-up)
## Chain 3:                0.072525 seconds (Sampling)
## Chain 3:                0.151638 seconds (Total)
## Chain 3: 
## 
## SAMPLING FOR MODEL '4158cf09bd62d921e268909ea92a86fb' NOW (CHAIN 4).
## Chain 4: Rejecting initial value:
## Chain 4:   Error evaluating the log probability at the initial value.
## Chain 4: Exception: normal_lpdf: Scale parameter is nan, but must be > 0!  (in 'modela68872b8d8a9_4158cf09bd62d921e268909ea92a86fb' at line 13)
## 
## Chain 4: Rejecting initial value:
## Chain 4:   Error evaluating the log probability at the initial value.
## Chain 4: Exception: normal_lpdf: Scale parameter is nan, but must be > 0!  (in 'modela68872b8d8a9_4158cf09bd62d921e268909ea92a86fb' at line 13)
## 
## Chain 4: 
## Chain 4: Gradient evaluation took 8e-06 seconds
## Chain 4: 1000 transitions using 10 leapfrog steps per transition would take 0.08 seconds.
## Chain 4: Adjust your expectations accordingly!
## Chain 4: 
## Chain 4: 
## Chain 4: Iteration:    1 / 10000 [  0%]  (Warmup)
## Chain 4: Iteration: 1000 / 10000 [ 10%]  (Warmup)
## Chain 4: Iteration: 2000 / 10000 [ 20%]  (Warmup)
## Chain 4: Iteration: 3000 / 10000 [ 30%]  (Warmup)
## Chain 4: Iteration: 4000 / 10000 [ 40%]  (Warmup)
## Chain 4: Iteration: 5000 / 10000 [ 50%]  (Warmup)
## Chain 4: Iteration: 5001 / 10000 [ 50%]  (Sampling)
## Chain 4: Iteration: 6000 / 10000 [ 60%]  (Sampling)
## Chain 4: Iteration: 7000 / 10000 [ 70%]  (Sampling)
## Chain 4: Iteration: 8000 / 10000 [ 80%]  (Sampling)
## Chain 4: Iteration: 9000 / 10000 [ 90%]  (Sampling)
## Chain 4: Iteration: 10000 / 10000 [100%]  (Sampling)
## Chain 4: 
## Chain 4:  Elapsed Time: 0.081844 seconds (Warm-up)
## Chain 4:                0.06383 seconds (Sampling)
## Chain 4:                0.145674 seconds (Total)
## Chain 4:

# stanからparameterの取り出し
tau <- extract(fit, 'tau')
tau <- unlist(tau, use.names=FALSE)


# 理論的な事後密度関数
t <- seq(0,4,length=1000)
fit <- dgamma(t, shape = alpha + length(x)/2, rate = beta + sum((x-mu)^2)/2)

fig <- plot_ly(x = tau, type="histogram", histnorm = "probability density", name="rstan")
fig %>% add_lines(x = t, y = fit, mode = 'lines', name="exact")

Exercise 1.7 \(\alpha, \beta\)は正の実数とする．観測\(x\)は\(\mathcal{E}(\theta)\)に従い，\(\theta\)の事前分布を\(\mathcal{G}(\alpha,\beta)\)とする．

(a). このとき\(\theta\)の事後分布を求めよ．

(b). 事後平均を求めよ．

(c). 事後予測分布を求めよ．

Solution. (a). 事後密度関数は \[\begin{align*} p(\theta\mid x)&\propto \theta e^{-\theta x} \theta^{\alpha-1}e^{-\beta\theta}\\ &\propto \theta^\alpha e^{-(x+\beta)\theta} \end{align*}\] となる．従って，事後密度関数の形から，事後分布は \(\mathcal{G}(\alpha+1, x+\beta)\).

(b). 練習問題1.5より \[\frac{\alpha+1}{x+\beta}.\]

(c). 正の実数\(x^*\)を将来の観測とすると，事後予測分布の確率密度関数は \[\begin{align*} p(x^*\mid x)&=\int_0^\infty\theta e^{-\theta x^*}\frac{(x+\beta)^{\alpha+1}\theta^\alpha}{\Gamma(\alpha+1)} e^{-(x+\beta)\theta}\mathrm{d}\theta\\ &=\int_0^\infty\frac{(x+\beta)^{\alpha+1}\theta^{\alpha+1}}{\Gamma(\alpha+1)} e^{-(x+x^*\beta)\theta}\mathrm{d}\theta\\ &=\frac{\Gamma(\alpha+2)}{(x+x^*+\beta)^{\alpha+2}}\frac{(x+\beta)^{\alpha+1}}{\Gamma(\alpha+1)}\\ &=\frac{(\alpha+1)(x+\beta)^{\alpha+1}}{(x+x^*+\beta)^{\alpha+2}}. \end{align*}\]

Exercise 1.8 観測\(x\)は\(\mathcal{N}(\theta_1,\theta_2^{-1})\)に従い，\(\theta_1\)の事前分布を\(\mathcal{N}(0,\theta_2^{-1})\)とし，さらに\(\theta_2\)の事前分布を\(\mathcal{E}(1)\)とする．

(a). \(\theta_1\)と\(x\)で条件づけた\(\theta_2\)の条件つき分布を求めよ．

(b). \(\theta_1\)の周辺事後分布を求めよ．

Solution. （20200312解答修正） (a). 尤度は \[p(x\mid \theta_1,\theta_2)=\sqrt{\frac{\theta_2}{2\pi}}\exp\left(-\frac{(x-\theta_1)^2}{2}\theta_2\right)\] と書ける．したがって事後分布の確率密度関数は \[\begin{align*} p(\theta_1,\theta_2\mid x) &\propto p(x\mid \theta_1,\theta_2)p(\theta_1\mid \theta_2)p(\theta_2)\\ &= \sqrt{\frac{\theta_2}{2\pi}}\exp\left(-\frac{(x-\theta_1)^2}{2}\theta_2\right)\sqrt{\frac{\theta_2}{2\pi}}\exp\left(-\frac{\theta_1^2}{2}\theta_2\right)\exp(-\theta_2)\\ &\propto \theta_2\exp\left(-\left(\frac{(x-\theta_1)^2}{2}+\frac{\theta_1^2}{2}+1\right)\theta_2\right) \end{align*}\] となる．すると\(p(\theta_2\mid x, \theta_1,)\propto p(\theta_1,\theta_2\mid x)\)だから\(\theta_1\)と\(x\)で条件づけた\(\theta_2\)の条件つき分布はガンマ分布とパラメータを見比べると， \[\mathcal{G}_a\left(2,\frac{(x-\theta_1)^2}{2}+\frac{\theta_1^2}{2}+1\right)\] であることがわかる．

(b). ガンマ関数と見比べることで， \[\begin{align*} p(\theta_1\mid x)&=\int_0^\infty p(\theta_1,\theta_2\mid x)\mathrm{d}\theta_2\\ &=\left(\frac{(x-\theta_1)^2}{2}+\frac{\theta_1^2}{2}+1\right)^{-2}\int_0^\infty \left(\frac{(x-\theta_1)^2}{2}+\frac{\theta_1^2}{2}+1\right)^2\\ &\quad\theta_2\exp\left(-\left(\frac{(x-\theta_1)^2}{2}+\frac{\theta_1^2}{2}+1\right)\theta_2\right)\mathrm{d}\theta_2\\ &=\left(\frac{(x-\theta_1)^2}{2}+\frac{\theta_1^2}{2}+1\right)^{-2}\Gamma(2) \end{align*}\] となる．だから\(t\)分布のパラメータ（16ページ）と見比べれば，これは \[ \mathcal{T}_3\left(\frac{x}{2}, \left\{\frac{1}{3}\left(1+\frac{x^2}{4}\right)\right\}^{1/2}\right) \] の確率密度関数である．したがって，周辺事後分布は上記の\(t\)分布である．

Remark. ここでは \[p(\theta_2\mid x,\theta_1)\propto p(\theta_1,\theta_2\mid x) \] を用いた．なぜなら， \[p(\theta_2\mid x,\theta_1)= \frac{p(\theta_1,\theta_2\mid x)}{p(\theta_1\mid x)}\] であり，\(\theta_1\)と\(x\)で条件つけた分布を計算する際には左辺の分母を定数と見て良いからだ．この考え方は，あとでギブスサンプリングを構成する際に重要である．

これもrstanを使って確認しよう．今までとは違う方法でrstanを走らせてみよう．モデルをすべてexcodeという変数に代入する．

excode <- '
data {
  real x;
}
parameters {
  real theta1;
  real<lower=0> theta2;
}
model {
  theta1 ~ normal(0, 1/sqrt(theta2)); // 事前分布
  theta2 ~ exponential(1); // 事前分布
  x ~ normal(theta1, 1/sqrt(theta2)); // 尤度
}'

コンパイルする際も先ほどと命令が若干異なる．

# データの生成
theta1 <- 2.0
theta2 <- 1.0
x <- rnorm(1,theta1,1/sqrt(theta2))
normal_exp_dat <- list(x = x)

# コンパイル
fit <- stan(model_code = excode, data = normal_exp_dat, 
            iter = 10000, chains = 4)

# stanからparameterの取り出し
theta1 <- extract(fit, 'theta1')
theta1 <- unlist(theta1, use.names=FALSE)

# 理論的な事後密度関数
t <- x/2 + seq(-10,10,length=1000)
scale <- sqrt((1 + x^2/4)/3)
fit <- dt( (t - x/2)/scale, df = 3)/scale

fig <- plot_ly(x = theta1, type="histogram", histnorm = "probability density", name="rstan")
fig <- fig %>% add_lines(x = t, y = fit, mode = 'lines', name="exact")
fig <- fig %>% layout(xaxis = list(range = c(x/2-10, x/2+10)))
fig

## Warning: 'scatter' objects don't have these attributes: 'histnorm'
## Valid attributes include:
## 'cliponaxis', 'connectgaps', 'customdata', 'customdatasrc', 'dx', 'dy', 'error_x', 'error_y', 'fill', 'fillcolor', 'groupnorm', 'hoverinfo', 'hoverinfosrc', 'hoverlabel', 'hoveron', 'hovertemplate', 'hovertemplatesrc', 'hovertext', 'hovertextsrc', 'ids', 'idssrc', 'legendgroup', 'legendgrouptitle', 'legendrank', 'line', 'marker', 'meta', 'metasrc', 'mode', 'name', 'opacity', 'orientation', 'selected', 'selectedpoints', 'showlegend', 'stackgaps', 'stackgroup', 'stream', 'text', 'textfont', 'textposition', 'textpositionsrc', 'textsrc', 'texttemplate', 'texttemplatesrc', 'transforms', 'type', 'uid', 'uirevision', 'unselected', 'visible', 'x', 'x0', 'xaxis', 'xcalendar', 'xhoverformat', 'xperiod', 'xperiod0', 'xperiodalignment', 'xsrc', 'y', 'y0', 'yaxis', 'ycalendar', 'yhoverformat', 'yperiod', 'yperiod0', 'yperiodalignment', 'ysrc', 'key', 'set', 'frame', 'transforms', '_isNestedKey', '_isSimpleKey', '_isGraticule', '_bbox'

Exercise 1.9 本問は例1.8の二項分布から多項分布への拡張である． \(K\)は正の整数，\(\alpha_1,\ldots,\alpha_K\)は正の実数とする． \(\mathcal{D}(\alpha_1,\ldots,\alpha_K)\)は確率密度関数 \[p(\theta_1,\ldots,\theta_{K-1})= \left\{ \begin{array}{ll} \frac{\Gamma(\alpha_1+\cdots+\alpha_K)}{\Gamma(\alpha_1)\cdots\Gamma(\alpha_K)}\theta_1^{\alpha_1-1}\cdots\theta_K^{\alpha_K-1}&\mathrm{if}\ 0<\theta_k<1, k=1,\ldots, K\\ 0&\mathrm{otherwise} \end{array} \right.\] をもつ．ただし，\(\theta=(\theta_1,\theta_2,\ldots,\theta_{K-1})\)とし， \[\theta_K=\theta_K(\theta)=1-\theta_1-\cdots-\theta_{K-1}\] とする．さらに，\(N\)は正の整数で，\(\theta_1,\ldots, \theta_{K-1}\)は正の実数で\(\theta_1+\cdots+\theta_{K-1}<1\)のとき，) \(\mathcal{M}(N,\theta_1,\ldots,\theta_{K-1})\)は確率関数 \[p(x_1,\ldots, x_{K-1})= \left\{ \begin{array}{ll} \frac{N!}{x_1!\cdots x_K!}\theta_1^{x_1}\cdots \theta_K^{x_K} &\mathrm{if}\ 0\le x_k\le N, k=1,\ldots, K\\ 0&\mathrm{otherwise} \end{array} \right.\] をもつ．ただし，\(x=(x_1,\ldots, x_{K-1})\)は正の整数で \[x_K=x_K(x)=N-x_1-\cdots-x_{K-1}\] とする． \(x\mid \theta\sim \mathcal{M}(N,\theta)\), \(\theta\sim \mathcal{D}(\alpha_1,\ldots,\alpha_K)\)であるとき，\(\theta\)の事後分布をもとめよ．

Solution. 尤度が \[p(x\mid \theta)\propto \theta_1^{x_1}\cdots \theta_K^{x_K}\] と書けるから事後分布の確率密度関数は \[\begin{align*} p(\theta\mid x)&\propto p(x\mid \theta)p(\theta)\\ &\propto \theta_1^{x_1}\cdots \theta_K^{x_K}~\theta_1^{\alpha_1-1}\cdots\theta_K^{\alpha_K-1}\\ &= \theta_1^{\alpha_1+x_1-1}\cdots \theta_K^{\alpha_K+x_K-1} \end{align*}\] となる．だから，ディリクレ分布の確率密度関数と見比べれば，事後分布が \(\mathcal{D}(\alpha_1+x_1,\ldots, \alpha_K+x_K)\) であることがわかる．

Remark. ディリクレ分布は和が\(1\)になる正の数列\(\theta_1,\ldots, \theta_K\)に値を取る確率分布である．とくに\(K=2\)としたときがベータ分布である．和をとると\(1\)になる列は多項分布の確率関数になるから，ディリクレ分布は確率分布の確率分布とも捉えられる．うえの練習問題はこの事実を使った．

ベータ分布は\(K=2\)であり，\(\theta_1+\theta_2=1\)となるので，\(\theta_1\)さえわかれば\(\theta_2\)もわかるので確率密度関数を描画するのもたやすい．だから，ベータ分布のパラメータがベータ分布にどのように影響するかをみることも容易である．それに比べると，\(K\ge 3\)のディリクレ分布の形状の把握は難しい．ここでは\(K=3\)の場合を考えてみよう．じつは \[X_1\sim\mathcal{G}(\alpha_1,1),\ldots, X_K\sim\mathcal{G}(\alpha_K,1)\] であり，独立であるとき， \[\theta=(\theta_1,\ldots,\theta_K)=\left(\frac{X_1}{\sum_{k=1}^KX_k},\ldots, \frac{X_K}{\sum_{k=1}^KX_k}\right)\] とすると，\((\theta_1,\ldots,\theta_{K-1})\)は\(\mathcal{D}(\alpha_1,\ldots,\alpha_K)\)に従う．以下では \(K=3\)のときに，乱数\(\theta\)をたくさん生成し，３次元空間に描画した．パラメータを変えて実験してみると良い．

set.seed(1234)
N <- 1e3
alpha <- c(1.0, 1.5, 1.8) #Parameter
x <- rgamma(N, alpha[1])
y <- rgamma(N, alpha[2])
z <- rgamma(N, alpha[3])
w <- x + y + z
data.fr <- data.frame(x=x/w, y=y/w, z=z/w)
plot_ly(data.fr, x=~x, y=~y, z=~z) %>% add_markers(marker=list(size=3))

Exercise 1.10 例1.14を参考に，実数\(\mu\), 正の実数\(\sigma\)による正規分布\(\mathcal{N}(\mu,\sigma^2)\)に対するジェフリーズ事前分布を計算したい．パラメータを\(\theta=(\mu,\sigma^2)\)としたときのジェフリーズ事前分布をフィッシャー情報行列の計算による方法と，不変性を用いた方法の二通りの方法で求めよ．

Solution. \(\xi=\sigma^2\)とおく．すると正規分布\(\mathcal{N}(\mu,\xi)\)の確率密度関数は \[\sqrt{\frac{1}{2\pi\xi}}\exp\left(-\frac{1}{2\xi}(x-\mu)^2\right)\] である．だからスコア関数は \[\begin{matrix}S_1\\S_2\end{matrix}:=\begin{pmatrix}\xi^{-1}(x-\mu)\\-\frac{1}{2\xi}+\xi^{-2}\frac{(x-\mu)^2}{2}\end{pmatrix}\] となる．いっぽう，例1.14で紹介したように， \[\mathbb{E}[x-\mu]=\mathbb{E}[(x-\mu)^3]=0,\ \mathbb{E}[(x-\mu)^2]=\xi,\ \mathbb{E}[(x-\mu)^4]=3\xi^2\] となる．したがって，フィッシャー情報行列は \[\mathbb{E}\left[\begin{pmatrix}S_1^2&S_1S_2\\S_2S_1&S_2^2\end{pmatrix}\right]=\begin{pmatrix}\xi^{-1}&0\\0&\xi^{-2}/2\end{pmatrix}\] である．この行列の行列式は\(\xi^{-3}/2\)となるから，パラメータを\((\mu,\xi)\)としたときのジェフリーズ事前分布は \[2^{-1/2}\xi^{-3/2}\] を密度関数として持つ．これでひとつめの直接的な方法でのジェフリーズ事前分布を導出できた．

いっぽう，例1.14では\(\tau=\xi^{-1}\)としてパラメータを\((\mu,\xi)\)としたときのジェフリーズ事前分布が \[\tau^{-1/2}\] を密度関数として持つことがわかっている．また \[(\mu,\tau)\mapsto (\mu,\xi)\] という変数変換によるヤコビ行列式は \[\left|\det\begin{pmatrix}1&0\\0&-\xi^{-2}\end{pmatrix}\right|=\xi^{-2}\] である．よってジェフリーズ事前分布の変数変換による不変性から，パラメータを\((\mu,\xi)\)としたときのジェフリーズ事前分布は \[\xi^{1/2}~\xi^{-2}=\xi^{-3/2}\] となり，たしかにさきほど求めたものと一致する（注: 次のRemark を見よ）．

Remark. この例のジェフリーズ事前分布のように非正則な分布に対しては，密度関数の定数倍の差は無視して良い．このことを，非正則な分布は定数倍の任意性があるとも言う．

Exercise 1.11 \(N\)は正の整数，\(\lambda_1,\lambda_2\)は正の実数とする． \(x_1^1,\ldots, x_N^1\mid \lambda_1\sim\mathcal{P}(\lambda_1)\)， \(x_1^2,\ldots, x_N^2\mid \lambda_2\sim\mathcal{P}(\lambda_2)\)でありすべて独立とする．二つのモデル \[\mathcal{M}_1:\lambda_1,\lambda_2\sim \mathcal{E}(1), \mathcal{M}_0:\lambda_1=\lambda_2\sim\mathcal{E}(1)\] を考える．二つのモデルへの事前確率は\(0.5\)ずつとする．

(a). モデル\(\mathcal{M}_0\)の事後確率を求めよ．

(b). ベイズ因子\(B_{10}\)を求めよ．

Solution. 例1.15でつかった\(x^N\)や\(x_1^N, x_2^N\)といった記号を再利用したうえで，さらに記号をいくつか用意しよう．各集団の和をそれぞれ \[S_1=\sum_{n=1}^Nx_n^1,\ S_2=\sum_{n=1}^Nx_n^2\] とし，全体の和を \[S=S_1+S_2\] とする．また，各集団の積をそれぞれ \[T_1=x_1^1\cdots x_N^1,\ T_2=x_1^2\cdots x_N^2\] とし，全体の積を \[T=T_1+T_2\] としよう．すると，モデル\(\mathcal{M}_1\)のもとの周辺事後確率密度関数は \[p(x^N\mid 1)=\int_0^\infty p(x_1^N\mid \lambda_1)p(\lambda_1)\mathrm{d}\lambda_1\times \int_0^\infty p(x_2^N\mid \lambda_2)p(\lambda_2)\mathrm{d}\lambda_2\] となる．うえの２つの積分の積のうち，最初の一つは \[\int_0^\infty\frac{\lambda_1^{S_1}}{T_1}e^{-N\lambda_1}~e^{-\lambda_1}\mathrm{d}\lambda_1=\frac{\Gamma(1+S_1)}{T_1(N+1)^{1+S_1}}\] と計算できる．２つ目の積についても同じ計算をすれば \[\int_0^\infty\frac{\lambda_2^{S_2}}{T_2}e^{-N\lambda_2}~e^{-\lambda_2}\mathrm{d}\lambda_2=\frac{\Gamma(1+S_2)}{T_2(N+1)^{1+S_2}}\] となることがわかる．よってモデル\(\mathcal{M}_1\)のもとの周辺事後確率密度関数は \[p(x^N\mid 1)=\frac{\Gamma(1+S_1)\Gamma(1+S_2)}{T(N+1)^{2+S}}\] となる．まったく同じ計算で，モデル\(\mathcal{M}_0\)の場合は \[p(x^N\mid 0)=\int_0^\infty\frac{\lambda_1^{S}}{T}e^{-2N\lambda}~e^{-\lambda}\mathrm{d}\lambda=\frac{\Gamma(1+S)}{T(2N+1)^{1+S}}\] を得る．したがって \[\frac{p(x^N\mid 0)}{p(x^N\mid 1)}=\frac{(N+1)^{2+S}}{(2N+1)^{1+S}}\binom{S}{S_1}\] となる．よって

(a). 事後確率は \[\frac{(N+1)^{2+S}}{(2N+1)^{1+S}}\binom{S}{S_1}\left(1+\frac{(N+1)^{2+S}}{(2N+1)^{1+S}}\binom{S}{S_1}\right)^{-1}\].

(b). ベイズ因子は \[\left(\frac{(N+1)^{2+S}}{(2N+1)^{1+S}}\binom{S}{S_1}\right)^{-1}.\]