{ "cells": [ { "cell_type": "markdown", "id": "5f99e577", "metadata": {}, "source": [ "# Exercise 1\n", "The following exercise is from Aki Vehtari's (one of the authors of the BDA book) [course](https://avehtari.github.io/BDA_course_Aalto/assignments/assignment2.html).\n", "\n", "Algae status is monitored in 274 sites at Finnish lakes and rivers. The observations (‘0’: no algae, ‘1’: algae present) for the 2008 algae status at each site are following:\n", "* No algae: 230 sites\n", "* Algae: 44 sites\n", "\n", "Let\n", "* $\\theta$ be the probability of a monitoring site having detectable blue-green algae levels,\n", "* $y$ the number of observed sites with algae detected, and\n", "* $n$ the total number sites surveyed.\n", "\n", "Use a binomial model for the observations and a Beta(2, 10) prior for $\\theta$." ] }, { "cell_type": "markdown", "id": "4389b7ba", "metadata": {}, "source": [ "### Question 1.\n", "\n", "Express the\n", "1. likelihood\n", "$$\n", "p(y \\mid \\theta, n=274) = ...\n", "$$\n", "2. posterior\n", "$$\n", "p(\\theta \\mid y=44, n=274) = ...\n", "$$" ] }, { "cell_type": "markdown", "id": "9032da89", "metadata": {}, "source": [ "### Question 2.\n", "Find the mean and mode of the posterior distribution." ] }, { "cell_type": "code", "execution_count": null, "id": "f3fb2b81", "metadata": {}, "outputs": [], "source": [ "y, n = 44, 274\n", "\n", "# Fill in the blanks\n", "alpha_prior, beta_prior = _, _\n", "alpha_posterior = _\n", "beta_posterior = _\n", "\n", "posterior_mean = _\n", "posterior_mode = _\n", "\n", "\n", "print(\"posterior: (alpha, beta) = \", alpha_posterior, beta_posterior)\n", "print(f\"Posterior Mean: {posterior_mean:.4f}\")\n", "print(f\"Posterior Mode: {posterior_mode:.4f}\")" ] }, { "cell_type": "markdown", "id": "527c70d3", "metadata": {}, "source": [ "Here's a plot." ] }, { "cell_type": "code", "execution_count": null, "id": "4445cb1a", "metadata": {}, "outputs": [], "source": [ "# Plot the prior\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "x = np.linspace(0, 1, 100)\n", "prior_pdf = beta.pdf(x, alpha_prior, beta_prior)\n", "plt.plot(x, prior_pdf, label='Prior', color='blue')\n", "\n", "# Plot the posterior\n", "posterior_pdf = beta.pdf(x, alpha_posterior, beta_posterior)\n", "plt.plot(x, posterior_pdf, label='Posterior', color='orange')\n", "\n", "plt.legend()\n", "\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "56c5bd02", "metadata": {}, "source": [ "### Question 3.\n", "\n", "Let $X$ be a random sample with parameter $\\theta$.\n", "\n", "In the classical (frequentist) approach, the 95% confidence interval for $\\theta$ is interval $(u(X), v(X))$ such that\n", "$$\n", "\\forall \\theta \\in (0, 1): P(u(X) \\leq \\theta \\leq v(X)) = \\gamma.\n", "$$\n", "\n", "In case of the binomial model and $\\gamma = 0.95$, the confidence interval is approximately given by\n", "$$\n", "\\left( \\frac{y}{n} - 1.96 \\sqrt{\\frac{y(n-y)}{n^3}}, \\frac{y}{n} + 1.96 \\sqrt{\\frac{y(n-y)}{n^3}} \\right).\n", "$$\n", "\n", "
\n", "\n", "In the Bayesian approach, the credible interval for $\\theta$ is an interval $(a, b)$ such that\n", "$$\n", "P(a \\leq \\theta \\leq b \\mid X) = \\gamma.\n", "$$\n", "\n", "Find the **95% credible interval**." ] }, { "cell_type": "code", "execution_count": null, "id": "a1e86ada", "metadata": {}, "outputs": [], "source": [ "from scipy.stats import beta\n", "\n", "# 95% confidence interval\n", "mean = y/n\n", "std = np.sqrt(mean * (1 - mean) / n)\n", "lower_bound = mean - 1.96 * std\n", "upper_bound = mean + 1.96 * std\n", "print(\"95% confidence interval: \", lower_bound, upper_bound)\n", "\n", "\n", "# 95% credible interval\n", "## See https://docs.scipy.org/doc/scipy-1.17.0/reference/generated/scipy.stats.beta.html\n", "lower_bound = _\n", "upper_bound = _\n", "print(\"95% credible interval: \", lower_bound, upper_bound)" ] }, { "cell_type": "markdown", "id": "c1dce303", "metadata": {}, "source": [ "### Question 4.\n", "\n", "We are interested in using our posterior distribution to estimate the probability that the proportion of detected algae samples $\\theta$ is smaller than the historical detection rate of $\\theta_0 = 0.2$, i.e. $P(\\theta < 0.2 \\mid y)$." ] }, { "cell_type": "code", "execution_count": null, "id": "2ae05e9e", "metadata": {}, "outputs": [], "source": [ "probability = _\n", "print(f\"P(θ < 0.2 | y) = {probability:.4f}\")" ] }, { "cell_type": "markdown", "id": "e6b697aa", "metadata": {}, "source": [ "### Question 4.\n", "Redo the analysis uning priors Beta(1, 1), Beta(0.5, 0.5), Beta(100, 2) and report\n", "* the posterior mean,\n", "* 95% credible interval, and\n", "* $P(\\theta < 0.2 \\mid y)$.\n", "\n", "Based on testing different priors, would you consider the posterior results believe and defensible (w.r.t. to this data set)?" ] }, { "cell_type": "code", "execution_count": null, "id": "068b0000", "metadata": {}, "outputs": [], "source": [ "# I have done the computations for you.\n", "\n", "def analyze_prior(alpha_prior, beta_prior):\n", " alpha_posterior = alpha_prior + y\n", " beta_posterior = beta_prior + n - y\n", " posterior_mean = alpha_posterior / (alpha_posterior + beta_posterior)\n", " credible_interval = beta.interval(0.95, alpha_posterior, beta_posterior)\n", " probability = beta.cdf(0.2, alpha_posterior, beta_posterior)\n", " print(f\"\\nPrior: Beta({alpha_prior}, {beta_prior})\")\n", " print(f\"Posterior mean: {posterior_mean:.4f}\")\n", " print(f\"95% credible interval: ({credible_interval[0]:.4f}, {credible_interval[1]:.4f})\")\n", " print(f\"P(θ < 0.2 | y) = {probability:.4f}\")\n", "\n", "\n", "# Beta(2, 10) - our original prior\n", "alpha_prior = 2\n", "beta_prior = 10\n", "analyze_prior(alpha_prior, beta_prior)\n", "\n", "# Beta(1,1)\n", "alpha_prior = 1\n", "beta_prior = 1\n", "analyze_prior(alpha_prior, beta_prior)\n", "\n", "# Beta(0.5, 0.5)\n", "alpha_prior = 0.5\n", "beta_prior = 0.5\n", "analyze_prior(alpha_prior, beta_prior)\n", "\n", "# Beta(100, 2)\n", "alpha_prior = 100\n", "beta_prior = 2\n", "analyze_prior(alpha_prior, beta_prior)" ] }, { "cell_type": "markdown", "id": "32bb33ef", "metadata": {}, "source": [ "And here is a plot." ] }, { "cell_type": "code", "execution_count": null, "id": "154b7917", "metadata": {}, "outputs": [], "source": [ "# A 2x2 grid of plots for the priors and posteriors\n", "fig, axes = plt.subplots(2, 2, figsize=(12, 10))\n", "priors = [(2, 10), (1, 1), (0.5, 0.5), (100, 2)]\n", "titles = ['Beta(2, 10)', 'Beta(1, 1)', 'Beta(0.5, 0.5)', 'Beta(100, 2)']\n", "for ax, (alpha_prior, beta_prior), title in zip(axes.flatten(), priors, titles):\n", " alpha_posterior = alpha_prior + y\n", " beta_posterior = beta_prior + n - y\n", " \n", " x = np.linspace(0, 1, 100)\n", " prior_pdf = beta.pdf(x, alpha_prior, beta_prior)\n", " posterior_pdf = beta.pdf(x, alpha_posterior, beta_posterior)\n", " \n", " ax.plot(x, prior_pdf, label='Prior', color='blue')\n", " ax.plot(x, posterior_pdf, label='Posterior', color='orange')\n", " ax.set_title(title)\n", " ax.legend()\n", "plt.tight_layout()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "5645573b", "metadata": {}, "source": [ "# Introduction to PyMC\n", "\n", "All the examples we have done so far have been analytical. We have been able to write down the likelihood and the prior, and then derive the posterior distribution using Bayes' theorem. However, in many cases, it is not possible to derive the posterior distribution analytically, since integrating the denominator in equation\n", "$$\n", "p(\\theta \\mid y) = \\frac{p(y \\mid \\theta) p(\\theta)}{\\int p(y \\mid \\theta) p(\\theta) d\\theta}\n", "$$\n", "can be difficult or analytically intractable. Sometimes this problem can be circumvented by only using the unnormalized posterior distribution, since the denominator does not depend on $\\theta$.\n", "\n", "Even worse, in many cases the likelihood or the prior may not have an analytical form.\n", "\n", "In such cases, approximation methods come to rescue and they have been incorporated into software packages such as PyMC. \n", "\n" ] }, { "cell_type": "markdown", "id": "12d46d1f", "metadata": {}, "source": [ "### Example 1\n", "\n", "Here is an example of how to use PyMC to perform Bayesian inference for the same problem as above. To make it more interesting, we will use as a prior a truncated normal distribution." ] }, { "cell_type": "code", "execution_count": null, "id": "7462298c", "metadata": {}, "outputs": [], "source": [ "# Plot truncated normal and beta(2, 10)\n", "from scipy.stats import truncnorm, beta\n", "\n", "x = np.linspace(0, 1, 100)\n", "truncated_normal_pdf = truncnorm.pdf(x, (0 - 0.15) /\n", " 0.1, (1 - 0.15) / 0.1, loc=0.15, scale=0.1)\n", "beta_pdf = beta.pdf(x, 2, 10)\n", "plt.plot(x, truncated_normal_pdf, label='Truncated Normal', color='blue')\n", "plt.plot(x, beta_pdf, label='Beta(2, 10)', color='orange')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "4357072e", "metadata": {}, "source": [ "Here we define the model and sample the posterior distribution." ] }, { "cell_type": "code", "execution_count": null, "id": "8c3795a7", "metadata": {}, "outputs": [], "source": [ "import pymc as pm\n", "import arviz as az\n", "\n", "with pm.Model() as model:\n", " theta = pm.TruncatedNormal('theta', mu=0.15, sigma=0.1, lower=0, upper=1)\n", " y_obs = pm.Binomial('y_obs', n=274, p=theta, observed=44)\n", " trace = pm.sample(2000, tune=1000)" ] }, { "cell_type": "code", "execution_count": null, "id": "8fb63062", "metadata": {}, "outputs": [], "source": [ "az.summary(trace)" ] }, { "cell_type": "code", "execution_count": null, "id": "967c572f", "metadata": {}, "outputs": [], "source": [ "az.plot_trace(trace)\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "id": "450009fc", "metadata": {}, "outputs": [], "source": [ "az.plot_posterior(trace, var_names=['theta'])\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "1d8d51ff", "metadata": {}, "source": [ "Probability $P(\\theta < 0.2 \\mid y)$" ] }, { "cell_type": "code", "execution_count": null, "id": "c9a6511b", "metadata": {}, "outputs": [], "source": [ "np.mean(trace.posterior['theta'] < 0.2).item() # item() converts a 0-dim array to a scalar" ] }, { "cell_type": "code", "execution_count": null, "id": "65d7a45a", "metadata": {}, "outputs": [], "source": [ "# Calculate posterior predictive distribution\n", "with model:\n", " pp = pm.sample_posterior_predictive(trace)" ] }, { "cell_type": "markdown", "id": "15ebcb25", "metadata": {}, "source": [ "### Example 2\n", "\n", "Bayesian inference and probabilistic programming allow us to easily contruct quite complex models. The following example is from [here](https://www.pymc.io/projects/docs/en/latest/learn/core_notebooks/pymc_overview.html#case-study-2-coal-mining-disasters).\n", "\n", "Consider the following time series of recorded coal mining disasters in the UK from 1851 to 1962. The number of disasters is thought to have been affected by changes in safety regulations during this period." ] }, { "cell_type": "code", "execution_count": null, "id": "addcf2fa", "metadata": {}, "outputs": [], "source": [ "disaster_data = np.array([4, 5, 4, 0, 1, 4, 3, 4, 0, 6, 3, 3, 4, 0, 2, 6,\n", " 3, 3, 5, 4, 5, 3, 1, 4, 4, 1, 5, 5, 3, 4, 2, 5,\n", " 2, 2, 3, 4, 2, 1, 3, np.nan, 2, 1, 1, 1, 1, 3, 0, 0,\n", " 1, 0, 1, 1, 0, 0, 3, 1, 0, 3, 2, 2, 0, 1, 1, 1,\n", " 0, 1, 0, 1, 0, 0, 0, 2, 1, 0, 0, 0, 1, 1, 0, 2,\n", " 3, 3, 1, np.nan, 2, 1, 1, 1, 1, 2, 4, 2, 0, 0, 1, 4,\n", " 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1]\n", ")\n", "\n", "years = np.arange(1851, 1962)\n", "\n", "plt.plot(years, disaster_data, \"o\", markersize=8, alpha=0.4)\n", "plt.ylabel(\"Disaster count\")\n", "plt.xlabel(\"Year\");\n" ] }, { "cell_type": "code", "execution_count": null, "id": "0bbaea68", "metadata": {}, "outputs": [], "source": [ "# Summarize the disaster data\n", "print(\"Total number of disasters: \", np.nansum(disaster_data))\n", "print(\"Mean number of disasters per year: \", np.nanmean(disaster_data))\n", "# Count of years with different number of disasters\n", "for i in range(7):\n", " print(f\"Number of years with {i} disasters: \", np.sum(disaster_data == i))\n" ] }, { "cell_type": "markdown", "id": "ddc45dfd", "metadata": {}, "source": [ "Occurrences of disasters in the time series is thought to follow a Poisson process with a large rate parameter in the early part of the time series, and from one with a smaller rate in the later part. We are interested in locating the change point in the series, which is perhaps related to changes in mining safety regulations.\n", "\n", "Our model is as follows:\n", "$$\n", "\\begin{aligned}\n", "D_t &\\sim \\text{Pois}(r_t), \\; r_t = \\begin{cases}\\lambda_1, & t < s \\\\ \\lambda_2, & t \\geq s \\end{cases} \\\\\n", "\\lambda_1 &\\sim \\text{Exponential}(1) \\\\\n", "\\lambda_2 &\\sim \\text{Exponential}(1) \\\\\n", "s &\\sim \\text{DiscreteUniform}(t_l, t_u)\n", "\\end{aligned}\n", "$$\n", "The parameters are defined as follows:\n", "* $D_t$: number of disasters in year $t$,\n", "* $\\lambda_1$: rate parameter for the early part of the time series,\n", "* $\\lambda_2$: rate parameter for the later part of the time series,\n", "* $s$: change point in the time series,\n", "* $t_l$: lower bound for the change point, and\n", "* $t_u$: upper bound for the change point." ] }, { "cell_type": "code", "execution_count": null, "id": "9eddfdec", "metadata": {}, "outputs": [], "source": [ "with pm.Model() as disaster_model:\n", " switchpoint = pm.DiscreteUniform(\"switchpoint\", lower=years.min(), upper=years.max())\n", "\n", " # Priors for pre- and post-switch rates number of disasters\n", " early_rate = pm.Exponential(\"early_rate\", 1.0)\n", " late_rate = pm.Exponential(\"late_rate\", 1.0)\n", "\n", " # Allocate appropriate Poisson rates to years before and after current\n", " rate = pm.math.switch(switchpoint >= years, early_rate, late_rate)\n", "\n", " disasters = pm.Poisson(\"disasters\", rate, observed=disaster_data)" ] }, { "cell_type": "code", "execution_count": null, "id": "e1c6686b", "metadata": {}, "outputs": [], "source": [ "pm.model_to_graphviz(disaster_model)" ] }, { "cell_type": "code", "execution_count": null, "id": "fb2e3680", "metadata": {}, "outputs": [], "source": [ "# Since the switchpoint is a discrete variable,\n", "# PyMC automatically usese Metropolis-Hastings instead of NUTS for sampling.\n", "with disaster_model:\n", " idata = pm.sample(10000)" ] }, { "cell_type": "code", "execution_count": null, "id": "0d816c8b", "metadata": {}, "outputs": [], "source": [ "az.summary(idata, var_names=[\"switchpoint\", \"early_rate\", \"late_rate\"])" ] }, { "cell_type": "code", "execution_count": null, "id": "829f3a3e", "metadata": {}, "outputs": [], "source": [ "axes_arr = az.plot_trace(idata)\n", "# Make more space between the subplots\n", "plt.subplots_adjust(hspace=0.6)\n", "plt.draw()\n", "for ax in axes_arr.flatten():\n", " if ax.get_title() == \"switchpoint\":\n", " labels = [label.get_text() for label in ax.get_xticklabels()]\n", " ax.set_xticklabels(labels, rotation=45, ha=\"right\")\n", " break\n", "plt.draw()" ] }, { "cell_type": "code", "execution_count": null, "id": "4777478b", "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10, 8))\n", "plt.plot(years, disaster_data, \".\", alpha=0.6)\n", "plt.ylabel(\"Number of accidents\", fontsize=16)\n", "plt.xlabel(\"Year\", fontsize=16)\n", "\n", "trace = idata.posterior.stack(draws=(\"chain\", \"draw\"))\n", "\n", "plt.vlines(trace[\"switchpoint\"].mean(), np.nanmin(disaster_data), np.nanmax(disaster_data), color=\"C1\")\n", "average_disasters = np.zeros_like(disaster_data, dtype=\"float\")\n", "for i, year in enumerate(years):\n", " idx = year < trace[\"switchpoint\"]\n", " average_disasters[i] = np.mean(np.where(idx, trace[\"early_rate\"], trace[\"late_rate\"]))\n", "\n", "sp_hpd = az.hdi(idata, var_names=[\"switchpoint\"])[\"switchpoint\"].values\n", "plt.fill_betweenx(\n", " y=[np.nanmin(disaster_data), np.nanmax(disaster_data)],\n", " x1=sp_hpd[0],\n", " x2=sp_hpd[1],\n", " alpha=0.5,\n", " color=\"C1\",\n", ")\n", "plt.plot(years, average_disasters, \"k--\", lw=2)\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "bae0350a", "metadata": {}, "source": [ "# Exercise 2" ] }, { "cell_type": "markdown", "id": "055fcf79", "metadata": {}, "source": [ "Reginald and Tony each have a coin with biases ```theta_1``` and ```theta_2``` respectively. They toss their coins a few times and recorded the following outcomes:" ] }, { "cell_type": "code", "execution_count": null, "id": "15e3a635", "metadata": {}, "outputs": [], "source": [ "url = \"https://github.com/jodyabney/DBDA2Eprograms/raw/refs/heads/master/z6N8z2N7.csv\"\n", "data = pd.read_csv(url)\n", "data" ] }, { "cell_type": "markdown", "id": "f8e24ee8", "metadata": {}, "source": [ "Your task is to construct a PyMC model and find the difference in the biases of the coins, i.e. ```theta_1 - theta_2```." ] }, { "cell_type": "code", "execution_count": null, "id": "9d61e207", "metadata": {}, "outputs": [], "source": [ "# Start writing your code here" ] }, { "cell_type": "markdown", "id": "a93fd266", "metadata": {}, "source": [ "# Exercise 3\n", "\n", "This exercise is from Jüristo's Practical Bayesian statistics course.\n", "\n", "Data was collected from gym teachers of high school boys. For each boy, we know\n", "* the height (in cm),\n", "* their 100m sprint time (in s), and\n", "* whether they liked playing basketball, dodgeball and soccer in the class." ] }, { "cell_type": "code", "execution_count": null, "id": "6c0fff1a", "metadata": {}, "outputs": [], "source": [ "url = \"https://raw.githubusercontent.com/tarmojuristo/bayes2025/main/data/gym_class.csv\"\n", "gym_data = pd.read_csv(url, index_col=0)\n", "gym_data.head()" ] }, { "cell_type": "markdown", "id": "96f2259d", "metadata": {}, "source": [ "We are interested in the effect of height on sprint times. Let's build a simple (Bayesian) linear regression model for this data." ] }, { "cell_type": "code", "execution_count": null, "id": "690ab960", "metadata": {}, "outputs": [], "source": [ "#Fill in the blanks\n", "\n", "with pm.Model() as gym_model:\n", " const = pm.Normal(\"const\", mu=_, sigma=_)\n", " height_coeff = pm.Normal(\"height_coeff\", mu=_, sigma=_)\n", " liking_coeff = pm.Normal(\"liking_coeff\", mu=_, sigma=_, size=3)\n", "\n", " res = const + height_coeff * gym_data[\"height\"] + pm.math.dot(liking_coeff, gym_data[[\"basketball\", \"dodgeball\", \"soccer\"]].values.T)\n", "\n", " std = pm.HalfNormal(\"std\")\n", "\n", " pm.Normal(\"obs\", mu=res, sigma=std, observed=gym_data[\"sprint\"])\n", "\n", " trace = pm.sample()" ] }, { "cell_type": "code", "execution_count": null, "id": "5c7af728", "metadata": {}, "outputs": [], "source": [ "az.plot_posterior(trace);" ] }, { "cell_type": "code", "execution_count": null, "id": "8d56f2aa", "metadata": {}, "outputs": [], "source": [ "pm.summary(trace)" ] }, { "cell_type": "markdown", "id": "efb928ad", "metadata": {}, "source": [ "This is a plot of the raw data." ] }, { "cell_type": "code", "execution_count": null, "id": "b5c85060", "metadata": {}, "outputs": [], "source": [ "from plotnine import *\n", "ggplot(gym_data,aes(x='height',y='sprint',color='basketball')) + geom_point()" ] }, { "cell_type": "markdown", "id": "b1342087", "metadata": {}, "source": [ "Should we make changes to our model?" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 5 }