{ "cells": [ { "cell_type": "markdown", "id": "e59345f4", "metadata": {}, "source": [ "## Exercise 1\n", "\n", "\n", "\n", "Consider the paper \"Beautiful parents have more daughters: ...\" by Kanazawa (2007), who analyzed data from a longitudinal study that included a measure of respondents' attractiveness $(1-5)$, and followed up over the several years and recorded the sex of these people's children.\n", "\n", "**The paper asserts that attractive parents (category 5) have more daughters than parents in other attractiveness categories (categories 1-4).**\n", "\n", "Let's set aside potential concerns of selection and multiple comparisons, because they are not the focus of this exercise." ] }, { "cell_type": "markdown", "id": "4dbc165f", "metadata": {}, "source": [ "### Data" ] }, { "cell_type": "markdown", "id": "39db1df6", "metadata": {}, "source": [ "Let's import the data and take a look at it." ] }, { "cell_type": "code", "execution_count": null, "id": "0f8926b6", "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "import pandas as pd\n", "import pymc as pm\n", "import matplotlib.pyplot as plt\n", "import arviz as az\n", "\n", "# Read data.csv\n", "data = pd.read_csv('data.csv')\n", "data.head()" ] }, { "cell_type": "markdown", "id": "8f7566b5", "metadata": {}, "source": [ "We will use only\n", "- attractiveness (1-Very unattractive, 2-Unattractive, 3-About average, 4-Attractive, 5-Very attractive)\n", "- first_child_sex (0-female, 1-male)" ] }, { "cell_type": "code", "execution_count": null, "id": "28bd01ab", "metadata": {}, "outputs": [], "source": [ "# Create a subset for category 5 and category 1-4\n", "cat5 = data[data['attractiveness'] == 5]\n", "cat1_4 = data[data['attractiveness'] < 5]\n", "\n", "# Calculate the sex ratios\n", "sex_ratio_cat5 = cat5['first_child_sex'].mean()\n", "sex_ratio_cat1_4 = cat1_4['first_child_sex'].mean()\n", "\n", "print(f\"Proportion of boys for category 5: {sex_ratio_cat5*100:.1f}%\")\n", "print(f\"Proportion of boys for categories 1-4: {sex_ratio_cat1_4*100:.1f}%\")\n", "print(f\"Difference: {(sex_ratio_cat5 - sex_ratio_cat1_4)*100:.1f} percentage points\")" ] }, { "cell_type": "markdown", "id": "ba2f1e46", "metadata": {}, "source": [ "### Frequentist analysis\n", "Let's start with a frequentist analysis of the data. We will use Welch's t-test to compare the mean proportion of daughters between the two groups." ] }, { "cell_type": "code", "execution_count": null, "id": "3f7aab0c", "metadata": {}, "outputs": [], "source": [ "# Compute the t-statistic and p-value for the difference\n", "from scipy import stats\n", "t_stat, p_value = stats.ttest_ind(cat5['first_child_sex'], cat1_4['first_child_sex'], equal_var=False)\n", "\n", "print(f\"T-statistic: {t_stat:.2f}, p-value: {p_value:.4f}\")\n", "print(f\"Standard error: {(sex_ratio_cat5 - sex_ratio_cat1_4)/t_stat*100:.1f} percentage points\")" ] }, { "cell_type": "markdown", "id": "62809f3f", "metadata": {}, "source": [ "Because the p-value is less than 0.05, the result is statistically significant. We therefore conclude that attractive parents have an estimated 7.3 percentage points more daughters than parents in other attractiveness categories, with a standard error of 3.1 percentage points (i.e., $7.3 \\pm 6.2$ percentage points for a 95% confidence interval)." ] }, { "cell_type": "markdown", "id": "a276b63e", "metadata": {}, "source": [ "### Bayesian analysis\n", "\n", "Let's now do a Bayesian analysis of the same data. Let\n", "- $p_1$ be the probability of girl births for beautiful parents (category 5),\n", "- $p_2$ for others (categories 1-4) and\n", "- $\\delta = p_1 - p_2$.\n", "\n", "Our goal is to do inference about $\\delta$." ] }, { "cell_type": "markdown", "id": "bbc18897", "metadata": {}, "source": [ "#### Task 1:\n", "Build a Bayesian model and estimate $\\delta$.\n", "Use **non-informative priors** of your choice (e.g. uniform or Jeffreys' prior).\n", "If you want, you can build second model, where $p_2$ is a fixed value equal to the overall proportion of boy births in USA (0.515) and only estimate $\\delta$ through $p_1$." ] }, { "cell_type": "code", "execution_count": null, "id": "91a8e425", "metadata": {}, "outputs": [], "source": [ "# Do stuff here" ] }, { "cell_type": "markdown", "id": "41805ee2", "metadata": {}, "source": [ "Obtained results should be similar to the one obtained in the frequentist analysis.\n", "\n", "---" ] }, { "cell_type": "markdown", "id": "36c113fd", "metadata": {}, "source": [ "The problem with the above analysis is that non-informative prior distribution contradicts what is known about the stability of the human sex ratio.\n", "\n", "Previous studies have shown, over time and across populations, that the proportion of girl births has been remarkably stable at about 48.5%.\n", "\n", "There is some variation but this variation is low, except in cases of selective abortion, infanticide, and extreme poverty and famine. For example, the proportion of girl births is about half a percentage point higher among whites than blacks in the United States, and there are similar or smaller differences when comparing younger mothers and older mothers, babies born in different seasons of the year, and other factors that have been studied.\n", "\n", "It is hard to believe that sex ratio has a higher correlation with parental attractiveness than with other variables." ] }, { "cell_type": "markdown", "id": "f05f1b2b", "metadata": {}, "source": [ "#### Task 2:\n", "Do a Bayesian analysis with\n", "- a fully informative prior (how do you believe that parental attractiveness affects the sex ratio?) and\n", "- a weakly informative prior (fully informative prior, but spread out / something between the non-informative and fully informative prior)." ] }, { "cell_type": "code", "execution_count": null, "id": "6948bc8b", "metadata": {}, "outputs": [], "source": [ "# Do stuff here" ] }, { "cell_type": "markdown", "id": "a87837a0", "metadata": {}, "source": [ "The point of this exercise is that for the particular problem of estimating the parameter $\\delta$, which certainly should be less than 1 percentage point—the available data from 3000 survey respondents is weak. A uniform prior then represents a strong statement that $\\delta$ can be large.\n", "\n", "In other settings, however, a uniform prior distribution for a parameter estimated in this way would work just fine: 3000 is a large sample size for the purpose of estimating real underlying differences of 5 percentage points or more.\n", "\n", "---\n", "\n", "This exercise was based on an example from the article \"[The prior can often only be understood in the context of the likelihood](https://arxiv.org/pdf/1708.07487)\". Lot of the text was taken from the article." ] }, { "cell_type": "markdown", "id": "7d86cf49", "metadata": {}, "source": [ "## Exercise 2" ] }, { "cell_type": "markdown", "id": "9f6b02ca", "metadata": {}, "source": [ "One problem when choosing priors is that it may be difficult to understand their effect as they propagate down the model into the data. The choices we made in the parameter space may induce something unexpected in the observable data space. A very helpful tool to better understand our assumptions is the prior predictive distribution." ] }, { "cell_type": "markdown", "id": "f6004b21", "metadata": {}, "source": [ "Let's build a model of a football player shooting penalty kicks.\n", "\n", "We assume that the player is aiming at the center of the goal $\\alpha = 0$, but due to other factors $\\alpha \\sim \\mathcal{N}(0, \\sigma)$, where $\\sigma$ is unknown.\n", "\n", "Using little bit of trigonometry, we come up with the following formula for the probability of scoring a goal:\n", "\n", "$$\n", "p_{\\text{goal}} = p\\left(|\\alpha| < \\tan^{-1}\\left(\\frac{L}{x}\\right)\\right) = 2\\Phi\\left(\\frac{\\tan^{-1}\\left(\\frac{L}{x}\\right)}{\\sigma}\\right) - 1\n", "$$\n", "\n", "
\n", " \"drawing\"\n", "
" ] }, { "cell_type": "markdown", "id": "749bba26", "metadata": {}, "source": [ "Let's do a following model:\n", "$$\n", "\\begin{aligned}\n", "\\sigma &\\sim \\mathcal{HN}(\\text{sigma}) \\\\\n", "p_{\\text{goal}} &= 2\\Phi\\left(\\frac{\\tan^{-1}\\left(\\frac{L}{x}\\right)}{\\sigma}\\right) - 1\\\\\n", "Y &\\sim \\text{Bernoulli}(p_{\\text{goal}})\n", "\\end{aligned}\n", "$$" ] }, { "cell_type": "markdown", "id": "d628c51a", "metadata": {}, "source": [ "Let's test out the model.\n", "\n", "Te make the testing easier we will look at 3 models at once for different values of sigma. Insert 3 values of sigma in degrees to the list `sigmas_deg` and run the code below." ] }, { "cell_type": "code", "execution_count": null, "id": "8f963a6a", "metadata": {}, "outputs": [], "source": [ "from scipy.stats import Normal\n", "import pytensor.tensor as pt\n", "\n", "sigmas_deg = [] #<--- Insert 3 values here, e.g. [..., ..., ...]\n", "\n", "half_length = 3.66 # meters\n", "penalty_point = 11 # meters\n", "\n", "# CDF of the standard normal distribution\n", "def Phi(x):\n", " return 0.5 + 0.5 * pt.erf(x / pt.sqrt(2.0))\n", "\n", "ppss = []\n", "sigmas_rad = np.deg2rad(sigmas_deg)\n", "for sigma in sigmas_rad:\n", " with pm.Model() as model:\n", " σ = pm.HalfNormal(\"σ\", sigma)\n", " α = pm.Normal(\"α\", 0, σ)\n", " p_goal = pm.Deterministic(\"p_goal\", 2 * Phi(np.arctan(half_length / penalty_point) / σ) - 1)\n", " pps = pm.sample_prior_predictive(250)\n", " ppss.append(pps)" ] }, { "cell_type": "markdown", "id": "934ee144", "metadata": {}, "source": [ "Let's plot the prior predictive distributions for the 3 models." ] }, { "cell_type": "code", "execution_count": null, "id": "01f3643b", "metadata": {}, "outputs": [], "source": [ "fig, axes = plt.subplots(1, 3, subplot_kw=dict(projection=\"polar\"), figsize=(15, 6))\n", "\n", "max_angle = np.arctan(half_length/penalty_point)\n", "\n", "for sigma, pps, ax in zip(sigmas_deg, ppss, axes):\n", " filtered = pps.prior.where(pps.prior[\"p_goal\"] > 0.1, drop=True)\n", " cax = ax.scatter(\n", " filtered[\"α\"].values,\n", " np.ones_like(filtered[\"α\"].values),\n", " c=filtered[\"p_goal\"].values,\n", " marker=\".\",\n", " cmap=\"viridis_r\",\n", " vmin=0.1\n", " )\n", " ax.fill_between(np.linspace(-max_angle, max_angle, 100), 0, 1.01, alpha=0.25)\n", " ax.set_yticks([])\n", " ax.set_title(rf\"$\\sigma = \\mathcal{{HN}}({sigma})$\", pad=25)\n", " ax.plot(0,0, 'o')\n", "fig.subplots_adjust(wspace=0.3)\n", "fig.colorbar(\n", " cax,\n", " ax=axes,\n", " extend=\"min\",\n", " ticks=[1, 0.5, 0.1],\n", " shrink=0.7,\n", " aspect=40,\n", " pad=0.08\n", ")\n", "\n", "\n", "plt.show()" ] } ], "metadata": { "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.8" } }, "nbformat": 4, "nbformat_minor": 5 }