{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "0177aed2",
   "metadata": {},
   "source": [
    "# Survival analysis<br> and censored data\n",
    "\n",
    "### Question:\n",
    "Does a person's genotype influence their risk of developing cardiovascular disease (CVD)?\n",
    "\n",
    "### Dataset overview:\n",
    "You have access to data from a long-term study of 50,000 Estonian individuals who have been observed since birth. Each person was screened to determine whether or not they developed CVD during their lifetime. All individuals have also been genotyped for a genetic variant (genotype) associated with CVD risk. The goal is to investigate how this genotype influences the age at onset of CVD. The data also includes information about each person’s sex and body mass index.\n",
    "\n",
    "This is a survival dataset, meaning it includes follow-up time for each individual, captured by the \"age\" variable. Some individuals experienced the event of interest (CVD), while others were censored, i.e. they died without ever having the disease during the observation period.\n",
    "\n",
    "#### The dataset contains the following five variables:\n",
    "1.\tAge – the person’s age in years at the time of CVD diagnosis (if they had it), or age at death (if they did not)\n",
    "\n",
    "2.\tEvent – indicates whether the person experienced a CVD event:\n",
    "\n",
    "* 1: the person had CVD\n",
    "* 0: the person did not have CVD and died from another cause\n",
    "\n",
    "3.\tGenotype – the number of risk alleles a person has; it can be:\n",
    "* 0: no risk alleles (e.g., genotype \"AA\")\n",
    "* 1: one risk allele (e.g., \"Aa\")\n",
    "* 2: two risk alleles (e.g., \"aa\")\n",
    "Note: The genotype is located on an autosomal chromosome, meaning it is not associated with biological sex. \n",
    "\n",
    "4.\tBMI – body mass index (kg/m2)\n",
    "\n",
    "5.\tSex – the biological sex of the individual"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7d36300f",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "df_full = pd.read_csv('CVD_data.csv')\n",
    "df_full.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c58bf2f3",
   "metadata": {},
   "source": [
    "### Data check\n",
    "Before moving on to the modeling, let's first check wheter the data looks realistic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fb612f2a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Data check\n",
    "df_full.describe()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f0ced7d6",
   "metadata": {},
   "source": [
    "**Hardy-Weinberg equilibrium check**\n",
    "\n",
    "Since we are working with genotpyes (AA, Aa, aa), we can check if the genotype frequencies are approximately in Hardy-Weinberg equilibrium.\n",
    "\n",
    "Let $p_{A}, p_{a}$ be the allele frequencies in infinite population, then when everyone is randomly mating, we expect the genotype frequencies to be:\n",
    "- $p_{AA} = p_{A}^2$\n",
    "- $p_{Aa} = 2p_{A}p_{a}$\n",
    "- $p_{aa} = p_{a}^2$\n",
    "\n",
    "This is Hardy-Weinberg equilibrium. Large deviations from these expected frequencies could indicate issues with the data.\n",
    "\n",
    "\n",
    "### Task 1. Compare the observed genotype frequencies with those expected under Hardy-Weinberg equilibrium."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "87df4216",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_full = pd.read_csv('CVD_data.csv')\n",
    "N = len(df_full)\n",
    "\n",
    "p_AA = (df_full['genotype']==0).mean()\n",
    "p_Aa = (df_full['genotype']==1).mean()\n",
    "p_aa = (df_full['genotype']==2).mean()\n",
    "\n",
    "print(\"Distribution of genotypes in the dataset:\")\n",
    "print(f'#(AA)={N * p_AA:.0f}, #(Aa)={N * p_Aa:.0f}, #(aa)={N * p_aa:.0f}')\n",
    "\n",
    "\n",
    "####### Task 1. Fill in the blanks ########\n",
    "\n",
    "p_A = _\n",
    "p_a = _\n",
    "\n",
    "p_AA_H = _\n",
    "p_Aa_H = _\n",
    "p_aa_H = _\n",
    "\n",
    "###########################################\n",
    "\n",
    "print(\"\\nExpected distribution under Hardy-Weinberg equilibrium:\")\n",
    "print(f'#(AA)={N * p_AA_H:.0f}, #(Aa)={N * p_Aa_H:.0f}, #(aa)={N * p_aa_H:.0f}')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5290e955",
   "metadata": {},
   "source": [
    "**Question:** Without using any statistical tests, give your assessment of whether the observed genotype frequencies appear to be approximately in Hardy-Weinberg equilibrium?"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "26110259",
   "metadata": {},
   "source": [
    "### Method\n",
    "\n",
    "We are going to model the age, when the CVD is theoretically diagnosed, assuming infinite life span and that *almost surely* everyone will be diagnosed with CVD eventually.\n",
    "\n",
    "To model the age at CVD diagnosis, we use the Weibull distribution, which is parameterised by\n",
    "* **shape** $k$,\n",
    "* **scale** $\\lambda$.\n",
    "\n",
    "Weibull distribution is commonly used in survival analysis, as hazard rate is proportional to a power of time:\n",
    "$$\n",
    "    \\frac{k}{\\lambda}\\left(\\frac{t}{\\lambda}\\right)^{k-1}.\n",
    "$$\n",
    "This is useful, because for $k>1$, it models aging process, where person is more likely to be diagnosed as time goes on.\n",
    "\n",
    "\n",
    "**Censoring:**\n",
    "Unfortunately or fortunately, I suppose, not everyone gets diagnosed with CVD since they happen to die before they get diagnosed. This is called right-censoring, and we have to take it into account. Otherswise, we would be underestimating the age at CVD diagnosis.\n",
    "\n",
    "Censoring is taken into account by using the following likelihood:\n",
    "$$\n",
    "L(\\theta) = \\prod_{i=1}^n f(t_i | \\theta)^{\\delta_i} P(T > t_i | \\theta)^{1-\\delta_i}\n",
    "$$\n",
    "where $t_i$ is the observed age of death or CVD diagnosis for the $i$-th person and $\\delta_i$ is the event indicator (1 if the person had CVD, 0 if they were censored).\n",
    "\n",
    "---\n",
    "\n",
    "**Note 1**\n",
    "\n",
    "**Hazard rate** can be interpreted as\n",
    "$$\n",
    "h(t) = \\lim_{\\Delta t \\to 0} \\frac{P(t \\leq T < t + \\Delta t | T \\geq t)}{\\Delta t},\n",
    "$$\n",
    "where $T$ is a random variable representing the time of event (CVD diagnosis in our case). In other words, it is the instantaneous risk of experiencing the event at time $t$, given that the person has survived (not experienced the event) up to time $t$.\n",
    "\n",
    "**Note 2**\n",
    "\n",
    "**Weibull distribution** is a generalisation of the exponential distribution, which is a special case when $k=1$. In that case, the hazard rate is constant over time. The Weibull distribution allows for increasing ($k>1$) or decreasing ($k<1$) hazard rates.\n",
    "\n",
    "---\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "873774b3",
   "metadata": {},
   "source": [
    "### Causal DAG"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d7fcab73",
   "metadata": {},
   "source": [
    "Below we present DAG with the least amount of assumed independencies. Note that the **Death (Age)** is assumed to be before being diagnosed **CVD** and therefore we can assume that given all the common causes of **CVD** and **Death**, they are independent."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3411a54",
   "metadata": {},
   "outputs": [],
   "source": [
    "from graphviz import Digraph\n",
    "\n",
    "dot = Digraph()\n",
    "\n",
    "# Nodes\n",
    "dot.node('G', 'Genotype')\n",
    "dot.node('B', 'BMI')\n",
    "dot.node('S', 'Sex')\n",
    "dot.node('T', 'Death (Age)')\n",
    "dot.node('C', 'CVD Diagnosis (Age)')\n",
    "dot.node('D', 'Age in the Dataset')\n",
    "dot.node('E', 'Event')\n",
    "\n",
    "# Edges\n",
    "dot.edge('G', 'B')\n",
    "dot.edge('S', 'B')\n",
    "dot.edge('G', 'T')\n",
    "dot.edge('S', 'T')\n",
    "dot.edge('B', 'T')\n",
    "dot.edge('G', 'C')\n",
    "dot.edge('B', 'C')\n",
    "dot.edge('S', 'C')\n",
    "dot.edge('T', 'D')\n",
    "dot.edge('C', 'D')\n",
    "dot.edge('T', 'E')\n",
    "dot.edge('C', 'E')\n",
    "\n",
    "dot"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "84f23f0b",
   "metadata": {},
   "source": [
    "To measure direct effect of **Genotype** on **CVD Diagnosis** we have to block the path **Genotype $\\rightarrow$ BMI $\\rightarrow$ CVD Diagnosis**. As we controlled for **BMI** we created a path **Genotype $\\rightarrow$ Sex $\\rightarrow$ CVD Diagnosis**. This means we have to control for **Sex** as well.\n",
    "\n",
    "Essentially this means that means that we are going to use all the covariates in the model."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2893d17f",
   "metadata": {},
   "source": [
    "### Model\n",
    "\n",
    "We use Weibull proportional hazards model. This means that shape parameter $k$ is constant and we use log-linear model for the scale parameter $\\lambda$:\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\eta &= \\alpha + \\beta \\cdot x,\\\\\n",
    "\\lambda &= \\exp(\\eta).\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "We assign following priors to the parameters:\n",
    "$$\n",
    "\\begin{aligned}\n",
    "\\alpha &\\sim \\mathcal{N}(\\text{avg log age}, 1),\\\\\n",
    "\\beta_\\text{GENO} &\\sim \\mathcal{N}(0, 1),\\\\\n",
    "\\beta_\\text{SEX} &\\sim \\mathcal{N}(0, 1),\\\\\n",
    "\\beta_{\\text{BMI}_z} &\\sim \\mathcal{N}(0, 1),\\\\\n",
    "\\text{(shape) } k &\\sim \\text{Gamma}(2, 0.1).\n",
    "\\end{aligned}\n",
    "$$\n",
    "\n",
    "For simplicitly, let's assume that the effect of alleles is additive, meaning that the effect of having two risk alleles is twice the effect of having one risk allele.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6ac07763",
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import pymc as pm\n",
    "import arviz as az\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "#Let's keep the sample size small for performance (2k for experimentations, 5k for final model in class)\n",
    "df = df_full.sample(5000) \n",
    "\n",
    "# Observed times\n",
    "t_obs = df['age'].values\n",
    "\n",
    "# Censoring indicator (True if censored, False if CVD observed)\n",
    "censored = (df['event'] == 0).values\n",
    "\n",
    "# Covariates\n",
    "geno   = df[\"genotype\"].values\n",
    "sex    = (df[\"sex\"] == \"male\").astype(int).values\n",
    "## Standardize the BMI\n",
    "bmi_z = ((df['bmi'] - df['bmi'].mean()) / df['bmi'].std()).values\n",
    "\n",
    "with pm.Model() as survival_model:\n",
    "    # Priors\n",
    "    α = pm.Normal(\"intercept\", mu=np.log(t_obs).mean(), sigma=1)\n",
    "    β_geno = pm.Normal(\"β_geno\", mu=0, sigma=1)\n",
    "    β_sex = pm.Normal(\"β_sex\", mu=0, sigma=1)\n",
    "    β_bmi = pm.Normal(\"β_bmi\", mu=0, sigma=1)\n",
    "    shape = pm.Gamma(\"shape\", alpha=2, beta=0.1)\n",
    "\n",
    "    # Linear predictor\n",
    "    η = α + β_geno * geno + β_sex * sex + β_bmi * bmi_z\n",
    "    scale = pm.math.exp(η)\n",
    "\n",
    "    weib = pm.Weibull(alpha=shape, beta=scale, observed=t_obs)\n",
    "\n",
    "\n",
    "    trace = pm.sample(1000, tune=1000, target_accept=0.9, init=\"adapt_diag\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a121334f",
   "metadata": {},
   "outputs": [],
   "source": [
    "az.summary(trace, var_names=[\"intercept\", \"β_geno\", \"β_sex\", \"β_bmi\", \"shape\"])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e828c016",
   "metadata": {},
   "source": [
    "Let's plot the survival curves for men with average BMI (average over both sexes)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9b8e3e1d",
   "metadata": {},
   "outputs": [],
   "source": [
    "posterior = trace.posterior\n",
    "\n",
    "intercept_samples = posterior[\"intercept\"].values.flatten()\n",
    "β_geno_samples = posterior[\"β_geno\"].values.flatten()\n",
    "β_sex_samples = posterior[\"β_sex\"].values.flatten()\n",
    "β_bmi_samples = posterior[\"β_bmi\"].values.flatten()\n",
    "shape_samples = posterior[\"shape\"].values.flatten()\n",
    "\n",
    "t = np.linspace(t_obs.min(), t_obs.max(), 100)\n",
    "\n",
    "def generate_survival_curves(genotype):\n",
    "    η = (\n",
    "        intercept_samples\n",
    "        + β_geno_samples * genotype\n",
    "        + β_sex_samples * 1 # Male\n",
    "        + β_bmi_samples * 0 # Average centralized BMI\n",
    "    )\n",
    "\n",
    "    scale = np.exp(η)\n",
    "\n",
    "    survival_curves = np.exp(-(t / scale[:, None]) ** shape_samples[:, None])\n",
    "    return survival_curves\n",
    "\n",
    "\n",
    "surv_AA = generate_survival_curves(0)\n",
    "surv_Aa = generate_survival_curves(1)\n",
    "surv_aa = generate_survival_curves(2)\n",
    "\n",
    "fig, ax = plt.subplots(figsize=(10, 6))\n",
    "\n",
    "# Plot HDI with ArviZ\n",
    "az.plot_hdi(t, surv_AA, hdi_prob=0.94, ax=ax, color=\"green\")\n",
    "az.plot_hdi(t, surv_Aa, hdi_prob=0.94, ax=ax, color=\"blue\")\n",
    "az.plot_hdi(t, surv_aa, hdi_prob=0.94, ax=ax, color=\"red\")\n",
    "\n",
    "# Plot mean curves\n",
    "ax.plot(t, surv_AA.mean(axis=0), label=\"Genotype AA\", color=\"green\")\n",
    "ax.plot(t, surv_Aa.mean(axis=0), label=\"Genotype Aa\", color=\"blue\")\n",
    "ax.plot(t, surv_aa.mean(axis=0), label=\"Genotype aa\", color=\"red\")\n",
    "\n",
    "ax.set_xlabel(\"Age\")\n",
    "ax.set_ylabel(\"Survival Probability\")\n",
    "ax.set_title(\"Posterior Survival Curves by Genotype (ArviZ)\")\n",
    "ax.legend()\n",
    "ax.grid(True)\n",
    "plt.tight_layout()\n",
    "plt.show()\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "70b9c4c1",
   "metadata": {},
   "source": [
    "For comparison, we plot the non-parametric Kaplan-Meier estimator of the survival curve."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "737988b4",
   "metadata": {},
   "outputs": [],
   "source": [
    "from lifelines import KaplanMeierFitter\n",
    "\n",
    "geno_map = {0: \"AA\", 1: \"Aa\", 2: \"aa\"}\n",
    "df[\"genotype_str\"] = df[\"genotype\"].map(geno_map)\n",
    "\n",
    "colors = {\"AA\": \"green\", \"Aa\": \"blue\", \"aa\": \"red\"}\n",
    "\n",
    "# Set up the Kaplan-Meier fitter\n",
    "kmf = KaplanMeierFitter()\n",
    "\n",
    "# Plot\n",
    "plt.figure(figsize=(10, 6))\n",
    "\n",
    "for genotype in [\"AA\", \"Aa\", \"aa\"]:\n",
    "    mask = df[\"genotype_str\"] == genotype\n",
    "    T = df.loc[mask, \"age\"]  # Time to event (age)\n",
    "    E = df.loc[mask, \"event\"]  # Event indicator: 1 = event (CVD), 0 = censored\n",
    "    kmf.fit(T, event_observed=E, label=f\"Genotype {genotype}\")\n",
    "    kmf.plot_survival_function(ci_show=True, color=colors[genotype])\n",
    "\n",
    "plt.xlabel(\"Age\")\n",
    "plt.ylabel(\"Survival Probability\")\n",
    "plt.title(\"Kaplan–Meier Survival Curves for CVD Diagnosis by Genotype\")\n",
    "plt.grid(True)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2819b40f",
   "metadata": {},
   "source": [
    "Finally, let's plot empirical survival curves (without accomodating for right-censoring and covariates)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a5c3d024",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_uncensored = df[df[\"event\"] == 1].copy()\n",
    "\n",
    "# Map genotype to readable labels\n",
    "geno_map = {0: \"AA\", 1: \"Aa\", 2: \"aa\"}\n",
    "df_uncensored[\"genotype_str\"] = df_uncensored[\"genotype\"].map(geno_map)\n",
    "\n",
    "# Sort values for ECDF computation\n",
    "plt.figure(figsize=(10, 6))\n",
    "for genotype in [\"AA\", \"Aa\", \"aa\"]:\n",
    "    subset = df_uncensored[df_uncensored[\"genotype_str\"] == genotype]\n",
    "    ages = np.sort(subset[\"age\"])\n",
    "    ecdf = np.arange(1, len(ages) + 1) / len(ages)\n",
    "    survival = 1 - ecdf\n",
    "    plt.step(ages, survival, where=\"post\", label=f\"Genotype {genotype}\", color=colors[genotype])\n",
    "\n",
    "plt.xlabel(\"Age\")\n",
    "plt.ylabel(\"Survival Probability (Empirical)\")\n",
    "plt.title(\"Empirical Survival Function of CVD Diagnosis by Genotype\")\n",
    "plt.legend(title=\"Genotype\")\n",
    "plt.grid(True)\n",
    "plt.tight_layout()\n",
    "plt.show()"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}