{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "163bb85c",
   "metadata": {},
   "source": [
    "# CanDI and DESeq2\n",
    "Let's say I want to look at changes in RNA expression across some cell lines in CCLE. DESeq2 is my preffered tool for doing differential expression analysis, unforutantely it's written in R. CanDI makes it easy to format CCLE read counts data into the shape that DESeq2 expects."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "72858c31",
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import anndata as ad\n",
    "import CanDI.candi as can\n",
    "\n",
    "from CanDI.pipelines import diffexp"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "319b94c2",
   "metadata": {},
   "source": [
    "#### Object Instantiation\n",
    "For this example I'm going to do differential expression analysis across male and female KRAS mutant cell lines. The cell below uses CanDI to generate the correct CellLineCluster objects for our purpose."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "ddd21097",
   "metadata": {},
   "outputs": [],
   "source": [
    "if type(can.data.mutations) != pd.DataFrame:\n",
    "    can.data.load('mutations')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "e3794753",
   "metadata": {},
   "outputs": [],
   "source": [
    "lung = can.Cancer(\"Lung Cancer\", subtype = \"NSCLC\")\n",
    "lung = can.CellLineCluster(lung.mutated(\"KRAS\", variant = \"Variant_Classification\", item = \"Missense_Mutation\"))\n",
    "\n",
    "lung_male = can.CellLineCluster(list(lung._info.loc[lung._info.sex == \"Male\",].index))\n",
    "lung_female = can.CellLineCluster(list(lung._info.loc[lung._info.sex == \"Female\"].index))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5aae975f",
   "metadata": {},
   "source": [
    "#### Data Munging\n",
    "The follow function takes two objects that we want to compare and automatically generates the counts and coldata matrices that DESeq2 needs to run. It's typically a good idea to filter our genes/transcripts with consistently low counts prior to running DESeq2. This speeds up analysis and avoids issues related to read count scaling and multiple hypthothesis testing corrections. In this case we don't care about different splicing of the same genes so I sum counts for duplicate indeces for all samples. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "29d9ab4c",
   "metadata": {},
   "outputs": [],
   "source": [
    "if type(can.data.rnaseq_reads) != pd.DataFrame:\n",
    "    can.data.load('rnaseq_reads')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "c697995d",
   "metadata": {},
   "outputs": [],
   "source": [
    "def make_counts_coldata(obj1, obj2, condition, factor1, factor2):\n",
    "    \n",
    "    counts1 = obj1.rnaseq_reads\n",
    "    coldat1 = pd.Series(counts1.shape[1] * [factor1], index = counts1.columns, name = condition)\n",
    "    \n",
    "    counts2 = obj2.rnaseq_reads\n",
    "    coldat2 = pd.Series(counts2.shape[1] * [factor2], index = counts2.columns, name = condition)\n",
    "    \n",
    "    #Concatenate Column Data\n",
    "    coldat = pd.concat([coldat1, coldat2], axis = 0)\n",
    "    #Concatenate read count data \n",
    "    counts_mat = pd.concat([counts1, counts2], axis = 1)\n",
    "    #Sum duplicate indeces\n",
    "    counts_mat = counts_mat.groupby(counts_mat.index).sum().astype(int)\n",
    "    \n",
    "    adata = ad.AnnData(\n",
    "        counts_mat.T,\n",
    "        obs = coldat.to_frame()\n",
    "    )\n",
    "\n",
    "    return adata"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "fe08e848",
   "metadata": {},
   "outputs": [],
   "source": [
    "adata = make_counts_coldata(lung_male, lung_female, \"sex\", \"male\", \"female\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d148ea96",
   "metadata": {},
   "source": [
    "#### Running pyDESeq2\n",
    "<!-- In the following cell I use the csvs I just saved as arguments for an r-script that runs DESeq2. The last argument in this script the filname for the results. -->"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "ceb0a995",
   "metadata": {},
   "outputs": [
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "Fitting size factors...\n",
      "... done in 0.05 seconds.\n",
      "\n",
      "Fitting dispersions...\n",
      "... done in 1.39 seconds.\n",
      "\n",
      "Fitting dispersion trend curve...\n",
      "... done in 0.58 seconds.\n",
      "\n",
      "Fitting MAP dispersions...\n",
      "... done in 1.42 seconds.\n",
      "\n",
      "Fitting LFCs...\n",
      "... done in 2.28 seconds.\n",
      "\n",
      "Replacing 5676 outlier genes.\n",
      "\n",
      "Fitting dispersions...\n",
      "... done in 0.28 seconds.\n",
      "\n",
      "Fitting MAP dispersions...\n",
      "... done in 0.28 seconds.\n",
      "\n",
      "Fitting LFCs...\n",
      "... done in 0.36 seconds.\n",
      "\n",
      "Running Wald tests...\n",
      "... done in 1.64 seconds.\n",
      "\n"
     ]
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Log2 fold change & Wald test p-value: sex male vs female\n",
      "              baseMean  log2FoldChange     lfcSE      stat    pvalue      padj\n",
      "gene                                                                          \n",
      "A1BG        200.696432        0.027944  0.671884  0.041591  0.966825  0.994527\n",
      "A1BG-AS1    255.912479        0.202698  0.626175  0.323708  0.746159  0.950009\n",
      "A1CF         12.057056       -1.367054  0.537503 -2.543342  0.010980  0.236614\n",
      "A2M          18.685828        0.365012  0.619866  0.588857  0.555957  0.902738\n",
      "A2M-AS1      49.283425       -0.266289  0.618898 -0.430264  0.667004  0.933027\n",
      "...                ...             ...       ...       ...       ...       ...\n",
      "ZYG11AP1      0.038949        0.057154  3.391294  0.016853  0.986554       NaN\n",
      "ZYG11B     2200.470135       -0.200760  0.214134 -0.937546  0.348478  0.816844\n",
      "ZYX       11155.922014        0.206356  0.377946  0.545994  0.585070  0.912754\n",
      "ZZEF1      4400.173132       -0.483351  0.222239 -2.174911  0.029637  0.362384\n",
      "ZZZ3       3532.326301       -0.212461  0.170959 -1.242756  0.213958  0.722301\n",
      "\n",
      "[52443 rows x 6 columns]\n"
     ]
    }
   ],
   "source": [
    "results = diffexp.run_deseq(\n",
    "    adata,\n",
    "    design = 'sex',\n",
    "    tested_level = 'male',\n",
    "    ref_level = 'female'\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "275e042d",
   "metadata": {},
   "source": [
    "#### Analyzing Results\n",
    "Now we can read the results of the differential expression analysis back into our python enviroment and continue analysis as necessary. Looking at the genes with the lowest adjusted p-value we see that XIST is the most significant differentially expressed genes. XIST is a lncRNA responsible for X inactivation in females so this a good positive control for our analysis."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "f72ee6cd",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>baseMean</th>\n",
       "      <th>log2FoldChange</th>\n",
       "      <th>lfcSE</th>\n",
       "      <th>stat</th>\n",
       "      <th>pvalue</th>\n",
       "      <th>padj</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>gene</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>XIST</th>\n",
       "      <td>4041.971706</td>\n",
       "      <td>-7.145007</td>\n",
       "      <td>0.710899</td>\n",
       "      <td>-10.050660</td>\n",
       "      <td>9.125401e-24</td>\n",
       "      <td>2.692176e-19</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>RPS4Y1</th>\n",
       "      <td>6613.044427</td>\n",
       "      <td>8.723193</td>\n",
       "      <td>1.031580</td>\n",
       "      <td>8.456144</td>\n",
       "      <td>2.763634e-17</td>\n",
       "      <td>4.076636e-13</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>CEACAM5</th>\n",
       "      <td>10444.137420</td>\n",
       "      <td>-7.282857</td>\n",
       "      <td>0.892029</td>\n",
       "      <td>-8.164374</td>\n",
       "      <td>3.231066e-16</td>\n",
       "      <td>3.177430e-12</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>DDX3Y</th>\n",
       "      <td>1195.388509</td>\n",
       "      <td>7.642864</td>\n",
       "      <td>0.983916</td>\n",
       "      <td>7.767805</td>\n",
       "      <td>7.985781e-15</td>\n",
       "      <td>5.889913e-11</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>GJB1</th>\n",
       "      <td>58.891176</td>\n",
       "      <td>-5.623715</td>\n",
       "      <td>0.752347</td>\n",
       "      <td>-7.474896</td>\n",
       "      <td>7.726502e-14</td>\n",
       "      <td>4.558945e-10</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "             baseMean  log2FoldChange     lfcSE       stat        pvalue  \\\n",
       "gene                                                                       \n",
       "XIST      4041.971706       -7.145007  0.710899 -10.050660  9.125401e-24   \n",
       "RPS4Y1    6613.044427        8.723193  1.031580   8.456144  2.763634e-17   \n",
       "CEACAM5  10444.137420       -7.282857  0.892029  -8.164374  3.231066e-16   \n",
       "DDX3Y     1195.388509        7.642864  0.983916   7.767805  7.985781e-15   \n",
       "GJB1        58.891176       -5.623715  0.752347  -7.474896  7.726502e-14   \n",
       "\n",
       "                 padj  \n",
       "gene                   \n",
       "XIST     2.692176e-19  \n",
       "RPS4Y1   4.076636e-13  \n",
       "CEACAM5  3.177430e-12  \n",
       "DDX3Y    5.889913e-11  \n",
       "GJB1     4.558945e-10  "
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "results.sort_values(\"padj\").head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "cde84608",
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.4"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}