---
title: "Coloc: using SuSiE to relax the single causal variant assumption"
author: "Chris Wallace"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Coloc: using SuSiE to relax the single causal variant assumption}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Multiple causal variants, using SuSiE to separate the signals

We load some simulated data.

```{r   }
library(coloc)
data(coloc_test_data)
attach(coloc_test_data) ## datasets D1, D2, D3 and D4
```

Datasets 3 and 4 are constructed to deliberately break the single causal variant assumption in `coloc.abf()`. 

```{r sens0, fig.width=8,fig.height=6    }
par(mfrow=c(2,1))
plot_dataset(D3, main="Dataset D3")
plot_dataset(D4, main="Dataset D4")
```

First, let us do a standard coloc (single causal variant) analysis to
serve as a baseline comparison.  The analysis concludes there is
colocalisation, because it "sees" the SNPs on the left which are
strongly associated with both traits.  But it misses the SNPs on the
right of the top left plot which are associated with only one trait.

```{r sens1, fig.width=8,fig.height=6    }
my.res <- coloc.abf(dataset1=D3, dataset2=D4)
class(my.res)
## print.coloc_abf
my.res
sensitivity(my.res,"H4 > 0.9")
```

Even though the sensitivity analysis itself looks good, the Manhattan plots suggest we are violating the assumption of a single causal variant per trait.

coloc has adopted the [SuSiE](https://stephenslab.github.io/susie-paper/index.html) framework for fine mapping in the presence of multiple causal variants.  This framework requires the LD matrix is known, so first check our datasets hold an LD matrix of the right format.  =check_dataset= should return NULL if there are no problems, or print informative error messages if there are.


```{r}
check_dataset(D3,req="LD")
check_dataset(D4,req="LD")
```

SuSiE can take a while to run on larger datasets, so it is best to run once per dataset with the =runsusie= function, store the results and feed those into subsequent analyses.  =runsusie= is just a wrapper around the =susie_rss= function in the [susieR package](https://stephenslab.github.io/susieR/) that automates running until convergence and saves a little extra information about snp names to make subsequent coloc processing simpler.  Here, it does indeed find two signals for dataset  D3 (there are two rows in the credible sets summary) and one for dataset D4. 

```{r}
  S3=runsusie(D3)
  summary(S3)
  S4=runsusie(D4)
  summary(S4)
```

With these susie output objects stored, we can colocalise every pair of signals.  This analysis says the first pair, tagged by s25.1 and s25 for datasets D3 and D4, do not colocalise (posterior for H3 is close to 1), whilst the second pair, tagged by the same SNP, s25, for both datasets, do (posterior for H4 is close to 1).
```{r}
if(requireNamespace("susieR",quietly=TRUE)) {
  susie.res=coloc.susie(S3,S4)
  print(susie.res$summary)
}
```

Note that because we are doing multiple colocalisations, sensitivity() needs to know which to consider, and we need to give it the datasets used if we want to see the Manhattan plots.
```{r sens, fig.width=8,fig.height=6 }
if(requireNamespace("susieR",quietly=TRUE)) {
  sensitivity(susie.res,"H4 > 0.9",row=1,dataset1=D3,dataset2=D4)
  sensitivity(susie.res,"H4 > 0.9",row=2,dataset1=D3,dataset2=D4)
}
```

# Important notes on running SuSiE
`runsusie()` is a wrapper around `susieR::susie_rss()`.  It adds some colnames to the returned objects (so that snps can be identified easily) and repeats the calls to `susie_rss()` until convergence is achieved.

`susie_rss()` has many other options and you should look at them if `runsusie()` is not giving the output you expect.  They can be passed directly through `runsusie()`.

``` r
if(requireNamespace("susieR",quietly=TRUE)) {
  ?susieR::susie_rss
}
```

One option I have varied to help SuSiE detect weaker signals, is the coverage parameter.  By default `susie_rss` looks for signals for which it can form a 95% credible set.  By reducing the coverage, we can find weaker signals.  For example we find nothing in this weaker signal dataset

``` r
if(requireNamespace("susieR",quietly=TRUE)) {
  ## make a dataset with a weaker signal
  D5=D3
  D5$varbeta=D5$varbeta * 4
  D5$N=D5$N / 2
  par(mfrow=c(1,2))
  plot_dataset(D3, main="original D3")
  plot_dataset(D5, main="weaker signal D5")
  summary(runsusie(D5)) # default coverage 0.95
}
```

But by reducing the coverage we can find one signal
``` r
if(requireNamespace("susieR",quietly=TRUE)) {
  summary(runsusie(D5,coverage=0.1))  # lower coverage
}
```

<!-- And reducing it further finds the other as well -->
<!-- ``` r -->
<!-- if(requireNamespace("susieR",quietly=TRUE)) { -->
<!--   summary(runsusie(D5,coverage=0.001)) # even lower -->
<!-- } -->
<!-- ``` -->

These values are just for illustration, I probably wouldn't believe a signal in a real dataset with a \( p > 10^{-5} \).  They let you find weaker signals, but the signals they find may not be real! coloc results should be treated with caution.  Nonetheless, this may be worthwhile if you have a dataset with a weak signal which you know has replicated in independent data.

``` r
if(requireNamespace("susieR",quietly=TRUE)) {
  S5=runsusie(D5,coverage=0.1)  # lower coverage
  summary(S5)
  res=coloc.susie(S5,S4)
  print(res$summary)
}
```