--- title: "Inferring kin relationships with kinference" author: S M Baylis and M V Bravington output: rmarkdown::html_vignette: toc: true vignette: > %\VignetteIndexEntry{Inferring kin relationships with kinference} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- Package `kinference` uses SNP genotypes to find pairs of closely-related kin within a set of biological samples, presumably for use in Close-Kin Mark-Recapture. In this vignette, we: 1. demonstrate the core functions for kin-finding with a perfect dataset (section 1); 1. clean up a 'typical' dataset to get it to work for kinfinding (section 2); 3. and show _some_ warning signs of Things Going Wrong (section 3). Kinference can be pretty complicated, though it isn't always. We can't show everything here; this is actually a subset of a much larger evolving document, and we ourselves are still learning about the process as we throw new datasets with new problems at the software^[And sometimes the right solution is: get a better dataset, or move to a different problem. Sad but true.]! There may be courses; check with the package maintainer. If you have questions, first have a look at the [Github Discussions page](https://github.com/orgs/closekin/discussions) for `closekin`, the parent Organisation of package `kinference`. Also, do look at `?kinference` inside R, plus of course the detailed documention of individual functions. This document makes reference in a few place to "the MS", a long-awaited paper that gives details of all the algorithms used in this package. Unfortunately, the MS doesn't quite exist at time of writing, but it is coming... # Section 1: kin-finding Here, we step through kin-finding with a well-behaved, well-cleaned dataset in `snpgeno` format. We will use the `dropbears` dataset included with package `kinference`. Some standard checks should _always_ be done, but the `dropbears` dataset is already clean and won't require any changes, and the allele frequencies have already been estimated. To see a dataset in need of cleaning (like yours!), see **Section 2**. A dataset for kinference should have all of its samples drawn from a single, genetically-homogenous population, without any duplicate samples, samples from another population (or species!), or cross-contaminated or degraded samples. It also needs all loci to be "well-behaved", ie in Hardy-Weinberg proportions, without substantial allelic dropout^[However, heritable/repeatable null alleles are OK--- see 'gbasics' package. Nulls are inevitable for some species under certain genotyping methods.] or genotyping errors, not duplicated, not under (strong) selection, etc. ## A quick look at the dataset ``` r library( mvbutils) ## not sure if it needs to be explicitly loaded library( gbasics) ## so data objects behave nicely library( atease) ## convenient access to attributes; x@thing library( kinference) ``` ``` r kincol <- kinPalette() ## kin-specific colours, used later ## 'dropbears' is a dataset that comes with 'kinference' ## Very quick look at snpgeno data: covariates and some genotypes of a few samples dim( dropbears) ``` ``` ## [1] 480 2000 ``` ``` r dropbears[1:3, 1:5] ``` ``` ## l l l l l ## o o o o o ## c c c c c ## u u u u u ## s s s s s ## _ _ _ _ _ ## Our_sample sex weight_kg popn dropheight_m 1 2 3 4 5 ## 1 sample_1 M 112.3 Tas 3 AB AAO AAO AAO AB ## 2 sample_2 F 125.7 Tas 4 AAO AB AB AAO AB ## 3 sample_3 Imm 96.6 Eastern 2 AAO AAO AAO AAO AB ``` ``` r ## Boring but necessary data-prep: dropbears <- kin_power(dropbears, k = 0.5) plod_HU <- find_HSPs(dropbears, keep_thresh = 0) histoPLOD( plod_HU, log=TRUE) ``` ![plot of chunk dropbears-overview1](./figures-kinference/dropbears-overview1-1.png) This plot shows observed log-densities of one kin-finding statistic (there are several) across all pairs of animals in the `dropbears` dataset, as the black stair-step lines. It also shows some vertical lines at the the predicted means for various kinships based on allele frequences, and the predicted distribution of unrelated pairs as a smooth purple curve. This is an excellent-looking kin space with all the bumps in all the right places - we'll definitely be able to find kin pairs here. ## Finding and classifying kin pairs From here, there are many potential workflows for finding kin pairs. The workflows can depend on demographic information (e.g., one might look for sibling pairs only among samples of young animals caught at a similar time, to reduce confusion from other kinships that are common in more general comparisons; or, first look for POPs only among pairs with a sizeable birth-year gap). Here, we show one potential workflow for kin-finding that does not use covariate/demographic information. We're going to end up with four categories of pairs: * POPs (Parent-Offspring Pairs) * FSPs (Full-Siblings Pairs): collectively, POPs and FSPs constitute "first-order" kin, or 1KPs * Definite 2KPs, (2nd-order Kin-Pairs, including Half-Siblings HSPs^[Because HSPs are usually much more important to CKMR than other 2KPs (and usually commoner), we sometimes slip into saying "HSPs" when "2KP" would be more correct, eg in the name of the function 'find_HSPs'. Sorry about that...], Grandparent-Grandchild GGPs, and full-thiatic^["Thiatic" is a succinct bidirectional gender-neutral term that comprises aunt-niece, uncle-niece, aunt-nephew, and uncle-nephew kinships. "Half-thiatic" pair is when either the mother of one is the grandmother of the other, or father/grandfather, but not both.] FTPs), and * "other", which includes 3KPs, 4KPs, etc, all "Unrelated" Pairs (UPs), plus a proportion of 2KPs that didn't look definite enough ("false negatives"). The technical complication is that the simple and robust kin-finding statistics which we prefer, are all just for deciding just 2 kinships, eg POP vs FSP, or 2KP VS UP. But here we have at least 4 possible kinships overall. Thus the trick is to divide-and-conquer, splitting the collection of kin-pairs first into coarse groups and into finer ones. The workflow goes: * find provisional POPs, ie all true POPs but potentially some FSPs too; * within the provisional POPs, split definite POPs from FSPs; * find provisional non-POP kin, including FSPs, 2KPs, 3KPs, and maybe some 4KPs, but hopefully no UPs; * within those, split definite FSPs; * identify a subset of (almost) definite 2KPs, and estimate the false-negative rate for that set. Let's begin. ## Find 'strong' kin pairs and split out POPs We'll start by doing pairwise comparisons of all samples, calculating a statistic designed to distinguish POPs from UPs or other types of kin. [Actually, two different POP statistics are available in 'kinference' now: the newer one, which might become the recommended option one after we've tested it more, is shown later, but we'll start with the older one, which has been used in all our real-life CKMR applications to date.] This (older) statistic is called "wpsex" (weighted pseudo-exclusions: see MS) and takes values between 0 and 1: close to 0 for POPs, and higher for UPs or other kin. First, let's get a global overview of the situation: ``` r # strongKin <- find_POPs(dropbears, limit_pairs=10000, keep_thresh=1) strongKin <- find_POPs(dropbears, keep_thresh=0.1) # upper limit guess of "interesting" wpsex values in other studies ``` ``` ## Returning just the 240 pairs with most POP-like wpsex; increase 'limit_pairs' to get more ``` ``` r # ... can disregard that limit_pairs warning *for now*, since this is just an overview ## What summary stats are there? names( attributes( strongKin)) # or atts( strongKin) thx2 'atease' ``` ``` ## [1] "names" "row.names" "class" "bins" ## [5] "n_wpsex_in_bin" "eta" "keep_thresh" "n_loci" ## [9] "mean_UP" "var_UP" "call" ``` ``` r ## Overview plot for all pairs with( attributes( strongKin), plot( bins, n_wpsex_in_bin, type='s' )) abline( v=strongKin@mean_UP, col=kincol['UP']) ``` ![plot of chunk dropbears_findPOPs1](./figures-kinference/dropbears_findPOPs1-1.png) ``` r ## Not very informative, because the huge number of UPs masks the interesting stuff. ## Transform the y-scale... ## 0.25 power transform is a bit like log but copes with 0 (thanks: Bill Venables) with( attributes( strongKin), plot( bins, n_wpsex_in_bin^0.25, type='s')) abline( v=strongKin@mean_UP, col=kincol['UP']) # theoretical mean ``` ![plot of chunk dropbears_findPOPs1](./figures-kinference/dropbears_findPOPs1-2.png) ``` r ## Zoom in on interesting region. First, re-run to make sure we get _all_ ## the kin-pairs in that region, with no limit_pairs warning strongKin <- find_POPs(dropbears, limit_pairs= 10000, # clearly less of them than that keep_thresh=0.07) # upper limit guess of "interesting" wpsex values in other studies ## ... and plot it again with( attributes( strongKin), plot( bins, n_wpsex_in_bin^0.25, type='s')) abline( v=strongKin@mean_UP, col=kincol['UP']) # theoretical mean ``` ![plot of chunk dropbears_findPOPs1](./figures-kinference/dropbears_findPOPs1-3.png) And we can clearly see a big bump of UPs on the right, a spike of about 20 pairs near zero, and in-between a few more bumps which will comprise FSPs, HSPs, and so on. The UP bump isn't _quite_ centred where purple theory says it should be^[We're not sure exactly why not, but it's possibly related to using estimated allele frequencies rather than true. For this dataset, null alleles happen to be non-existent (a consequence of the genotyping method :) but maximum-likelihood estimates of null frequency can of course slightly exceed 0 by chance, and here they do so for about 25% of the loci.], but it's close enough; let's keep going. Based on those plots, we now pick some reasonable threshold of "interesting", and then select all pairs below that threshold: ``` r ## Select threshold for "worth investigating further". By eye. UPcut <- 0.055 ## Focus on those pairs: strongest <- strongKin %where% (wpsex < UPcut) # same as: strongest <- strongKin[ strongKin$wpsex < UPcut,] dim( strongest) ``` ``` ## [1] 96 4 ``` ``` r ## Show just top 5 then bottom 5 of these, for space: head( strongest, 5) ``` ``` ## wpsex i j nABOO ## 1 0 446 445 0 ## 2 0 456 455 0 ## 3 0 442 441 0 ## 4 0 464 463 0 ## 5 0 480 479 0 ``` ``` r tail( strongest, 5) ``` ``` ## wpsex i j nABOO ## 92 0.05321628 372 371 0 ## 93 0.05418490 296 295 0 ## 94 0.05478877 224 223 0 ## 95 0.05495167 210 209 0 ## 96 0.05498302 316 315 0 ``` Now we'll calculate a statistic designed to separate POPs from FSPs^[There might also be some 2KPs within `strongest`, but in terms of this statistic they will look more like FSPs than like POPs. We only want "POPs vs rest" for now, so we can deal with "rest" later.]. Note that we do have to specify a genotyping error rate (`gerr`) here--- but the calculation is quite robust to mis-specification, so don't worry. In _our_ datasets, it's been below 0.01 (1%), but _yours_ might be different, of course. If your error rate is much higher than that, you probably won't be able to use your dataset for kin-finding anyway. ``` r splitPOPs <- split_FSPs_from_POPs( dropbears, strongest, gerr= 0.01) hist( splitPOPs$PLOD_FP, breaks= 50, main= "", xlab= "PLOD_FP") ## Colours: kincol was set by call to kinPalette() abline(v = splitPOPs@E_FSP, col = kincol['FSP'], lwd = 2) abline(v = splitPOPs@E_POP, col = kincol['POP'], lwd = 2) legend("topright", legend = c("POP expectation", "FSP expectation"), lty = c(1,1), lwd = c(2,2), col = kincol[ c('POP', 'FSP')], bg = "white") ``` ![plot of chunk dropbears_split_FSPs_POPs](./figures-kinference/dropbears_split_FSPs_POPs-1.png) ``` r ## Clump in [-50,-100] are POPs; ## rest a mix of FSPs, HSPs, and possibly weaker kin. POPs <- splitPOPs[splitPOPs$PLOD_FP < 0,] ## our first set of definite kin pairs POPs$ij <- upairid(POPs$i, POPs$j) ## a pair ID ``` We now have a set of definite POPs; next we need to find FSPs and 2KPs. ## Find definite FSPs We can't go straight from the POP results in `strongest` because the FSPs may not all be in `strongest` (their "wpsex" score may not have fallen below that `UPcut` threshold in the first place). So we start by provisionally identifying "all interesting kin" (1KPs and 2KPs, but perhaps with some 3KPs etc mixed in), taking out the POPs that we already have, and then separating the remainder into FSPs vs "the rest". The tool for the job is `find_HSPs`, which we already used for a quick overview of all kinships. This time we should be a bit more careful with the arguments because we want to select specific pairs, so here's some detail. `find_HSPs`--- which should really be called `find_2KPs`--- keeps binned counts of PLODs for all pairwise comparisons, but only keeps individual PLODs for those pairs whose PLOD is above `keep_thresh`, which we will need to set^[The logic here is that the overwhelming majority of pairs will be unrelated, and keeping scores for each of them can easily overflow RAM with useless "information". For the UPs, we only need to check whether the PLOD distribution is consistent with theory, so binned counts are fine.] Before calling `find_HSPs`, we can find the expected PLOD for an UP, which will be negative. It so happens that the expected PLOD for an 2KP is roughly the absolute value of the UP mean, and for a 3KP the expected PLOD is close to 0. If kinference is going to work, the 2KP bump should be well to the right of the 3KP mean, so choosing `keep_thresh` just below 0 (ie at a small proportion of the UP mean) should keep the interesting stuff without too much rubbish. It's also a good idea to set `limit_pairs` argument, so that the computer doesn't try to retain zillions of actually-unimportant kin in case we set `keep_thresh` too low; 10000 should be fine. With all that in mind, here we go: ``` r ## find HSPs-or-stronger using find_HSPs ## Expected PLOD for UPs (thanks to 'kin_power' earlier): sum( dropbears$locinfo$E_UP) ``` ``` ## [1] -64.47591 ``` ``` r ## So "-10" should be OK for keep.thresh here. Guard against keeping too many: HSPs_1 <- find_HSPs(dropbears, keep_thresh = -10, limit_pairs=10000) ## plot the counts of PLODs in bins, on a log_10 scale so we can actually see the bumps: histoPLOD( HSPs_1, log=TRUE) ``` ![plot of chunk dropbears-findHSPs1](./figures-kinference/dropbears-findHSPs1-1.png) Note the general pattern - there is a large bump of UPs, with mean and variance exactly where theory predicts them to be, and bumps for 2KPs and 1KPs (ie, FSP as well as POP) centred exactly where theory expects^[Recall that POP stats were slightly off. However, 2KP _stats_ seem a bit more robust, even though 2KPs themselves are harder to identify. So if these HSP/2KP stats _aren't_ very well aligned in your data, be nervous.]. There's also a "shoulder" around 0, between the UP and 2KP bumps, which is just where we'd expect 3KPs to be. This is a good dataset. Now we can zoom in on just the individually-kept PLODs for the retained pairs. We want an unlogged plot, so we must set `lb` to be the lower bound on displayed PLODs (to exclude the huge bump). The unlogged version shows actual retained PLODs, not bin-counts, so there's no point in setting `lb` below the previous `keep_thresh`. We also need to set `fullsib_cut` by eye, anywhere in the air-gap that is hopefully between 1KPs and 2KPs (it's not vitally important where; see '?histoPLOD' for more). ``` r # 'bin' sets bin width for histo histoPLOD( HSPs_1, log=FALSE, fullsib_cut = 120, lb = -10, bin = 10) ``` ![plot of chunk dropbears_findHSPs2](./figures-kinference/dropbears_findHSPs2-1.png) The PLOD in `find_HSPs` doesn't itself discriminate well between FSPs and POPs, but there is a clear gap between 1KPs and 2KPs. Any 1KPS that _aren't_ POPs must be FSPs, so we can pick all 1KPs using this PLOD, subset out the POPs that we already found, and be left with the FSPs. ``` r POPsOrFSPs <- HSPs_1[HSPs_1$PLOD > 110,] ## anywhere in the valley POPsOrFSPs$ij <- upairid(POPsOrFSPs$i, POPsOrFSPs$j) ## pair ID to match the one in POPs FSPs <- POPsOrFSPs[!POPsOrFSPs$ij %in% POPs$ij,] ## find our FSPs ``` Now, we have our set of POPs, our set of FSPs, and a set of tentative 2KPs. Again, those "second-order kin" comprise half-sibling, grandparent-grandchild, and full-thiatic pairs: we cannot split between those genetically^[Well, it is theoretically possible to improve somewhat on a blind guess, but you need high-resolution genomic data.], but in many CKMR applications, covariate data such as ages can do the job. In this particular case, the 2KPs will almost all be HSPs, on demographic grounds (details out of scope here). However, the bigger issue is that the set of tentative 2KPs may well be contaminated by more-distant kin, mainly 3KP (half-thiatic and great-grandparent-grandchild pairs) and 4KP (e.g. half-first-cousin pairs). We have already gotten rid of the 1KPs, but for the rest we now need to decide on a PLOD cutoff above which we will treat everything as "definite 2KP". We will then use the distribution of PLODs among those "definite 2KPs" to work out how many false-negative 2KPs are likely to have fallen below the cutoff. Unlike with 1KPs vs 2KPs, there won't be a clean break, so things get a bit more complicated. The idea of choosing the cutoff is to let us retain a (hopefully large) proportion of true 2KPs, subject to a tolerance on the expected number of contaminating 3KPs and 4KPs that by chance fall above the threshold. That tolerance is up to you; you really don't want many false-positives from 3KPs because they can mess up a CK model. The problem isn't the _number_ of false-positives; rather, it's the different dynamics that different kinships have. Most of all, 3KPs tend to have much longer birth-gaps than HSPs, so that mislabelling even a couple of 3KPs as "HSPs" can cause a substantial bias in mortality-rate estimates. (Note that it doesn't cause bias to discard some true 2KPs below the cutoff, because we can compensate by incorporating the estimated false-negative rate directly into the CKMR model.) While it's a good idea to choose a high cutoff (i.e., few false-positives) to minimise the bias from 3KPs, there is a trade-off; higher cutoffs do lead to fewer definite 2KPs going into the model, and thus to higher variances, as well as making it harder to check model diagnostics. Nevertheless, you should certainly go high enough to get your estimated false-positive rate well _below_ the false-negative rate; you defintely _don't_ want "equal probabilities" here. We have often chosen the cutoff at a round number by eye, but a more systematic approach is via the `autopick_threshold` function. This takes the empirical distribution of PLODs in the tentative HSPs, estimates the impact of linkage, and--- on the assumption that each comes from either a HSP, HTP, or half first cousin pair (H1CP)--- fits a constrained mixture decomposition (see function documentation and MS for details) to estimate the underlying distributions and numbers of HSPs/HTPs/H1CPs, before numerically choosing a cutoff to achieve the desired tolerance.^[note that we're using 'HSP', 'HTP', and 'H1CP' in this assumption, rather than '2KP', '3KP', and '4KP'. The distinction arises because, although expected PLODs are the same within each of 2KP, 3KP, and 4KP, linkage means that PLOD _variances_ can differ between kin-types within a kinship order, and 'autopick_threshold' assumes HSPs, HTPs, and H1CPs specifically. This is unlikely to have much impact in practice, but is one more reason to use demographic information to minimise the number of non-HSP 2KPs at this stage of the analysis] When `autopick_threshold` works well, it's great; but don't trust its output if it doesn't look sensible, because it may have too much flexibility in fitting especially when the number of kin-pairs is low. Whether-or-not you use `autopick_threshold`, the process _is_ intrinsically subjective, because _you_ have to decide how many 3KPs you are willing to tolerate. So, picking a cutoff by eye is also OK (because the false-negative rate computed afterwards will eliminate bias), as long as you don't try to "tweak" the cutoff to include/exclude specific pairs. ``` r HSPs_1$ij <- upairid(HSPs_1$i, HSPs_1$j) HSPs_2 <- HSPs_1[!HSPs_1$ij %in% c(POPs$ij, FSPs$ij),] ## plot the empirical PLODs with the expected distribution ## of the PLOD for half-sibling pairs thresh <- autopick_threshold( dropbears, kin = HSPs_2, fitrange_PLOD = c(0, 110), FPtol_pairs = 1, # tolerate 1 expected false-positive use4th = TRUE, selecto = "ML", plot_bins = 5) ``` ![plot of chunk splitHSPsFromThiatics](./figures-kinference/splitHSPsFromThiatics-1.png) ``` r thresh ``` ``` ## [1] 26.94363 ## attr(,"fitrange_PLOD") ## [1] 0 110 ## attr(,"selecto") ## [1] "ML" ## attr(,"use4th") ## [1] TRUE ## attr(,"FPtol_pairs") ## [1] 1 ## attr(,"info") ## V0_M3 Vx_M3 V0_M4 Vx_M4 V_UP V_HSP ## 179.66091016 192.07090954 162.47650979 166.37340040 135.90524760 185.86912547 ## C_hat rho_hat E2 E3 E4 SD2 ## 211.68747051 0.05300852 61.60599802 -1.43495747 -32.95543522 13.63338276 ## SD3 SD4 P2 P3 P4 N3 ## 13.85896495 12.89858133 0.14357251 0.13856494 0.71786255 49.27040441 ## Nall thresh Pr_FNeg ## 355.57627543 26.94363293 0.00550372 ``` Well, that plot looks great! With that cutoff, we can select our "definite 2KPs" and the estimated false-positive and false-negative rates of that set: ``` r HSPs <- HSPs_2[HSPs_2$PLOD > thresh,] fpos_rate <- thresh@FPtol_pairs / nrow(HSPs) ## ~2% of our 'HSPs' are false-positives fneg_rate <- thresh@info["Pr_FNeg"] ## we're missing about 0.5% of HSPs to false-negatives; ## _extremely_ good, expect 10--20% yourself! nrow(HSPs) ## 52 inferred likely half-sibling pairs ``` ``` ## [1] 52 ``` ## Examine covariates of kin-pairs data It's almost always useful to look at the covariates of the kin-pairs, which typically have a different distribution to the covariates of typical samples. Those covariates are essential for subsequent CKMR models. ``` r summary( dropbears$info[ c( HSPs$i, HSPs$j), ]) ## samples in HSPs ``` ``` ## Our_sample sex weight_kg popn ## Length :104 Length :104 Min. : 47.30 Length :104 ## N.unique :104 N.unique : 4 1st Qu.: 84.42 N.unique : 3 ## N.blank : 0 N.blank : 0 Median :101.30 N.blank : 0 ## Min.nchar: 10 Min.nchar: 1 Mean : 98.88 Min.nchar: 3 ## Max.nchar: 10 Max.nchar: 3 3rd Qu.:116.40 Max.nchar: 7 ## Max. :138.50 ## dropheight_m ## Min. : 0.000 ## 1st Qu.: 2.000 ## Median : 3.000 ## Mean : 3.442 ## 3rd Qu.: 4.000 ## Max. :10.000 ``` ``` r summary( dropbears$info) ## all samples ``` ``` ## Our_sample sex weight_kg popn ## Length :480 Length :480 Min. : 29.00 Length :480 ## N.unique :480 N.unique : 4 1st Qu.: 88.72 N.unique : 3 ## N.blank : 0 N.blank : 0 Median :107.85 N.blank : 0 ## Min.nchar: 8 Min.nchar: 1 Mean :102.02 Min.nchar: 3 ## Max.nchar: 10 Max.nchar: 3 3rd Qu.:119.03 Max.nchar: 7 ## Max. :161.10 ## dropheight_m ## Min. : 0.000 ## 1st Qu.: 2.000 ## Median : 3.000 ## Mean : 3.498 ## 3rd Qu.: 5.000 ## Max. :10.000 ``` ## Alternative POP statistic Incidentally, here's the newer POP statistic in action. You need to specify some genotyping error rate, as with `split_FSPs_from_POPs`, but again any reasonable choice should lead to the same final set of POPs. ``` r ## Assume modest error rate sk1 <- find_POPs_lglk( dropbears, gerr= 0.01, limit_pairs= 200, keep_thresh= -5) histoPLOD( sk1, log=TRUE) ``` ![plot of chunk dropbears_POP_lglk1](./figures-kinference/dropbears_POP_lglk1-1.png) ``` r histoPLOD( sk1, log=FALSE, lb=-5) ``` ![plot of chunk dropbears_POP_lglk1](./figures-kinference/dropbears_POP_lglk1-2.png) ``` r ## Assume bad error rate sk5 <- find_POPs_lglk( dropbears, gerr= 0.05, limit_pairs= 200, keep_thresh= -5) histoPLOD( sk5, log=TRUE) ``` ![plot of chunk dropbears_POP_lglk1](./figures-kinference/dropbears_POP_lglk1-3.png) # Section 2: cleaning the bluefin dataset Your data will almost certainly need cleaning before you can do the kin-finding in Section One. To show how (and, in **Section 3**, _why_), we'll use the `bluefin` dataset included in package `kinference`. This is a small, anonymised subset of the tuna close-kin data described in Farley et al., 2024; see `?bluefin` for details and full citation. ## Population allele frequencies Many of the functions in `kinference` require a `snpgeno` with pre-computed population allele frequencies for each locus, stored in an object called `pbonzer`. The `bluefin` dataset has its allele frequencies pre-computed based on a larger version of the dataset, so we won't recompute them, but here's how to do it: ``` r # 'bluefin' is a dataset that comes with 'kinference'; see '?bluefin' # Check genotype encoding: diplos( bluefin) # aha, 6-way genotyping (legacy CSIRO data) ``` ``` ## [1] "AA" "AB" "AO" "BB" "BO" "OO" ``` ``` r ## if you're using 4-way genotypes(most likely) ## bluefin <- est_ALF_ABO_quick( bluefin) ## if your data are stored 6-way: ## ('bluefin' is 6-way, but 'dropbears' is 4-way) ## bluefin <- est_ALF_6way( bluefin) # needs extra info too ``` Some loci may have an estimated A or B allele frequency of zero, which can cause problems. It is best to drop loci with very low estimated A or B allele frequencies - they wouldn't be very informative anyway. ``` r bluefin <- bluefin[ , bluefin$locinfo$pbonzer[ ,"A"] > 1e-6] bluefin <- bluefin[ , bluefin$locinfo$pbonzer[ ,"B"] > 1e-6] ``` ## Checking for badly-behaved loci The `bluefin` dataset, after these tweaks, has 1627 loci. The function `check6and4` compares observed genotype frequencies for each locus to the expected frequencies under Hardy-Weinberg Equilibrium, giving a P-value for each locus. In an abstract perfect dataset, we would expect their P-values to be uniformly distributed over the interval [0,1] - so we would _expect_ about 0.151 of them to have a P-value less than 1e-4. In practice, they have _always_ looked "worse than expected" on real data, but often this has not mattered downstream. The P-values are more to be taken as _relative_ measures of well-behavedness here. ``` r pvals <- check6and4(bluefin, thresh_pchisq_6and4 = c(1e-4, 1e-6), show6 = FALSE) ``` ``` ## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests... ``` ``` ## calculating both 6-way and 4-way pvals, plotting 4-way... ``` ![plot of chunk bluefin6and4](./figures-kinference/bluefin6and4-1.png)![plot of chunk bluefin6and4](./figures-kinference/bluefin6and4-2.png) Those 'thresh...' values are entirely subjective, and dataset-dependent. They just control which points get coloured, so you can experiment til you find a threshold that excludes the loci you feel are visually unacceptable. (The real "proof of the pudding" comes much later, when you check whether 'find_HSPs' is looking OK. You should expect that this whole QC process will need to be iterative!) We can see that most loci in this dataset lie pretty close to the "observed = expected" line for all four genotypes - but some of them are off the mark. In real datasets, it is wise to have a little wriggle-room: this dataset may still contain duplicate samples, fish from another population, or contaminated or degraded samples - and each of those can contribute to the _loci_ having lower-than-expected P-values. However, some of these loci are just abysmal - bad fish will make all loci look bad, but they generally won't make a few particular loci look _really bad_. The abysmal-looking loci almost certainly aren't the result of a few bad fish. So for now, we will just remove loci with P < 1e-10. After the rest of the cleaning stages, we can (and will!) revisit these P-values without the bad samples. ``` r bluefin_1 <- bluefin[, pvals$pval4 > 1e-10] pvals_1 <- check6and4(bluefin_1, thresh_pchisq_6and4 = c(1e-4, 1e-6), show6 = FALSE) ``` ``` ## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests... ``` ``` ## calculating both 6-way and 4-way pvals, plotting 4-way... ``` ![plot of chunk bluefin1](./figures-kinference/bluefin1-1.png)![plot of chunk bluefin1](./figures-kinference/bluefin1-2.png) ``` r ## pvals_1 is storage so we can check back on the same scores later ``` ## Finding and removing duplicate samples Now, let's check for duplicate samples ``` r ## look for duplicates, noting any pairs with <50 different genotypes: dups <- find_duplicates(bluefin_1, max_diff_loci = 50) ``` ![plot of chunk bluefinFindDups](./figures-kinference/bluefinFindDups-1.png) ``` r nrow(dups) ## 24 pairs of animals with fewer than 50 different genotypes ``` ``` ## [1] 24 ``` We've got 24 pairs of animals with fewer than 50 different genotypes - but nearly all of our animal pairs have more than 500 different genotypes, and there's a huge air-gap between 40 and 500 with no animal pairs. The 24 pairs with less than 50 different genotypes are almost certainly duplicates. (On your own dataset, you should of course check covariates at this point to confirm.) Anyway, quite often "duplicates" actually turn out to be triplicates etc, but 'kinference' only records pairwise results. When we have duplicates or triplicates or whatever, we only need to keep one from each set. `kinference` contains a convenient function - `drop_dups_pairwise_equiv` - to help us do exactly that: ``` r bluefin_2 <- bluefin_1[ -c( drop_dups_pairwise_equiv( dups[,2:3])),] ``` ## Samples from another population or species Now, we need to find and remove any outlying "animals not like the others". These may be animals from another species or population that have somehow made it into our dataset, or they might be badly degraded or contaminated, or something might have gone wrong with extraction or sequencing for that sample, or so on and so forth. Animals that don't look like they're drawn from the same population as everyone else aren't usable for kinference, at least not with this software. ``` r ilglk <- ilglk_geno(bluefin_2) abline(v = -1525) ## put our cutoff here ``` ![plot of chunk bluefinIlglk](./figures-kinference/bluefinIlglk-1.png) This pattern of `ilglk_geno` scores is pretty typical when most samples are correctly identified, but there are a handful of animals from a distantly-related stock or another species in the dataset: there is a large, approximately Normally-distributed, spike at the right-hand side of the plot, and a long tail strung out to the left, containing the animals from another stock or species. In our case, any animal with an ilglk score less than -1525 or so is not believable as a member of the same stock as the main spike. Let's remove them. ``` r bluefin_3 <- bluefin_2[ilglk > -1525,] ## now, does it look any better? ilglk <- ilglk_geno(bluefin_3) ``` ![plot of chunk bluefinIlglk2](./figures-kinference/bluefinIlglk2-1.png) ``` r ## Yes! Now re-do that important but boring internal calculation bluefin_3 <- kin_power(bluefin_3, k = 0.5) ``` ## Cross-contaminated and degraded samples Now, we need to _try_ to find and remove cross-contaminated or degraded samples. Checking for excess or deficiency of heterozygosity is the way to go here, but there some subtle details to do with null alleles and so on; see MS^[Which doesn't exist _yet_...]. The function for this is `hetzminoo_fancy`, which has a "rich" variant optimized for finding contamination, and a "poor" variant for degradation. With "rich", you're looking for too-big values; with "poor", for too-small values--- although it must be said that the two variants tend to look pretty similar. ``` r hetzpoor <- hetzminoo_fancy(bluefin_3, "poor") ## ... maybe a couple of outlying low samples ``` ![plot of chunk bluefinHetzes](./figures-kinference/bluefinHetzes-1.png) ``` r hetzrich <- hetzminoo_fancy(bluefin_3, "rich") ## ... maybe no high ones (debatable!) ``` ![plot of chunk bluefinHetzes](./figures-kinference/bluefinHetzes-2.png) ``` r bluefin_4 <- bluefin_3[hetzpoor > 0.17,] ## remove the outlying low hetzpoor scores hetzpoor <- hetzminoo_fancy(bluefin_4, "poor") ## Hey, the hetzpoor outliers are gone! ``` ![plot of chunk bluefinHetzes](./figures-kinference/bluefinHetzes-3.png) That was actually pretty good data (after initial cleanups)--- the heterozygosity outliers weren't particularly drastic and there weren't many. We could have gone a bit safer by trimming a few more "outliers", but let's keep going with what we now have. ## How has removing bad fish impacted our apparent locus quality? Remember how I mentioned that having bad _samples_ in the dataset can make _loci_ look bad? Let's take a look at the P-values for individual loci we stored earlier, re-calculate them now that we've removed all the bad fish, and see whether the loci look any better. ``` r pvals_2 <- check6and4(bluefin_4, thresh_pchisq_6and4 = c(1e-4, 1e-6), show6 = FALSE) ``` ``` ## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests... ``` ``` ## calculating both 6-way and 4-way pvals, plotting 4-way... ``` ![plot of chunk bluefin6and4sMk2](./figures-kinference/bluefin6and4sMk2-1.png)![plot of chunk bluefin6and4sMk2](./figures-kinference/bluefin6and4sMk2-2.png) ``` r lociPvals <- data.frame( pvals = c(pvals_2$pval4, pvals_1$pval4), stage = c( rep("after", length(pvals_2$pval4)), rep("before", length(pvals_1$pval4))) ) lm1 <- lm(pvals ~ stage, data = lociPvals) summary(lm1) ``` ``` ## ## Call: ## lm(formula = pvals ~ stage, data = lociPvals) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.11292 -0.10324 -0.07165 0.04540 0.81516 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.112922 0.004010 28.160 <2e-16 *** ## stagebefore -0.007792 0.005671 -1.374 0.17 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.1549 on 2984 degrees of freedom ## Multiple R-squared: 0.0006323, Adjusted R-squared: 0.0002974 ## F-statistic: 1.888 on 1 and 2984 DF, p-value: 0.1695 ``` Removing bad fish has caused a small improvement in the apparent quality of our loci: the average locus in our dataset has seen its P-value increase by 0.008. This is why we didn't want to be _too_ harsh in trimming loci early in our cleaning pathway - we would have removed loci that were actually fine, but looked bad because of the bad fish, and that would have cost us power by reducing the number of good loci we use for kin-finding. If you _really_ want to refine your locus-selection to keep as many as possible, you might try a _very_ lenient cutoff for locus quality, then look for very bad samples, then tighten the locus quality cutoff, then look for bad samples again. Another approach is to be _very_ stringent initially with both loci _and_ samples, then progressively re-check whether some loci and some samples can be re-admitted to the "good club". Wait for the book! ## Kinfer the bluefin dataset The `bluefin` dataset is now cleaned. We won't go through the full kin-finding routine again here - it would be the same as for `dropbears` (see **Section One**). I'll just show you the "histoPLOD", so you can see that kin-finding would now work for `bluefin`, cleaned as it is. ``` r ## As outlined in Section 1, `keep_thresh` sets a PLOD limit to ## prevent storing umpteen-millions of PLODs. Negative 10 i s a often ## about right, but you may need to set it lower (to store more PLODs) ## or higher (to save RAM or make histoPLODs prettier) HSPs <- find_HSPs(bluefin_4, keep_thresh = -10) histoPLOD( HSPs, log=TRUE) ``` ![plot of chunk bluefinKinfer](./figures-kinference/bluefinKinfer-1.png) This histoPLOD shows a spike of unrelated pairs that is well-characterised by its expected distribution, a spike in density where we would expect to see half-siblings, and another spike in density where we would expect FSPs or POPs. From this, it should be clear that we can kinfer this dataset using the same approach as in Section One, albeit with slightly less power. # Section Three: do we _really_ need to do all that cleaning? In this section, we demonstrate a handful of things that can go wrong in a kinference analysis, and suggest ways to deal with them. This set of examples is not exhaustive! If you have a problem that is not dealt with in this section, you can try getting in touch with experienced kinferencers, including the maintainer of package `kinference`, at the [Github Discussions page](https://github.com/orgs/closekin/discussions) for `closekin`, the parent Organisation of package `kinference`. Offscreen, I've prepared versions of the `bluefin` dataset with all-but-one of the cleaning steps. For example, `bluefin_exceptIlglk` has had its loci checked, has had duplicates removed, had its contaminated and degraded samples removed - but I've kept all the samples that failed our check for animals from another population, using `ilglk_geno`. I've also built `bluefin_exceptDuplicates`, `bluefin_exceptHetzPoor`, `bluefin_exceptHetzRich`, and `bluefin_except6and4`, each of which has skipped the obvious cleaning step. Here, we'll walk through the sorts of things you might see if your data aren't properly cleaned. ## Duplicates _In principle_, there are two reasons we don't want duplicate samples in our kin-finding dataset: first, because our null distributions for our various kin-finding statistics are calculated under the assumption that each sample is drawn independently and at random from the sampled population. Duplicate samples, especially if they account for a large proportion of the dataset, will make those null distributions and decisions based on them inaccurate. Second, if you follow the kin-finding pathway shown in this vignette, and have duplicate samples in your dataset, and don't check your results carefully, those dups will be mistaken for POPs. Duplicate pairs are very easy to spot: they look more closely related than any true kin pair, so they appear far off to the right of the histoPLODs (they might not be visible on the unlogged version though, because it sets its xlim internally and will cut off massive outliers). The `bluefin` dataset doesn't have all that many duplicates, so the differences to the null distributions aren't pronounced. ``` r dupHSPs <- find_HSPs(bluefin_exceptDups, keep_thresh = -10) ## histoPLOD( dupHSPs, log=TRUE) # use the logged version to pick the fullsib_cut histoPLOD( dupHSPs, log=FALSE, lb = 20, fullsib_cut = 150) ``` ![plot of chunk keptDups](./figures-kinference/keptDups-1.png) ## Samples from another population or species Samples from another population or species will look very unlike the bulk of the other samples, so pairs with one animal from population A (the intended population) and another from population B (the unintended population) will tend to have very un-kin-like kinship scores. On the other hand, pairs between two samples from population B _may_ have very high kinship scores, especially if some loci have alleles that are rare in population A but more common (or even fixed) in population B. ``` r # par(mfrow = c(2,1)) ilglkHSPs <- find_HSPs(bluefin_exceptIlglks, keep_thresh = -10, minbin = -150) ## there are a bunch of 'less related than expected' fish in the low tail ## of the unrelated pair bump. histoPLOD( ilglkHSPs, log=TRUE) ``` ![plot of chunk keptIlglks](./figures-kinference/keptIlglks-1.png) Note both the bump out to the left of the unrelated pairs. There may also be additional pairs scattered throughout the 'kin-like' regions of the PLOD distribution (although it's not particularly apparent in this dataset), but these pairs in the 'kin-like- regions will not closely match the expected peak locations for any particular kin group except by accident. ## Contaminated or degraded samples A few contaminated or degraded samples in a mostly-clean dataset behave very similarly to samples from another population or species: when paired with unrelated clean samples, they will look dissimilar, but when paired with each other, they can look very much like kin pairs. The `bluefin` dataset does not have many cross-contaminated or degraded samples, so offscreen, I have added in some cross-contaminated samples by taking random pairs of samples and randomly 'contaminating' the first sample with alleles present in the second sample with a probability that varies by sample. I've then removed the original samples and mixed the 'contaminated' samples back in with the `bluefin` dataset, generating a dataset called `ctm_bluefin`. In Section 2, I showed that bad samples make loci look worse. Because `ctm_bluefin` has many more bad samples than `bluefin`, we will have to re-do the cleaning steps, starting from the locus cleaning. I'll show kin-finding outputs on a version with its contaminated samples left in, then show contamination cleaning, and finally kin-finding outputs on the cleaned dataset. ``` r pvals_ctm <- check6and4(ctm_bluefin, thresh_pchisq_6and4 = c(1e-4, 1e-6), show6 = FALSE) ``` ``` ## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests... ``` ``` ## calculating both 6-way and 4-way pvals, plotting 4-way... ``` ![plot of chunk ctm_bluefin_loci](./figures-kinference/ctm_bluefin_loci-1.png)![plot of chunk ctm_bluefin_loci](./figures-kinference/ctm_bluefin_loci-2.png) ``` r pvals_clean <- check6and4(bluefin, thresh_pchisq_6and4 = c(1e-4, 1e-6), show6 = FALSE) ``` ``` ## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests... ## calculating both 6-way and 4-way pvals, plotting 4-way... ``` ![plot of chunk ctm_bluefin_loci](./figures-kinference/ctm_bluefin_loci-3.png)![plot of chunk ctm_bluefin_loci](./figures-kinference/ctm_bluefin_loci-4.png) ``` r mean(pvals_ctm$pval4) ``` ``` ## [1] 0.1205191 ``` ``` r mean(pvals_clean$pval4) ``` ``` ## [1] 0.09623471 ``` ``` r ## pvals by *locus*, for the dataset with contaminated *samples*, are substantially worse than in the clean dataset, on average. ``` ``` r ctm_bluefin_1 <- ctm_bluefin[, pvals_ctm$pval4 > 1e-15] ## so we will be _substantially_ more lenient on locus quality ## pvals_ctm_2 <- check6and4(ctm_bluefin, thresh_pchisq_6and4 = c(1e-4, 1e-6), show6 = FALSE) dups <- find_duplicates(ctm_bluefin_1, max_diff_loci = 50) ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-1.png) ``` r ## Note how cross-contaminated samples make duplicate-detection less clear - but we still ## have an OK air-gap at 50 different genos in this case ctm_bluefin_2 <- ctm_bluefin_1[ -c( drop_dups_pairwise_equiv( dups[,2:3])),] ilglk <- ilglk_geno(ctm_bluefin_2) abline(v = -1550) ## put our cutoff here ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-2.png) ``` r ctm_bluefin_3 <- ctm_bluefin_2[ilglk > -1550,] ilglk <- ilglk_geno(ctm_bluefin_3) ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-3.png) ``` r # Not great... the issue is not so much outliers, but general mis-fit ctm_bluefin_4 <- kin_power(ctm_bluefin_3, k = 0.5) ## ... important but boring calculation - just run it ## kinfinding hsps <- find_HSPs(ctm_bluefin_4, keep_thresh = -10) ``` ``` ## Returning just the 541 pairs with the highest PLOD scores; increase 'limit_pairs' to get more ``` ``` r histoPLOD( hsps, log=TRUE) ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-4.png) ``` r ## Note the large additional bump around PLOD = 0, and lack ## of air-gap between HSPs and POPs/FSPs. histoPLOD( hsps, log=FALSE, lb = -5, fullsib_cut = 100) ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-5.png) ``` r ## Also notice a few pairs with suspiciously-high PLODs ## find and remove cross-contaminated samples: ## cross-contaminated samples are clear outliers hetzrich <- hetzminoo_fancy(ctm_bluefin_4, "rich") ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-6.png) ``` r ## cross-contaminated samples are almost as clear on the degradation variant: hetzpoor <- hetzminoo_fancy(ctm_bluefin_4, "poor") ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-7.png) ``` r ctm_bluefin_4 <- ctm_bluefin_4[hetzrich < 0.27,] ## skip removing the outlying hetzrich scores hetzpoor <- hetzminoo_fancy(ctm_bluefin_4, "poor") ## Now we only have low outliers in hetzpoor! ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-8.png) ``` r ctm_bluefin_5 <- ctm_bluefin_4[hetzpoor > 0.18,] ## skip removing the bad ones hsps <- find_HSPs(ctm_bluefin_5, keep_thresh = -10) histoPLOD( hsps, log=TRUE) ## the PLOD bump around zero has gone ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-9.png) ``` r ## ... and there are no more samples with suspiciously high PLODs histoPLOD( hsps, log=FALSE, lb = -5, fullsib_cut = 100) ``` ![plot of chunk ctm_bluefin_loci2](./figures-kinference/ctm_bluefin_loci2-10.png)