Package kinference
uses SNP genotypes to find pairs of closely-related kin within a set of
biological samples, presumably for use in Close-Kin Mark-Recapture. In
this vignette, we:
Kinference can be pretty complicated, though it isn’t always. We
can’t show everything here; this is actually a subset of a much larger
evolving document, and we ourselves are still learning about the process
as we throw new datasets with new problems at the software1! There may be courses;
check with the package maintainer. If you have questions, first have a
look at the Github Discussions
page for closekin, the parent Organisation of package
kinference. Also, do look at ?kinference
inside R, plus of course the detailed documention of individual
functions. This document makes reference in a few place to “the MS”, a
long-awaited paper that gives details of all the algorithms used in this
package. Unfortunately, the MS doesn’t quite exist at time of writing,
but it is coming…
Here, we step through kin-finding with a well-behaved, well-cleaned
dataset in snpgeno format. We will use the
dropbears dataset included with package
kinference. Some standard checks should always be
done, but the dropbears dataset is already clean and won’t
require any changes, and the allele frequencies have already been
estimated. To see a dataset in need of cleaning (like yours!), see
Section 2.
A dataset for kinference should have all of its samples drawn from a single, genetically-homogenous population, without any duplicate samples, samples from another population (or species!), or cross-contaminated or degraded samples. It also needs all loci to be “well-behaved”, ie in Hardy-Weinberg proportions, without substantial allelic dropout2 or genotyping errors, not duplicated, not under (strong) selection, etc.
library( mvbutils) ## not sure if it needs to be explicitly loaded
library( gbasics) ## so data objects behave nicely
library( atease) ## convenient access to attributes; x@thing
library( kinference)kincol <- kinPalette() ## kin-specific colours, used later
## 'dropbears' is a dataset that comes with 'kinference'
## Very quick look at snpgeno data: covariates and some genotypes of a few samples
dim( dropbears)## [1] 480 2000
## l l l l l
## o o o o o
## c c c c c
## u u u u u
## s s s s s
## _ _ _ _ _
## Our_sample sex weight_kg popn dropheight_m 1 2 3 4 5
## 1 sample_1 M 112.3 Tas 3 AB AAO AAO AAO AB
## 2 sample_2 F 125.7 Tas 4 AAO AB AB AAO AB
## 3 sample_3 Imm 96.6 Eastern 2 AAO AAO AAO AAO AB
## Boring but necessary data-prep:
dropbears <- kin_power(dropbears, k = 0.5)
plod_HU <- find_HSPs(dropbears, keep_thresh = 0)
histoPLOD( plod_HU, log=TRUE)This plot shows observed log-densities of one kin-finding statistic
(there are several) across all pairs of animals in the
dropbears dataset, as the black stair-step lines. It also
shows some vertical lines at the the predicted means for various
kinships based on allele frequences, and the predicted distribution of
unrelated pairs as a smooth purple curve. This is an excellent-looking
kin space with all the bumps in all the right places - we’ll definitely
be able to find kin pairs here.
From here, there are many potential workflows for finding kin pairs. The workflows can depend on demographic information (e.g., one might look for sibling pairs only among samples of young animals caught at a similar time, to reduce confusion from other kinships that are common in more general comparisons; or, first look for POPs only among pairs with a sizeable birth-year gap). Here, we show one potential workflow for kin-finding that does not use covariate/demographic information. We’re going to end up with four categories of pairs:
The technical complication is that the simple and robust kin-finding statistics which we prefer, are all just for deciding just 2 kinships, eg POP vs FSP, or 2KP VS UP. But here we have at least 4 possible kinships overall. Thus the trick is to divide-and-conquer, splitting the collection of kin-pairs first into coarse groups and into finer ones. The workflow goes:
Let’s begin.
We’ll start by doing pairwise comparisons of all samples, calculating a statistic designed to distinguish POPs from UPs or other types of kin. [Actually, two different POP statistics are available in ‘kinference’ now: the newer one, which might become the recommended option one after we’ve tested it more, is shown later, but we’ll start with the older one, which has been used in all our real-life CKMR applications to date.] This (older) statistic is called “wpsex” (weighted pseudo-exclusions: see MS) and takes values between 0 and 1: close to 0 for POPs, and higher for UPs or other kin. First, let’s get a global overview of the situation:
# strongKin <- find_POPs(dropbears, limit_pairs=10000, keep_thresh=1)
strongKin <- find_POPs(dropbears,
keep_thresh=0.1) # upper limit guess of "interesting" wpsex values in other studies## Returning just the 240 pairs with most POP-like wpsex; increase 'limit_pairs' to get more
# ... can disregard that limit_pairs warning *for now*, since this is just an overview
## What summary stats are there?
names( attributes( strongKin)) # or atts( strongKin) thx2 'atease'## [1] "names" "row.names" "class" "bins"
## [5] "n_wpsex_in_bin" "eta" "keep_thresh" "n_loci"
## [9] "mean_UP" "var_UP" "call"
## Overview plot for all pairs
with( attributes( strongKin),
plot( bins, n_wpsex_in_bin, type='s' ))
abline( v=strongKin@mean_UP, col=kincol['UP'])## Not very informative, because the huge number of UPs masks the interesting stuff.
## Transform the y-scale...
## 0.25 power transform is a bit like log but copes with 0 (thanks: Bill Venables)
with( attributes( strongKin),
plot( bins, n_wpsex_in_bin^0.25,
type='s'))
abline( v=strongKin@mean_UP, col=kincol['UP']) # theoretical mean## Zoom in on interesting region. First, re-run to make sure we get _all_
## the kin-pairs in that region, with no limit_pairs warning
strongKin <- find_POPs(dropbears,
limit_pairs= 10000, # clearly less of them than that
keep_thresh=0.07) # upper limit guess of "interesting" wpsex values in other studies
## ... and plot it again
with( attributes( strongKin),
plot( bins, n_wpsex_in_bin^0.25,
type='s'))
abline( v=strongKin@mean_UP, col=kincol['UP']) # theoretical meanAnd we can clearly see a big bump of UPs on the right, a spike of about 20 pairs near zero, and in-between a few more bumps which will comprise FSPs, HSPs, and so on. The UP bump isn’t quite centred where purple theory says it should be5, but it’s close enough; let’s keep going. Based on those plots, we now pick some reasonable threshold of “interesting”, and then select all pairs below that threshold:
## Select threshold for "worth investigating further". By eye.
UPcut <- 0.055
## Focus on those pairs:
strongest <- strongKin %where% (wpsex < UPcut)
# same as: strongest <- strongKin[ strongKin$wpsex < UPcut,]
dim( strongest)## [1] 96 4
## wpsex i j nABOO
## 1 0 446 445 0
## 2 0 456 455 0
## 3 0 442 441 0
## 4 0 464 463 0
## 5 0 480 479 0
## wpsex i j nABOO
## 92 0.05321628 372 371 0
## 93 0.05418490 296 295 0
## 94 0.05478877 224 223 0
## 95 0.05495167 210 209 0
## 96 0.05498302 316 315 0
Now we’ll calculate a statistic designed to separate POPs from FSPs6. Note that
we do have to specify a genotyping error rate (gerr) here—
but the calculation is quite robust to mis-specification, so don’t
worry. In our datasets, it’s been below 0.01 (1%), but
yours might be different, of course. If your error rate is much
higher than that, you probably won’t be able to use your dataset for
kin-finding anyway.
splitPOPs <- split_FSPs_from_POPs( dropbears, strongest, gerr= 0.01)
hist( splitPOPs$PLOD_FP, breaks= 50, main= "", xlab= "PLOD_FP")
## Colours: kincol was set by call to kinPalette()
abline(v = splitPOPs@E_FSP, col = kincol['FSP'], lwd = 2)
abline(v = splitPOPs@E_POP, col = kincol['POP'], lwd = 2)
legend("topright", legend = c("POP expectation", "FSP expectation"),
lty = c(1,1), lwd = c(2,2),
col = kincol[ c('POP', 'FSP')], bg = "white")## Clump in [-50,-100] are POPs;
## rest a mix of FSPs, HSPs, and possibly weaker kin.
POPs <- splitPOPs[splitPOPs$PLOD_FP < 0,] ## our first set of definite kin pairs
POPs$ij <- upairid(POPs$i, POPs$j) ## a pair IDWe now have a set of definite POPs; next we need to find FSPs and 2KPs.
We can’t go straight from the POP results in strongest
because the FSPs may not all be in strongest (their “wpsex”
score may not have fallen below that UPcut threshold in the
first place). So we start by provisionally identifying “all interesting
kin” (1KPs and 2KPs, but perhaps with some 3KPs etc mixed in), taking
out the POPs that we already have, and then separating the remainder
into FSPs vs “the rest”. The tool for the job is find_HSPs,
which we already used for a quick overview of all kinships. This time we
should be a bit more careful with the arguments because we want to
select specific pairs, so here’s some detail.
find_HSPs— which should really be called
find_2KPs— keeps binned counts of PLODs for all pairwise
comparisons, but only keeps individual PLODs for those pairs whose PLOD
is above keep_thresh, which we will need to set7 Before
calling find_HSPs, we can find the expected PLOD for an UP,
which will be negative. It so happens that the expected PLOD for an 2KP
is roughly the absolute value of the UP mean, and for a 3KP the expected
PLOD is close to 0. If kinference is going to work, the 2KP bump should
be well to the right of the 3KP mean, so choosing
keep_thresh just below 0 (ie at a small proportion of the
UP mean) should keep the interesting stuff without too much rubbish.
It’s also a good idea to set limit_pairs argument, so that
the computer doesn’t try to retain zillions of actually-unimportant kin
in case we set keep_thresh too low; 10000 should be fine.
With all that in mind, here we go:
## find HSPs-or-stronger using find_HSPs
## Expected PLOD for UPs (thanks to 'kin_power' earlier):
sum( dropbears$locinfo$E_UP)## [1] -64.47591
## So "-10" should be OK for keep.thresh here. Guard against keeping too many:
HSPs_1 <- find_HSPs(dropbears, keep_thresh = -10, limit_pairs=10000)
## plot the counts of PLODs in bins, on a log_10 scale so we can actually see the bumps:
histoPLOD( HSPs_1, log=TRUE)Note the general pattern - there is a large bump of UPs, with mean and variance exactly where theory predicts them to be, and bumps for 2KPs and 1KPs (ie, FSP as well as POP) centred exactly where theory expects8. There’s also a “shoulder” around 0, between the UP and 2KP bumps, which is just where we’d expect 3KPs to be. This is a good dataset.
Now we can zoom in on just the individually-kept PLODs for the
retained pairs. We want an unlogged plot, so we must set lb
to be the lower bound on displayed PLODs (to exclude the huge bump). The
unlogged version shows actual retained PLODs, not bin-counts, so there’s
no point in setting lb below the previous
keep_thresh. We also need to set fullsib_cut
by eye, anywhere in the air-gap that is hopefully between 1KPs and 2KPs
(it’s not vitally important where; see ‘?histoPLOD’ for more).
# 'bin' sets bin width for histo
histoPLOD( HSPs_1, log=FALSE, fullsib_cut = 120, lb = -10, bin = 10)The PLOD in find_HSPs doesn’t itself discriminate well
between FSPs and POPs, but there is a clear gap between 1KPs and 2KPs.
Any 1KPS that aren’t POPs must be FSPs, so we can pick all 1KPs
using this PLOD, subset out the POPs that we already found, and be left
with the FSPs.
POPsOrFSPs <- HSPs_1[HSPs_1$PLOD > 110,] ## anywhere in the valley
POPsOrFSPs$ij <- upairid(POPsOrFSPs$i, POPsOrFSPs$j) ## pair ID to match the one in POPs
FSPs <- POPsOrFSPs[!POPsOrFSPs$ij %in% POPs$ij,] ## find our FSPsNow, we have our set of POPs, our set of FSPs, and a set of tentative 2KPs. Again, those “second-order kin” comprise half-sibling, grandparent-grandchild, and full-thiatic pairs: we cannot split between those genetically9, but in many CKMR applications, covariate data such as ages can do the job. In this particular case, the 2KPs will almost all be HSPs, on demographic grounds (details out of scope here). However, the bigger issue is that the set of tentative 2KPs may well be contaminated by more-distant kin, mainly 3KP (half-thiatic and great-grandparent-grandchild pairs) and 4KP (e.g. half-first-cousin pairs). We have already gotten rid of the 1KPs, but for the rest we now need to decide on a PLOD cutoff above which we will treat everything as “definite 2KP”. We will then use the distribution of PLODs among those “definite 2KPs” to work out how many false-negative 2KPs are likely to have fallen below the cutoff. Unlike with 1KPs vs 2KPs, there won’t be a clean break, so things get a bit more complicated.
The idea of choosing the cutoff is to let us retain a (hopefully large) proportion of true 2KPs, subject to a tolerance on the expected number of contaminating 3KPs and 4KPs that by chance fall above the threshold. That tolerance is up to you; you really don’t want many false-positives from 3KPs because they can mess up a CK model. The problem isn’t the number of false-positives; rather, it’s the different dynamics that different kinships have. Most of all, 3KPs tend to have much longer birth-gaps than HSPs, so that mislabelling even a couple of 3KPs as “HSPs” can cause a substantial bias in mortality-rate estimates. (Note that it doesn’t cause bias to discard some true 2KPs below the cutoff, because we can compensate by incorporating the estimated false-negative rate directly into the CKMR model.) While it’s a good idea to choose a high cutoff (i.e., few false-positives) to minimise the bias from 3KPs, there is a trade-off; higher cutoffs do lead to fewer definite 2KPs going into the model, and thus to higher variances, as well as making it harder to check model diagnostics. Nevertheless, you should certainly go high enough to get your estimated false-positive rate well below the false-negative rate; you defintely don’t want “equal probabilities” here.
We have often chosen the cutoff at a round number by eye, but a more
systematic approach is via the autopick_threshold function.
This takes the empirical distribution of PLODs in the tentative HSPs,
estimates the impact of linkage, and— on the assumption that each comes
from either a HSP, HTP, or half first cousin pair (H1CP)— fits a
constrained mixture decomposition (see function documentation and MS for
details) to estimate the underlying distributions and numbers of
HSPs/HTPs/H1CPs, before numerically choosing a cutoff to achieve the
desired tolerance.10 When autopick_threshold
works well, it’s great; but don’t trust its output if it doesn’t look
sensible, because it may have too much flexibility in fitting especially
when the number of kin-pairs is low. Whether-or-not you use
autopick_threshold, the process is intrinsically
subjective, because you have to decide how many 3KPs you are
willing to tolerate. So, picking a cutoff by eye is also OK (because the
false-negative rate computed afterwards will eliminate bias), as long as
you don’t try to “tweak” the cutoff to include/exclude specific
pairs.
HSPs_1$ij <- upairid(HSPs_1$i, HSPs_1$j)
HSPs_2 <- HSPs_1[!HSPs_1$ij %in% c(POPs$ij, FSPs$ij),]
## plot the empirical PLODs with the expected distribution
## of the PLOD for half-sibling pairs
thresh <- autopick_threshold( dropbears, kin = HSPs_2, fitrange_PLOD = c(0, 110),
FPtol_pairs = 1, # tolerate 1 expected false-positive
use4th = TRUE, selecto = "ML", plot_bins = 5)## [1] 26.94363
## attr(,"fitrange_PLOD")
## [1] 0 110
## attr(,"selecto")
## [1] "ML"
## attr(,"use4th")
## [1] TRUE
## attr(,"FPtol_pairs")
## [1] 1
## attr(,"info")
## V0_M3 Vx_M3 V0_M4 Vx_M4 V_UP V_HSP
## 179.66091016 192.07090954 162.47650979 166.37340040 135.90524760 185.86912547
## C_hat rho_hat E2 E3 E4 SD2
## 211.68747051 0.05300852 61.60599802 -1.43495747 -32.95543522 13.63338276
## SD3 SD4 P2 P3 P4 N3
## 13.85896495 12.89858133 0.14357251 0.13856494 0.71786255 49.27040441
## Nall thresh Pr_FNeg
## 355.57627543 26.94363293 0.00550372
Well, that plot looks great! With that cutoff, we can select our “definite 2KPs” and the estimated false-positive and false-negative rates of that set:
HSPs <- HSPs_2[HSPs_2$PLOD > thresh,]
fpos_rate <- thresh@FPtol_pairs / nrow(HSPs) ## ~2% of our 'HSPs' are false-positives
fneg_rate <- thresh@info["Pr_FNeg"]
## we're missing about 0.5% of HSPs to false-negatives;
## _extremely_ good, expect 10--20% yourself!
nrow(HSPs) ## 52 inferred likely half-sibling pairs## [1] 52
It’s almost always useful to look at the covariates of the kin-pairs, which typically have a different distribution to the covariates of typical samples. Those covariates are essential for subsequent CKMR models.
## Our_sample sex weight_kg popn
## Length :104 Length :104 Min. : 47.30 Length :104
## N.unique :104 N.unique : 4 1st Qu.: 84.42 N.unique : 3
## N.blank : 0 N.blank : 0 Median :101.30 N.blank : 0
## Min.nchar: 10 Min.nchar: 1 Mean : 98.88 Min.nchar: 3
## Max.nchar: 10 Max.nchar: 3 3rd Qu.:116.40 Max.nchar: 7
## Max. :138.50
## dropheight_m
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 3.000
## Mean : 3.442
## 3rd Qu.: 4.000
## Max. :10.000
## Our_sample sex weight_kg popn
## Length :480 Length :480 Min. : 29.00 Length :480
## N.unique :480 N.unique : 4 1st Qu.: 88.72 N.unique : 3
## N.blank : 0 N.blank : 0 Median :107.85 N.blank : 0
## Min.nchar: 8 Min.nchar: 1 Mean :102.02 Min.nchar: 3
## Max.nchar: 10 Max.nchar: 3 3rd Qu.:119.03 Max.nchar: 7
## Max. :161.10
## dropheight_m
## Min. : 0.000
## 1st Qu.: 2.000
## Median : 3.000
## Mean : 3.498
## 3rd Qu.: 5.000
## Max. :10.000
Incidentally, here’s the newer POP statistic in action. You need to
specify some genotyping error rate, as with
split_FSPs_from_POPs, but again any reasonable choice
should lead to the same final set of POPs.
## Assume modest error rate
sk1 <- find_POPs_lglk( dropbears, gerr= 0.01, limit_pairs= 200, keep_thresh= -5)
histoPLOD( sk1, log=TRUE)## Assume bad error rate
sk5 <- find_POPs_lglk( dropbears, gerr= 0.05, limit_pairs= 200, keep_thresh= -5)
histoPLOD( sk5, log=TRUE)Your data will almost certainly need cleaning before you can do the
kin-finding in Section One. To show how (and, in Section
3, why), we’ll use the bluefin dataset
included in package kinference. This is a small, anonymised
subset of the tuna close-kin data described in Farley et al., 2024; see
?bluefin for details and full citation.
Many of the functions in kinference require a
snpgeno with pre-computed population allele frequencies for
each locus, stored in an object called pbonzer. The
bluefin dataset has its allele frequencies pre-computed
based on a larger version of the dataset, so we won’t recompute them,
but here’s how to do it:
# 'bluefin' is a dataset that comes with 'kinference'; see '?bluefin'
# Check genotype encoding:
diplos( bluefin) # aha, 6-way genotyping (legacy CSIRO data)## [1] "AA" "AB" "AO" "BB" "BO" "OO"
## if you're using 4-way genotypes(most likely)
## bluefin <- est_ALF_ABO_quick( bluefin)
## if your data are stored 6-way:
## ('bluefin' is 6-way, but 'dropbears' is 4-way)
## bluefin <- est_ALF_6way( bluefin) # needs extra info tooSome loci may have an estimated A or B allele frequency of zero, which can cause problems. It is best to drop loci with very low estimated A or B allele frequencies - they wouldn’t be very informative anyway.
The bluefin dataset, after these tweaks, has 1627 loci.
The function check6and4 compares observed genotype
frequencies for each locus to the expected frequencies under
Hardy-Weinberg Equilibrium, giving a P-value for each locus. In an
abstract perfect dataset, we would expect their P-values to be uniformly
distributed over the interval [0,1] - so we would expect about
0.151 of them to have a P-value less than 1e-4. In practice, they have
always looked “worse than expected” on real data, but often
this has not mattered downstream. The P-values are more to be taken as
relative measures of well-behavedness here.
## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests...
## calculating both 6-way and 4-way pvals, plotting 4-way...
Those ‘thresh…’ values are entirely subjective, and dataset-dependent. They just control which points get coloured, so you can experiment til you find a threshold that excludes the loci you feel are visually unacceptable. (The real “proof of the pudding” comes much later, when you check whether ‘find_HSPs’ is looking OK. You should expect that this whole QC process will need to be iterative!)
We can see that most loci in this dataset lie pretty close to the “observed = expected” line for all four genotypes - but some of them are off the mark. In real datasets, it is wise to have a little wriggle-room: this dataset may still contain duplicate samples, fish from another population, or contaminated or degraded samples - and each of those can contribute to the loci having lower-than-expected P-values. However, some of these loci are just abysmal - bad fish will make all loci look bad, but they generally won’t make a few particular loci look really bad. The abysmal-looking loci almost certainly aren’t the result of a few bad fish. So for now, we will just remove loci with P < 1e-10. After the rest of the cleaning stages, we can (and will!) revisit these P-values without the bad samples.
bluefin_1 <- bluefin[, pvals$pval4 > 1e-10]
pvals_1 <- check6and4(bluefin_1, thresh_pchisq_6and4 = c(1e-4, 1e-6), show6 = FALSE)## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests...
## calculating both 6-way and 4-way pvals, plotting 4-way...
Now, let’s check for duplicate samples
## look for duplicates, noting any pairs with <50 different genotypes:
dups <- find_duplicates(bluefin_1, max_diff_loci = 50)## [1] 24
We’ve got 24 pairs of animals with fewer than 50 different genotypes
- but nearly all of our animal pairs have more than 500 different
genotypes, and there’s a huge air-gap between 40 and 500 with no animal
pairs. The 24 pairs with less than 50 different genotypes are almost
certainly duplicates. (On your own dataset, you should of course check
covariates at this point to confirm.) Anyway, quite often “duplicates”
actually turn out to be triplicates etc, but ‘kinference’ only records
pairwise results. When we have duplicates or triplicates or whatever, we
only need to keep one from each set. kinference contains a
convenient function - drop_dups_pairwise_equiv - to help us
do exactly that:
Now, we need to find and remove any outlying “animals not like the others”. These may be animals from another species or population that have somehow made it into our dataset, or they might be badly degraded or contaminated, or something might have gone wrong with extraction or sequencing for that sample, or so on and so forth. Animals that don’t look like they’re drawn from the same population as everyone else aren’t usable for kinference, at least not with this software.
This pattern of ilglk_geno scores is pretty typical when
most samples are correctly identified, but there are a handful of
animals from a distantly-related stock or another species in the
dataset: there is a large, approximately Normally-distributed, spike at
the right-hand side of the plot, and a long tail strung out to the left,
containing the animals from another stock or species. In our case, any
animal with an ilglk score less than -1525 or so is not believable as a
member of the same stock as the main spike. Let’s remove them.
bluefin_3 <- bluefin_2[ilglk > -1525,]
## now, does it look any better?
ilglk <- ilglk_geno(bluefin_3)Now, we need to try to find and remove cross-contaminated or
degraded samples. Checking for excess or deficiency of heterozygosity is
the way to go here, but there some subtle details to do with null
alleles and so on; see MS11. The function for this is
hetzminoo_fancy, which has a “rich” variant optimized for
finding contamination, and a “poor” variant for degradation. With
“rich”, you’re looking for too-big values; with “poor”, for too-small
values— although it must be said that the two variants tend to look
pretty similar.
bluefin_4 <- bluefin_3[hetzpoor > 0.17,] ## remove the outlying low hetzpoor scores
hetzpoor <- hetzminoo_fancy(bluefin_4, "poor") ## Hey, the hetzpoor outliers are gone!That was actually pretty good data (after initial cleanups)— the heterozygosity outliers weren’t particularly drastic and there weren’t many. We could have gone a bit safer by trimming a few more “outliers”, but let’s keep going with what we now have.
Remember how I mentioned that having bad samples in the dataset can make loci look bad? Let’s take a look at the P-values for individual loci we stored earlier, re-calculate them now that we’ve removed all the bad fish, and see whether the loci look any better.
## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests...
## calculating both 6-way and 4-way pvals, plotting 4-way...
lociPvals <- data.frame(
pvals = c(pvals_2$pval4, pvals_1$pval4),
stage = c(
rep("after", length(pvals_2$pval4)),
rep("before", length(pvals_1$pval4)))
)
lm1 <- lm(pvals ~ stage, data = lociPvals)
summary(lm1)##
## Call:
## lm(formula = pvals ~ stage, data = lociPvals)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.11292 -0.10324 -0.07165 0.04540 0.81516
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.112922 0.004010 28.160 <2e-16 ***
## stagebefore -0.007792 0.005671 -1.374 0.17
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1549 on 2984 degrees of freedom
## Multiple R-squared: 0.0006323, Adjusted R-squared: 0.0002974
## F-statistic: 1.888 on 1 and 2984 DF, p-value: 0.1695
Removing bad fish has caused a small improvement in the apparent quality of our loci: the average locus in our dataset has seen its P-value increase by 0.008. This is why we didn’t want to be too harsh in trimming loci early in our cleaning pathway - we would have removed loci that were actually fine, but looked bad because of the bad fish, and that would have cost us power by reducing the number of good loci we use for kin-finding. If you really want to refine your locus-selection to keep as many as possible, you might try a very lenient cutoff for locus quality, then look for very bad samples, then tighten the locus quality cutoff, then look for bad samples again.
Another approach is to be very stringent initially with both loci and samples, then progressively re-check whether some loci and some samples can be re-admitted to the “good club”. Wait for the book!
The bluefin dataset is now cleaned. We won’t go through
the full kin-finding routine again here - it would be the same as for
dropbears (see Section One). I’ll just
show you the “histoPLOD”, so you can see that kin-finding would now work
for bluefin, cleaned as it is.
## As outlined in Section 1, `keep_thresh` sets a PLOD limit to
## prevent storing umpteen-millions of PLODs. Negative 10 i s a often
## about right, but you may need to set it lower (to store more PLODs)
## or higher (to save RAM or make histoPLODs prettier)
HSPs <- find_HSPs(bluefin_4, keep_thresh = -10)
histoPLOD( HSPs, log=TRUE)This histoPLOD shows a spike of unrelated pairs that is well-characterised by its expected distribution, a spike in density where we would expect to see half-siblings, and another spike in density where we would expect FSPs or POPs. From this, it should be clear that we can kinfer this dataset using the same approach as in Section One, albeit with slightly less power.
In this section, we demonstrate a handful of things that can go wrong
in a kinference analysis, and suggest ways to deal with them. This set
of examples is not exhaustive! If you have a problem that is not dealt
with in this section, you can try getting in touch with experienced
kinferencers, including the maintainer of package
kinference, at the Github Discussions
page for closekin, the parent Organisation of package
kinference.
Offscreen, I’ve prepared versions of the bluefin dataset
with all-but-one of the cleaning steps. For example,
bluefin_exceptIlglk has had its loci checked, has had
duplicates removed, had its contaminated and degraded samples removed -
but I’ve kept all the samples that failed our check for animals from
another population, using ilglk_geno. I’ve also built
bluefin_exceptDuplicates,
bluefin_exceptHetzPoor,
bluefin_exceptHetzRich, and
bluefin_except6and4, each of which has skipped the obvious
cleaning step. Here, we’ll walk through the sorts of things you might
see if your data aren’t properly cleaned.
In principle, there are two reasons we don’t want duplicate samples in our kin-finding dataset: first, because our null distributions for our various kin-finding statistics are calculated under the assumption that each sample is drawn independently and at random from the sampled population. Duplicate samples, especially if they account for a large proportion of the dataset, will make those null distributions and decisions based on them inaccurate. Second, if you follow the kin-finding pathway shown in this vignette, and have duplicate samples in your dataset, and don’t check your results carefully, those dups will be mistaken for POPs.
Duplicate pairs are very easy to spot: they look more closely related
than any true kin pair, so they appear far off to the right of the
histoPLODs (they might not be visible on the unlogged version though,
because it sets its xlim internally and will cut off massive outliers).
The bluefin dataset doesn’t have all that many duplicates,
so the differences to the null distributions aren’t pronounced.
dupHSPs <- find_HSPs(bluefin_exceptDups, keep_thresh = -10)
## histoPLOD( dupHSPs, log=TRUE) # use the logged version to pick the fullsib_cut
histoPLOD( dupHSPs, log=FALSE, lb = 20, fullsib_cut = 150)Samples from another population or species will look very unlike the bulk of the other samples, so pairs with one animal from population A (the intended population) and another from population B (the unintended population) will tend to have very un-kin-like kinship scores. On the other hand, pairs between two samples from population B may have very high kinship scores, especially if some loci have alleles that are rare in population A but more common (or even fixed) in population B.
# par(mfrow = c(2,1))
ilglkHSPs <- find_HSPs(bluefin_exceptIlglks, keep_thresh = -10, minbin = -150)
## there are a bunch of 'less related than expected' fish in the low tail
## of the unrelated pair bump.
histoPLOD( ilglkHSPs, log=TRUE)Note both the bump out to the left of the unrelated pairs. There may also be additional pairs scattered throughout the ‘kin-like’ regions of the PLOD distribution (although it’s not particularly apparent in this dataset), but these pairs in the ’kin-like- regions will not closely match the expected peak locations for any particular kin group except by accident.
A few contaminated or degraded samples in a mostly-clean dataset behave very similarly to samples from another population or species: when paired with unrelated clean samples, they will look dissimilar, but when paired with each other, they can look very much like kin pairs.
The bluefin dataset does not have many
cross-contaminated or degraded samples, so offscreen, I have added in
some cross-contaminated samples by taking random pairs of samples and
randomly ‘contaminating’ the first sample with alleles present in the
second sample with a probability that varies by sample. I’ve then
removed the original samples and mixed the ‘contaminated’ samples back
in with the bluefin dataset, generating a dataset called
ctm_bluefin.
In Section 2, I showed that bad samples make loci look worse. Because
ctm_bluefin has many more bad samples than
bluefin, we will have to re-do the cleaning steps, starting
from the locus cleaning. I’ll show kin-finding outputs on a version with
its contaminated samples left in, then show contamination cleaning, and
finally kin-finding outputs on the cleaned dataset.
## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests...
## calculating both 6-way and 4-way pvals, plotting 4-way...
## genotypes are stored 6-way. Doing both 6-way and 4-way GOF tests...
## calculating both 6-way and 4-way pvals, plotting 4-way...
## [1] 0.1205191
## [1] 0.09623471
## pvals by *locus*, for the dataset with contaminated *samples*, are substantially worse than in the clean dataset, on average.ctm_bluefin_1 <- ctm_bluefin[, pvals_ctm$pval4 > 1e-15]
## so we will be _substantially_ more lenient on locus quality
## pvals_ctm_2 <- check6and4(ctm_bluefin, thresh_pchisq_6and4 = c(1e-4, 1e-6), show6 = FALSE)
dups <- find_duplicates(ctm_bluefin_1, max_diff_loci = 50)## Note how cross-contaminated samples make duplicate-detection less clear - but we still
## have an OK air-gap at 50 different genos in this case
ctm_bluefin_2 <- ctm_bluefin_1[ -c( drop_dups_pairwise_equiv( dups[,2:3])),]
ilglk <- ilglk_geno(ctm_bluefin_2)
abline(v = -1550) ## put our cutoff here# Not great... the issue is not so much outliers, but general mis-fit
ctm_bluefin_4 <- kin_power(ctm_bluefin_3, k = 0.5)
## ... important but boring calculation - just run it
## kinfinding
hsps <- find_HSPs(ctm_bluefin_4, keep_thresh = -10)## Returning just the 541 pairs with the highest PLOD scores; increase 'limit_pairs' to get more
## Note the large additional bump around PLOD = 0, and lack
## of air-gap between HSPs and POPs/FSPs.
histoPLOD( hsps, log=FALSE, lb = -5, fullsib_cut = 100)## Also notice a few pairs with suspiciously-high PLODs
## find and remove cross-contaminated samples:
## cross-contaminated samples are clear outliers
hetzrich <- hetzminoo_fancy(ctm_bluefin_4, "rich")## cross-contaminated samples are almost as clear on the degradation variant:
hetzpoor <- hetzminoo_fancy(ctm_bluefin_4, "poor")ctm_bluefin_4 <- ctm_bluefin_4[hetzrich < 0.27,]
## skip removing the outlying hetzrich scores
hetzpoor <- hetzminoo_fancy(ctm_bluefin_4, "poor") ## Now we only have low outliers in hetzpoor!ctm_bluefin_5 <- ctm_bluefin_4[hetzpoor > 0.18,] ## skip removing the bad ones
hsps <- find_HSPs(ctm_bluefin_5, keep_thresh = -10)
histoPLOD( hsps, log=TRUE) ## the PLOD bump around zero has gone## ... and there are no more samples with suspiciously high PLODs
histoPLOD( hsps, log=FALSE, lb = -5, fullsib_cut = 100)And sometimes the right solution is: get a better dataset, or move to a different problem. Sad but true.↩︎
However, heritable/repeatable null alleles are OK— see ‘gbasics’ package. Nulls are inevitable for some species under certain genotyping methods.↩︎
Because HSPs are usually much more important to CKMR than other 2KPs (and usually commoner), we sometimes slip into saying “HSPs” when “2KP” would be more correct, eg in the name of the function ‘find_HSPs’. Sorry about that…↩︎
“Thiatic” is a succinct bidirectional gender-neutral term that comprises aunt-niece, uncle-niece, aunt-nephew, and uncle-nephew kinships. “Half-thiatic” pair is when either the mother of one is the grandmother of the other, or father/grandfather, but not both.↩︎
We’re not sure exactly why not, but it’s possibly related to using estimated allele frequencies rather than true. For this dataset, null alleles happen to be non-existent (a consequence of the genotyping method :) but maximum-likelihood estimates of null frequency can of course slightly exceed 0 by chance, and here they do so for about 25% of the loci.↩︎
There might also be some 2KPs within
strongest, but in terms of this statistic they will look
more like FSPs than like POPs. We only want “POPs vs rest” for now, so
we can deal with “rest” later.↩︎
The logic here is that the overwhelming majority of pairs will be unrelated, and keeping scores for each of them can easily overflow RAM with useless “information”. For the UPs, we only need to check whether the PLOD distribution is consistent with theory, so binned counts are fine.↩︎
Recall that POP stats were slightly off. However, 2KP stats seem a bit more robust, even though 2KPs themselves are harder to identify. So if these HSP/2KP stats aren’t very well aligned in your data, be nervous.↩︎
Well, it is theoretically possible to improve somewhat on a blind guess, but you need high-resolution genomic data.↩︎
note that we’re using ‘HSP’, ‘HTP’, and ‘H1CP’ in this assumption, rather than ‘2KP’, ‘3KP’, and ‘4KP’. The distinction arises because, although expected PLODs are the same within each of 2KP, 3KP, and 4KP, linkage means that PLOD variances can differ between kin-types within a kinship order, and ‘autopick_threshold’ assumes HSPs, HTPs, and H1CPs specifically. This is unlikely to have much impact in practice, but is one more reason to use demographic information to minimise the number of non-HSP 2KPs at this stage of the analysis↩︎
Which doesn’t exist yet…↩︎