Computing Significance of Overlap between Two Sets using Hypergeometric Test

Share Tweet

There are many cases where we have two sets (e.g. under two different conditions) of things such as transcripts, genes or proteins and we want to compute the significance of the overlap between them. Hypergeometric test is very simple and widely used option for such cases.

I’ll use the phyper function in R but you can use the same idea in SciPy (Python).

Let’s say you have from 200 genes (A);

  • 10 genes common or overlapping (set B ∩ set C)
  • 25 genes in set B
  • 50 genes in set C
  • 135 genes not in set B or set C

Hypergeometric test

To compute the significance of overlap use;

phyper(10, 50, 200 - 50, 25, lower.tail = FALSE)
[1] 0.0214406

So, if your threshold for p-value is 0.05 (or 5%), then you can say the overlap is significant.

Share Tweet


Please start a discussion down below or send me an email!