In statistics, the "rank" represents the indexed position of the numerical values in ascending (or descending) order measured on a random sample. So, by ranking, we transform the data and get another random variable. Numerically, ranking a random sample of size n corresponds to making a random permutation of the integers from 1 to n. Importantly, since each of the n! possible permutations have the same probability of occurrence, we can calculate the means of these random variables which are invariant with respect to any monotonic transformation of the original data. There are many statistics based on ranks, among which: Wilcoxon-Mann-Whitney test (also known as Wilcoxon Rank Sum Test), Wilcoxon signed-rank test, Kruskal-Wallis test, Friedman test, Spearman’s rank correlation coefficient. The rank-based tests are generally more robust compared to their homologous parametric tests, enable to deal easily with ordinal data, and the asymptotic approximations of the rank statistics are quite accurate even for small sample sizes. Ranking transformation requires unique values in the original data. However, due to numeric rounding or discretization, even for the continuous random variables it is possible the presence of tied values, which turns out to impact the rank distribution. Let be x(j) and x(k) repeated values with frequency fj and fk, respectively in a sample of size N; then, there will be a number of possible rank assignments equal to
and each assignment produces its own rank test
statistic. If so, we should try to
break the ties to avoid arbitrary selection of one of the possible rank
assignments. The most commonly applied methods include a) replacing the tied
values with their middle ranks, b) replacing the tied values with arithmetic
means of their ranks, c) excluding the tied values, or d) introducing noise
into the data. Techniques a) and b) are commonly used, but they have
effects on Type I (false positive rate) and Type II errors (false negative
rate). The exclusion of the tied values can be beneficial in terms of containment
of Type I errors, but it also results in reducing the sample size which can
lower the statistical power, as well as losing a piece of information. The
addition of noise breaks the ties and preserves the information content of the
data. The amount of noise to insert can be arbitrarily set based on the nature
of the data set. Reportedly, breaking ties by jittering does not seem to harm
the Type I error or the power provided that a resampling approach is adopted [1]. The implicit risk of
this technique stands on its potential impact on many ranks, but we can tune
the noise such that only the tied values are involved.
Let’s
see an application of this method (see below the code for reproducing the
results). We want to compare two samples of data:
x =
[34.2,27.1,25,24.3,29.7,30.2,28,26.4]
y =
[21.8,25,20.3,20.5,20.7,22.3,20.2,21.6,19.8,20.1,21.5,21.1,20.3,20.6,25]
The summary statistics and the plot of the distributions of x and y clearly indicate that, overall, the x sample presents greater values than the y sample:
Sample |
Min |
1st Q. |
Median |
Mean |
3rd Q. |
Max |
x |
24.30 |
26.05 |
27.55 |
28.11 |
29.82 |
34.20 |
y |
19.80 |
20.30 |
20.70 |
21.39 |
21.70 |
25.00 |
The
plot also reveals two bumps in the y density curve that make us cast some doubt
on the normality distribution of y, and the Shapiro-Francia normality test
confirms this thought (W = 0.78, p-value = 0.003). Therefore, also considering
that the two samples are relatively small, we should not use the t-test for
comparing the two samples. Rather, the non-parametric Wilcoxon Rank Sum test
can be favoured. Indeed, the hypothesis of homogeneous variances is not
rejected with the Fligner-Killeen test (chi-squared = 3.7115, df = 1,
p-value = 0.05404), which makes the conduction of a non-parametric test
appropriate. Looking closer at the data, however, we notice that there are some
tied values (20.3, 20.3, 25.0, 25.0, 25.0). The test needs a continuity
correction (that is, an approximation to the normal distribution of the test
statistic, W) since we cannot compute the exact p-value with ties (we must think
of the distribution of the test statistic as discrete and independent of the
distribution of the original data), which yields: W = 117, p-value = 0.0002606.
The p-value, per se, does not convey information about how different the
distributions of the samples are. Instead, the calculation of the 95%
confidence interval of the location parameter, that is the estimation of the median
of the difference between the samples, provides us with this information. If
exact p-values are available, an exact confidence interval can be obtained [2]. Otherwise, the
confidence interval and location estimate will be based on normal
approximations. Using the original samples, we get a 95% CI from 4.399958 to 9.100084
and a difference in location equal to 6.399. This finding suggests that the
median of the differences between the sampled measurements is quite large (relative
to the small sizes of the samples) and that the median of the values from
sample x is expected to be at least 4.399 units greater than the median of
sample y.
tied |
no ties |
20.3 |
20.28326 |
20.3 |
20.31196 |
25 |
24.97646 |
25 |
25.0038 |
25 |
25.00668 |
If
we conduct the Wilcoxon Rank Sum test on the jittered data, we get W = 116,
p-value = 0.000048, a difference in location equal to 6.4, with a 95
%confidence interval in the range [4.376465: 9.100]. These results are similar
to the ones obtained without jittering, but theoretically, if the total sample
size N is greater than 25, the approximation to a normal distribution of the
test statistic is appropriate, whilst if N ≤ 25, we need to calculate the exact
test. Therefore, adding noise to tied values when the total sample size is
relatively small is the best choice. Alternatively, we should have to exclude
from the analysis the tied values, but this seems to be the “last chance”
option since it involves the loss of information and, ultimately, causes
statistical power lowering.
R code
#Add
gaussian noise to tied values
noise2tied
<- function(x,y,w) {
#x,y:
vectors of data
#w: a real
value in (0, 1] that regulates the amount of noise to be added
#PLOT
plot(density(x),xlim=c(15,40),ylim=c(0,0.4),main="distribution
of x and y",xlab="")
grid(nx=NULL,ny=NULL,lty=3,col="lightgray")
lines(density(y),col="red")
legend("topright",bty="n",legend=c("x","y"),col=c("black","red"),lty=c(1,1))
#merge the
vectors
X <-
c(x,y)
#assign
labels to the data
n <-
length(X)
idx<-
seq(1,n,1)
#Group
membership
G1 <-
rep(1,times=length(x))
G2 <-
rep(2,times=length(y))
G <-
c(G1,G2)
#identify
the tied values
DF <-
data.frame(X,G,idx)
DFo <-
DF[order(X),]
Xo <-
DFo[,1]
dX <-
diff(Xo)
u0 <-
which(dX==0)
#calculate
the number of tied values in the vector X
if(any(u0))
{
u0 <-
c(head(u0,1)+1,u0,tail(u0,1)+1)
uTied <-
DFo[u0,3]
print(uTied)
} else {
u0
print("there
are not tied values")
}
nTies <-
length(u0)
#Tied
values
Tied <-
X[uTied]
#temporary
vector where the tied values will be substituted
xn <- X
#
#break the
ties
nboot=10^4
#number of repetitions
k=1
#initialize the counter
Ranks <-
matrix(0,nrow=nboot,ncol=n,byrow=TRUE)
Ranks_Group
<- matrix(0,nrow=nboot,ncol=n,byrow=TRUE)
while (k
<=nboot) {
xn[uTied] <- xn[uTied] +
w*rnorm(length(uTied)) #add noise to the tied values
Ranks[k,]<-
sort(xn,index.return=TRUE)$ix #rank the data
Ranks_Group[k,] <-
G[sort(xn,index.return=TRUE)$ix] #assign
the ranked data to their group
k=k+1
}
Xrank <-
sapply(1:n,function(i) median(xn[Ranks[,i]])) #calculate the median across the
iterations of the ranked data
matched
<- data.frame(Xo,Xrank) #compare the result with the original data
colnames(matched)
<- c("tied","no.ties")
#MODE
FUNCTION
Mode <- function(x)
{
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
#Assign the
ties values to the most frequent group observed in the bootstrap
Grank <-
sapply(1:n,function(i) Mode(Ranks_Group[,i]))
return(list(count_Tied=nTies,
whosTied=uTied,untied_all=xn,untied_group_1=xn[c(1:length(x))],untied_group_2=xn[seq(length(x)+1,n,1)],Ranks=sort(Xrank,index.return=TRUE)$ix,X_Ranked=Xrank,Ranks_Group=Grank,CompareData=matched))
}
#Example
x
<- c(34.2,27.1,25,24.3,29.7,30.2,28,26.4)
y
<-
c(21.8,25,20.3,20.5,20.7,22.3,20.2,21.6,19.8,20.1,21.5,21.1,20.3,20.6,25)
W<-10^-4
X<-
c(x,y)
#Shapiro-Francia
test of normality
library(nortest)
sf.test(x)
sf.test(y)
fligner.test(X,G)
#test homogeneity of variances
#Group
membership
G1
<- rep(1,times=length(x))
G2
<- rep(2,times=length(y))
G
<- c(G1,G2)
wilcox.test(x,y,conf.int=TRUE)
out<-noise2tied(x,y,W)
wilcox.test(out$untied_group_1,out$untied_group_2,conf.int=TRUE)
c(median(x),median(y))
c(median(out$untied_group_1),median(out$untied_group_2))
tapply(out$Ranks,out$Ranks_Group,sum)
#sum of ranks
References
1. McGee M, (2018). Case for omitting tied observations
in the two-sample t-test and the Wilcoxon-Mann-Whitney Test. PLoS ONE
13(7):e0200837 https://doi.org/10.1371/journal.pone.0200837.
2. Bauer D F, (1972). Constructing confidence sets using
rank statistics. Journal of the American Statistical Association, 67,
687–690. doi: 10.1080/01621459.1972.10481279.