Sunday, 6 November 2022

Adding gaussian noise to break the tied values

 

In statistics, the "rank" represents the indexed position of the numerical values in ascending (or descending) order measured on a random sample. So, by ranking, we transform the data and get another random variable. Numerically, ranking a random sample of size n corresponds to making a random permutation of the integers from 1 to n. Importantly, since each of the n! possible permutations have the same probability of occurrence, we can calculate the means of these random variables which are invariant with respect to any monotonic transformation of the original data. There are many statistics based on ranks, among which: Wilcoxon-Mann-Whitney test (also known as Wilcoxon Rank Sum Test), Wilcoxon signed-rank test, Kruskal-Wallis test, Friedman test, Spearman’s rank correlation coefficient. The rank-based tests are generally more robust compared to their homologous parametric tests, enable to deal easily with ordinal data, and the asymptotic approximations of the rank statistics are quite accurate even for small sample sizes. Ranking transformation requires unique values in the original data. However, due to numeric rounding or discretization, even for the continuous random variables it is possible the presence of tied values, which turns out to impact the rank distribution. Let be x(j) and x(k) repeated values with frequency fj and fk, respectively in a sample of size N; then, there will be a number of possible rank assignments equal to


 and each assignment produces its own rank test statistic. If so, we should try to break the ties to avoid arbitrary selection of one of the possible rank assignments. The most commonly applied methods include a) replacing the tied values with their middle ranks, b) replacing the tied values with arithmetic means of their ranks, c) excluding the tied values, or d) introducing noise into the data. Techniques a) and b) are commonly used, but they have effects on Type I (false positive rate) and Type II errors (false negative rate). The exclusion of the tied values can be beneficial in terms of containment of Type I errors, but it also results in reducing the sample size which can lower the statistical power, as well as losing a piece of information. The addition of noise breaks the ties and preserves the information content of the data. The amount of noise to insert can be arbitrarily set based on the nature of the data set. Reportedly, breaking ties by jittering does not seem to harm the Type I error or the power provided that a resampling approach is adopted [1]. The implicit risk of this technique stands on its potential impact on many ranks, but we can tune the noise such that only the tied values are involved.

Let’s see an application of this method (see below the code for reproducing the results). We want to compare two samples of data:

x = [34.2,27.1,25,24.3,29.7,30.2,28,26.4]

y = [21.8,25,20.3,20.5,20.7,22.3,20.2,21.6,19.8,20.1,21.5,21.1,20.3,20.6,25]

The summary statistics and the plot of the distributions of x and y clearly indicate that, overall, the x sample presents greater values than the y sample: 

 

Sample

Min

1st Q.

Median

Mean

3rd Q.

Max

x

24.30

26.05

27.55

28.11

29.82

34.20

y

19.80

20.30

20.70

21.39

21.70

25.00


The plot also reveals two bumps in the y density curve that make us cast some doubt on the normality distribution of y, and the Shapiro-Francia normality test confirms this thought (W = 0.78, p-value = 0.003). Therefore, also considering that the two samples are relatively small, we should not use the t-test for comparing the two samples. Rather, the non-parametric Wilcoxon Rank Sum test can be favoured. Indeed, the hypothesis of homogeneous variances is not rejected with the Fligner-Killeen test (chi-squared = 3.7115, df = 1, p-value = 0.05404), which makes the conduction of a non-parametric test appropriate. Looking closer at the data, however, we notice that there are some tied values (20.3, 20.3, 25.0, 25.0, 25.0). The test needs a continuity correction (that is, an approximation to the normal distribution of the test statistic, W) since we cannot compute the exact p-value with ties (we must think of the distribution of the test statistic as discrete and independent of the distribution of the original data), which yields: W = 117, p-value = 0.0002606. The p-value, per se, does not convey information about how different the distributions of the samples are. Instead, the calculation of the 95% confidence interval of the location parameter, that is the estimation of the median of the difference between the samples, provides us with this information. If exact p-values are available, an exact confidence interval can be obtained [2]. Otherwise, the confidence interval and location estimate will be based on normal approximations. Using the original samples, we get a 95% CI from 4.399958 to 9.100084 and a difference in location equal to 6.399. This finding suggests that the median of the differences between the sampled measurements is quite large (relative to the small sizes of the samples) and that the median of the values from sample x is expected to be at least 4.399 units greater than the median of sample y.  

When adding noise in the tied values and calculating the median of the ranked data over several iterations, we can obtain unique values and perform an “exact” Wilcoxon Rank Sum test. Using the data of the samples above described, over 10000 iterations, the median of the formerly tied value results very close to the original:  

tied

no ties

20.3

20.28326

20.3

20.31196

25

24.97646

25

25.0038

25

25.00668


If we conduct the Wilcoxon Rank Sum test on the jittered data, we get W = 116, p-value = 0.000048, a difference in location equal to 6.4, with a 95 %confidence interval in the range [4.376465: 9.100]. These results are similar to the ones obtained without jittering, but theoretically, if the total sample size N is greater than 25, the approximation to a normal distribution of the test statistic is appropriate, whilst if N ≤ 25, we need to calculate the exact test. Therefore, adding noise to tied values when the total sample size is relatively small is the best choice. Alternatively, we should have to exclude from the analysis the tied values, but this seems to be the “last chance” option since it involves the loss of information and, ultimately, causes statistical power lowering.    

R code

#Add gaussian noise to tied values

noise2tied <- function(x,y,w) {

#x,y: vectors of data

#w: a real value in (0, 1] that regulates the amount of noise to be added

#PLOT

plot(density(x),xlim=c(15,40),ylim=c(0,0.4),main="distribution of x and y",xlab="")

grid(nx=NULL,ny=NULL,lty=3,col="lightgray")

lines(density(y),col="red")

legend("topright",bty="n",legend=c("x","y"),col=c("black","red"),lty=c(1,1))

#merge the vectors

X <- c(x,y)

#assign labels to the data

n <- length(X)

idx<- seq(1,n,1)

#Group membership

G1 <- rep(1,times=length(x))

G2 <- rep(2,times=length(y))

G <- c(G1,G2)

#identify the tied values

DF <- data.frame(X,G,idx)

DFo <- DF[order(X),]

Xo <- DFo[,1]

dX <- diff(Xo)

u0 <- which(dX==0)

#calculate the number of tied values in the vector X

if(any(u0)) {

u0 <- c(head(u0,1)+1,u0,tail(u0,1)+1)

uTied <- DFo[u0,3]

print(uTied)

} else {

u0

print("there are not tied values")

}

nTies <- length(u0)

#Tied values

Tied <- X[uTied]

#temporary vector where the tied values will be substituted

xn <- X #

#break the ties

nboot=10^4 #number of repetitions

k=1 #initialize the counter

Ranks <- matrix(0,nrow=nboot,ncol=n,byrow=TRUE)

Ranks_Group <- matrix(0,nrow=nboot,ncol=n,byrow=TRUE)

while (k <=nboot) {

                xn[uTied] <- xn[uTied] + w*rnorm(length(uTied)) #add noise to the tied values

                Ranks[k,]<- sort(xn,index.return=TRUE)$ix #rank the data

                Ranks_Group[k,] <- G[sort(xn,index.return=TRUE)$ix]    #assign the ranked data to their group

                k=k+1

}

Xrank <- sapply(1:n,function(i) median(xn[Ranks[,i]])) #calculate the median across the iterations of the ranked data

matched <- data.frame(Xo,Xrank) #compare the result with the original data 

colnames(matched) <- c("tied","no.ties")

#MODE FUNCTION

Mode <- function(x) {

   ux <- unique(x)

   ux[which.max(tabulate(match(x, ux)))]

}

#Assign the ties values to the most frequent group observed in the bootstrap

Grank <- sapply(1:n,function(i) Mode(Ranks_Group[,i]))

return(list(count_Tied=nTies, whosTied=uTied,untied_all=xn,untied_group_1=xn[c(1:length(x))],untied_group_2=xn[seq(length(x)+1,n,1)],Ranks=sort(Xrank,index.return=TRUE)$ix,X_Ranked=Xrank,Ranks_Group=Grank,CompareData=matched))

}

        

#Example

x <- c(34.2,27.1,25,24.3,29.7,30.2,28,26.4)

y <- c(21.8,25,20.3,20.5,20.7,22.3,20.2,21.6,19.8,20.1,21.5,21.1,20.3,20.6,25) 

W<-10^-4

X<- c(x,y)

#Shapiro-Francia test of normality

library(nortest)

sf.test(x)

sf.test(y)

fligner.test(X,G) #test homogeneity of variances

#Group membership

G1 <- rep(1,times=length(x))

G2 <- rep(2,times=length(y))

G <- c(G1,G2)

wilcox.test(x,y,conf.int=TRUE)

out<-noise2tied(x,y,W)

wilcox.test(out$untied_group_1,out$untied_group_2,conf.int=TRUE)

c(median(x),median(y))

c(median(out$untied_group_1),median(out$untied_group_2))

tapply(out$Ranks,out$Ranks_Group,sum) #sum of ranks

 

References

1.    McGee M, (2018). Case for omitting tied observations in the two-sample t-test and the Wilcoxon-Mann-Whitney Test. PLoS ONE 13(7):e0200837 https://doi.org/10.1371/journal.pone.0200837.

2.    Bauer D F, (1972). Constructing confidence sets using rank statistics. Journal of the American Statistical Association, 67, 687–690. doi: 10.1080/01621459.1972.10481279.




Understanding Anaerobic Threshold (VT2) and VO2 Max in Endurance Training

  Introduction: The Science Behind Ventilatory Thresholds Every endurance athlete, whether a long-distance runner, cyclist, or swimmer, st...