The curse that is dimensionality
Hazards of big data
Big data has created somewhat of a storm in the technological sector. To say that big data is a buzzword, even in non-tech circles, would be a massive understatement. The idea of big data is everywhere, coursing through the veins of modern society. This could mean various rows (samples) and little rows (samples) like genomic sequencing in life sciences research. The Curse of Dimensionality, or Large P, Small N, ((P >> N)), issue is applicable to the latter case of several variables measured on a considerably few number of samples.
Every variable in a data set is viewed as a dimension with the group of variables that define the space in which the samples are encompassed under. Think of a 2D space which has its definition by the height and weight of grade school students. Every student is indicated as a point on the plot with the X axis (dimension) tagged as height and the Y axis (dimension) tagged as weight. Generally, senior students are of greater height and heavier so their points on a plot are probable to lie in the upper right area of the space. Statistical procedures for undertaking analysis of this 2D data exists. MANOVA, for instance, can evaluate if the heights and weights amongst both genders is differing. This statistical test is right as the information is presumably bivariate normal.
When we have to contend with several variables, the Curse of Dimensionality alters the behaviour of data and conventional statistical methods provided incorrect solutions. This led to escalated expenditure due to follow up on outcomes that are wrong with costly and timely experiments, and reduces the speed in the product development pipelines. In this blog post by AICoreSpot, we demonstrate what the alterations in behaviour of data are in high dimensions. In a subsequent blog, we will talk about how we attempt to prevent these issues in applied data analysis of high dimensional data.
Data possesses attributes
Statistics generated over the previous 100 years are on the basis of probability models, or distributions. This strategy gives an automatic and objective technique for decision making under set assumptions with regards to the information. This model for data analytics has served its purpose and turned out to be very effective in basic biomedical research and clinical trials.
Unluckily, this strategy fails when the number of variables is much larger than the number of samples. i.e. (P >> N\)). In high dimensions the data assumptions required for statistical testing are not met. In the following we illustrate what happens to the information when \(P>>N\) produces wrong outcomes.
There are four attributes of high dimensional data:
- Points move farther away from one another in high dimensions.
- Points shift farther away from the centre in high dimensions.
- The distances amongst all duos of points becomes the same.
- The precision of any predictive model comes close to 100%.
Every attribute is detailed below with R code so the reader can evaluate it for themselves.
Property 1: Points shift farther away from one another in high dimensions. This implies that the density of data in local neighbourhoods is too limited to fit distributions.
Look at the height/weight information detailed above. Every student is indicated as a vector of length two, \(s_i = (x_{i,1}, x_{i,2})\), where \(x_{i1}\) is the height and \(x_{i2}\) is the weight of student \(i\). The Euclidean distance amongst two students \(i,j\) over 2 dimensions \(p = 2\) is \[d_{p = 2}(s_{i}, s_{j}) = \sqrt{(x_{i,1} – x_{j,1})^2 + (x_{i,2} – x_{j,2})^2}.\] Distance is non-decreasing when variables are included, \[d_{p = 3}(s_{i}, s_{j}) = \sqrt{(x_{i,1} – x_{j,1})^2 + (x_{i,2} – x_{j,2})^2 + (x_{i,3} – x_{j,3})^2} \geq d_{p=2}(s_{i}, s_{j}),\] and strictly appreciating when the extra variables have differing values (e.g., \(x_{i3} \neq x_{j3}\)). This outcome extends to high dimensions, \(d_{p = 1} \leq d_{p=2} \leq d_{p=3} \leq \ldots\).
As the number of dimensions appreciates to infinity, where the values are differing between samples, the distance amongst pairs of points goes towards infinity, i.e., \(\lim_{p\to\infty} d_{P = p}(s_{i}, s_{j}) \rightarrow \infty\). But does this result at the limit have influence at dimensions we see with sophisticated big data? Simulations demonstrate that it does.
A simulated data set 100 samples (rows) and 2 covariates \(X,Y\) (columns) was produced from the uniform distribution over the unit square and illustrated in the following plot.
### This code generates 100 uniformly distributed data
### points over the unit cube used throughout this report
###
set.seed(864729)
simData = cbind(X = runif(100), Y = runif(100))
p1 <- ggplot(as.data.frame(simData), aes(x = X, y = Y)) +
geom_point() + geom_vline(xintercept = .5) + geom_hline(yintercept = 0.5) +
ggtitle(‘100 Uniform Points on the Unit Square’) +
xlab(“X”) + ylab(“Y”) +
theme_bw()
p1
The mean couplewise distance amongst these points is \(0.53\). To view the impact of putting in extra dimensions on mean distance between couples of pints the subsequent code was executed. The plot displays the mean increases gradually with higher dimensions implying that that the points are making statistical analysis tough.
### This code adds U(0,1) random variable additional dimensions to
### the simulated data and measures the average pairwise distance
###
simData.average.dist = c(‘Noise’ = 0, ‘Iter’ = 1, ‘Distance’ = 0.53)
noise = c(8, 98, 998, 9998, 99998)
for(iter in 1:10) {
for(addedNoise in noise) {
simData.noise = cbind(simData,
matrix(runif(dim(simData)[1]*addedNoise),
nrow=dim(simData)[1]))
simData.dist = dist(simData.noise)
simData.average.dist = rbind(simData.average.dist,
c(‘Noise’ = addedNoise,
‘Iter’ = iter,
‘Distance’ = mean(dist(simData.noise))))
}
}
simData.dist.agg = aggregate(simData.average.dist[,3],
by = list(‘Dimensions’ = simData.average.dist[,1]),
mean)
simData.dist.agg$Dimensions = simData.dist.agg$Dimensions + 2
p2 <- ggplot(simData.dist.agg,
aes(x = Dimensions, y = x)) +
geom_point() + geom_line() +
ggtitle(‘Increase in Distance Between Points’) +
xlab(“log(Dimensions)”) + ylab(“Average Distance”) +
scale_x_log10(breaks =
trans_breaks(“log10”,
function(x) 10^x),
labels = trans_format(“log10”,
math_format(10^.x))) +
theme_bw()
p2 + annotation_logticks(sides = ‘b’)
p2 + annotation_logticks(sides = ‘b’)
Impact on analysis: Researchers typically cluster \(P>>N\) data. But, sparseness masks clustering and data clusters that are in low dimensions go away in higher dimensions. 10 randomized samples were simulated from \(N(1-0,1)\) and 10 from \(N(10,1)\) distributions. The plot illustrates these data and three cluster analyses for the original data (obvious demarcation of the 2 groups sample \(1-10\) and samples \(11-20\)), with 9 extra noise variables added in – some loss of discrimination from enhanced height where merging happens, and 99 extra noise variables included. (total loss of clusters)
The outcome of clustering when \P>>N\) is a spurious dendrogram where clusters are arbitrary groupings of the samples.
### this code generates 2 groups of samples from N(-10,1) and N(10,1)
### and runs cluster analysis with and without added noise
###
par(mfrow=c(2,2))
simDataNorm = c(rnorm(10, -10, 1), rnorm(10, 10, 1))
plot(density(simDataNorm), xlab = ‘X’, main = ‘Two Normal Distributions’)
grid()
plot(hclust(dist(simDataNorm)), main = ‘No Noise’, xlab=”)
plot(hclust(dist(cbind(simDataNorm,
matrix(runif(20*9, -10, 10), ncol=9)))),
main = ‘9 Noise Variables’, xlab=”)
plot(hclust(dist(cbind(simDataNorm,
matrix(runif(20*99, -10, 10), ncol=99)))),
main = ’99 Noise Variables’, xlab=”)
Property 2: Points shift farther away from the centre in high dimensions. This implies the data is shifting away from the centre and heading towards the external edge of the space.
Points in high dimensions wind up on the external edge or shell of the distribution. Take up independent \(X_i\sim U(0,10)\) arbitrary variables or dimensions \(P\) are included to the data, the \Pr( d(min(x_1, x_2, \ldots), 0) \le \epsilon)\rightarrow 1\) and \(Pr( d(max(x_1, x_2, \ldots), 1) \le \epsilon)\rightarrow 1\). To put it in other words, as more dimensions are included at least one gets within \(\epsilon\) of 0, and at least one gets within \(\epsilon\) of 1.
The table illustrates the probability of 1) falling under 0.001, 2) falling above 0.999, and 3) either falling under 0.001 or falling above 0.999, as extra dimensions are included. As forecasted the probability of falling on the boundary enhances with increasing dimensions \(P\)
The second outcome is the observed centre of the data getting farther away from the actual centre. For a multivariate, \(U(0,1)\) distribution the Expected Centre is at \(0.5\) for every dimension. In the simulation that follows, the observed centre, that is, the mean was estimated and it’s distance to the expected centre calculated. As the dimensions increase \(O-E\) increases demonstrating the estimated mean is farther away from the actual mean.
Impact on analysis
Precise parameter estimation is required to fit distributions, execute hypothesis tests, decide power and sample size for experiments, calculate confidence intervals, and various other statistical calculations. As \(O-E\) increases with extra dimensions the capability to precisely fit distributions, execute hypothesis tests, etc. will reduce causing false conclusions.
### This code calculates the probablities of being with 0.001 units away from
### the data distribution boundaries, and calculates the difference between the
### observed and expected mean parameter for different dimensions
###
boundaryProb = function(data) {
x = t(apply(data, 1, function(a) {c(min(a) <= 0.001, max(a) >= 0.999)}))
y = cbind(x, apply(x, 1, max))
apply(y, 2, mean)
}
nnoise = c(seq(2, 32, by = 4), 100, 1000) – 2
noise = c(0, 8, 98, 998, 9998, 99998)
boundaryProb.results = NULL
for(iter in 1:10) {
for(addedNoise in noise) {
simData.noise = cbind(simData,
matrix(runif(dim(simData)[1]*addedNoise),
nrow=dim(simData)[1]))
simData.mean = apply(simData.noise, 2, mean)
simData.dist.ctr = sqrt(sum((simData.mean –
rep(0.5, length(simData.mean)))^2))
boundaryProb.results = rbind(boundaryProb.results,
c(2+addedNoise,
boundaryProb(simData.noise),
simData.dist.ctr))
}
}
colnames(boundaryProb.results) = c(‘Dimensions’, ‘Pr(Min. <= 0.001)’,
‘Pr(Max. >= 0.999)’, ‘Pr(Either)’,
‘O – E Dist.’)
boundaryProb.results = as.data.frame(boundaryProb.results)
round(aggregate(boundaryProb.results[,2:5],
by=list(‘Dimensions’ = boundaryProb.results$Dimensions), ‘mean’),
3)
## Dimensions Pr(Min. <= 0.001) Pr(Max. >= 0.999) Pr(Either) O – E Dist.
## 1 2e+00 0.000 0.000 0.000 0.040
## 2 1e+01 0.007 0.008 0.015 0.091
## 3 1e+02 0.091 0.094 0.177 0.290
## 4 1e+03 0.625 0.647 0.868 0.915
## 5 1e+04 0.999 1.000 1.000 2.893
## 6 1e+05 1.000 1.000 1.000 9.134
Property 3: The distance amongst all pairs of points becomes the same. This implies that the two nearest points have the same distance apart as the two furthest points!
This is the most perplexing outcome of \(P>>N\) data. Geometrically, 3 points \((I,j,k)\) in 2 dimensions are equally distant from each other when they are vertices of an equilateral triangle. Including a \)4^{th}\) point, \((I,j,k,l)\), equidistance happens when the points are vertices of a conventional tetrahedron in 3 dimensions. Beyond 4 points we are restricted to ponder about this empirically to attain proof of equidistance in higher dimensions.
All pairwise distances are amongst the maximum and minimum (range) pairwise distances. If the difference amongst the maximum and the minimum becomes smaller than all pairwise distances become more samey, then for all pairs \((I,j)\), \(d(s_i,s_j) \approxd_{max}\approx d_{min}\) and as \(d_{max}-d_{min} \rightarrow 0\), then for \((I,j)\) \(d(s_i,s_j) \rightarrow d_{max} \rightarrow d_{min}\) To put it in different words, all pairwise distances become equivalent. The following plot demonstrates that the in simulation \(d_{max}-d_{min} \rightarrow 0\) as the number of dimensions escalates, and thus by necessity \(d(s_i,s_j) \rightarrow d_{max} \rightarrow d_{min)\)
### This codes added U(0,1) random variable dimensions to the sample data
### and calcuates the scaled difference between the max and min.
###
noise = c(2, 10, 100, 1000, 10000, 100000) – 2
simData.average.dist = NULL
for(iter in 1:10) {
for(addedNoise in noise) {
simData.noise = cbind(simData,
matrix(runif(dim(simData)[1]*addedNoise),
nrow=dim(simData)[1]))
simData.dist = dist(simData.noise)
simData.average.dist = rbind(simData.average.dist,
c(‘Noise’ = addedNoise, ‘Iter’ = iter, summary(simData.dist)))
}
}
simData.average.agg = aggregate(simData.average.dist[,3:8],
by=list(‘Noise’ = simData.average.dist[,1]), mean)
simData.pairwise.dists = data.frame(‘Dimensions’ = simData.average.agg[,1]+2,
‘Dist’ = (simData.average.agg[,7] –
simData.average.agg[,2]) /
simData.average.agg[,7])
p3 <- ggplot(simData.pairwise.dists,
aes(x = Dimensions, y = Dist)) +
geom_point() + geom_line() +
ggtitle(‘Decrease in Scaled Distance Between Minimum and Maximum’) +
xlab(“log(Dimensions)”) + ylab(“Average Distance”) +
scale_x_log10(breaks =
trans_breaks(“log10”,
function(x) 10^x),
labels = trans_format(“log10”,
math_format(10^.x))) +
theme_bw()
p3 + annotation_logticks(sides = ‘b’)
Impact of analysis: Nearest-neighbour algorithms categorize points on the basis of the majority of classes of the closest points. In this simulation, 100 \(N(-5,1)\) and 100 \)N(5,1)\) data undertook simulation and \(U)(-5,5)\) noise dimensions included. The error for 1 – original 2 group data dimension and 10 dimensions was 0, but enhanced as extra dimensions were included. At 100,000 dimensions, not an unreasonable number of dimensions \(P\) for \(P>>N\) data – closest neighbour was incorrect on half of the cases.
### This codes measures misclassification (error) rate in k = 5 nearest
### neighbor analysis with 100 samples from N(-5, 1) and 100 from N(5, 1)
###
simDataNorm = as.matrix(c(rnorm(100, -5, 1), rnorm(100, 5, 1)))
simDataNorm.train = simDataNorm
simDataNorm.nn = kNN(simDataNorm.train, k=5)
a = apply(simDataNorm.nn$id[1:100,], 1,
function(a) ifelse(sum(a<101) > 2, 1, 0))
b = apply(simDataNorm.nn$id[101:200,], 1,
function(a) ifelse(sum(a>100) > 2, 1, 0))
simDataNorm.results = c(‘Dimension’ = 0, ‘Iter’ = 1,
‘Error’ = 100 – (100*(sum(c(a,b)) / 200)))
for(noise in c(9, 99, 999, 9999, 99999)) {
for(i in 1:10) {
simDataNorm.train = as.matrix(cbind(simDataNorm,
matrix(runif(noise*200, -5, 5), ncol=noise)))
simDataNorm.nn = kNN(simDataNorm.train, k=5)
a = apply(simDataNorm.nn$id[1:100,], 1,
function(a) ifelse(sum(a<101) > 2, 1, 0))
b = apply(simDataNorm.nn$id[101:200,], 1,
function(a) ifelse(sum(a>100) > 2, 1, 0))
simDataNorm.results = rbind(simDataNorm.results,
c(‘Dimension’ = noise,
‘Iter’ = i,
‘Error’ = 100 – (100*(sum(c(a,b)) / 200))))
}
}
simDataNorm.agg = aggregate(simDataNorm.results[,3],
by=list(‘Dimensions’ = simDataNorm.results[,1]),
mean)
simDataNorm.agg$Dimensions = simDataNorm.agg$Dimensions + 1
p4 <- ggplot(simDataNorm.agg,
aes(x = Dimensions, y = x)) +
geom_point() + geom_line() +
ggtitle(‘Increase in Error with Higher Dimensions’) +
xlab(“log(Dimensions)”) + ylab(“Error”) +
scale_x_log10(breaks =
trans_breaks(“log10”,
function(x) 10^x),
labels = trans_format(“log10”,
math_format(10^.x))) +
theme_bw()
p4 + annotation_logticks(sides = ‘b’)
Property 4: The precision of any predictive model gets close to 100%. This implies models can always be found that forecast group characteristics with increased precision.
In \(P>>N\) data where points have shifted farther away from one another (Property 1), points are distributed on the external boundary of the information (Property 2) , and all points are equidistant (Property 3), hyerplanes can be identified to demarcate any two or more subsets of data points even when there are no groups in the information. A hyperplane in \(P\) dimensions is a \(P-1\) boundary through the space that demarcates the space into two sides. These are leveraged to categorize points into groups dependent on the which side of the hyperplane(s) a point falls.
Impact on analysis: In \(P>>N\) information hyperplanes can be identified to classify any trait of the samples, even when each of the variables are noise. This simulations fits classification trees to forecast even and odd rows \((Y)\) for multivariate \(X\sim U(0,1)\) hypercube with differing dimensions. There ought to be no model to precisely forecast even and odd rows with arbitrary data.
However, the plot illustrates that the capability to precisely forecast even and odd rows enhances with dimensions. Thus, any forecasting models fit to high dimensional data may be the outcome of purely arbitrary data. While these models can be verified with test data 2 things should be kept in mind. To start with, a split-data approach where some data is held back from evaluation still has many mutual hyperplanes so the evaluation data precision may still be pretty high. There are restricted samples in \(P>>N\) data making this tough. Next, validating from follow-up experiments is costly and reduces the speed of product development. Confirmation research may illustrate the model is incorrect, but it would be preferred to not have required to execute those studies to begin with.
### this code runs recursive partitioning on purely random U(0,1)
### data and calculates how many correctly classified as coming
### from an even or odd row in the data
###
y = rep(c(1,2), 50)
simData.rpart.predict = NULL
for(noise in c(1, 10, 100, 1000, 10000)) {
for(i in 1:50) {
simData.y = as.data.frame(cbind(y, matrix(runif(noise*100), ncol=noise)))
simData.rpart = rpart(as.factor(y) ~ ., data = simData.y)
simData.rpart.class = table(y, predict(simData.rpart, type=’class’))
simData.rpart.predict = rbind(simData.rpart.predict,
c(‘Dimension’ = noise, ‘Iter’ = i,
‘Error’ = (100 – sum(diag(simData.rpart.class)))))
}
}
simData.rpart.agg = aggregate(simData.rpart.predict[,3],
by=list(‘Dimensions’ = simData.rpart.predict[,1]),
mean)
p5 <- ggplot(simData.rpart.agg,
aes(x = Dimensions, y = x)) +
geom_point() + geom_line() +
ggtitle(‘Reduction in Error with Higher Dimensions’) +
xlab(“log(Dimensions)”) + ylab(“Error”) +
scale_x_log10(breaks =
trans_breaks(“log10”,
function(x) 10^x),
labels = trans_format(“log10”,
math_format(10^.x))) +
theme_bw()
p5 + annotation_logticks(sides = ‘b’)
Conclusion
In this blog post, attributes of \(P>>N\) data were detailed and simulations leveraged to illustrate their impact on data analysis: as points shift away from one another clusters experience disruption and dendrograms provide wrong results, as point shift away from the centre estimating the mean vector is less precise, as distance amongst all pairs of points converge into the singular value nearest-neighbour analysis becomes less precise, and as hyperplanes prop up the capability to forecast noise to forecast arbitrary outcomes becomes very precise.
What make all of this very disconcerting is that outcome will be produced from analyses but it is not possible to be aware if they hold validity or completely suspect owing to the Curse of Dimensionality.