Talk:Curse of dimensionality

Overfitting
Could also be added as a complementary viewpoint. 2A02:1812:160C:3D00:E0B3:502A:7C93:7319 (talk) 02:21, 9 October 2016 (UTC)

Example
The current example is unfortunate, since the volume of unit hypercube is 1, exactly the same as the length of the unit interval: so the reader could be excused for saying "well, each sample corresponds to 1% of the space, that's the same, isn't it?". I'll see if I can rephrase it to make the "vastness" more obvious. -- 80.168.224.235 15:34, 10 November 2006 (UTC)


 * It still says in the second paragraph that "the common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse." As 80.168.224.235 has already pointed out, the volume of the unit cube is 1---independently of the dimension. If you consider the Euclidean unit ball, then you even have that the volume decreases with the dimension. I think the correct explanation for data sparseness in high dimensions is | concentration of measure. 129.26.133.39 (talk) 10:35, 15 March 2017 (UTC)

Problem types
There are (at least) two types of problem that exhibit the "curse of dimensionality". The article does not distinguish between them, and as a result the explanation is not very clear.


 * In the first type, each variable can take one of several discrete values, or each variable is mapped onto several discrete ranges. Taking the variables together, a huge number of combinations of values must be considered.


 * In the second type, samples are compared using a measure such as a Euclidean distance which is defined using the variables. Because most of the volume of a hypercube is concentrated in the corners, use of the measure to select samples may not be as successful as is naively expected.

The first type includes "optimization problems" such as those considered by Bellman. The objective function must by computed for each combination of values.

The first type also includes "machine learning problems" where each range of values is categorised as a separate feature. An enormous amount of training data is required so that there are several samples with each combination of values. The "Hughes effect" is that with a fixed amount of training data, the predictive power reduces as the dimensionality increases.

The second type includes "nearest neighbor search in high dimensional space". It is not possible to quickly reject candidates by using the difference of one variable as a lower bound for a distance based on all the variables. This is described in R. B. Marimont and M. B. Shapiro, "Nearest Neighbour Searches and the Curse of Dimensionality", Journal of the Institute of Mathematics and its Applications, 24, 1972, 59-70, and in E. Chavez et al., "Searching in Metric Spaces", ACM Computing Surveys, 33, 2001, 273-321.

JonH (talk) 11:33, 10 April 2010 (UTC)

I think the sentence "This is an important intuition for understanding the chi-squared distribution." should be removed unless there is an explanation for this. Otherwise it is just confusing. — Preceding unsigned comment added by Anoyzz (talk • contribs) 15:07, 16 May 2012 (UTC)

Bayesian statistics section
The claim in the Bayesian statistics section that the curse of dimensionality is overcome by MCMC needs clarification and support. Personally, I think that it's flat wrong as the goal in MCMC is to sample, and in the sampling section there is a very nice description of why this should be hard in very high dimensions. I cannot hope to read the mind of the author of the Bayesian statistics section, so if it's not improved in a couple months, half a year, whenever I check this next, I'm just going to remove the section altogether. bradweir (talk) 18:19, 13 March 2013 (UTC)

Link to Coursera Online Course
The curse of dimensionalty is explained in more detail in a video lecture on coursera: https://class.coursera.org/machlearning-001/lecture/215 I don't know where exactly to put it but I think it is very helpful to learn more about it. — Preceding unsigned comment added by 2001:638:902:200B:F21F:AFFF:FE5E:484 (talk) 11:14, 18 June 2014 (UTC)

Original research in "Curse of dimensionality in different domains"
Specifically for the "Sampling" and "Distance functions" sections, there are no references to any of the claims made. It all makes sense, for sure, but it reads like a proof, and it would be nice to find a source for these arguments. — Preceding unsigned comment added by Mobeets (talk • contribs) 13:57, 31 March 2016 (UTC)


 * For the sampling aspect, you can probably use this tutorial as a reference: http://www.dbs.ifi.lmu.de/~zimek/publications/PAKDD2013/TutorialPAKDD2013OutlierDetectionHighDim.pdf for examples slides 30,31; journal version is "A. Zimek, E. Schubert, H.-P. Kriegel: A Survey on Unsupervised Outlier Detection in High-Dimensional Numerical Data. Statistical Analysis and Data Mining, 5(5): 363–387, 2012." - this discusses several aspects of the curse of dimensionality, and is already cited in the article. 138.246.2.177 (talk) 08:29, 1 April 2016 (UTC)

No, it doesn't depend on the algorithm
The unduly prominent, and uncited, original section about how it "depends on the algorithm" is not only unencyclopedic but incorrect. The main point people intend to make when they talk about "the curse of dimensionality" is that it does not depend on the algorithm... the nature of high-dimensional search - the problem itself, and just the problem - is such that you cannot write a good algorithm for it except by making big compromises specialized to particular applications. There are strong theoretical reasons, such as the work of Ryan Williams and others on reduction from SETH, to suspect that it will always be this way, and that high-dimensional search is hard for the same reasons that 3SAT is hard (though not in exactly the same sense of hardness). We don't have a "it's only hard depending on the algorithm" section in the article on 3SAT, and we shouldn't have one here. 130.226.132.3 (talk) 11:08, 6 May 2016 (UTC)


 * Of course it does depend on the algorithm. For example an algorithm that only processes one attribute at a time (e.g. computing the mean of each dimension) is not affected. The question is whether such an algorithm is meaningful (the 'do nothing algorithm' clearly is not affected by this either), but without narrowing down the set of algorithms we are talking about, this statement undoubtedly is correct. The challenge is to find a good way of narrowing down... Chire (talk) 18:38, 16 June 2016 (UTC)


 * Computing the mean of each dimension does not solve similarity search, and its difficulty is not relevant to the hardness of similarity search. By all means the curse of dimensionality depends on the problem - and "compute the mean of each dimension" is different from similarity search because it's a different problem.  But it doesn't depend on the algorithm.  The curse of dimensionality refers to similarity search, which is already adequately defined in the references; it doesn't need to be narrowed down further.  130.226.132.3 (talk) 08:45, 7 July 2016 (UTC)


 * While I don't really object the removal of that paragraph, the curse of dimensionality is not confined to similarity search (and even then, there are similarity measures that are not affected. For example the discrete metric.) For example combinatorial explosion. --Chire (talk) 13:57, 8 July 2016 (UTC)

Corners of a hypercube
The statement the high-dimensional unit hypercube can be said to consist almost entirely of the "corners" of the hypercube, with almost no "middle" is at best misleading, and worst simply wrong. If hypercube has edge $$2r$$ so the distance from the midpoint to the corners is $$r \sqrt{d}$$ and if $$d$$ is large, then the overwhelming majority of points are in the range $$r \sqrt{\tfrac{d}{3}} \pm r$$ from the midpoint and also from the nearest corner, so for example if $$r=1$$ and $$d=30000$$ then well over 99.9% of points are between 99 and 101 away from both the midpoint and the nearest corner. It might be better to say that the high-dimensional unit hypercube can be said to consist almost entirely of the "sides" of the hypercube, with almost no "middle". 2A00:23C6:1482:A100:1038:8965:1773:9999 (talk) 12:18, 22 February 2021 (UTC)
 * Sure! The statement about the concentration of the hypercube in the corners is a typical mistake! I will correct it now.Agor153 (talk) 13:45, 24 February 2021 (UTC)