## Abstract

Learning from high-dimensional data is usually quite challenging, as captured by the well-known phrase *curse of dimensionality*. Data analysis often involves measuring the similarity between different examples. This sometimes becomes a problem, as many widely used metrics tend to concentrate in high-dimensional feature spaces. The reduced contrast makes it more difficult to distinguish between close and distant points, which renders many traditional distance-based learning methods ineffective. Secondary distances based on shared neighbor similarities have recently been proposed as one possible solution to this problem. However, these initial metrics failed to take *hubness* into account. Hubness is a recently described aspect of the dimensionality curse, and it affects all sorts of \(k\)-nearest neighbor learning methods in severely negative ways. This paper is the first to discuss the impact of hubs on forming the shared neighbor similarity scores. We propose a novel, hubness-aware secondary similarity measure \(simhub_s\) and an extensive experimental evaluation shows it to be much more appropriate for high-dimensional data classification than the standard \(simcos_s\) measure. The proposed similarity changes the underlying \(k\)NN graph in such a way that it reduces the overall frequency of label mismatches in \(k\)-neighbor sets and increases the purity of occurrence profiles, which improves classifier performance. It is a hybrid measure, which takes into account both the supervised and the unsupervised hubness information. The analysis shows that both components are useful in their own ways and that the measure is therefore properly defined. This new similarity does not increase the overall computational cost, and the improvement is essentially ‘free’.

This is a preview of subscription content, access via your institution.

## References

- 1.
Tomašev N, Mladenić D (2012) Hubness-aware shared neighbor distances for high-dimensional k-nearest neighbor classification. In: Proceedings of the 7th international conference on hybrid artificial intelligence systems. HAIS ’12

- 2.
Scott D, Thompson J (1983) Probability density estimation in higher dimensions. In: Proceedings of the fifteenth symposium on the interface, pp 173–179

- 3.
Aggarwal CC, Hinneburg A, Keim DA (2001) On the surprising behavior of distance metrics in high dimensional spaces. In: Proceedings of the 8th international conference on database theory (ICDT), pp 420–434

- 4.
François D, Wertz V, Verleysen M (2007) The concentration of fractional distances. IEEE Trans Knowl Data Eng 19(7):873–886

- 5.
Durrant RJ, Kabán A (2009) When is ‘nearest neighbour’ meaningful: a converse theorem and implications. J Complex 25(4):385–397

- 6.
Radovanović M, Nanopoulos A, Ivanović M (2009) Nearest neighbors in high-dimensional data: the emergence and influence of hubs. In: Proceedings of the 26th international conference on machine learning (ICML), pp 865–872

- 7.
Radovanović M, Nanopoulos A, Ivanović M (2010) On the existence of obstinate results in vector space models. In Proceedings of the 33rd annual international ACM SIGIR conference on research and development in information retrieval, pp 186–193

- 8.
Aucouturier J, Pachet F (2004) Improving timbre similarity: how high is the sky? J Negat Res Speech Audio Sci 1

- 9.
Aucouturier J (2006) Ten experiments on the modelling of polyphonic timbre. Technical report, Docteral dissertation, University of Paris 6

- 10.
Flexer A, Gasser M, Schnitzer D (2010) Limitations of interactive music recommendation based on audio content. In: Proceedings of the 5th audio mostly conference: a conference on interaction with sound. ACM, AM ’10, New York, NY, USA, pp 13:1–13:7

- 11.
Flexer A, Schnitzer D, Schlüter J (2012) A mirex meta-analysis of hubness in audio music similarity. In: Proceedings of the 13th international society for music information retrieval conference. ISMIR’12

- 12.
Schedl M, Flexer A (2012) Putting the user in the center of music information retrieval. In: Proceedings of the 13th international society for music information retrieval conference. ISMIR’12

- 13.
Schnitzer D, Flexer A, Schedl M, Widmer G (2011) Using mutual proximity to improve content-based audio similarity. In: ISMIR’11, pp 79–84

- 14.
Gasser M, Flexer A, Schnitzer D (2010) Hubs and orphans—an explorative approach. In: Proceedings of the 7th sound and music computing conference. SMC’10

- 15.
Radovanović M, Nanopoulos A, Ivanović M (2011) Hubs in space: popular nearest neighbors in high-dimensional data. J Mach Learn Res 11:2487–2531

- 16.
Jarvis RA, Patrick EA (1973) Clustering using a similarity measure based on shared near neighbors. IEEE Trans Comput 22:1025–1034

- 17.
Ertz L, Steinbach M, Kumar V (2001) Finding topics in collections of documents: a shared nearest neighbor approach. In: Proceedings of text Mine01, first SIAM international conference on data mining

- 18.
Yin J, Fan X, Chen Y, Ren J (2005) High-dimensional shared nearest neighbor clustering algorithm. In: Fuzzy systems and knowledge discovery, vol 3614 of Lecture Notes in computer science. Springer, Berlin, Heidelberg, pp 484–484

- 19.
Moëllic PA, Haugeard JE, Pitel G (2008) Image clustering based on a shared nearest neighbors approach for tagged collections. In: Proceedings of the international conference on content-based image and video retrieval. CIVR ’08. ACM, New York, NY, USA, pp 269–278

- 20.
Anil KumarPatidar, Agrawal JMN (2012) Analysis of different similarity measure functions and their impacts on shared nearest neighbor clustering approach. Int J Comput Appl 40:1–5

- 21.
Zheng L-Z, Huang DC (2012) Outlier detection and semi-supervised clustering algorithm based on shared nearest neighbors. Comput Syst Appl 29:117–121

- 22.
Houle ME, Kriegel HP, Kröger P, Schubert E, Zimek A (2010) Can shared-neighbor distances defeat the curse of dimensionality? In: Proceedings of the 22nd international conference on scientific and statistical database management. SSDBM’10, Springer, pp 482–500

- 23.
Bennett KP, Fayyad U, Geiger D (1999) Density-based indexing for approximate nearest-neighbor queries. In: ACM SIGKDD conference proceedings, ACM Press, pp 233–243

- 24.
Ayad H, Kamel M (2003) Finding natural clusters using multi-clusterer combiner based on shared nearest neighbors. In: Multiple classifier systems. vol 2709 of Lecture Notes in computer science. Springer, Berlin, Heidelberg, pp 159–159

- 25.
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) The role of hubness in clustering high-dimensional data. In: PAKDD (1)’11, pp 183–195

- 26.
Buza K, Nanopoulos A, Schmidt-Thieme L (2011) Insight: efficient and effective instance selection for time-series classification. In: Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining, vol Part II. PAKDD’11, Springer, pp 149–160

- 27.
Tomašev N, Mladenić D (2011) Exploring the hubness-related properties of oceanographic sensor data. In: Proceedings of the SiKDD conference

- 28.
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) Hubness-based fuzzy measures for high dimensional k-nearest neighbor classification. In: Machine learning and data mining in pattern recognition, MLDM conference

- 29.
Tomašev N, Radovanović M, Mladenić D, Ivanović M (2011) A probabilistic approach to nearest neighbor classification: Naive hubness bayesian k-nearest neighbor. In: Proceedings of the CIKM conference

- 30.
Tomašev N, Mladenić D Nearest neighbor voting in high-dimensional data: learning from past occurences. In: PhD forum, ICDM conference

- 31.
Tomašev N, Mladenić D (2012) Nearest neighbor voting in high dimensional data: Learning from past occurrences. Comput Sci Inf Syst 9(2):691–712

- 32.
Tomašev N, Mladenić D (2011) The influence of weighting the k-occurrences on hubness-aware classification methods. In: Proceedings of the SiKDD conference

- 33.
Fix E, Hodges J (1951) Discriminatory analysis, nonparametric discrimination: consistency properties. Technical report, USAF School of Aviation Medicine, Randolph Field, Texas

- 34.
Stone CJ (1977) Consistent nonparametric regression. Ann Stat 5:595–645

- 35.
Devroye L, Gyorfi AK, Lugosi G (1994) On the strong universal consistency of nearest neighbor regression function estimates. Ann Stat 22:1371–1385

- 36.
Cover TM, Hart PE (1967) Nearest neighbor pattern classification. IEEE Trans Inf Theory IT 13(1):21–27

- 37.
Devroye L (1981) On the inequality of cover and hart. IEEE Trans Pattern Anal Mach Intell 3:75–78

- 38.
Keller JE, Gray MR, Givens JA (1985) A fuzzy k-nearest-neighbor algorithm. IEEE Trans Syst Man Cybern 15:580–585

- 39.
Jensen R, Cornelis C (2008) A new approach to fuzzy-rough nearest neighbour classification. In: Proceedings of the 6th international conference on rough sets and current trends in computing. RSCTC ’08. Springer, Berlin, Heidelberg, pp 310–319

- 40.
Song Y, Huang J, Zhou D, Zha H, Giles CL (2007) Iknn: Informative k-nearest neighbor pattern classification. In: Proceedings of the 11th European conference on principles and practice of knowledge discovery in databases. PKDD 2007, Springer, Berlin, Heidelberg pp 248–264

- 41.
Hodge VJ, Austin J (September 2005) A binary neural k-nearest neighbour technique. Knowl Inf Syst (KAIS) 8(3):276–291

- 42.
Ougiaroglou S, Nanopoulos A, Papadopoulos AN, Manolopoulos Y, Welzer-druzovec T (2007) Adaptive k-nearest neighbor classification based on a dynamic number of nearest neighbors. In: Proceedings of ADBIS Conference. ADBIS 2007

- 43.
Zhang H, Berg AC, Maire M, Malik J (2006) Svm-knn: Discriminative nearest neighbor classification for visual category recognition. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition—vol 2. CVPR ’06, IEEE Computer Society , Washington, DC, USA, pp 2126–2136

- 44.
Triguero I, García S, Herrera F (2011) Differential evolution for optimizing the positioning of prototypes in nearest neighbor classification. Pattern Recognit 44(4):901–916

- 45.
Ambert KH, Cohen AM (2012) k-information gain scaled nearest neighbors: a novel approach to classifying protein-protein interaction-related documents. EEE/ACM Trans Comput Biol Bioinform 9(1):305–310

- 46.
Xing Z, Pei J, Yu PS (2009) Early prediction on time series: a nearest neighbor approach. In: Proceedings of the 21st international joint conference on artificial intelligence. IJCAI’09, Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp 1297–1302

- 47.
Chaovalitwongse WA, Fan YJ, Sachdeo RC (2007) On the time series k-nearest neighbor classification of abnormal brain activity. IEEE Trans Syst Man Cybern Part A 37:1005–1016

- 48.
Holte RC, Acker LE, Porter BW (1989) Concept learning and the problem of small disjuncts. In: Proceedings of 11th international conference AI, vol 1. Morgan Kaufmann Publishers Inc. pp 813–818

- 49.
van den Bosch A, Weijters T, Herik HJVD, Daelemans W (1997) When small disjuncts abound, try lazy learning: a case study

- 50.
Li Y, Zhang X (2011) Improving k-nearest neighbor with exemplar generalization for imbalanced classification. In: Advances in knowledge discovery and data mining, vol 6635. Springer, pp 321–332

- 51.
Tan S (May 2005) Neighbor-weighted k-nearest neighbor for unbalanced text corpus. Expert Syst Appl 28:667–671

- 52.
Wang S, Li X, Xia JF, Zhang XP (2010) Weighted neighborhood classifier for the classification of imbalanced tumor dataset. J Circuits Syst Comput, pp 259–273

- 53.
Van Hulse J, Khoshgoftaar T (December 2009) Knowledge discovery from imbalanced and noisy data. Data Knowl Eng 68(12):1513–1542

- 54.
Chen J, ren Fang H, Saad Y (2009) Fast approximate \(k\)NN graph construction for high dimensional data via recursive Lanczos bisection. J Mach Learn Res 10:1989–2012

- 55.
Tomašev N, Brehar R, Mladenić D, Nedevschi S (2011) The influence of hubness on nearest-neighbor methods in object recognition. In: IEEE conference on intelligent computer communication and Processing

- 56.
Lowe DG (November 2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91

- 57.
Zhang Z, Zhang R (2008) Multimedia data mining: a systematic introduction to concepts and theory. Chapman and Hall, New York

- 58.
Tomašev N, Mladenić D (2012) Under review: reference anonymized for double-blind, review

- 59.
Napierala K, Stefanowski J (2012) Identification of different types of minority class examples in imbalanced data. In: Corchado E, Snel V, Abraham A, Wozniak M, Graa M, Cho SB (eds) Hybrid artificial intelligent systems, vol 7209 of lecture notes in computer science. Springer, Berlin, Heidelberg, pp 139–150

- 60.
Lienhart R, Maydt J (2002) An extended set of haar-like features for rapid object detection. In: IEEE ICIP 2002, pp 900–903

- 61.
Tan PN, Steinbach M, Kumar V (2005) Introduction to data mining. Addison Wesley, Reading

- 62.
Bilenko M, Basu S, Mooney RJ (2004) Integrating constraints and metric learning in semi-supervised clustering. In: Proceedings of the twenty-first international conference on Machine learning. ICML ’04, New York, NY, USA, ACM, pp 11

- 63.
Lu Z (2007) Semi-supervised clustering with pairwise constraints: a discriminative approach. J Mach Learn Res—Proceedings Track, pp 299–306

- 64.
Kumar N, Kummamuru K, Paranjpe D (2005) Semi-supervised clustering with metric learning using relative comparisons. In: Proceedings of the Fifth IEEE international conference on data mining. ICDM ’05, IEEE Computer Society, Washington, DC, USA, pp 693–696

## Acknowledgments

This work was supported by the Slovenian Research Agency, the IST Programme of the EC under PASCAL2 (IST-NoE-216886).

## Author information

### Affiliations

### Corresponding author

## Additional information

This is an extended version of the paper *Hubness-aware Shared Neighbor Distances for High-dimensional k-Nearest Neighbor Classification*, which was presented at the Data Mining: Data Preparation and Analysis special session of the Hybrid Artificial Intelligence conference (HAIS 2012) [1].

## Rights and permissions

## About this article

### Cite this article

Tomašev, N., Mladenić, D. Hubness-aware shared neighbor distances for high-dimensional \(k\)-nearest neighbor classification.
*Knowl Inf Syst* **39, **89–122 (2014). https://doi.org/10.1007/s10115-012-0607-5

Received:

Revised:

Accepted:

Published:

Issue Date:

### Keywords

- Hubs
- Metric learning
- Curse of dimensionality
- \(k\)-nearest neighbor
- Classification
- Shared neighbor distances