Polibits, Vol. 45, pp. 21-25, 2012.
Abstract: Near-duplicate detection is important when dealing with large, noisy databases in data mining tasks. In this paper, we present the results of applying the Rank distance and the Smith-Waterman distance, along with more popular string similarity measures such as the Levenshtein distance, together with a disjoint set data structure, for the problem of near-duplicate detection.
Keywords: Near-duplicate detection, string similarity measures, database, data mining
PDF: String Distances for Near-duplicate Detection
PDF: String Distances for Near-duplicate Detection