Class OnlineSortedNeighborhoodMethod<T>
- java.lang.Object
-
- com.bakdata.dedupe.candidate_selection.online.OnlineSortedNeighborhoodMethod<T>
-
- Type Parameters:
T
- the type of the record.
- All Implemented Interfaces:
CandidateSelection<T>
,OnlineCandidateSelection<T>
public final class OnlineSortedNeighborhoodMethod<T> extends java.lang.Object implements OnlineCandidateSelection<T>
A sorted neighborhood method (SNM) for online deduplication. Records are sorted in multiple passes by a specific sorting key and all records within a windoww
are compared.Multiple passes with different windows increase probability that duplicates are sorted near each other in a specific pass. It's best to use a wide range of record attributes and combine them to
CompositeValue
if not discriminating enough. Ideally, each sorting value appears only once per unique duplicate cluster. Longer sorting keys pose almost no runtime overhead and rather small additional memory. Note that multiple passes result in duplicate candidates that are removed within this algorithm. Thus, the actual amount of candidates is usually smaller than#passes * w
.Window size:
In an online application, the window is constantly changing, such that each record is compared when first inserted and whenever a new record in the current window is inserted.
Therefore, a record is compared more than
w
times: On average2*w-1
times, with a lower bound ofw-1
(during initial insertion).For comparison, in an offline SNM, each record is compared w-1 times (ignoring the records at the beginning and in the end of the sort index).
Thus, this algorithm still preserves the most reliably properties of offline SNM: A linear amount of comparison to the dataset size and a minimum number of comparisons per record.
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
OnlineSortedNeighborhoodMethod.OnlineSortedNeighborhoodMethodBuilder<T>
static class
OnlineSortedNeighborhoodMethod.Pass<T,K extends java.lang.Comparable<K>>
Represents a pass over the dataset with a specific sorting key and window size.
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description static <T> OnlineSortedNeighborhoodMethod.OnlineSortedNeighborhoodMethodBuilder<T>
builder()
boolean
equals(java.lang.Object o)
int
getDefaultWindowSize()
The default window size, when not explicitly given.@NonNull java.util.List<OnlineSortedNeighborhoodMethod.Pass<T,?>>
getPasses()
The different passes used to select the candidates.int
hashCode()
@NonNull java.util.stream.Stream<Candidate<T>>
selectCandidates(T newRecord)
Selects the candidates for the a new incoming record.java.lang.String
toString()
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface com.bakdata.dedupe.candidate_selection.CandidateSelection
selectCandidates
-
Methods inherited from interface com.bakdata.dedupe.candidate_selection.online.OnlineCandidateSelection
selectCandidates
-
-
-
-
Method Detail
-
selectCandidates
@NonNull public @NonNull java.util.stream.Stream<Candidate<T>> selectCandidates(@NonNull T newRecord)
Description copied from interface:OnlineCandidateSelection
Selects the candidates for the a new incoming record. The selection algorithm will maintain an internal representation of the previously seen records or a subset thereof.Thus, this method is stateful.
- Specified by:
selectCandidates
in interfaceOnlineCandidateSelection<T>
- Parameters:
newRecord
- the new record for which the candidates are generated.- Returns:
- the generated candidates.
-
builder
public static <T> OnlineSortedNeighborhoodMethod.OnlineSortedNeighborhoodMethodBuilder<T> builder()
-
getPasses
@NonNull public @NonNull java.util.List<OnlineSortedNeighborhoodMethod.Pass<T,?>> getPasses()
The different passes used to select the candidates.
-
getDefaultWindowSize
public int getDefaultWindowSize()
The default window size, when not explicitly given. Defaults to 10 but should always be explicitly set when used.
-
equals
public boolean equals(java.lang.Object o)
- Overrides:
equals
in classjava.lang.Object
-
hashCode
public int hashCode()
- Overrides:
hashCode
in classjava.lang.Object
-
toString
public java.lang.String toString()
- Overrides:
toString
in classjava.lang.Object
-
-