Class OnlineSortedNeighborhoodMethod<T>

  • Type Parameters:
    T - the type of the record.
    All Implemented Interfaces:
    CandidateSelection<T>, OnlineCandidateSelection<T>

    public final class OnlineSortedNeighborhoodMethod<T>
    extends java.lang.Object
    implements OnlineCandidateSelection<T>
    A sorted neighborhood method (SNM) for online deduplication. Records are sorted in multiple passes by a specific sorting key and all records within a window w are compared.

    Multiple passes with different windows increase probability that duplicates are sorted near each other in a specific pass. It's best to use a wide range of record attributes and combine them to CompositeValue if not discriminating enough. Ideally, each sorting value appears only once per unique duplicate cluster. Longer sorting keys pose almost no runtime overhead and rather small additional memory. Note that multiple passes result in duplicate candidates that are removed within this algorithm. Thus, the actual amount of candidates is usually smaller than #passes * w.

    Window size:

    In an online application, the window is constantly changing, such that each record is compared when first inserted and whenever a new record in the current window is inserted.

    Therefore, a record is compared more than w times: On average 2*w-1 times, with a lower bound of w-1 (during initial insertion).

    For comparison, in an offline SNM, each record is compared w-1 times (ignoring the records at the beginning and in the end of the sort index).

    Thus, this algorithm still preserves the most reliably properties of offline SNM: A linear amount of comparison to the dataset size and a minimum number of comparisons per record.

    • Method Detail

      • selectCandidates

        @NonNull
        public @NonNull java.util.stream.Stream<Candidate<T>> selectCandidates​(@NonNull
                                                                               T newRecord)
        Description copied from interface: OnlineCandidateSelection
        Selects the candidates for the a new incoming record. The selection algorithm will maintain an internal representation of the previously seen records or a subset thereof.

        Thus, this method is stateful.

        Specified by:
        selectCandidates in interface OnlineCandidateSelection<T>
        Parameters:
        newRecord - the new record for which the candidates are generated.
        Returns:
        the generated candidates.
      • getDefaultWindowSize

        public int getDefaultWindowSize()
        The default window size, when not explicitly given. Defaults to 10 but should always be explicitly set when used.
      • equals

        public boolean equals​(java.lang.Object o)
        Overrides:
        equals in class java.lang.Object
      • hashCode

        public int hashCode()
        Overrides:
        hashCode in class java.lang.Object
      • toString

        public java.lang.String toString()
        Overrides:
        toString in class java.lang.Object