Interface Deduplication<T>

  • Type Parameters:
    T - the type of the record
    All Known Subinterfaces:
    OfflineDeduplication<T>, OnlineDeduplication<T>
    All Known Implementing Classes:
    FusingOnlineDuplicateDetection
    Functional Interface:
    This is a functional interface and can therefore be used as the assignment target for a lambda expression or method reference.

    @FunctionalInterface
    public interface Deduplication<T>
    A full deduplication process, which ensures that no duplicate record is emitted.

    In general, all implementations will ensure that the user receives a duplicate-free datasets, either through repeated updates or by providing a duplicate-free dataset.

    The actual implementation may use any means necessary to find duplicates, to ensure proper transitivity ((A,B) is duplicate and (B,C) is duplicate implies that (A,C) is duplicate), and to give a resulting representation.

    • Method Summary

      All Methods Instance Methods Abstract Methods Default Methods 
      Modifier and Type Method Description
      @NonNull java.util.stream.Stream<T> deduplicate​(@NonNull java.util.stream.Stream<? extends T> records)
      Deduplicates the dataset.
      default @NonNull java.util.Collection<T> materializedDeduplicate​(@NonNull java.lang.Iterable<? extends T> records)
      Deduplicates the dataset.
      default @NonNull java.util.Collection<T> materializedDeduplicate​(@NonNull java.lang.Iterable<? extends T> records, @NonNull java.util.function.Function<? super T,​java.lang.Object> idExtractor)
      Selects the candidates for the given records and materializes them.
    • Method Detail

      • deduplicate

        @NonNull
        @NonNull java.util.stream.Stream<T> deduplicate​(@NonNull
                                                        @NonNull java.util.stream.Stream<? extends T> records)
        Deduplicates the dataset.

        Note that for online algorithms, duplicates will be repeatedly emitted with updated representation, since it is impossible to suppress earlier emission without blocking execution. The user needs to invalidate earlier results through external means (for example, putting them in a key-value store with the key being the id of the duplicate).

        Parameters:
        records - the record that should be freed from duplicates.
        Returns:
        a duplicate-free dataset with the above mentioned limitation for online algorithms.
      • materializedDeduplicate

        @NonNull
        default @NonNull java.util.Collection<T> materializedDeduplicate​(@NonNull
                                                                         @NonNull java.lang.Iterable<? extends T> records)
        Deduplicates the dataset.

        For online algorithms this method can only be applied on a finite stream and could be used to verify results in a test or compare performance to an offline algorithm.

        For online algorithms, this method may emit duplicates repeatedly with the same id, since an online algorithm usually eagerly emits results. To ensure a real duplicate-free dataset, use materializedDeduplicate(Iterable, Function).

        Parameters:
        records - the record that should be freed from duplicates.
        Returns:
        a duplicate-free dataset with the above mentioned limitation for online algorithms.
        Implementation Requirements:
        For online algorithms, it is strongly encouraged that duplicates are filtered such that only the final representation remains.
      • materializedDeduplicate

        @NonNull
        default @NonNull java.util.Collection<T> materializedDeduplicate​(@NonNull
                                                                         @NonNull java.lang.Iterable<? extends T> records,
                                                                         @NonNull
                                                                         @NonNull java.util.function.Function<? super T,​java.lang.Object> idExtractor)
        Selects the candidates for the given records and materializes them. The additional idExtractor ensures that previously found duplicates are removed from the output.

        For online algorithms this method can only be applied on a finite stream and could be used to verify results in a test or compare performance to an offline algorithm.

        Parameters:
        records - the record that should be freed from duplicates.
        Returns:
        a duplicate-free dataset.
        Implementation Requirements:
        For online algorithms, it is strongly encouraged that duplicates are filtered such that only the final representation remains.