Similarity Featurizer used to decide whether a pair of records is similar or not. Featurizing a pair of rows in this context will return 1.0 if the pair is similar or 0.0 otherwise.
If set to true, the algorithm will automatically calculate token weights. Default token weights are defined based on token idf values.
Adding weights into the join might lead to more reliable pair comparisons but could add overhead to the algorithm. However, smart optimizations such as Prefix Filtering used in some implementations of AnnotatedSimilarityFeaturizer might actually reduce overhead if there is an abundance of common tokens in the dataset.
Join two RDDs.
First RDD of rows
Second RDD of rows
True if rddA is a sample of rddB
an RDD with pairs of similar rows.
Similarity Featurizer used to decide whether a pair of records is similar or not.