The data cleaning process is inherently iterative, usually starting with cursory exploratory data analysis on small samples where the analyst manually fixes errors and estimates their impact.
As the analyst learns more about the dataset, she often refines the cleaning to make it more accurate and robust.
SampleClean implements a set of interchangeable and composable physical and logical data cleaning operators. This allows for quick construction and adaptation of data cleaning pipelines
Many data scientists clean and wrangle data
with one-off scripts. This leads to unreliable software
and brittle workflows that transfer data between different programs. In our research
, we identify a set of logical operators commonly found in data cleaning workflows: Sampling, Similiarity Join, Filtering, and Extraction. We build an API around these logical operators allowing them to be executed and optimized at a physical level.
Optimized Similarity Join:
Deduplication, entity resolution, and outlier detection use an operator called "Similarity Join". Given two relations, this finds all pairs that satisfy some similarity condition. A naive implementation would require applying the metric to all pairs of tuples. In SampleClean, we apply prefix-filtering to reduce the number of tuple comparisons. Programs that use the logical operator can benefit from all of our optimizations.
In many applications, an approximate answer suffices. In SampleClean, we allow users to clean samples of data and extrapolate results.
Below, we show a subset of restaurant records from a yelp challenge dataset where 15% are duplicates. SampleClean's deduplication implementation is significantly faster than the naive all pairs implementation and sampling allows the user to tradeoff between accuracy and time.
Crowd input is expensive and SampleClean is parsimonious with the questions it asks to the crowd. We apply a technique called active learning to budget crowd effort. As crowd responses arrive, we train a model to predict the responses. We then apply the crowd to the most uncertain predictions.
Below, we plot the query error in a value filling task. As crowd workers fill in values, we can learn what values to add. We find that Active learning provides a significant benefit for small sample sizes and is very valuable for early results/evaluations.
Tuning without rewriting:
Many data cleaning physical implementations are sensitive to parameterization. By abstracting physical implementations, users can easily try different techniques and parameters.