Release v0.1


The SampleClean project is hosted on Github:

Latest Release: SampleClean-0.1 (Jar) (GZip Tar) (Zip)

Requirements: JDK 1.6+, Scala 2.10.x, Spark 1.0-1.2


We provide a set of Scala libraries for Entity Resolution, Crowd Sourcing, and Approximate Query Processing.

Entity Resolution: The problem of linking multiple database representations of the same real world "entity". SampleClean provides a library and programming API for constructing distributed entity resolution pipelines.

Crowd Sourcing: Entity resolution tasks can be hard to automate and for reliable results crowdsourcing is a preferred solution. SampleClean provides a library of crowd sourcing tools that also adaptively learns through Active Learning. To use crowd sourcing, a pre-requisite is to run the AMPCrowd server.

Approximate Query Processing: We often want to know aggregate statistics of the database (SUM, COUNT, AVG), and to answer these queries with high accuracy it often suffices to clean a small sample of data. SampleClean provides the primitives to sample and extrapolate query results on the sample.

Programming With SampleClean

You can download the SampleClean jar to include with any Spark programs or you can clone our github repository to check out the source code. We have provided a programming guide to help you get started.

Quick Start

We will walk through a basic tutorial on how to get SampleClean running using Spark Shell either locally or on a cluster.


1. Java Development Kit 7+ Download
2. Scala 2.10.x Download

Spark and SampleClean Local Installation

1. First create a new directory mkdir sampleclean
2. Download Spark 1.2.x to this directory Download
3. Untar Spark tar xvzf spark-1.2.2.tgz
4. Build Spark
cd spark-1.2.2
sbt/sbt -Phive assembly/assembly
5. Download SampleClean to the spark directory
6. To avoid permission issues on a local deployment, configure hive with our default config. Download the config to the spark directory Download
7. Put the config in the spark configuration folder mv hive-site.xml.default conf/hive-site.xml

Testing Your Installation

8. Download the example dataset to the spark folder Download
9. Open the Spark shell ./bin/spark-shell --jars sampleclean-v0.1.jar
10. Import SampleClean import sampleclean.api.SampleCleanContext
11. Create New SampleCleanContext and HiveContext val scc = new SampleCleanContext(sc)
12. Load Example Dataset
scc.hql("CREATE TABLE 
        restaurant(id String, 
             entity String,
             name String,
             category String,
             city String) 
scc.hql("LOAD DATA LOCAL INPATH 'restaurant.csv' OVERWRITE INTO TABLE restaurant")

13. Create a working set

14. Count the number of distinct restaurants
scc.hql("select count(distinct name) from restaurant").collect().foreach(println)

15. Do Entity Resolution
import sampleclean.clean.deduplication.EntityResolution
val algorithm = EntityResolution.longAttributeCanonicalize(scc,"restaurant_working","name",0.7)

16. Count the number of distinct restaurants

scc.hql("select count(distinct name) from restaurant").collect().foreach(println)

Using the Crowd

19. Configure crowd tasks (if you installed AMPCrowd earlier):

import sampleclean.crowd._
val crowdConfig = CrowdConfiguration(crowdName=”internal”, 
val taskParams = CrowdTaskConfiguration(votesPerPoint=1, maxPointsPerTask=10)

20. Add a crowd matching step to the entity resolution algorithm

val crowdMatcher = EntityResolution.createCrowdMatcher(scc, “name” , “restaurant_working”)
val crowdAlgorithm = EntityResolution.longAttributeCanonicalize(scc,"restaurant_working","name",0.6)

21. Run the crowd-driven entity resolution (creating crowd tasks) crowdAlgorithm.exec()
22. Do some crowd tasks (navigate your browser to
23. Persist the new results scc.writeToParent("restaurant_working")
24. Count the number of distinct restaurants

scc.hql("select count(distinct name) from restaurant").collect().foreach(println)

25. Exit exit

Cluster Installation

You can also use SampleClean on a Spark cluster using our provided scripts. Note that you must have valid AWS credentials to start your cluster. The scripts configure all requirements necessary. Check sampleclean-async/deploy/README to learn about deploying EC2 clusters for Sample Clean. After starting the cluster, you can login remotely and use Sample Clean with Spark Submit or Spark Shell (similar to the local usage mode). Remember to load your datasets into HDFS using ephemeral or persistent storage before running your application.