Crowdsourced data labeling with AWS SageMaker Ground Truth


What is data labeling needed for?

After the initial idea of what to solve as a machine learning problem, you are going to need high quality training data. Generation of raw data is fairly straightforward, but transforming that raw data into useful, meaningful, and machine-interpretable training data requires human effort. Data labels for images are named areas of image which contain specific objects.

Computer vision can also detect objects in images, and defining a reasonable set of object types to detect can improve the machine labeling accuracy. Computer labeling with improper objects produces low quality results.  An extreme form of this has been made into art by DeepDream, which created nightmarish renditions of dogs from images of jellyfish, for example.

For our purpose, we use data labeling by humans to create data labels. These data labels will be used to train the machine learning model to classify similar situations.

How does Ground Truth help

Ground Truth is a crowdsourcing service for running labeling jobs provided as a part of AWS SageMaker, a machine learning service suite. It is a platform for setting up, running and staffing your labeling jobs online. You can either have the labeling jobs be worked on by your own staff, or use Mechanical Turk for finding workers to label for you.

When using Mechanical Turkers as workforce, you set the task pricing. There is a recommended price for a given estimate of time required to complete single task. Ground Truth has a set of image labeling tasks which each has a selection of durations. As with Mechanical Turk, having a higher price relative to task duration will make the task more interesting for Turkers to complete early.

Price for labeling the data set is multiplication of the image set size, the price per task and number of workers per task. Ground Truth runs the same labeling job on several human labelers to improve accuracy and to find outlier results by single labeler.

With a good instruction set, a reasonably sized data set could be labeled in a day with very little organizing required, leaving you free to focus on the ML problem.

How do you set it up technically

As the training data is viewed by unknown parties and stored by AWS, any sensitive content in the training images needs to be anonymized. In my prototype case, I preprocessed training images, removing all metadata EXIF and renaming the images to remove the timestamp in the name.

The training images need to be stored in S3. As usual, the S3 bucket should be limited to as little visibility and permissions as possible. Trial, error and tutorial resources showed that at least the following S3 bucket features need to be allowed.

  • Public ACL access blocking flags must be disabled. Ground Truth probably works via modifying the ACL as required to allow specific Turkers’ Cognito users access.
  • Blocking public access via bucket policy can be done, lacking reasons not to.
  • The bucket must be in the same region as the Ground Truth labeling job.
  • Task instruction media files need to be publicly available.

Cloudformation YAML template for the Ground Truth role policy allows Cognito actions in addition to accessing the files.

- Sid: "sagemakergroundtruth"
Effect: "Allow"
- "cognito-idp:CreateGroup"
- "cognito-idp:CreateUserPool"
- "cognito-idp:CreateUserPoolDomain"
- "cognito-idp:AdminCreateUser"
- "cognito-idp:CreateUserPoolClient"
- "cognito-idp:AdminAddUserToGroup"
- "cognito-idp:DescribeUserPoolClient"
- "cognito-idp:DescribeUserPool"
- "cognito-idp:UpdateUserPool"
Resource: "*"
- Sid: "listTrainingbucket"
Effect: "Allow"
- "s3:ListBucket"
Resource: !Sub "arn:aws:s3:::${TrainingDataBucketName}"
- Sid: "readTrainingbucket"
- "s3:GetObject"
- "s3:PutObject"
Effect: "Allow"
Resource: !Sub "arn:aws:s3:::${TrainingDataBucketName}/*"

A manifest file of all training images is used as input to the labeling task. This is in the simplest form a list of S3 objects.

Labeling job instructions

Hardest part of the task creation is determining the best labeling criteria, and creating clear, non-ambiguous instructions on how to meet those criteria. The image above was a result of multi-class labeling job. The possible labels were “sidewalk”, “bike path”, “car on road” and “car on sidewalk”. As seen by the overlapping green and orange bounding boxes, my instructions were unclear, and led to labeling of the car to being both “on road” and “on sidewalk”.

Instructions should include both description of what is a good object to label as well as how bounding box should be fit around the object. Partially visible objects is another issue for instructions. My task required labelers to label the sidewalk in an image, which is always only partially shown.

In many labeling cases it might be useful to use several tasks to label single class each.

Results for the example task and further development

The output from this labeling work is a JSON list of labels with confidence ratings. Confidence is calculated by how well the labeling results for the image match across several labelers outputs. Labeling confidence for many objects is low, due to street environment being cluttered with lot of objects. I assume this street imagery leads to having too many unclear objects and street areas which have very subjective bounds. 

Especially the sidewalks are labeled with low confidence. This could be improved by using semantic segmentation, or pixel areas, for  labeling exact areas instead.

Labeling cars is also a common task, and COCO dataset already contains a large number of imagery with cars in them. Having humans label mere existence of cars in the image set is waste of resources.

Once the initial model for detecting interesting features in images is in place, the human processing tasks would become more valuable in validating results of ML classification tasks.

Jukka Dahlbom
Jukka Dahlbom
Head of Data Engineering, Co-founder Tietotyön vastapainoksi vaellan, musisoin, tanssin ja pelaan.