Recently a paper with the title “Data Poisoning Attacks and Defenses to Crowdsourcing Systems” was published on arXiv in which the authors analyze data poisoning attacks on crowdsourced data labeling. Here is a summary.
For classification tasks such as image classification large labeled datasets of high quality are required in order to build machine learning models that achieve state-of-the-art performance. However, creating these datasets is often challenging as in many situations only unlabeled data is available and labeling millions or even billions of unlabeled items would require a lot of manual effort with thousands of people being involved.
Labeling Datasets via Crowdsourcing
To still get the labels for the items of an unlabeled dataset one option is crowdsourcing. Here, the labeling of unlabeled data is done by the crowd that might consist of thousands or ten of thousands of individuals, also known as workers. Each worker gets a small subset of the items of the big dataset and assigns the correct labels to these items. When a worker has labeled all items, the worker sends the results back to the server which aggregates the results of all workers.
The results are usually noisy and unreliable as each worker might make mistakes and…