James Routley

Classifying unsafe ad content has proven an enticing problem space for leveraging large language models (LLMs). The inherent complexity involved in identifying policy-violating content demands solutions capable of deep contextual and cultural understanding, areas of relative strength for LLMs over traditional machine learning systems. But fine-tuning LLMs for such complex tasks requires high-fidelity training data that is difficult and expensive to curate at the necessary quality and scale. Standard data-intensive approaches to training models are costly, especially given the need to handle concept drift as safety policies evolve or as new types of unsafe ad content arise. In the worst case the model must be retrained on a completely new data set. Reducing the amount of training data needed is therefore paramount.

With this in mind, we describe a new, scalable curation process for active learning that can drastically reduce the amount of training data needed for fine-tuning LLMs while significantly improving model alignment with human experts. The process can be applied to datasets of hundreds of billions of examples to iteratively identify the examples for which annotation would be most valuable and then use the resulting expert labels for fine-tuning.

In our experiments, we were able to reduce the scale of training data needed from 100,000 to under 500 training examples, while increasing model alignment with human experts by up to 65%. Production systems using larger models have seen even greater reductions in data scale, using up to four orders of magnitude less data while maintaining or improving quality.

Curation process

Our process starts with a zero- or few-shot initial model (LLM-0), which we provide with a prompt describing the content of interest, e.g., defining clickbait and asking “Is this ad clickbait?” The LLM-0 model then labels ads as clickbait (orange in the figure below) or benign (blue) and generates a large labeled data set, shown as (1) below. Note that this initial data set is typically highly imbalanced, since in production traffic only very few (<1%) ads are actually clickbait. The LLM’s true positive rate is also low because it has not yet been fine-tuned. To find the most informative examples, we separately cluster examples labeled clickbait and examples labeled benign, which yields some overlapping clusters, thus indicating potential model confusion between clickbait and benign examples (2). For each such overlapping cluster pair, we find pairs of examples lying nearest each other that have different labels (3) and send these to human experts for an opinion. If needed to stay within our review budget, we prioritize pairs of examples that cover a larger area of our search space (4). The resulting curated set is both informative (since it contains the most confusable examples along the decision boundary) and diverse (since it draws from different regions along that decision boundary).

These expert-provided labels are split randomly into two sets. The first is used for model evaluation, based on two key alignment metrics: the internal alignment measuring how much experts agree, and the model–human alignment between the current model and human experts. The second is used to fine-tune the current models, producing the next iteration of the model. The process repeats until the model–human alignment either matches the internal alignment or plateaus and cannot be improved further.

Metric

Our curation process does not assume the existence of ground truth. Many classification problems in the ads safety space, such as content moderation or fraud detection, are inherently ambiguous and require interpretation and deliberation, even among policy experts. We therefore cannot rely on standard metrics like precision and recall, which require a ground truth label. Instead we use Cohen’s Kappa, a measure of how well two independent annotators align, above what would be expected from chance agreement. In our experiments, Cohen’s Kappa is used as both a quality indicator for datasets (including model evaluation during the curation process, as noted above); and as a measure of model performance. Values closer to 1 show higher alignment, 0 indicates no alignment above chance, and negative values indicate systematic disagreement. While standards for interpreting these scores vary, Kappa values above .8 are widely considered to be exceptionally good, and values above .4 are generally considered acceptable.

Experiments

We wanted to understand which models and tasks would benefit most from our curation process. As baselines for our experiments, we fine-tuned two LLMs of different sizes (Gemini Nano-1 with 1.8B parameters and Nano-2 with 3.25B parameters) on two tasks of different complexity (lower and higher, based on expert alignment) using crowdsourced labels. Each crowdsourced data set has ~100K annotations and a strong class imbalance, with around 95% benign labels on average.

We compared each of these four baseline conditions against the corresponding curated condition in which each model (Nano-1 and Nano-2) is fine-tuned over multiple rounds using the curation process described above. At each iteration, we selected our curated set of examples and used them for model evaluation and fine-tuning, as described above. All models plateaued before reaching parity with the experts’ internal alignment, so we stopped at 6 iterations (~400 fine-tuning and ~250 evaluation samples) for the lower complexity task and 5 iterations (~250 fine-tuning and ~150 evaluation samples) for the higher complexity task. (Note that the lower complexity task had a larger variety of examples, which may account for the longer time needed to converge.) Both data sets had a final class balance of ~40% positive examples.

The table below provides an overview of the scale and quality of the data used in each condition. Experts reached an average pairwise Cohen’s Kappa of .81 (on the lower complexity task) and .78 (on the higher complexity task) through the curation process. We consider these the ceiling for model performance. To assess the quality of our crowdsourced data, we calculated Kappa alignment between crowdsourced annotations and experts based on our full curated set, which was .59 (lower complexity) and .41 (higher complexity).

Below we show how models trained on these vastly different data sets performed in each of our baseline and curated conditions. The 1.8B parameter model saw comparable performance on both tasks: the baseline and curated models had .24 and .25 alignment, respectively, for the lower complexity task, and both models had .13 alignment on the higher complexity task. By contrast, the 3.25B parameter model showed significant quality improvements when trained with our curation process. Kappa scores for the baseline and curated models were .36 and .56, respectively, for the lower complexity task; and .23 and .38, respectively, for the higher complexity task — an improvement in alignment of 55-65% using three orders of magnitude less data (250 to 450 examples, compared to 100K in the baseline condition).

These results demonstrate that careful curation of LLM datasets to focus on fewer, more informative examples can yield better or equivalent classifier performance using much less data — three orders of magnitude less in the experiments reported here, and up to four orders of magnitude less for the larger models used in production. Of course, these gains require not only good curation but also very high quality data. For our use cases, we have observed that a label quality above .8 pairwise Cohen’s Kappa is needed to reliably outperform crowdsourced data. Consistently achieving this level of quality poses a separate challenge, to be discussed in a subsequent blog post.

But given sufficient label quality, our curation process is able to leverage the strengths of both LLMs, which can cast a wide net over the problem space, and domain experts, who can focus more efficiently on the most challenging examples. The ability to retrain models with just a handful of examples is especially valuable for handling the rapidly changing landscapes of domains like ads safety. We believe the approach we’ve described will enable systems that can make more flexible, efficient use of high-fidelity labels to escape the data bottleneck.

Acknowledgements

This work would not have been possible without our outstanding team of engineers and product managers. Steve Walker is a co-founder of our project and co-creator of the curation process as well as the tech lead for the machine learning infrastructure of our project. Kelsie McElroy is the project manager and a co-founder of our project. We also want to thank the Ads Privacy and Safety leadership team for their continued support and belief in our vision.

Achieving 10,000x training data reduction with high-fidelity labels

Curation process

Metric

Experiments

Acknowledgements