广告点击预测

广告系统是广告与用户流量的匹配。

1. requirements

场景类

We have a bidding server which makes bids and produces logs. Also, we have information about impressions and conversions (usually with some delays). We want to have a model which using this data will predict a probability of click (conversion转化)
What types of ads are we predicting clicks for (display ads, video ads, searched ads, sponsored content)?
Are there specific user segments or contexts we should consider (demographics, location, browsing history)?
Do we have fatigue period (where ad is no longer shown to the users where there is no interest, for X days)?
What type of user-ad interaction data do we have access to can we use it for training our models?
Do we have negative feedback features (such as hide ad, block)?
How do we collect negative samples (not clicked, negative feedback)?

功能类

objective

primary business objective: maximize revenue
How will we define and measure the success of click predictions (click-through rate, conversion rate)?
personalization, diversity, Handle explicit negative feedback

constraint

scale: number of users
latency: 50ms to 100ms
Imbalanced Data: Click events are sparse relative to impressions, requiring techniques to address imbalance.

业务过程

advertiser create ads
ads indexing (inverted index, we can use elastic search)
- 如何减少广告索引的latency，inverted index + db replica + cache
users search for certain keywords
recall
ranking

特点

Data Sources

It is important for CTR prediction to learn implicit feature interactions behind user click behaviors.

Contextual Features

User-Ad Interaction Features

广告算法主流模型（广告算法基本都是point-wise训练方式，因为广告是很少以列表的形式连续呈现）

Offline metrics

Online metrics

A/B testing
Real-Time Inference
- low-latency serving framework (e.g., TensorFlow Serving) to generate predictions within 50-100ms
- Cache frequently requested user-ad pairs to reduce latency

Detecting Issues

Continuous Improvement

bad ads
- 侧重解决数据来源(人工标注), 以及数据量比较小的问题
- LLM fine tune teacher, teacher做bulk inference, distill到student
calibration:
- fine-tuning predicted probabilities to align them with actual click probabilities
data leakage:
- info from the test or eval dataset influences the training process
- target leakage, data contamination (from test to train set)
catastrophic forgetting
- model trained on new data loses its ability to perform well on previously learned tasks
gdpr、dma这些rule对广告的影响
uplift: 预测增量值(lift的部分), 预测某种干预对于个体状态或行为的因果效应(识别营销敏感人群)。

Lift = P(buy|treatment) - P(buy|no treatment)

精读

扩展

Last updated 21 days ago