痛点
数据标注不是哪里都有资源双盲,标注质量不过关,
标注队列负样本太少,随机采样噪声不可避免,
badcase和标注数据一样,指标一塌糊涂,靠建模解决不了问题!
用模型卡阈值迭代,实在是太繁琐了!
置信学习
Confident Learning: Estimating Uncertainty in Dataset Labels
Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence.
置信学习是以识别标签错误、表征标签噪声并应用于带噪学习,有着充分的理论支持。
优势
开源工具包cleanlab
参考cleanlab官方文档
CL methods
Count: Characterize and Find Label Errors using the Confident Joint
algorithm procedure
Rank and Prune: Data Cleaning
data prune
一行代码找到错误样本
# Compute psx (n x m matrix of predicted probabilities)
# in your favorite framework on your own first, with any classifier.
# Be sure to compute psx in an out-of-sample way (e.g. cross-validation)
# Label errors are ordered by likelihood of being an error.
# First index in the output list is the most likely error.
from cleanlab.pruning import get_noise_indices
ordered_label_errors = get_noise_indices(
s=numpy_array_of_noisy_labels,
psx=numpy_array_of_predicted_probabilities,
sorted_index_method='normalized_margin', # Orders label errors
)
三行代码训练
from cleanlab.classification import LearningWithNoisyLabels
from sklearn.linear_model import LogisticRegression
# Wrap around any classifier (scikit-learn, PyTorch, TensorFlow, FastText, etc.)
lnl = LearningWithNoisyLabels(clf=LogisticRegression())
lnl.fit(X=X_train_data, s=train_noisy_labels)
predicted_test_labels = lnl.predict(X_test)
小结
- 在实际项目中,去除样本标签的噪声对于有监督学习意义重大
- 理论支持合理,可以放心使用,开源工具支持,代码开发轻量级
- 轻便实用,正确的使用姿势,轻轻松松玩转 noisy label learning
声明:本站部分文章内容及图片转载于互联 、内容不代表本站观点,如有内容涉及侵权,请您立即联系本站处理,非常感谢!