置信学习

痛点

数据标注不是哪里都有资源双盲，标注质量不过关，

标注队列负样本太少，随机采样噪声不可避免，

badcase和标注数据一样，指标一塌糊涂，靠建模解决不了问题！

用模型卡阈值迭代，实在是太繁琐了！

Confident Learning: Estimating Uncertainty in Dataset Labels

Confident learning (CL) is an alternative approach which focuses instead on label quality by characterizing and identifying label errors in datasets, based on principles of pruning noisy data, counting with probabilistic thresholds to estimate noise, and ranking examples to train with confidence.

置信学习是以识别标签错误、表征标签噪声并应用于带噪学习，有着充分的理论支持。

优势

类别自适应阈值，自动发现标注错误的样本

不需要调参，直接估计噪声标签和真实标签的联合分布

有开源工具包支持，快速上手

开源工具包cleanlab

参考cleanlab官方文档

CL methods

Count: Characterize and Find Label Errors using the Confident Joint

algorithm procedure

Rank and Prune: Data Cleaning

data prune

一行代码找到错误样本

 # Compute psx (n x m matrix of predicted probabilities) 
 #     in your favorite framework on your own first, with any classifier.
 # Be sure to compute psx in an out-of-sample way (e.g. cross-validation)
 # Label errors are ordered by likelihood of being an error.
 #     First index in the output list is the most likely error.
 from cleanlab.pruning import get_noise_indices
 
 ordered_label_errors = get_noise_indices(
     s=numpy_array_of_noisy_labels,
     psx=numpy_array_of_predicted_probabilities,
     sorted_index_method='normalized_margin', # Orders label errors
  )

三行代码训练

 from cleanlab.classification import LearningWithNoisyLabels
 from sklearn.linear_model import LogisticRegression
 
 # Wrap around any classifier (scikit-learn, PyTorch, TensorFlow, FastText, etc.)
 lnl = LearningWithNoisyLabels(clf=LogisticRegression()) 
 lnl.fit(X=X_train_data, s=train_noisy_labels) 
 predicted_test_labels = lnl.predict(X_test)

小结

在实际项目中，去除样本标签的噪声对于有监督学习意义重大
理论支持合理，可以放心使用，开源工具支持，代码开发轻量级
轻便实用，正确的使用姿势，轻轻松松玩转 noisy label learning

声明：本站部分文章内容及图片转载于互联、内容不代表本站观点，如有内容涉及侵权，请您立即联系本站处理，非常感谢！

置信学习 – 从噪声标签中学习

痛点

置信学习

优势

开源工具包cleanlab

CL methods

一行代码找到错误样本

三行代码训练

小结

相关推荐