ALiPy: Active Learning in Python

Supervised machine learning methods usually require a large set of labeled examples for model training. However, in many real applications, there are plentiful unlabeled data but limited labeled data; and the acquisition of labels is costly. Active learning (AL) reduces the labeling cost by iteratively selecting the most valuable data to query their labels from the annotator. This article introduces a Python toobox ALiPy for active learning. ALiPy provides a module based implementation of active learning framework, which allows users to conveniently evaluate, compare and analyze the performance of active learning methods. In the toolbox, multiple options are available for each component of the learning framework, including data process, active selection, label query, results visualization, etc. In addition to the implementations of more than 20 state-of-the-art active learning algorithms, ALiPy also supports users to easily configure and implement their own approaches under different active learning settings, such as AL for multi-label data, AL with noisy annotators, AL with different costs and so on. The toolbox is well-documented and open-source on Github, and can be easily installed through PyPI.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

12/08/2020

Active Learning: Problem Settings and Recent Developments

In supervised learning, acquiring labeled training data for a predictive...
11/30/2021

DeepAL: Deep Active Learning in Python

We present DeepAL, a Python library that implements several common strat...
06/07/2017

Active Learning for Structured Prediction from Partially Labeled Data

We propose a general purpose active learning algorithm for structured pr...
03/29/2020

A Comprehensive Benchmark Framework for Active Learning Methods in Entity Matching

Entity Matching (EM) is a core data cleaning task, aiming to identify di...
01/18/2022

Active Learning for Open-set Annotation

Existing active learning studies typically work in the closed-set settin...
06/30/2020

Similarity Search for Efficient Active Learning and Search of Rare Concepts

Many active learning and search approaches are intractable for industria...
07/27/2020

Deep Active Learning for Solvability Prediction in Power Systems

Traditional methods for solvability region analysis can only have inner ...

Code Repositories

ALiPy

ALiPy is a python package for experimenting with different active learning settings and algorithms.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Active learning is a main approach to learning with limited labeled data. It tries to reduce the human efforts on data annotation by actively querying the most important examples (Settles (2009)).

ALiPy is a Python toolbox for active learning, which is suitable for various users. On one hand, the whole process of active learning has been well implemented. Users can easily perform experiments by several lines of codes to finish the whole process from data pre-process to result visualization. Also, more than 20 commonly used active learning methods have been implemented in the toolbox, providing users many choices. Table 1 summarizes the main approaches implemented in ALiPy. On the other hand, ALiPy supports users to implement their own ideas about active learning with high freedom. By decomposing the active learning process into multiple components, and correspondingly implementing them with different modules, ALiPy is designed in a low coupling way, and thus let users to freely configure and modify any parts of the active learning. Furthermore, in addition to the traditional active learning setting, ALiPy also supports other novel settings. For example, the data examples could be multi-labeled, the oracle could be noisy, and the annotation could be cost-sensitive.

AL with Instance Selection Uncertainty (Lewis and Gale (1994)), Query By Committee (Abe and Mamitsuka (1998)), Expected Error Reduction (Roy and McCallum (2001)), Random, Graph Density (Ebert et al. (2012)), BMDR (Wang and Ye (2013))), QUIRE (Huang et al. (2010)), LAL (Konyushkova et al. (2017)), SPAL (Tang and Huang (2019))
AL for Multi-Label Data AUDI (Huang and Zhou (2013)), QUIRE (Huang et al. (2014)), MMC (Yang et al. (2009)), Adaptive (Li and Guo (2013)), Random
AL by Querying Features AFASMC (Huang et al. (2018)), Stability (Chakraborty et al. (2013)), Random
AL with Different Costs HALC (Yan and Huang (2018)), Random, Cost performance
AL with Noisy Oracles CEAL (Huang et al. (2017)), IEthresh (Donmez et al. (2009)), Repeated (Sheng et al. (2008)), Random
AL with Novel Query Types AURO (Huang et al. (2015))
AL for Large Scale Tasks Subsampling
Table 1: Implemented active learning strategies in different settings.

2 Modules in ALiPy

As illustrated in Figure 1, we decompose the active learning implementation into multiple components. To facilitate the implementation of different active learning methods under different settings, we develop ALiPy based on multiple modules, each corresponding to a component of the active learning process.

Below is the list of modules in ALiPy.

  • alipy.data_manipulate: It provides the basic functions of data pre-process and partition. Cross validation or hold out test are supported.

  • alipy.query_strategy: It consists of 25 commonly used query strategies.

  • alipy.index.IndexCollection: It helps to manage the indexes of labeled and unlabeled examples.

  • alipy.metric: It provides multiple criteria to evaluate the model performance.

  • alipy.experiment.state and alipy.experiment.state_io: They help to save the intermediate results after each query and can recover the program from breakpoints.

  • alipy.experiment.stopping_criteria It implements some commonly used stopping criteria.

  • alipy.oracle: It supports different oracle settings. One can set to have multiple oracles with noisy annotations and different costs.

  • alipy.experiment.experiment_analyser: It provides functions for gathering, processing and visualizing the experimental results.

  • alipy.utils.multi_thread: It provides a parallel implementation of k-fold experiments.

The above modules are independently designed implemented. In this way, the code between different parts can be implemented without limitation. Also, each independent module can be replaced by users’ own implementation (without inheriting). The modules in ALiPy will not influence each other and thus can be substituted freely.

In each module, we also provide a high flexibility to make the toolbox adaptive to different settings. For example, in data split function, one can provide the shape of your data matrix or a list of example names to get the split. In the oracle class, one can further specify the cost of each label, and query instance-label pairs in multi-label setting. In the analyser class, the experimental results can also be unaligned for cost-sensitive setting, where an interpolate will be performed automatically when plotting the learning curves.

For more details, please refer to the document at http://parnec.nuaa.edu.cn/huangsj/alipy, and the git repository at https://github.com/NUAA-AL/ALiPy.

Figure 1: A general framework for implementing an active learning approach.

3 Usage of ALiPy

ALiPy provides several optional usages for different users.

For the users who are less familiar with active learning and want to simply apply a method to a dataset, ALiPy provides a class which has encapsulated various tools and implemented the main loop of active learning, namely alipy.experiment.AlExperiment. Users can run the experiments with only a few lines of codes by this class without any background knowledge.

For the users who want to experimentally evaluate the performance of existing active learning methods, ALiPy provides implementations of more than 20 state-of-the-art methods, along with detailed instructions and plentiful example codes.

For the users who want to implement their own idea and perform active learning experiments, ALiPy provides module based structure to support users to modify any part of active learning. More importantly, some novel settings are supported to make the implementation more convenient. We also provide detailed api references and usage examples for each module and setting to help users get started quickly. Note that, ALiPy does not force users to use any tool classes, they are designed in an independent way and can be substituted by users’ own implementation without inheriting anything.

For details, please refer to the documents and code examples available on the ALiPy homepage and github.


References

  • Abe and Mamitsuka (1998) Naoki Abe and Hiroshi Mamitsuka. Query learning strategies using boosting and bagging. In Proceedings of the 15th International Conference on Machine Learning, pages 1–9, 1998.
  • Chakraborty et al. (2013) Shayok Chakraborty, Jiayu Zhou, Vineeth Nallure Balasubramanian, Sethuraman Panchanathan, Ian Davidson, and Jieping Ye. Active matrix completion. In IEEE 13th International Conference on Data Mining, pages 81–90, 2013.
  • Donmez et al. (2009) Pinar Donmez, Jaime G. Carbonell, and Jeff G. Schneider. Efficiently learning the accuracy of labeling sources for selective sampling. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 259–268, 2009.
  • Ebert et al. (2012) Sandra Ebert, Mario Fritz, and Bernt Schiele. RALF: A reinforced active learning formulation for object class recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition

    , pages 3626–3633, 2012.
  • Huang and Zhou (2013) Sheng-Jun Huang and Zhi-Hua Zhou. Active query driven by uncertainty and diversity for incremental multi-label learning. In IEEE 13th International Conference on Data Mining, pages 1079–1084, 2013.
  • Huang et al. (2010) Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and representative examples. In Advances in Neural Information Processing Systems, pages 892–900, 2010.
  • Huang et al. (2014) Sheng-Jun Huang, Rong Jin, and Zhi-Hua Zhou. Active learning by querying informative and representative examples. IEEE Transactions on Pattern Analysis and Machine Intelligence, 36(10):1936–1949, 2014.
  • Huang et al. (2015) Sheng-Jun Huang, Songcan Chen, and Zhi-Hua Zhou. Multi-label active learning: Query type matters. In

    Proceedings of the 25th International Joint Conference on Artificial Intelligence

    , pages 946–952, 2015.
  • Huang et al. (2017) Sheng-Jun Huang, Jia-Lve Chen, Xin Mu, and Zhi-Hua Zhou. Cost-effective active learning from diverse labelers. In Proceedings of the 26th International Joint Conference on Artificial Intelligence, pages 1879–1885, 2017.
  • Huang et al. (2018) Sheng-Jun Huang, Miao Xu, Ming-Kun Xie, Masashi Sugiyama, Gang Niu, and Songcan Chen. Active feature acquisition with supervised matrix completion. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 1571–1579, 2018.
  • Konyushkova et al. (2017) Ksenia Konyushkova, Raphael Sznitman, and Pascal Fua. Learning active learning from data. In Advances in Neural Information Processing Systems, pages 4228–4238, 2017.
  • Lewis and Gale (1994) David D. Lewis and William A. Gale.

    A sequential algorithm for training text classifiers.

    In Proceedings of the 17th Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, pages 3–12, 1994.
  • Li and Guo (2013) Xin Li and Yuhong Guo. Active learning with multi-label SVM classification. In Proceedings of the 23rd International Joint Conference on Artificial Intelligence, pages 1479–1485, 2013.
  • Roy and McCallum (2001) Nicholas Roy and Andrew McCallum.

    Toward optimal active learning through sampling estimation of error reduction.

    In Proceedings of the 18th International Conference on Machine Learning, pages 441–448, 2001.
  • Settles (2009) B. Settles. Active learning literature survey. Technical report, University of Wisconsin-Madison, 2009.
  • Sheng et al. (2008) Victor S. Sheng, Foster J. Provost, and Panagiotis G. Ipeirotis. Get another label? improving data quality and data mining using multiple, noisy labelers. In Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 614–622, 2008.
  • Tang and Huang (2019) Ying-Peng Tang and Sheng-Jun Huang. Self-paced active learning: Query the right thing at the right time. In Proceedings of the 33rd AAAI Conference on Artificial Intelligence, 2019.
  • Wang and Ye (2013) Zheng Wang and Jieping Ye. Querying discriminative and representative samples for batch mode active learning. In Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 158–166, 2013.
  • Yan and Huang (2018) Yifan Yan and Sheng-Jun Huang. Cost-effective active learning for hierarchical multi-label classification. In Proceedings of the 27th International Joint Conference on Artificial Intelligence, pages 2962–2968, 2018.
  • Yang et al. (2009) Bishan Yang, Jian-Tao Sun, Tengjiao Wang, and Zheng Chen. Effective multi-label active learning for text classification. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 917–926, 2009.