A Survey of Adaptive Resonance Theory Neural Network Models for Engineering Applications

05/04/2019 ∙ by Leonardo Enzo Brito da Silva, et al. ∙ IEEE 0

This survey samples from the ever-growing family of adaptive resonance theory (ART) neural network models used to perform the three primary machine learning modalities, namely, unsupervised, supervised and reinforcement learning. It comprises a representative list from classic to modern ART models, thereby painting a general picture of the architectures developed by researchers over the past 30 years. The learning dynamics of these ART models are briefly described, and their distinctive characteristics such as code representation, long-term memory and corresponding geometric interpretation are discussed. Useful engineering properties of ART (speed, configurability, explainability, parallelization and hardware implementation) are examined along with current challenges. Finally, a compilation of online software libraries is provided. It is expected that this overview will be helpful to new and seasoned ART researchers.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Adaptive Resonance Theory (ART) (Grossberg, 1976a, b, 1980, 2013) is a biologically-plausible theory of how a brain learns to consciously attend, learn and recognize patterns in a constantly changing environment. The theory states that resonance regulates learning in neural networks with feedback (recurrence). Thus, it is more than a neural network architecture, or even a family of architectures. However, it has inspired many neural network architectures that have very attractive properties for applications in science and engineering, such as being fast and stable incremental learners with relatively small memory requirements and straightforward algorithms (Wunsch II, 2009)

. In this context, fast learning refers to the ability of the neurons’ weight vectors to converge to their asymptotic values directly with each input sample presentation. These, and other properties, make ART networks attractive to many researchers and practitioners, as they have been used successfully in a variety of science and engineering applications.

ART addresses the problem of stability vs. plasticity (Grossberg, 1980; Carpenter & Grossberg, 1987a). Plasticity refers the ability of a learning algorithm to adapt and learn new patterns. In many such learning systems plasticity can lead to instability, a situation in which learning new knowledge leads to the loss or corruption of previously-learned knowledge, also known as catastrophic forgetting. Stability, on the other hand, is defined by the condition that no prototype vector can take on a previous value after it has changed, and that an infinite presentation of inputs results in forming a finite number of clusters (Xu & Wunsch II, 2009; Moore, 1989). ART addresses this stability-plasticity dilemma by introducing the ability to learn arbitrary input patterns in a fast and stable self-organizing fashion without suffering from catastrophic forgetting.

There have been some previous studies with similar objectives of surveying the ART neural network literature (Du, 2010; Amorim et al., 2011; Jain et al., 2014; RamaKrishna et al., 2014). This survey expands on those works, compiling a broad and informative sampling of ART neural network architectures from the ever-growing machine learning literature. It captures a representative set of examples of various ART architectures in the unsupervised, supervised and reinforcement learning modalities, as well as some models that cross these boundaries and/or combine multiple learning modalities. The overarching goal of this survey is to provide researchers with an accessible coverage of these models, with a focus on their motivations, interpretations for engineering applications and a discussion of open problems for consideration. It is not meant as a comparative assessment of these models but rather a roadmap to assess options.

The remainder of this paper is organized as follows. Section 2 presents a sampling of unsupervised learning (UL) ART models, divided into elementary, topological, hierarchical, biclustering and data fusion architectures. Section 3

discusses supervised learning (SL) ART models for both classification and regression. Reinforcement learning (RL) ART models are discussed in Section 

4. Sections 5 and 6 discuss some of the useful properties of ART architectures and open problems in this field, respectively. Section 7 provides links to some repositories of ART neural network code, and Section 8 concludes the paper.

2 ART models for unsupervised learning

2.1 Elementary architectures

At their core, the elementary ART models are predominantly used for unsupervised learning applications. However, they also lay the foundation to build complex ART-based systems capable of performing all three machine learning modalities (Secs. 2, 3, and 4). This section describes the main characteristics of ART family members in terms of their code representation, long-term memory unit, system dynamics (which encompasses activation, match, resonance and learning) and user-defined parameters. For clarity, Table 1 summarizes the common notation used in the following subsections.

An elementary ART neural network model (Fig. 1) usually consists of two fully connected layers as well as a system responsible for its decision-making capabilities:

  • Feature representation field F1: this is the input layer. In feedforward mode, the output of this layer, or short-term memory (STM), simply propagates the input samples to the F2 layer via the bottom-up long-term memory units (LTMs) . In feedback mode, the F1 layer works as a comparator, in which  and the F2’s expectation (in the form of a top-down LTM ) are compared and the outcome is sent to the orienting subsystem. Hence, F1 is also known as comparison layer.

  • Category representation field F2: this layer yields the network output (STM). It is also known as recognition or competitive layer. Neurons, prototypes, categories and templates will be used interchangeably when referring to the F2 nodes. The LTM associated with a category is , . Note that not all elementary ART models discussed in this survey have independent bottom-up and top-down LTM parts; however, is always used to indicate the LTM (or set of adaptive parameters) of a given category.

  • Orienting subsystem: this is a system that regulates both the search and learning mechanisms by inhibiting or allowing categories to resonate.

Figure 1: Elementary ART model underlying various designs. The orienting subsystem uses the vigilance threshold to regulate whether ART can go into resonance or if it must reset.
Notation Description
input sample ()
original data dimensionality ()
F1 feature representation field
F2 category representation field
number of categories
F1 activity/output (STM)
F2 activity/output (STM)
a category
category parameters (LTM unit)
activation function
match function
chosen category index (via WTA)
vigilance parameter
vigilance region
Table 1: Unsupervised ART models notation.

Note that some ART models represent pre-processing procedures of the input samples by another layer preceding F1, namely the Input field F0. In this survey, it is assumed that the inputs to an ART network have already gone through the required transformations, and thus this layer is omitted from the discussion.

ART models are competitive, self-organizing, dynamic and modular networks. When a sample is presented, a winner-takes-all (WTA) competition takes place over its categories at the output layer F2. Then, the neuron that optimizes that model’s activation function across the nodes is chosen, e.g., the neuron that maximizes some similarity measure to the presented sample

(1)

A category represents a hypothesis. Therefore, a hypothesis test cycle, commonly referred to as a vigilance test, is performed by the orienting subsystem to determine the adequacy of the selected category, i.e., the winner category must satisfy a match criterion (or several match criteria). If the confidence on such a hypothesis is larger than the minimum threshold (namely, the vigilance parameter ), the neural network enters in a resonance state and learning (i.e., adaptation of the long-term memory (LTM) units) is allowed. Otherwise, category is inhibited, the next highest ranked category is selected, and the search resumes. If no category satisfies the required resonance condition(s), then a new one is created to encode the presented input sample. This ability to reject a hypothesis/category via a two-way similarity measure, i.e. permissive clustering (Seiffertt & Wunsch II, 2010)

, makes ART stand out from other methods, such as k-means 

(MacQueen, 1967). A vigilance region () for a given network category can be defined in the data space as

(2)

where is the match function, which yields the confidence on hypothesis . In other words, it is the region in the input space containing the set of all points such that the resonance criteria is met. Therefore satisfying (or not) the vigilance test for sample can be modeled using

(3)

where is the indicator function.

The resonance constraint in Eq. (2) is depends on the vigilance parameter , which regulates the granularity of the network as ART maps samples to categories. Particularly, lower vigilance encourages generalization (Vigdor & Lerner, 2007). Selecting the vigilance parameter is a difficult task in clustering problems. Concretely, the problem of choosing the number of clusters is traded for the problem of choosing the vigilance value.

Distinct ART models feature specific LTM units, activation and match functions, vigilance criteria and learning laws. Algorithm 1 summarizes the dynamics of an elementary ART model.

Input : , (parameters).
Output : .
 
/* Notation : set of ART nodes. : subset of highly active nodes (). : LTM unit. : activation function parameter(s). : learning function parameter(s). : match function parameter(s). : vigilance parameter(s). : initialization parameter(s). : activation function. : match function. : learning function. : vigilance function (e.g., ). : initialization function. */
654321 Present input sample: . Compute activation function(s): . Perform WTA competition: . Compute match function(s): . Perform vigilance test(s): . if  is TRUE then
      7 Update category : .
8 else
      109 Deactivate category : . if  then
            11 Go to step 1.
      12 else
            151413 Set . Create new category: . Initialize new category: .
      
1716Set output: . Go to step 1.
Algorithm 1 Elementary ART algorithm.

2.1.1 Art 1

The ART 1 neural network (Carpenter & Grossberg, 1987a) was the seminal implementation of the theory championed by Grossberg used for engineering applications. It relies on crisp set theoretic operators to cluster binary input samples using a similarity measure based on Hamming distance (Serrano-Gotarredona et al., 1998).

LTM. ART 1 categories are parameterized with bottom-up and top-down adaptive weight vectors .

Activation. When a sample is presented to ART 1, the activation function of each category is computed as

(4)

where is a binary input, is a binary logic AND, is the bottom-up weight vector, is the norm, and is an inner product.

When a given node is selected via the WTA competition, the output of the F2 activity (short-term memory - STM) becomes

(5)

moreover, the F1 activity (short-term memory - STM) is defined as

(6)

Note that the WTA competition always include one uncommitted node, which is is guaranteed to satisfy the vigilance criterion following Eq. (7).

Match and resonance. The highest activated node is tested for resonance using

(7)

where and . The vigilance criterion checks if is true, and, in the affirmative case, the category is allowed to learn.

Learning. When the system enters a resonant state, learning is ensued as

(8)
(9)

where is a user-defined parameter (larger values of bias the selection of uncommitted nodes over committed ones). Note that the bottom-up weight vectors are normalized versions of their top-down counterparts. If an uncommitted node is selected to learn sample , then another one is created and initialized as

(10)
(11)

ART 1 features the following appealing properties thoroughly discussed in (Serrano-Gotarredona et al., 1998): “vigilance or variable coarseness, self-scaling, self-stabilization in a small number of iterations, online learning, capturing rate events, direct assess to familiar input patterns, direct assess to subset and superset patterns, biasing the network to form new categories.

2.1.2 Art 2

ART 2 (Carpenter & Grossberg, 1987b) and 2-A (Carpenter et al., 1991b) represent the initial effort toward extending ART 1 (Sec. 2.1.1) applications to real valued data. They were largely supplanted by Fuzzy ART (Sec. 2.1.3) which has since become one of the most widely used and referenced foundational building block for ART networks. This was followed by other architectures such as the ART 3 (Carpenter & Grossberg, 1990) hierarchical architecture, Exact ART (Raijmakers & Molenaar, 1997) (which is a complete ART network based on ART 2) and Correlation-based ART (Yavaş & Alpaslan, 2009) along with its hierarchical variant (Yavaş & Alpaslan, 2012) which use correlation analysis methods for category matching. Particularly, the ART 2-A (Carpenter et al., 1991b) architecture was developed following ART 2 with the same properties and a much faster speed.

LTM. The internal category representation in ART 2-A consists of an adaptive scaled weight vector .

Activation. The activation function of each category  in response to a normalized input sample is computed as

(12)

where is the choice parameter.

Match and resonance. The category with the highest activation value is chosen via winner-takes-all selection. Its match function is computed as

(13)

and the vigilance test is performed to determine whether resonance occurs using the following: , where is the vigilance threshold.

If the winning category passes the vigilance test, resonance occurs, and the category is allowed to learn this input pattern. If the category fails the vigilance test, a reset signal is triggered for this category, and the category with the next highest activation is selected for the same process.

Learning. When resonance occurs, the weights of the winning category are updated as

(14)

where is the learning rate.

2.1.3 Fuzzy ART

Fuzzy ART (FA) (Carpenter et al., 1991c) is arguably the most widely used ART model. It extends the capabilities of ART 1 (Sec. 2.1.1) to process real-valued data by incorporating fuzzy set theoretic operators (Zadeh, 1965). Typically, samples are pre-processed by applying complement coding (Carpenter et al., 1991a, 1992). This transformation doubles the original input dimension while imposing a constant norm ():

(15)

This process encodes the degree of presence and absence of each data feature. The augmented input vector prevents a category proliferation type due to weight erosion (Carpenter, 1997).

LTM. Each category LTM unit is a weight vector . If complement coding is employed, then , and the geometric interpretation of a category is a hyperrectangle (or hyperbox), in the data space, with lower left corner and upper right corner representing features ranges (minimum and maximum data statistics).

Activation. The activation function of a category is defined as (Weber law)

(16)

where is a component-wise fuzzy AND/intersection (minimum), is the choice parameter which is related to the system’s complexity (it can be seen as a regularization parameter that penalizes large weights). Its role has been thoroughly investigated in (Georgiopoulos et al., 1996). The activation function measures the degree to which is a fuzzy subset of and is biased towards smaller categories. The F1 activity is defined as

(17)

Match and resonance. When the winner node is selected, the F2 activity is

(18)

and a hypothesis testing cycle is conducted using

(19)

where and is the vigilance parameter. The vigilance criterion checks if is true, and, in the affirmative case, the category is allowed to learn. An uncommitted category will always satisfy the match criterion. Fuzzy ART vigilance regions are hyperoctagons and thoroughly discussed in (Anagnostopoulos & Georgiopoulos, 2002; Verzi et al., 2006; Meng et al., 2016). The match function ensures that if learning takes place, the updated category will not exceed the maximum allowed size. Specifically, category ’s size is measured as

(20)

where, considering the complement coded inputs, (for an uncommitted category: ). Particularly, the match function measures the size of the category if it is allowed to learn the presented sample. Thus, the vigilance criterion imposes an upper bound to the category size defined by the vigilance parameter ()

(21)

where represents the smallest hyperrectangle capable of enclosing both and the presented sample .

Learning. If the vigilance test fails, then the winner category is inhibited, and the search continues until another one is found or created. When the vigilance criterion in met by category , it adapts using

(22)

where is the learning parameter. If an uncommitted node is recruited to learn sample , then another one is created and initialized as . According to Eq. (22), the norm of a weight vector is monotonically non-increasing during learning since categories can only expand (Vigdor & Lerner, 2007).

2.1.4 Fuzzy Min-Max

The Fuzzy Min-Max neural network (Simpson, 1993) is an unsupervised learning network that uses fuzzy set theory to build clusters using a hyperbox representation discovered via the fuzzy min-max learning algorithm. Each category in Fuzzy Min-Max is represented explicitly as a hyperbox, with the minimum and maximum points of the hyperbox as well as a value for the membership function that measures the degree to which each input pattern falls within this category. The category hyperboxes are adjusted to fit each input sample using a contraction and expansion algorithm that expands the hyperbox of the winning category to fit the input sample and then contracts any other hyperboxes that are found to overlap with the new hyperbox boundaries.

2.1.5 Distributed ART

The distributed ART (dART) (Carpenter, 1996a, b, 1997) features distributed code representation for activation, match and learning processes to improve noise robustness and memory compression in a system that features fast and stable learning. Particularly, in WTA mode, distributed ART reduces in functionality to fuzzy ART (Sec. 2.1.3).

LTM. The distributed ART LTM units consist of bottom-up () and top-down () adaptive thresholds (), which are initialized as small random values and , respectively. When employing complement coding, the geometric interpretation of a category

is a family of hyperrectangles nested by the activation levels

. The edges of hyperrectangle are defined, for each input dimension , as the bounded interval — where is a rectifier operator. Note that the size decreases as increases. Particularly, setting yields the smallest hyperrectangle , and the substitution corresponds to fuzzy ART’s LTM.

Activation. The activation function can be defined as a choice-by-difference (Carpenter & Gjaja, 1994) () variant

(23)

or a Weber law (Carpenter & Grossberg, 1987a) () variant

(24)

where is a component-wise rectifier operator (i.e., for each component of vector ), and and are the medium-term memory (MTM) depletion parameters. After the nodes’ activations are computed, the F2 activity can be obtained by employing the increased-gradient content-addressable-memory (IG CAM) rule:

(25)

such that and . The subset consists of the nodes such that for and . Examples are the Q-max rule (see Sec. 3.1.10) or greater than average activations (i.e., , ). Note that the power law converges to WTA when .

Match and Resonance. The distributed ART’s match function is defined as

(26)

where the activity is given by

(27)

and

(28)

Resonance occurs if , where and . Otherwise, the MTM depletion parameters are updated as

(29)
(30)

and the distributed dynamics continue by recomputing Eqs. (25) through (26). Note that the depletion parameters and are (re)set to at the beginning of every input sample presentation.

Learning. When the system enters a resonant state, distributed learning takes place according to the nodes’ activation levels. Specifically, the top-down adaptive thresholds are updated using the distributed outstar learning law (Carpenter, 1994):

(31)

whereas the bottom-up adaptive thresholds are updated using the distributed instar learning law (Carpenter, 1997):

(32)

where is the learning rate. The adaptive thresholds’ components, , start near zero and monotonically increase during the learning process. After learning takes place, the depletion parameters and are both reset to their initial values (). In WTA mode, the distributed instar and outstar learning laws become the instar (Grossberg, 1972) and outstar (Grossberg, 1968, 1969) laws, respectively, and thus distributed ART reduces to fuzzy ART (Sec. 2.1.3).

2.1.6 Gaussian ART

Gaussian ART (Williamson, 1996) was developed to reduce category proliferation in noisy environments and to provide a more efficient category LTM unit.

LTM. Each category

is a Gaussian distribution composed by mean

, standard deviation

and instance counting (i.e., the number of samples encoded by category

used to compute its a priori probability). Therefore, a category is geometrically interpreted as a hyperellipse in the data space.

Activation. Gaussian ART is rooted in Bayes’ decision theory, and as such its activation function is defined as:

(33)

where the likelihood is estimated as

(34)

and the prior as

(35)

Note that the evidence is neglected in the computations (since it is equal for all categories ), and feature independence is assumed, i.e., is a diagonal matrix (). Therefore, since it assumes uncorrelated features, it cannot capture covarying data. A category is then chosen following the maximum a posteriori (MAP) criterion:

(36)

Match and Resonance. The match function is defined as a normalized version of :

(37)

which is then compared to the vigilance parameter threshold . Note that in the original Gaussian ART paper (Williamson, 1996), a log discriminant is used to reduce the computational burden in both the activation (Eq. (33)) and match (Eq. (37)) functions.

Learning. When the vigilance criterion is met, learning is ensued for the resonating category as

(38)
(39)
(40)

If a new category is created, then it is initialized with , , and (isotropic). The initial standard deviation in Gaussian ART directly affects the number of categories created.

2.1.7 Hypersphere ART

The Hypersphere ART (HA) (Anagnostopoulos & Georgiopulos, 2000) architecture was designed as a successor for Fuzzy ART (Section 2.1.3) that inherits its advantageous qualities while utilizing fewer categories and having a more efficient internal knowledge representation.

LTM. Each category is represented as , where and are the centroid and radius, respectively. Since it does not require complement coding of input samples, it uses memory per category, which is a smaller memory requirement than fuzzy ART, which uses memory to represent the hyperrectangular categories. Naturally, categories are hyperspheres in the data space.

Activation. The category activation function for each category is calculated as:

(41)

where is the (Euclidean) norm, is the choice parameter and is the radial extend parameter which controls the maximum possible category size achieved during training. The lower-bound is defined as:

(42)

Match and resonance. The winning category is selected using WTA competition, and the match function is computed as

(43)

where the vigilance criterion is .

Learning. If the winning category satisfies the vigilance test, then resonance occurs, and the radius and centroid of the winning node are updated as follows:

(44)
(45)

where is the learning rate parameter.

If the winning category fails the vigilance test, it is reset, and the process is repeated. Eventually, either a category succeeds or a new one is created with its radius and centroid initialized as and , respectively.

2.1.8 Ellipsoid ART

Ellipsoid ART (EA) (Anagnostopoulos & Georgiopoulos, 2001a, b) is a generalization of hypersphere ART that uses hyperellipses instead of hyperspheres to represent the categories. These require memory and are subjected to two distinct constraints during training: (1) maintain a constant ratio between the lengths of their major and minor axes, and (2) maintain a fixed direction of their major axis once it is set. These restrictions, however, can pose some limitations to the categories discovered by ellipsoid ART depending on the order in which the input samples are presented.

LTM. A category in ellipsoid ART is described by its parameters , where is the centroid of the category’s hyperellipses, is the direction of the category’s major axis and is the category’s radius (or half the length of its major axis).

Activation. The distance between an input sample and a category is calculated as:

(46)

where is the (Euclidean) vector norm and is a user-specified parameter that defines the ratio between a category’s major and minor axes. The category activation function for each category is then calculated as:

(47)

where is the choice parameter, and is a user-specified parameter.

Match and resonance. The match function of the winning category selected using winner-takes-all is given by

(48)

where is the vigilance parameter.

Learning. If the winning category satisfies , then resonance occurs, and it is updated as follows:

(49)
(50)
(51)

where is the learning rate, and represents the second input sample to be encoded by this category. When a new category is created, its major axis direction is initially set to the zero vector , and then Eq. (51) is used to update it when the second pattern is committed to the category. The hyperellipse’s major axis direction stays fixed after that.

If the winning category fails the vigilance check, then it is inhibited, and the entire process is repeated until a winner category satisfies the resonance criterion. If no existing category succeeds, then a new category is created with its weights initialized with , , and .

2.1.9 Quadratic neuron ART

The quadratic neuron ART model (Su & Liu, 2002, 2005) was developed in the context of a multi-prototype-based clustering framework that integrates dynamic prototype generation and hierarchical agglomerative clustering to retrieve arbitrarily-shaped data structures.

LTM. A category is a quadratic neuron (DeClaris & Su, 1991, 1992; Su et al., 1997; Su & Liu, 2001) parameterized by , where , , and are the adaptable LTMs. Particularly, these neurons are hyperellipsoid structures in the multidimensional data space.

Activation. The activation of a quadratic neuron  is given by

(52)

where

is a linear transformation of the input

(53)

Match and resonance. After the winning node is selected using WTA competition, the system will enter a resonant state if node ’s response is larger than or equal to the vigilance parameter , i.e., if , where the match function is equal to the activation function (Eq. (52)).

Learning. If the vigilance criterion is satisfied for node , then its parameters are adapted using gradient ascent

(54)

where  is the learning rate. Specifically,

(55)
(56)
(57)

where , and are the learning rates. Otherwise, a new category is created and initialized with , , and , where is a user-defined parameter.

2.1.10 Bayesian ART

LTM. Bayesian ART (BA) (Vigdor & Lerner, 2007) is another architecture using multidimensional Gaussian distributions to parameterize the categories: , where , and

are the mean, covariance matrix, and prior probability, respectively. The latter parameter is computed using the number of samples

learned by a category.

Activation. Like Gaussian ART (Sec. 2.1.6

), Bayesian ART also integrates Bayes decision theory in its framework. Thus, its activation function is given by the posterior probability of category 

:

(58)

where is the same as Eq. (34) but uses a full covariance matrix (instead of diagonal), and is the estimated prior probability of category as in Eq. (35).

Match and Resonance. After the WTA competition is performed and the winner category is selected using the maximum a posteriori probability (MAP) criterion (Eq. (36)), the match function is computed as

(59)

such that the vigilance criterion is designed to limit category ’s hyper-volume. The vigilance test is defined as , where represents the maximum allowed hyper-volume.

Learning. If the selected category resonates (i.e., the match criterion is satisfied), then learning occurs. The sample count and means are updated using Eq. (38) and Eq. (39), respectively. The covariance matrix is updated as:

(60)

which corresponds to the sequential maximum-likelihood estimation of parameters for a multidimensional Gaussian distribution (Vigdor & Lerner, 2007). The Hadamard product is used when a diagonal covariance matrix is desired. Otherwise, a new category is created with , , and . Naturally, the initial covariance matrix should satisfy the vigilance constraint (i.e., , where ). In this ART model, categories can both grow and shrink.

2.1.11 Grammatical ART

The Grammatical ART (GramART) architecture (Meuth, 2009) represents a specialized version of ART designed to work with variable-length input patterns which are used to encode grammatical structure. It builds templates while adhering to a Backus-Naur form grammatical structure (Knuth, 1964).

LTM. To allow for comparisons between variable-length input patterns, GramART uses a generalized tree representation to encode its internal categories. Each node in the tree for a category contains an array representing the distribution of the different possible grammatical symbols at that node.

Activation. The activation function for a category is defined as a parallel to Fuzzy ART’s activation function (Sec. 2.1.3

), but GramART defines its own operator for calculating the intersection between a category and an input pattern. A tree in GramART is defined as an ordered pair

where is a set of nodes and is a set of binary relations that describe the structure of the tree. For nodes and :

(61)

The activation of a category in GramART is given by

(62)

where the intersection operator is defined as:

(63)

and represents each of the values stored in corresponding to the symbols present in the input pattern . The tree norm operator is defined as the number of nodes in the tree.

Match and resonance. The category with the highest activation value is chosen using winner-takes-all selection, and the following vigilance criterion is checked to determine whether the input pattern resonates with this category:

(64)

If this vigilance criterion is satisfied, resonance occurs and the category is allowed to learn this input pattern. Otherwise, it is reset, and the category with the next best activation is checked.

Learning. When resonance occurs, the weight of the winning category is updated using the following learning rule:

(65)

where

(66)

The weights are updated recursively down the grammar tree, and they reflect the probability of a tree symbol occurring in the node representing this particular category.

2.1.12 Validity index-based vigilance fuzzy ART

The validity index-based vigilance fuzzy ART (Brito da Silva & Wunsch II, 2017) endows fuzzy ART with a second vigilance criterion based on cluster validity indices (Xu & Wunsch II, 2009). The usage of this immediate reinforcement signal alleviates input order dependency and allows for a more a robust hyper-parameterization.

LTM. This is a fuzzy ART-based architecture. Therefore, categories are hyperrectangles as described in Sec. 2.1.3.

Activation. The validity index-based vigilance fuzzy ART activation function is equal to fuzzy ART’s and thus, is computed using Eq. (16) in Sec. 2.1.3.

Match and Resonance. After a winner is selected, the first match function () is identical to fuzzy ART’s (Eq. (19) in Sec. 2.1.3), whereas the second () is defined as

(67)

which represents the penalty (or reward) incurred by assigning sample to category and thereby changing the current clustering state of the data set from to (if there is no change in assignment, then ). The function corresponds to a cluster validity index value given a partition of disjointed clusters (defined by categories ), where . The second vigilance region is then , and . The vigilance criterion checks if . In the affirmative case, the category is allowed to learn. Note that the discussion so far implies the maximization of a cluster validity index; naturally, when minimization is sought, the inequality in the definition of should be reversed. This is a greedy algorithm that selects the best clustering assignment based on immediate feedback. Naturally, performance is biased toward the data structures favored by the selected cluster validity index.

Learning. If both vigilances are satisfied, then learning is ensued. Otherwise, the search resumes or a new category is created. The learning rules are identical to fuzzy ART’s (Sec. 2.1.3). Note that the validity index-based vigilance fuzzy ART model learns in offline mode, given that the entire data is used for the computation of Eq. (67).

2.1.13 Dual vigilance fuzzy ART

The dual vigilance fuzzy ART (DVFA) (Brito da Silva et al., 2019) seeks retrieve arbitrarily shaped clusters with low parameterization requirements via a single fuzzy ART module. This is accomplished by augmenting fuzzy ART with two vigilance parameters, namely, the upper bound () and lower bound (), representing quantization and cluster similarity, respectively.

LTM. The categories of the dual vigilance fuzzy ART are hyperrectangles.

Activation. The activation function of the dual vigilance fuzzy ART is the same as fuzzy ART’s (Eq. (16) in Sec. 2.1.3).

Match and resonance. When a category is chosen by the WTA competition, it is subjected to a dual vigilance mechanism. The first match function () uses in Eq. (19), whereas the second () is conducted using a more relaxed constraint; i.e., it uses in Eq. (19).

Learning. If the first vigilance criterion is satisfied, then learning proceeds as in fuzzy ART (Eq. (22)). Otherwise, the second test is performed, and, if satisfied, a new category is created and mapped to the same cluster as the category undergoing the dual vigilance tests via a mapping matrix (where is the number of categories and is the number of clusters). Alternately, if both tests fail, then the search continues with the next highest ranked category; if there are none left, then a new node is created and the matrix expands:

(68)

The associations between categories and clusters are permanent in this incremental many-to-one mapping (multi-prototype representation of clusters), and they enable the data structures of arbitrary geometries to be detected by dual vigilance fuzzy ART’s simple design.

2.2 Topological architectures

The ART models discussed in this section are designed to enable multi-category representation of clusters, thus capturing the data topology more faithfully. Generally, they are used to cluster data in which arbitrarily-shaped structures are expected (multi-prototype clustering methods).

2.2.1 Fuzzy ART-GL

Fuzzy ART with group learning (fuzzy ART-GL) model (Isawa et al., 2007) augments fuzzy ART (Sec. 2.1.3) with topology learning (inspired by neural-gas (Martinetz & Shulten, 1991; Martinetz & Schulten, 1994)) to retrieve clusters with arbitrary shapes. The code representation, LTMs and dynamics of fuzzy ART remain the same. However, when a sample is presented, a connection between the first and second resonating categories (if they both exist) is created by setting the corresponding entry of an adjacency matrix to one. This model also possesses an age matrix, which tracks the duration of such connections and whose dynamics are as follows: the entry related to the first and second current resonating categories is refreshed (i.e., set to zero) following a sample presentation, whereas all other entries related to the first resonating category are incremented by one. Connections with an age value above a certain threshold expire, i.e., they are pruned (note that the threshold varies deterministically over time). This procedure allows this model to dynamically create and remove connections between categories during learning (co-occurrence of resonating categories, thus following a Hebbian approach). Clusters are defined by groups of connected categories.

The fuzzy ART combining overlapped category in consideration of connections (C-fuzzy ART) variant (Isawa et al., 2008a) was developed to mitigate category proliferation, which is accomplished by merging the first resonant category with another connecting and overlapping category. Another variant introduced in (Isawa et al., 2008b, 2009) augments the latter model with individual and adaptive vigilance parameters to further reduce category proliferation.

2.2.2 TopoART

Fuzzy topoART (Tscherepanow, 2010) is a model that combines fuzzy ART (Sec. 2.1.3) and topology learning (inspired by self-organizing incremental neural networks (Furao & Hasegawa, 2006)). Specifically, it features the same representation, activation/match functions, vigilance test and search/learning mechanisms as fuzzy ART, while integrating noise robustness and topology-based learning.

Briefly, the topoART model consists of two fuzzy ART-based modules (topoARTs A and B) that cluster, in parallel, the data in two hierarchical levels, while sharing the same complement coded inputs. Each category is endowed with an instance counting feature (i.e., sample count), such that every learning cycles (i.e., iterations) categories that encoded less than a minimum number of samples are dynamically removed. Once this threshold is reached, “candidate” categories become “permanent” categories, which can no longer be deleted. In this setup, module A serves as a noise filtering mechanism for module B. The propagation of a sample to module B depends on which type of module A’s category was activated. Specifically, a sample is fed to module B if and only if the corresponding module A’s resonant category is “permanent”; therefore, module B will only focus on certain regions of the data space. Note that no additional information is passed from module A to B, and both can form clusters independently.

Regarding the hierarchical structure, the vigilance parameters of modules A and B are related by

(69)

such that module B’s maximum category size is % smaller than module A’s ( and are module A’s and B’s vigilance parameters, respectively), which implies that module B has a higher granularity () and thus yields a finer partition of the data set.

TopoART employs competitive and cooperative learning: not only the winner category but also the second winner is allowed to learn (naturally, both need to satisfy the vigilance criteria). The learning rates are set as , such that the second winner partially learns to encode the presented sample. If the first and second winner both exist, then they are linked to establish a topological structure. These lateral connections are permanent, unless categories are removed via the noise thresholding procedure. Clusters are formed by the connected categories, thus better reflecting the data distribution and enabling the discovery of arbitrarily-shaped data structures (topoART is a graph-based multi-prototype clustering method).

Finally, in prediction mode, the following activation function, which is independent of category size, is used:

(70)

the vigilance test is neglected, and only “permanent” nodes are allowed to be activated.

A number of topoART variants have been developed in the literature, e.g., the hypersphere topoART (Tscherepanow, 2012), which replaces fuzzy ART modules with hypersphere ARTs (Sec. 2.1.7); the episodic topoART (Tscherepanow et al., 2012) which incorporates temporal information (i.e., time variable and thus the order of input presentation) to build a spatio-temporal mapping throughout the learning process and generate “episode-like” clusters; and the topoART-AM (Tscherepanow et al., 2011), which builds hierarchical hetero-associative memories via a recall mechanism.

2.3 Hierarchical architectures

Elementary ART modules have been used as building blocks to construct both bottom-up (agglomerative) and top-down (divisive) hierarchical architectures. Typically, these follow one of two designs (Massey, 2009): (i) cascade (series connection) of ART modules in which the output of a preceding ART layer is used as the input of the succeeding one, or (ii) parallel ART modules enforcing different vigilance criteria while having a common input layer.

2.3.1 ARTtree

The ARTtree (Wunsch II et al., 1993) is a way of building a hierarchy of ART neural modules in which an input sample is sent simultaneously to every module in every level of the tree. Each node in the ART tree hierarchy is connected to one of its parent’s F2 categories, and each of the F2 categories in this node is connected to one of its children. The nodes in each layer of the tree hierarchy share a common vigilance value, and the vigilance typically increases further down the tree such that tiers of the tree that have more nodes are associated with higher vigilance values.

When an input sample is presented to the ARTtree hierarchy, all the ART nodes can be allowed to perform their match and activation functions, but only the node connected to its parent’s winning F2 category is allowed to resonate with and learn this pattern. Therefore, resonance only cascades down a single path in the ARTtree, and no other nodes outside that path are allowed to learn this sample. This can effectively allow ART to perform a type of varying--means clustering (Wunsch II et al., 1993).

The highly parallel nature of ARTtree lends itself well to hardware-based implementations, such as optoelectronic implementations (Wunsch II et al., 1993) and massively parallel implementations via general purpose Graphics Processing Unit (GPU) acceleration (Kim & Wunsch II, 2011). The study presented in (Kim & Wunsch II, 2011) performed this task using NVIDIA CUDA GPU hardware and an implementation of ARTtree that uses fuzzy ART units in the tree nodes. The results reported in the study show a massive speed boost for deep trees when compared to the CPU in terms of computing time, while smaller trees performed worse on the GPU due to the high data transfer penalties between the CPU and GPU memory.

2.3.2 Self-consistent modular ART

The self-consistent modular ART (SMART) (Bartfai, 1994) is a modular architecture designed to perform hierarchical divisive clustering (i.e., to represent different levels of data granularity in a top-down approach). It builds a self-consistent hierarchical structure via self-organization and uses ART 1 (Sec. 2.1.1

) as elementary units. In this architecture, a number of ART modules operate in parallel with different vigilance parameter values, while receiving the same input samples and connecting in a manner that makes the hierarchical cluster representation self-consistent. These connections are such that many-to-one mapping of specific to general categories is learned across such modules. Specifically, the hierarchy is explicitly represented via associative links between modules.

Concretely, a two-level SMART architecture can be implemented using an ARTMAP (Sec. 3.1.1) in auto-associative mode; i.e., ARTMAP is used in an unsupervised manner by presenting the same input sample to both modules A and B with different vigilance parameters and forcing a hierarchical structure by making , such that module B enforces its categorization (an internal supervision) on module A.

2.3.3 ArboART

ArboART (Ishihara et al., 1995) is an agglomerative hierarchical clustering method based on ART. More specifically, it uses ART 1.5-SSS (small sample size) (Ishihara et al., 1993) (variant of ART 1.5 (Levine & Penz, 1990), which in turn is a variation of ART 2 (Carpenter & Grossberg, 1987b)), as a building block. Briefly, prototypes of one ART are the inputs to another ART with looser vigilance (similarity constraint). Therefore, prototypes obtained from a lower level (bottom part of the dendrogram) are fed to the next ART layer. ART modules on higher layers have decreasingly lower vigilance values, i.e., the similarity constraint is less strict. This enables the construction of a tree (hierarchical graph structure). One of the advantages over traditional hierarchical methods is that it does not require a full recomputation when a new sample is added, only partial recomputations are needed in ART (inside the specific clusters). ArboART uses several layers of ART as well as one pass learning. Concretely, it makes super-clusters of previous clusters in a hierarchical way, thereby making a generalization of categories in the process.

2.3.4 Joining hierarchical ART

The joining hierarchical ART (HART-J) (Bartfai, 1996) is a hierarchical agglomerative clustering method (bottom-up approach) that uses ART 1 modules (Sec. 2.1.1) as building blocks and follows a cascade design. Specifically, each layer of this multi-layer model corresponds to an ART 1 network that clusters the prototypes generated by the preceding layer. The input of layer is given by: