Cognitive and Connectionist Systems Lab

Monash University, Australia

The GSOM

Past Work | Current Projects

The GSOM is a proud research endeavour of our lab. With the bility to incrementally generate nodes and define structure to match input data, the GSOM has demonstrated many practical advantages over the traditional SOM.

From human genetics to bio-informatics, from seismic activity to medical analysis, from patterns in oncology to large scale text analysis: with a record number of applications to its credit, the Growing Self Organizing Map has proven it’s applicability as a general clustering tool, but with the advantages of significantly faster processing, built in hierarchical cluster generation and input driven network shape. The benchmarks of the GSOM are,

  1. It can be used on a set of data, about which no previous information is available, to identify the clusters present (unsupervised method).

  2. The shape of the GSOM represents the grouping in the data, and therefore, such grouping has a better opportunity of attracting the attention of the data analyst for further investigation.

  3. The number of nodes required to represent a set of data (at the same level of spread) is less than that of a comparable SOM. This will result in faster processing and efficient use of computing resources.

  4. The weights of the newly generated nodes are initialized to fit in with the existing (current) map. This is possible due to the incremental addition of nodes to the network at the areas of necessity. In the SOM, the nodes are normally randomly initialized and have to be first ordered. The GSOM does not require an ordering phase, which further reduces processing time. The ordered weight initialization method also results in less chance of twisted maps.

  5. Having fewer nodes at the early stages and localized weight adaptation also results in less processing time.

  6. Hierarchical analysis of datasets with the spread factor - a primary stage of general clustering and further stages of specific clustering

The GSOM algorithm :

  1. Initialization phase:
    1. Initialize the weight vectors of the starting nodes (usually four) with random numbers between 0 and 1.
    2. Calculate the growth threshold (GT) for the given data set of dimension D according to the spread factor (SF) using the formula GT = - D \times ln ( SF )

  2. Growing Phase:
    1. Present input to the network.
    2. Determine the weight vector that is closest to the input vector mapped to the current feature map (winner), using Euclidean distance . This step can be summarized as: find q' such that \left | v - w_{q'} \right \vert \le \left | v - w_q \right \vert \forall q \in \mathbb{N} where v, w are the input and weight vectors respectively, q is the position vector for nodes and \mathbb{N} is the set of natural numbers.
    3. The weight vector adaptation is applied only to the neighborhood of the winner and the winner itself. The neighborhood is a set of neurons around the winner, but in the GSOM the starting neighborhood selected for weight adaptation is smaller compared to the SOM (localized weight adaptation). The amount of adaptation (learning rate) is also reduced exponentially over the iterations. Even within the neighborhood, weights that are closer to the winner are adapted more than those further away. The weight adaptation can be described by w_j ( k + 1 ) = \begin{cases} w_j ( k ) & \mbox{if}j \notin \Nu_{k+1} \\ w_j ( k ) + LR ( k ) \times (x_k - w_j ( k ) ) & \mbox{if}j \in \Nu_{k+1} \end{cases} where the Learning Rate LR(k), k \in \mathbb{N} is a sequence of positive parameters converging to zero as k \to \infty. wj(k), wj(k + 1) are the weight vectors of the node j before and after the adaptation and Νk + 1 is the neighbourhood of the winning neuron at the (k + 1)th iteration. The decreasing value of LR(k) in the GSOM depends on the number of nodes existing in the map at time k.
    4. Increase the error value of the winner (error value is the difference between the input vector and the weight vectors).
    5. When TE_i \; GT(where TEi is the total error of node i and GT is the growth threshold). Grow nodes if i is a boundary node. Distribute weights to neighbors if i is a non-boundary node.
    6. Initialize the new node weight vectors to match the neighboring node weights.
    7. Initialize the learning rate (LR) to its starting value.
    8. Repeat steps 2 – 7 until all inputs have been presented and node growth is reduced to a minimum level.

  3. Smoothing phase.
    1. Reduce learning rate and fix a small starting neighborhood.
    2. Find winner and adapt the weights of the winner and neighbors in the same way as in growing phase.

Publications associated with the GSOM:

Journal Publications

  1. Hsu, A., Tang, S. and Halgamuge, S. K. (2003) An unsupervised hierarchical dynamic self-organizing approach to cancer class discovery and marker gene identification in microarray data. Bioinformatics 19(16): pp 2131-2140


  2. Alahakoon, D., Halgamuge, S. K. and Sirinivasan, B. (2000) Dynamic Self Organizing Maps With Controlled Growth for Knowledge Discovery, IEEE Transactions on Neural Networks, Special Issue on Knowledge Discovery and Data Mining, 11, pp 601-614.

Conference Publications

  1. Alahakoon, D., Halgamuge, S. K. and Sirinivasan, B. (1998) A Structure Adapting Feature Map for Optimal Cluster Representation in Proceedings of The 5th International Conference on Neural Information Processing (ICONIP 98), Kitakyushu, Japan, pp 809-812


  2. Alahakoon, D., Halgamuge, S. K. and Sirinivasan, B. (1998) A Self Growing Cluster Development Approach to Data Mining in Proceedings of IEEE International Conference on Systems, Man and Cybernetics, San Diego, USA, pp 2901-2906


  3. Alahakoon, D. and Halgamuge, S. K. (1998) Knowledge Discovery with Supervised and Unsupervised Self Evolving Neural Networks in Proceedings of 5th International Conference on Soft Computing and Information/Intelligent Systems, Fukuoka, Japan, pp 907-910


  4. De Silva, L.P.D.P. and Alahakoon, D. (2006) Analysis of Seismic Activity using the Growing SOM for the Identification of Time Dependent Patterns in Proceedings of 5th International Conference on Information and Automation (ICIA), Colombo, Sri Lanka, pp 200-205

 

 

 

©2007 Monash University, Australia