Parameters

The clustering properties set in LMCLUSParameters instance, which is defined as follows:

type LMCLUSParameters
    min_dim::Int                     # Minimum cluster dimension
    max_dim::Int                     # Maximum cluster dimension
    number_of_clusters::Int          # Nominal number of resulting clusters
    hist_bin_size::Int               # Fixed number of bins for the distance histogram.
    min_cluster_size::Int            # Minimum cluster size
    best_bound::Float64              # Best bound
    error_bound::Float64             # Error bound
    max_bin_portion::Float64         # Maximum histogram bin size
    random_seed::Int64               # Random seed
    sampling_heuristic::Int          # Sampling heuristic
    sampling_factor::Float64         # Sampling factor
    histogram_sampling::Bool         # Sample points for distance histogram
    zero_d_search::Bool              # Enable zero-dimensional manifold search
    basis_alignment::Bool            # Manifold cluster basis alignment
    dim_adjustment::Bool             # Manifold dimensionality adjustment
    dim_adjustment_ratio::Float64    # Ratio of manifold principal subspace variance
    mdl::Bool                        # Enable MDL heuristic
    mdl_model_precision::Int         # MDL model precision encoding constant
    mdl_data_precision::Int          # MDL data precision encoding constant
    mdl_quant_error::Float64         # Quantization error of a bin size calculation
    mdl_compres_ratio::Float64       # Cluster compression ration
    log_level::Int                   # Log level (0-5)
end

Here is a description of algorithm parameters and their default values:

name description default
min_dim Low bound of a cluster manifold dimension. 1
max_dim High bound of a cluster manifold dimension. It cannot be larger then a dimensionality of a dataset.  
number_of_clusters Expected number of clusters. Requred for the sampling heuristics. 10
hist_bin_size Number of bins for a distance histogram. If this parameter is set to zero, the number of bins in the distance histogram determined by parameter max_bin_portion. 0
min_cluster_size Minimum size of a collection of data points to be considered as a proper cluster. 20
best_bound Separation best bound value is used for evaluating a goodness of separation characterized by a discriminability and a depth between modes of a distance histogram. 1.0
error_bound Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster. 1e-4
max_bin_portion Sampling error bound determines a minimal number of samples required to correctly identify a linear manifold cluster. Value should be selected from a (0,1) range. 0.1
random_seed Random number generator seed. If not specified then RNG will be reinitialized at every run. 0
sampling_heuristic

The choice of heuristic method:

  1. algorithm will use a probabilistic heuristic which will sample a quantity exponential in max_dim and cluster_number parameters
  2. will sample fixed number of points
  3. the lesser of the previous two
3
sampling_factor Sampling factor used in the sampling heuristics (see above, options 2 & 3) to determine a number of samples as a percentage from a total dataset size. 0.01
histogram_sampling Turns on a sampling for a distance histogram. Instead of computing the distance histogram from a whole dataset, the algorithm draws a small sample for the histogram construction, thus improving a its performance. This parameter depends on a cluster_number value. false
zero_d_search Turn on/off zero dimensional manifold search. false
basis_alignment Turn of/off an alignment of a manifold cluster basis. *If it’s on, a manifold basis of the generated cluster is aligned along the direction of the maximum variance (by performing PCA). false
dim_adjustment Turn on/off a linear manifold cluster dimensionality detection by looking for portion of a variance associated with principal components. false
dim_adjustment_ratio Ratio of manifold principal subspace variance. 0.99
mdl Turn on/off minimum description length heuristic for a complexity validation of a generated cluster. false
mdl_model_precision MDL model precision encoding value. 32
mdl_data_precision MDL data precision encoding value. 16
mdl_quant_error Quantization error of a bin size calculation for a histogram which used in determining entropy value of the empirical distance distribution. 1e-4
mdl_compres_ratio Compression threshold value for discarding candidate clusters. 1.05
log_level Logging level (ranges from 0 to 5). 0

Suggestions

Particular settings could impact performance of the algorithm:

  • If you want a persistent clustering results fix a random_seed parameter. By default, RNG is reinitialized every time when algorithm runs.
  • If a dimensionality of the data is low, a histogram sampling could speeding up calculations.
  • Value 1 of sampling_heuristic parameter should not be used if parameter max_dim is large, as it will generate a very large number of samples.
  • Increasing value of max_bin_portion parameter could improve an efficiency of the clustering partitioning, but as well could degrade overall performance of the algorithm.

Parallelization

This implementation of LMCLUS algorithm uses parallel computations during a manifold sampling stage. You need add additional workers before executing the algorithm.