Cluster analysis Documentation

General

This program is under development. Check your results carefully.

The Python script 'clusterana.py' is used to extract more detailed information from the data produced by the script clusterstat.py. The program requires a completed or partly completed run of clusterstat.py. The program reads the restart file generated by clusterstat to extract the system information. In particular, the script allows to check the accuracy of the cluster expansion under consideration or negligence of selected clusters.

Usage

To start calculating the clusters type:

clusterana [-opt] input_file

Possible options can be obtained with

clusterana -h

-------------------------------------------------------------------------------
Purpose:     Statistical analysis for the cluster expansion
             of a potential surface.

Usage:       clusterana [-deb -ver -h -? ] inputfile.inp

  -deb       Run in debug mode (more detailed logging).
  -D <name>  The name directory of a clusterstat run.
  -ver       Print version info.
  -h -?      Print this help text. 

Further analysis of data produced by statistical analysis program
'clusterstat'. This requires an (at least partly) completed run
of 'clusterstat'. The system information is obtained from the restart
file of 'clusterstat'.
-------------------------------------------------------------------------------

Input Documentation

The input needed by clusterana in general follows the rules of the usual MCTDH input. The input is organized in three sections:

Section	Description
RUN	What is to be done.
EXPANSION	Specification of the expansion terms. (optional)
COORDINATES	modification of the coordinate ranges (optional).

Example inputs can be found in $MCTDH_DIR/inputs/clusters/statistics.

RUN-SECTION

Required keywords
Keyword	Description
name = S	The 'name' directory of the previous clusterstat run and directory to which output files are written.
Optional keywords
Keyword	Description
steps = I (,I1)	The number of Monte-Carlo steps to be processed. If one value is given, the first I steps are read, if two numbers are given, steps I to I1 are read. Default: all steps will be processed.
logminmax = I (,I1)	Enable logging of the set of samples with the smallest and lowest values concerning the true potential, the approximated potential and the difference between them. The first number denotes the maximum number of samples to be logged, the second, if given, the minimum number of samples to be skipped before the next sample can be logged. This can be used to avoid logging of all samples when the random walker is in a certain region of the PES that produces large errors etc.
count-samples = R,R1 (,S)	Count the number of samples for which the contribution of a cluster is NOT in the region between R and R1. S is a unit, i.e., one of `"cm-1"`, `"eV"`, `"meV"`, `"au"`, `"kcal/mol"` or `"kJ/mol"`.
densities = S (,S1)	Calculate 1D and/or 2D densities (histograms) for the coordinate vectors of the random walker. S and S1 can be one of "1D" and "2D". If "1D" is set, all one-dimensional densities are calculated, if "2D" is set, all two-dimensional densities are calculated. Requires keyword "bins".
bins = I	The number of bins used in the calculation of the densities.
outunit = S	Unit in which the output is written. Default: same as used in the clusterstat run.
overwrite	Allow overwriting of existing files in the 'name' directory. Similar to option -w in the command line.
mean	Write order-dependent and reference-dependent cumulative mean error of the cluster expansion to file.
rms	Write order-dependent and reference-dependent cumulative root-mean-square error of the cluster expansion to file.
auto	Calculate the auto-correlation of the trajectory of the random walker to file (to check the quality of the random walk).
cluster-statistics	Write detailed statistics for each cluster to file.

EXPANSION-SECTION

The EXPANSION-SECTION if given, is used to define which of the clusters calculated in the clusterstat run are to be used to approximate the PES. Only the clusters specified here will be included in the calculation of mean and RMS values as specified in the RUN-SECTION.

For a detailed description on defining the cluster-expansion see the documentation of clusterstat.

COORDINATE-SECTION

The COORDINATES-SECTION can be used to alter the range of the coordinated of the random walker. All samples for which the walker has left the specified range are ignored (except for calculation of the auto-correlation function). The range can be specified in two ways. On possibility is the definition of a range as in the documentation of clusterstat. A second possibility is providing a function which tests if the walker is an a certain region. In this case the COORDINATES-SECTION may only contain two keywords:

Keyword	Description
user-source = S	The path (relative or absolute) to the module containing the routine which tests the coordinate vector. See using the 'user-source' keyword.
routine = S	Name of the routine provided by the module given with 'user-source' which is used to test the coordinate vector.

The routine specified above receives a NumPy array (type float) containing a coordinate vector and must return True or False, indicating the validity of the sample belonging to the coordinate vector. If True is returned the sample is assumed to be valid and included in the statistics, otherwise the sample is skipped.

Output Documentation

The output generated by clusterana are a number of ASCII files containing different types of data. The following files are generated:

analysis.log
The log file containing time stamp and messages sent through the logging system.
analysis_input
String representation of the original input. This file is reproduced from already processed input and is not merely a copy of the original input file.
statistics
Detailed statistics for all clusters, one cluster per line. If more than one reference point is used the numbers in this file represent the weighted sum over all reference points. The file is organized in up to 6 columns. The first column contains the tuple specifying the cluster, the second and third column its mean contribution and root-mean-square value. If count-samples is given in the RUN-SECTION, the number of samples below and above the range specified with count-samples are given in columns four and five, respectively. If the cluster had been excluded from the cluster expansion, this is flagged in the last column.
The last two lines of this files contain estimated of the overall error of the cluster-expansion.
statistics_ref_<i>
Same as statistics, but for one reference point only, where <i> denotes the number of the reference point. The numbers given in these files do not incorporate weighting factors. These files are only generated if more then one reference point is used.
samples_largest_<suffix>
If logminmax is set in the RUN-SECTION this file contains the coordinate vectors and energies of the samples that produced the largest absolute numbers for the true potential (<suffix> = "exact"), the approximated potential (<suffix> = "approx") and their difference (<suffix> = "delta"). If count-samples is set in the RUN-SECTION also the most relevant clusters are listed.
samples_lowest_<suffix>
Same as samples_largest_<suffix> but for the smallest absolute values.
deltaE
For each sample (one sample a line) the first column contains the difference between the exact potential and the approximated potential (V_exact - V_approx). The second column contains the exact potential and the third column the approximated potential.
curr_order_ana
For each sample (one sample per line) the summed contribution of all clusters of a certain order to the approximated potential. The file is organized in columns. The first column contains all contributions from the zeroth order clusters, the second column the contribution from all first order clusters, etc.
mean_order_ana
The n-th line contains the mean of the first n lines of file curr_order_ana
rms_order_ana
The n-th line contains the root-mean-square of the first n lines of file curr_order_ana
curr_error_order_ana
For each sample (one sample per line) the order-dependent error of the approximation. All clusters in the expansion up to a certain order are summed and subtracted from the exact potential. The file is organized in columns. The first column contains the exact potential minus all zeroth order contribution. The second column contains the exact potential minus all zeroth and first order contributions, and so on.
mean_error_order_ana
The n-th line contains the mean of the first n lines of file curr_error_order_ana.
rms_error_order_ana
The n-th line contains the root-mean-square of the first n lines of file curr_error_order_ana.
curr_ref_ana
For each sample (one sample per line) the weighted contribution of the expansion around the reference points of the approximation. The n-th column contains the contribution of the expansion around the n-th reference point.
mean_ref_ana
The n-th line contains the mean of the first n line in file curr_ref_ana.
rms_ref_ana
The n-th line contains the root-mean-square of the first n line in file curr_ref_ana.
curr_error_ref_ana
For each sample (one sample per line) the weighted exact potential minus the weighted approximation.. All clusters around the n-th reference point are summed and subtracted from the weighted exact potential. the n-th column contains the result for the n-th reference point.
mean_error_ref_ana
The n-th line of this file contains the mean of the first n lines of file curr_error_ref_ana
rms_error_ref_ana
The n-th line of this file contains the root-mean-square of the first n lines of file curr_error_ref_ana