Cluster analysis Documentation
This program is under development. Check your results carefully.
The Python script 'clusterana.py' is used to extract more detailed
information from the data produced by the script
program requires a completed or partly completed run of clusterstat.py.
The program reads the restart file generated by clusterstat to extract
the system information. In particular, the script allows to check the
accuracy of the cluster expansion under consideration or negligence of
To start calculating the clusters type:
Possible options can be obtained with
Purpose: Statistical analysis for the cluster expansion
of a potential surface.
Usage: clusterana [-deb -ver -h -? ] inputfile.inp
-deb Run in debug mode (more detailed logging).
-D <name> The name directory of a clusterstat run.
-ver Print version info.
-h -? Print this help text.
Further analysis of data produced by statistical analysis program
'clusterstat'. This requires an (at least partly) completed run
of 'clusterstat'. The system information is obtained from the restart
file of 'clusterstat'.
The input needed by clusterana in general follows the rules
of the usual MCTDH input. The input is organized in three sections:
||What is to be done.
||Specification of the expansion terms. (optional)
||modification of the coordinate ranges (optional).
Example inputs can be found in
|name = S
||The 'name' directory of the previous clusterstat run and
directory to which output files are written.
|steps = I (,I1)
||The number of Monte-Carlo steps to be processed. If one value
is given, the first I steps
are read, if two numbers are given, steps
I1 are read.
Default: all steps will be processed.
|logminmax = I (,I1)
||Enable logging of the set of samples with the smallest and
lowest values concerning the true potential, the approximated potential
and the difference between them.
The first number denotes the maximum number of samples to be logged,
the second, if given, the minimum number of samples to be skipped
before the next sample can be logged. This can be used to avoid logging
of all samples when the random walker is in a certain region of the PES
that produces large errors etc.
|count-samples = R,R1 (,S)
||Count the number of samples for which the contribution of a
cluster is NOT in the region between R and R1. S is a unit, i.e.,
one of "cm-1", "eV", "meV", "au",
"kcal/mol" or "kJ/mol".
|densities = S (,S1)
|| Calculate 1D and/or 2D densities (histograms) for the
coordinate vectors of the random walker. S and S1 can be one of "1D"
and "2D". If "1D" is set, all one-dimensional densities are calculated,
if "2D" is set, all two-dimensional densities are calculated. Requires
|bins = I
||The number of bins used in the
calculation of the densities.
|outunit = S
||Unit in which the output is
written. Default: same as used in the clusterstat run.
||Allow overwriting of existing files in the 'name' directory.
Similar to option -w in the command line.
||Write order-dependent and reference-dependent cumulative mean
error of the cluster expansion to file.
||Write order-dependent and
reference-dependent cumulative root-mean-square error of the cluster
expansion to file.
||Calculate the auto-correlation
of the trajectory of the random walker to file
(to check the quality of the random walk).
||Write detailed statistics for
each cluster to file.
The EXPANSION-SECTION if given, is used to define which of the clusters
calculated in the clusterstat run are to be used to
PES. Only the clusters specified here will be included in the
mean and RMS values as specified in the RUN-SECTION.
For a detailed description on defining the cluster-expansion see
the documentation of clusterstat.
The COORDINATES-SECTION can be used to alter the range of the
the random walker. All samples for which the walker has left the
are ignored (except for calculation of the auto-correlation function).
range can be specified in two ways. On possibility is the definition of
as in the documentation of
clusterstat. A second possibility is providing a function
tests if the walker is an a certain region. In this case the
COORDINATES-SECTION may only contain two keywords:
|user-source = S
||The path (relative or absolute) to the module containing the
routine which tests the coordinate vector. See
using the 'user-source' keyword.
|routine = S
||Name of the routine provided by
the module given with 'user-source' which is used to test the
The routine specified above receives a NumPy array (type float) containing
a coordinate vector and must return
True or False, indicating the validity of the sample
belonging to the coordinate vector. If True is returned the
sample is assumed to be valid and included in the statistics, otherwise
the sample is skipped.
The output generated by clusterana are a number of ASCII
different types of data. The following files are generated:
The log file containing time stamp and messages sent through the
String representation of the original input. This file is reproduced
from already processed input and is not merely a copy of the original
Detailed statistics for all clusters, one cluster per line. If more
than one reference point is used the numbers in this file represent the
weighted sum over all reference points. The file is organized in up to
6 columns. The first column contains the tuple specifying the cluster,
the second and third column its mean contribution and root-mean-square
value. If count-samples is given in the RUN-SECTION, the number of
samples below and above the range specified with count-samples
are given in columns four and
five, respectively. If the cluster had been excluded from the cluster
expansion, this is flagged in the last column.
The last two lines of this files contain estimated of the overall
error of the cluster-expansion.
Same as statistics, but for one reference point only, where <i>
denotes the number of the reference point. The numbers given in these
files do not incorporate weighting factors. These files are only
generated if more then one reference point is used.
If logminmax is set in the RUN-SECTION this file contains the
coordinate vectors and energies of the samples that produced the
largest absolute numbers for the true potential (<suffix>
= "exact"), the approximated potential (<suffix>
= "approx") and their difference (<suffix> =
"delta"). If count-samples is set in the RUN-SECTION also the most
relevant clusters are listed.
Same as samples_largest_<suffix> but for the smallest
For each sample (one sample a line) the first column contains the
difference between the exact potential and the approximated potential
(Vexact - Vapprox).
The second column contains the exact potential
and the third column the approximated potential.
For each sample (one sample per line) the summed contribution of all
clusters of a certain order to the approximated potential. The file is
organized in columns. The first column contains all contributions from
the zeroth order clusters, the second column the contribution from all
first order clusters, etc.
The n-th line contains the mean of the first n lines of file curr_order_ana
The n-th line contains the root-mean-square of the first n lines of
For each sample (one sample per line) the order-dependent error of the
approximation. All clusters in the expansion up to a certain order are
summed and subtracted from the exact potential. The file is organized
in columns. The first column contains the exact potential minus all
zeroth order contribution. The second column contains the exact
potential minus all zeroth and first order contributions, and so on.
The n-th line contains the mean of the first n lines of file curr_error_order_ana.
The n-th line contains the root-mean-square of the first n lines of
For each sample (one sample per line) the weighted contribution of the
expansion around the reference points of the approximation. The n-th
column contains the contribution of the expansion around the n-th
The n-th line contains the mean of the first n line in file curr_ref_ana.
The n-th line contains the root-mean-square of the first n line in file
For each sample (one sample per line) the weighted exact potential
minus the weighted
approximation.. All clusters around the n-th reference point are summed
and subtracted from the weighted exact potential. the n-th column
contains the result for the n-th reference point.
The n-th line of this file contains the mean of the first n lines of
The n-th line of this file contains the root-mean-square of the first n
lines of file curr_error_ref_ana