Clusterstat Documentation

General

The Python script 'clusterstat.py' is used perform a random walk over a potential energy surface and generate raw data concerning a particular cluster expansion defined in the input file.

Within the cluster-expansion a high-dimensional potential V is expanded into n-particle interaction terms, called the clusters. (Elsewhere, this expansion is also refered to as 'HDMR' or 'n-mode representation'.) The raw data can be further analyzed with the Python script 'clusterana.py'. The program is executed in three stages:

Initialization
The input file is parsed and checked for consistency and the hierarchy of clusters is established.
Random walk
After the initialization, if required, a random walk using a Metropolis algorithm is performed to obtain a trajectory of coordinate values. Within the Metropolis algorithm, at each step a random vector ΔQ of the length specified in the RUN-SECTION is generated and added the present coordinate vector Q to generate the trial configuration. The trial configuration Q' is then tested against a random number r in the interval [0,1] via p=exp(-β(V(Q') - V(Q))), where V(Q) is a probe potential provided by the user. If r < p the step is accepted, leading to the new configuration Q' --> Q, otherwise the step is rejected and the procedure is repeated with a different random ΔQ.
Calculation of clusters
After the random walk is completed, the clusters are generated and saved to file for each point in the trajectory. This is the most time-consuming part of the run. Also some basic statistics are written.

Remark:
The script relies on many features of the NumPy/SciPy package which is not installed by default on many platforms. Please make sure that NumPy 1.1 or later is installed. (Note: for Apple computers the Numpy/SciPy developers strongly recommentd using the official Python distribution and not the version shipped with Mac OS X.)

Usage

To start generating statistical data type:

clusterstat [-opt] input_file

Possible options can be obtained with

clusterstat -h

-------------------------------------------------------------------------------
Purpose:     Statistical analysis for the cluster expansion
             of a potential surface.

Usage:       clusterstat [-mnd -c -D -deb -w -ver -h -? ] inputfile.inp

  -mnd       Make name directory
  -c         Continuation run
  -n <int>   Perform <int> steps (ignore value in input file)
  -t <file>  Use a trajectory stored in <file>
  -D <dir>   Denotes the directory where files are written to
             (name in .inp file ignored).
  -deb       Run in debug mode (more detailed logging)
  -w         Allow to overwrite existing data.
  -ver       Print version info
  -h -?      Print this help text.
-------------------------------------------------------------------------------

Note: Continuation runs (-c option) can be used to re-start a crashed or canceled job. If -c is given, the program attempts to read a previously created restart-file in the name directory and reads the current state.

Remark on restart files: restart files are produced with the Python cPickle module and can be read and edited manually in a Python shell.

Input Documentation

The input needed by cluster_stat in general follows the rules of the usual MCTDH input. The input is organized in five sections:

Section	Description
RUN	What is to be done.
COORDINATE	Definition of the physical coordinates and their ranges.
MODES	Definition of logical coordinates, i.e., the mode-combination scheme (similar to the definition of single-particle functions).
REFERENCE	Reference points around which the PES is expanded.
EXPANSION	Specification of the expansion terms.

Example inputs can be found in $MCTDH_DIR/inputs/clusters/statistics.

RUN-SECTION

Within the RUN-SECTION, basic job parameters are defined. This includes file names and path information, output to be generated as well as the setup of the Monte Carlo algorithm. The following keywords can be used:

Required keywords
Keyword	Description
name = S	The 'name' directory to which output files are written.
user-source = S	The path (relative or absolute) to the module containing the potential energy routine and, if required, a probe routine and a switching function. See using the 'user-source' keyword.
steps = I	The number of steps of the random walker.
stepnorm = R	The size, i.e, the Euclidean norm, of a step of the random walker
Optional keywords
Keyword	Description
title = S	The title of the calculation.
potential-routine = S	The name of the routine that evaluates the PES for a given coordinate vector (see below). The routine must be found in the module set by the 'user-source' keyword (see above). Default: S = `"potential"`
probe-routine = S	The name of the routine that evaluates a probe function for a given coordinate vector (see below). This routine can be different from the PES routine (see below). The routine must be found in the module set by the 'user-source' keyword (see ). Default: the same routine set by the 'potential-routine' keyword
temperature = R(,S)	The 'temperature' used in the Monte-Carlo process to calculate the trajectory of the random walker. S can be a unit (see 'outunit' keyword for possible units). Default unit is `"au"` . Note: there is no multiplication with the Boltzmann constant. The number given here really is `k_BT = β^-1` already. This keyword can be omitted if 'trajectory' is set.
outunit = S	By default, all output is written in atomic units (`"au"`). It can, however, be changed to other units. 'outunit' can be one of `"cm-1"`, `"eV"`, `"meV"`, `"au"`, `"kcal/mol"` or `"kJ/mol"`.
trajectory = S	Equivalent to option `-t`. The name of or path to a file containing a trajectory. If `trajectory` is given, no random walk is performed but the coordinates are obtained from this file. Format: ASCII, one set of whitespace separated coordinate values (float) per line. Optional additional column containing a weight. See `coordinates` file in output documentation.
overwrite	Allow overwriting of existing files in the 'name' directory. Similar to option -w in the command line.
mean	While generating the cluster-data calculate the cumulative and order-dependent mean error of the expansion. See `mean_error_order` file in output documentation.
rms	While generating the cluster-data calculate the cumulative and order-dependent root-mean-square of the error of the expansion. See `rms_error_order` file in output documentation.
walk-only	Do not generate any data concerning the cluster-expansion but only perform a random walk following the probe-function.
ignore-weights	If `trajectory = S` or comand line option `-t` is set, possibly existing weight factors in the trajectory file are ignored and all coordinates are weighted equally.

The potential-routine and probe-routine keywords

The potential-routine keyword is used to specify the name of the PES routine in the module specified as 'user-source'. The subroutine receives a 1D numpy array (type float) containing a coordinate vector and must return a single float number, the potential energy for the given coordinates in atomic units. The program assumes that the result is in atomic units.

The probe-routine keyword is used to specify the name of a probe routine. The probe routine is only used to determine the trajectory of the random walker during the Monte-Carlo process. I.e, the trajectory of the random walker follows the Boltzmann distribution of the probe function (~exp(-βV_probe)). If the probe-function is not specified, the routine specified by the potential-routine keyword is used.

Simple example of a PES subroutine:

#!/usr/bin/env python
#
# file: my_pes.py

import numpy

def potential(Q):
    "My PES routine - just a harmonic oscillator"

    return numpy.dot(Q,Q)/2

COORDINATES-SECTION

The COORDINATES-SECTION serves a similar purpose as the PRIMITIVE-BASIS-SECTION of MCTDH. Here the coordinates of the system are defined and given a label and a range.

The coordinates are set up one per line, starting with the coordinate label, followed by the start of the coordinate range and its end.

In addition to the range, also a scaling factor for the random walker can be added. Since the step size of the random walker is always of the same length, one can add a weighting factor for each component after the range (default is unity). If the weighting factor is set to zero, the initial value of coordinate is not altered.

Note: The initial position of the random walker is set to the first reference point.

MODES-SECTION

The MODES-SECTION is used to specify mode combinations, i.e., logical coordinates which are used to generate the clusters. In general, this is very similar to the SINGLE-PARTICLE-SECTION within the MCTDH main program. Within the MODES-SECTION one particle is specified per line by comma separated listing of coordinate labels as specified in the PRIMITIVE-BASIS-SECTION.

Example

MODES-SECTION
 Q1, Q2           # mode 1
 Q3               # mode 2
 Q5, Q4           # mode 3
 Q6               # mode 4
end-modes-section

REFERENCE-SECTION

The cluster expansion is performed around a number of reference points. In most cases one reference point might be sufficient, however, e.g., in some systems the definition of multiple reference points might be helpful (e.g., systems with large amplitude motions).

Within the REFERENCE-SECTION, these reference points are specified one per line between the begin-reference and end-reference statements. The first entry in a line is the label of the coordinate as specified in the COORDINATES-SECTION. The label is followed by whitespace separated coordinate values for each reference point. Note: The initial position of the random walker is set to the first reference point.

Other keywords
In the case of multiple reference points additional keywords can be used to define how the clusters are summed. By default all clusters will be summed with the same relative weight.

Optional keywords

Keyword Description

weights = R,R1,... The clusters for each reference point are summed together with weights R,R1, etc. If weights are given, there must be as many weights as reference points. weights cannot be negative and the sum must be unity. If neither weights nor a switching function are given, all weights are set to 1/N where N is the number of reference points.

switch-function = S Alternatively to the weights keyword, a switching function can be used to generate coordinate dependent weights. S is the name of a routine in the module specified by the 'user-source' keyword in the RUN-SECTION (see below).

Using a switching function
Using a user-defined switching function is similar to defining the potential energy routine. A routine (or other callable object) of the name set by the 'switch-function' keyword must be found in the module or library defined by the 'user-source' keyword in the run-section. The switching function receives a complete coordinate vector containing the currunt position of the random walker (1D numpy array, type float) and must return a 1D float array (or any other ordered sequence) of weights. The array must contain as many entries as there are reference points. There must not be negative elements and the sum of all elements must be unity.

Example: two reference points connected by a switching function

REFERENCE-SECTION
  begin-reference
  # label   ref1  ref2
    Q1     0.0   0.0
    Q2     0.0   0.0
    Q3     0.0   0.0
    Q4    -1.0   1.0
    Q5     1.0  -1.0
    Q6     1.0   1.0
  end-reference
  switch-function = my_switch_routine
end-reference-section

EXPANSION-SECTION

The EXPANSION-SECTION is used to specify the cluster expansion. Clusters are specified by tuples of either coordinate labels or mode numbers. Multiple tuples can be defined in one row of the EXPANSION-SECTION in which case they have to be comma separated. The expansion can also be distributed over several lines.

Complete modes are specified by integers, while single coordinates are specified by their labels as given within the COORDINATE-SECTION or PRIMITIVE-BASIS-SECTION. The integer defining a mode is determined by its position in the MODES-SECTION, i.e., the first mode defined in the MODES-SECTION is adressed by the integer 1, the second by 2, etc. For instance, the integer 3 will refer to mode three containing the coordinates 'Q4' and 'Q5' in the example above.

A tuple is a list of mode integers and/or coordinate labels enclosed in round brackets. For example the tuple (1,2,Q4) defines a cluster containing the coordinates 'Q1', 'Q2', 'Q3' and 'Q4', where the first two modes are completely taken into account by specification through integers, but only one coordinate ('Q4') of mode '3' is included. The same cluster could of course also be specified by the tuple (Q1,Q2,Q3,Q4).

Ordering of the tuples
Within a tuple, the mode number (whether or not a mode is specified by its integer or only parts of it are included by a coordinate label) must always increase from left to right so that the smallest mode number is in the most left position. Also, a mode cannot be specified multiply, for example when a mode integer and a coordinate label are used. Within the example above, the tuples (2,1,3) and (2,3,Q5) are invalid; the first because of the order of integers, the second because of double definition of the coordinate label 'Q5' as it is part of mode 3.

The empty tuple
The empty tuple () is used for clusters with no coordinate dependence, i.e, a singe (reference) point (see REFERENCE-SECTION).

Using wildcards
If a lage number of clusters is to be calculated, specification of all single clusters can be cumbersome. The character '*' can be used as a wildcard. Wildcards always represent mode integers and cannot be used for coordinate labels. Depending on the position within a tuple a wildcard is expanded to all possible mode integers for this particular position.

Examples using the mode definition above:

The tuple (*,3,*) is expanded into (1,3,4),(2,3,4).

The tuple (*,Q5) is expanded into (1,Q5),(2,Q5).

The tuple (*,*,4) is an acronym for the tuples (1,2,4), (1,3,4) and (2,3,4).

The tuple (*,*,2) is invalid as the first mode integer after the wildcards is too small.

The tuple (*,*) is expanded to tuples (1,2),(1,3),(1,4),(2,3),(2,4),(3,4).

Example 1
Expansion to second order around a single point:

EXPANSION-SECTION
 (),(*),(*,*)
end-expansion-section

Example 2:
Full expansion to first order around logical coordinate 4 plus two additional 2nd-order clusters:

EXPANSION-SECTION
 (4)
 (*,4)
 (*,3,4)
end-expansion-section

Output Documentation

Within the different stages of the calculation, different output files are generated or updated within the name directory.

1. Initialization

In the very beginning the program opens a log file clusterstat.log which remains open until termination of the program. It recieves all messages sent through the logging system and contains time-stamps, log levels and messages.

Also in the very beginning a file input containing a string representation of the original input is generated. This file is reproduced from allready processed input and is not merely a copy of the original input file.

In Addition, the restart file is generated. This file is continuously updated to allow restarting the calculation after a crash or manual interruption. Note: The restart file is generated using the Python built-in cPickle module and can be easily inspected and altered in an interactive Python shell (see cPickle documentation).

2. Random walk

After the initialization, if required, a random walk using a Metropolis algorithm is performed. During the random walk the file coordinates is generated. It contains the coordinate vector for each accepted Monte-Carlo step in ASCII format, one vector per line with whitespace separated floating point numbers. Also generated at this stage is the file probe. The n-th line of this file contains the energy returned by the probe function of the n-th accepted Monte-Carlo step, the mean of the first n energies and the root-mean-square of the first n energies. These files are not generated if the trajectory is read from file.

3. Calcululation of the Clusters

When the trajectory is generated, the clusters are calculated for each coordinate vector and reference point. The following files are created (all ASCII and whitespace separated for multiple columns):

clusters_ref_<i>
The files containing the clusters were <i> is the number of the reference point. All clusters for one coordinate vactor are stored in one line, separated by whitespaces. These files can become very large.
energies
The n-th line contains the true energy of the n-th Monte-Carlo step, the mean of the first n energies and the root-mean-square of the first n energies.
ratio
The n-th line contains the average acceptance rate intil the n-th Monte Carlo step in percent.
switch
If a switching function is used, this file contains the weights vor each reference point according to the switching function.
curr_error_order
The order-dependent error of the approximation for the n-th coordinate set. Each column contains the true potential minus the cumulated sum of all clusters to a certain order. The order of a cluster is determined by the number of modes it contains, regardless of wether all coordinates of a mode are included or just a subset. Clusters that only contain the reference point are of oder zero. The first column of this file contains the true potential minus all (weighted) clusters of order zero, the second column contains the true potential minus all zero-th and first order clusters, etc.
mean_error_order
The n-th line of this file contains the mean of the first n lines of the file curr_error_order.
rms_error_order
The n-th line of this file contains the root-mean-square of the first n lines of the file curr_error_order.