[c]
Introduction
Multi-locus sequence typing (MLST
) is a widely used method of characterization of bacterial isolates. It has been applied to over 50,000 isolates in over 50 different species and its results are freely available from the pubMLST
website. This is the homepage of SimMLST, a coalescent method to jointly simulate MLST data and the clonal genealogy that gave rise to the sample. Such simulations are useful to make interpretations about real datasets.
Running SimMLST from the Graphical User Interface
Running the graphical user interface (GUI) of SimMLST can be done simply by starting the program without any argument (or, in a Desktop environment such as Windows or KDE, by double-clicking on the icon of the program). Once loaded, the GUI looks like this:
The first five boxes are used to specify the parameters of the simulations and correspond respectively to the -N, -B, -T, -R and -D options of the command line. The population size model can be specified using a combination of -C and -E arguments as detailed below. The last four boxes are used to indicate which files should be generated by SimMLST, and correspond respectively to the -o, -l, -c and -d options of the command line. Finally, the checkbox allows to store the ancestral material in the DOT graph and corresponds to the -a option of the command line.
Running SimMLST from the command line
SimMLST can be run from the command line using any combination of the following options:
-N NUM
|
Sets the number of isolates (the default is 100) |
| -T NUM |
Sets the value of theta, the scaled mutation rate (the default is 100) |
| -R NUM |
Sets the value of rho, the scaled recombination rate (the default is 100) |
| -D NUM |
Sets the value of delta, the mean size of imports (the default is 500) |
| -B NUM,...,NUM |
Sets the number and length of the fragments (the default is 400,400,400,400,400,400,400) |
| -C T,N |
Sets the population size constant and equal to N before time T (cf. below)
|
| -E T,R |
Sets the population size exponentially growing with rate R before time T (cf. below)
|
| -s NUM |
Use the given seed to initiate random number generator (by default the seed is generated from /dev/urandom on Linux systems
and from the clock on Windows systems)
|
| -o FILE |
Export the data to the given file in XMFA format (cf. below)
|
| -c FILE |
Export the clonal genealogy to the given file in the Newick format (cf. below)
|
| -l FILE |
Export the local trees to the given file in the seq-gen format (cf. below)
|
| -d FILE |
Export the graph of ancestry as a DOT graph to the given file (cf. below)
|
| -a |
Include the ancestral material in the DOT graph (cf. below)
|
Specifying a population size model
In the graphical user interface as in the command line, the user can speficify a population size model using a combination of -E and -C arguments. By default, the population size is assumed to be constant, which would be equivalent to using the argument -C 0,1. The -C and -E options can be used to specify any population size that is piecewise constant or exponential.
- The -C T,N option indicates a constant piece before time T and where the population was of size N relative to the current population size. A negative value for N indicates that the population size remains constant at the current value.
- The -E T,R option indicates an exponential piece before time T where the rate of growth was R. A positive value of R indicates exponential growth and a negative value exponential decline of the population size.
Each piece lasts until the beginning of the next piece. For example, an exponentially growing population size of rate 10 can be specified using -E 0,10. A bottleneck model where the population size was reduced to a tenth between time 0.2 and 0.5 can be specified using -C 0.2,0.1 -C 0.5,1. The -C and -E arguments of SimMLST correspond respectively to the -eN and -eG arguments of the Hudson's MS program
.
The Graphical User Interface allows visualisation of the population dynamics specified by a combination of arguments. For example, the arguments "-C 0.2,0.1 -C 0.5,1 -E 1,-1 -C 2,-1" are represented as:

Output Format
Depending on user choices, SimMLST produces four types of output files.
- The most important one is the simulated MLST data, which is in the eXtended Multi-Fasta Alignment (XMFA) format. It is basically the concatenation of a FASTA file for each simulated gene fragment, with an "=" sign separating the fragments. It is the same format as the input file of ClonalFrame
.
- SimMLST can also output in a separate file the clonal genealogy, in the Newick format
.
- SimMLST can also export the local trees contained in a dataset, in the input format of the program seq-gen
. This format is simply a list of trees in the Newick format
, each of which is preceded by a number in square brackets indicating the number of sites that share a given local tree.
- SimMLST can also generate a full description of the graph representing the ancestry of a sample, in the DOT language
. This can be used in conjunction with the DOT program
to produce publication-quality figures illustrating the ancestry process. The user can specify in the options whether he would like the ancestral material to be shown on the figure. Here is an example of graph where the ancestral material is not shown:
and here is the same graph with the ancestral material shown: