/*
	GELSTATS
	source code and documentation copyright 1995 and 1996
	Stephan Pelikan, Steven H. Rogstad  and
	the University of Cincinnati.

	We're giving this program away.
	You alone must decide if it is correct
	and suitable for your purposes. If you do use
	it, you do so at your own risk. 

	You may redistribute the source code or executable
	versions of this program provided that you 
        1) include this copyright notice, 
        2) do not charge money for it, and 
        3) distribute the program without modifications.

*/


This README file is to  accompany the program
GELSTATS version 2.6 built 28 October 1996

0. INTRODUCTION
I. PREPARING A DATA FILE
II. RUNNING THE PROGRAM
III. UNDERSTANDING THE OUTPUT
IV. BUILDING GELSTATS
V. THANKS
VI. BIBLIOGRAPHY
VI. APPENDIX
VII. OUTPUT FROM SAMPLE.DAT

0. INTRODUCTION
a. General Remarks
b. Referencing GELSTATS
c. The Authors
d. Obtaining GELSTATS

As you can see, each section of this document starts with an outline of
the subsections it contains. 

a. General Remarks
The program GELSTATS is designed to help you make inferences based on
multi-locus VNTR gels. We've made it available as a DOS executable and
as C++ source code. Ideally you will decide on the suitability of the
program by reading the source code carefully. This document is designed
to help you understand the program and its output. We're giving it to
you for free and hope you find its worth every penny you paid us for
it.

b. Referencing GELSTATS
If you want to use the results obtained with GELSTATS in your
publications you can provide an unambiguous reference to the program
by referencing the name (GELSTATS) and version number as well as our
names and addresses. You can always reference the program as a
publication:

S. Pelikan and S. Rogstad, "GELSTATS version 2.6"  University of
Cincinnati, Cincinnati, Ohio. 1996.


The alternative is to reference a publication describing the GELSTATS
program:

Rogstad, Steven H..  and Steve. Pelikan. GELSTATS: a computer program
for population genetics analyses using VNTR multilocus probe data.
BioTechniques, Dec 1996 21(6) ???-???


Even referencing this paper, you should probably mention the version
and date of the program you used. 

We would like to hear about problems you find in GELSTATS, corrections
you can propose for any errors, suggestions you have for improvements
to the program, etc. Email is the best way to reach us.

IF YOU USE THIS PROGRAM, WE'D LIKE TO HEAR FROM YOU. PLEASE SEND US A
NOTE OR EMAIL (see below). THIS WILL LET US DOCUMENT THE USEFULNESS OF
THE PROGRAM AND ENABLE US TO NOTIFY YOU DIRECTLY AS NEW VERSIONS ARE
MADE AVAILABLE OR IF WE DISCOVER ERRORS IN THE PROGRAM.


c. The Authors

Steve Pelikan
Department of Mathematical Sciences
University of Cincinnati
Cincinnati, OH, 45221-0025
steve.pelikan@uc.edu  -or- 
pelikan@math.uc.edu

Steven Rogstad
Biological Sciences ML 6
University of Cincinnati
Cincinnati, OH, 45221-0006
rogstad@email.uc.edu

d. Obtaining GELSTATS

You can obtain GELSTATS by anonymous FTP from ftp.uc.edu. You can
choose between a UNIX tar file (gelstats.tar), a compressed UNIX tar
file (gelstats.tar.gz) for which you'll need the decompression program
"gzip", and a compressed DOS zip archive (gelstats.zip) for which
you'll need an "unzip" program such as the shareware program "pkunzip".
Don't forget that you need to transfer these files in BINARY mode!

You can also obtain the documentation, program, or complete
distribution over the WEB from Pelikan's homepage with this URL:

http://math.uc.edu/~pelikan

The distribution includes a DOS executable file and C++ source code,
along with this README file and a sample dataset. See  section  IV.
(BUILDING GELSTATS) if you want to compile GELSTATS yourself. We have
reason to believe you can build it on MSDOS with Borland C++ version
4.5.  The program also builds using The Free Software Foundation's C/C++
compiler (GCC version 2.7.0) on SUNOS 4.1.3. Indeed, development of the
program was carried out exclusively with the fine tools provided by the
FSF.

The distribution should contain 10 files:

gelstats.exe
sample.dat
readme
gelstats.cpp
arrays.cpp
maths.cpp
arrays.h
maths.h


I. PREPARING A DATA FILE
The data file is an ASCII file. It has no punctuation in it. Numbers
are separated by spaces and/or end-of-lines. The first line of the data
file describes the size and general nature of your data. On the first
line you specify exactly 3 integers: N, P, and G. N is the number of
population bands on your gel (or the number of alleles you have
studied). P is the number of individuals you have studied (the number
of lanes on the gel), and G is the number of groups to which you have
assigned the lanes. The groups are numbered with consecutive integers
starting with 0: 0,1,...,G-1. 

On the next line after N,P,G there should be P numbers which indicate
the group to which different lanes are assigned. The number k appears as
the lth number on this line if and only if lane l is assigned to group
k. You must include this line. If you don't want your data treated as
groups, take G as 1 on the first line, and make all P numbers on the
second line equal to 0.

The remainder of the data file you should think of as a gel. The lines
of the file are population (synoptic) bands and the columns of the file
are lanes. There is a 1 in column i and row j if and only if the band
number j appeared in lane i. Otherwise, there is a 0 in column i and row j.

You can prepare your data file using any editor, wordprocessor, or
spreadsheet program you like as long as you can save the file in ASCII
format. GELSTATS will likely get confused if you feed it a Word or
WordPerfect file (for example) that has hidden formatting symbols in
it.

If you want to put comments in your data file, put them after the data.
GELSTATS only reads from the file until it has acquired the amount of
data you have specified with N and P (lanes and bands). Comments after
the data are ignored.

Here's an example of an acceptable data file:

5 12 3
0 1 2 0 1 2 0 1 2 0 1 2
0 0 1 0 1 1 0 1 1 0 0 0
1 1 0 0 0 0 0 0 1 0 1 1
0 1 1 0 0 1 1 0 0 0 1 0
1 1 0 1 1 1 1 1 0 1 0 0
0 0 0 0 0 1 0 0 0 1 0 0

A sample data file of more realistic proportions should have
arrived with your copy of GELSTATS. It is called sample.dat

GELSTATS makes some simple checks of your data file to ensure that the
groups are numbered correctly and that there is enough data for the
number of lanes and bands you have specified. It also checks to make
sure that the main body of the data contains only 0's and 1's. If your
data doesn't meet these minimal requirements, GELSTATS will print a
message and stop.

II.  RUNNING THE PROGRAM
The easiest way to run the program is to arrange to have the program 
(GELSTATS.EXE under DOS or GELSTATS under UNIX) and the datafile
(say it's called SAMPLE.DAT) in the same directory. The you can simply
give the command

GELSTATS SAMPLE.DAT

and the output of the program will scroll past on your screen.

Usually, you'll want to save and inspect the output and the easiest way
to do this is to redirect the output to a file. The command line to do
this is

GELSTATS SAMPLE.DAT > SAMPLE.OUT

There is only one command line option. It is used to specify the
number of randomizations that are performed by the permutation test
procedures of GELSTATS. You can specify this value, the number of
"Iterations", by giving an integer on the command line after the  name
of the data file:

GELSTATS SAMPLE.DAT 100 > SAMPLE.OUT

When the number of iterations is large, it can take quite a while for
the program to run, and the default value of Iterations is fairly large
(5000). While you're experimenting with the program and testing to make
sure your data file is in the proper form, you probably want to set
Iterations to a small positive value (10 or 100 are fine).

III. UNDERSTANDING THE OUTPUT 
A. Preface
B. Is Gelstats For You? (The assumptions of Gelstats)
C. The Output 

A. Preface
This section is intended to tell what the output of a GELSTATS run
means. In order to do that we need to explain what GELSTATS does and to
indicate why we have chosen the methods we have. The output of a
GELSTATS run comes in sections, and after a general discussion of the
techniques of the program, we'll describe the output and
methods section by section. 

Before we begin, though, a digression on permutation methods is in
order. Essentially all of the output of GELSTATS that you will use for
making inferences is based on permutation methods. Because these methods
aren't as well known as they should be, we'll begin with an example. A
good reference is Good (1993).

Suppose you want to compare the heights in two groups of people. You
take random samples of the groups and measure the heights of the people
in the samples, obtaining X1,X2,...,Xn and Y1,Y2,...,Ym. To find out if
the mean heights of the two groups is the same you could do a t-test.
This would be justified if you knew that heights were normally
distributed in the two groups. You'd also be quite justified in using a
t-test if the distribution of heights wasn't too far from normal and the
sizes of the samples were rather large.

A permutation approach to the question would proceed as follows. Arrange
the heights in a row of n+m numbers:
    X1,X2,...,Xn,Y1,Y2,...,Ym
with all the group-1 scores on the left. Any arrangement of these
numbers in a row we will call a "configuration". For this initial
configuration, find the sum of the first n numbers and call it the
"target". For any configuration, we'll refer to the sum of the first n
numbers in the configuration as its "score".

Suppose we randomly rearrange the numbers to form another configuration
and compute its score. Will the score be bigger or smaller that the
target? That depends, of course. If the heights from group 1 tend to be
higher than the heights in groups 2, the chances are good that the
random configuration will have a score that is smaller than the
target. On the other hand, if the heights don't really differ in the
two groups, we'd expect that the chances would be about 50-50 whether
the score of a random configuration will exceed the target or not.

This is the idea behind a permutation test. Compute all possible
configurations, find their scores, and determine the fraction, p, of
the configurations for which the score exceeds the target. If this
fraction is small, p is the significance with which we conclude that
the center of the distribution of heights in group 1 is larger than the
center of the distribution of the heights in group 2.

When there are many observations, the number of possible configurations
is huge and people usually content themselves with determining the
fraction of a random collection of configurations that have a score
exceeding the target. This general scheme, which might be called an 
"approximate permutation method", is what GELSTATS does to test various
hypotheses.


B. Is GELSTATS For You? (The assumptions of GELSTATS) 
The advantages of the permutation methods we use are that they don't
assume a particular form for the distribution of the variables we're
working with. They also provide good (exact) significance
levels even with small sample sizes. For large sample sizes, they're as
good (powerful) as any other method. For making inferences about
similarity or heterozygosity within and between populations, the
alternative to the permutation methods adopted here are some
approximate parametric methods devised by Lynch (1990). These methods
rely on working out the theoretical mean and sampling variance for the
variates in question and then assuming that the sample sizes are large
enough that the central limit theorem applies. (This says that the
distributions are nearly normal).

With current methods, we find it impossible to run enough lanes on a
gel to satisfy our concerns about the normality assumptions.
Nevertheless, simulation studies we did suggest that many times,
Lynch's methods work quite well with reasonably small sample sizes and
produce results that agree with permutation methods. 

There is one reason for which you should consider avoiding permutation
methods. If your gels are such that comparisons of distant lanes are
inaccurate, you might be better with Lynch's methods. As pointed out by
Lynch, samples can be assigned to lanes on a gel in such a way that
using his methods, you only need to make comparisons between adjacent
lanes in order to test hypotheses about different levels of
similarities between groups. In generating random configurations, the
permutation methods assume that all pair-wise comparisons of lanes are
equally valid.


SOME ASSUMPTIONS:

-- We assume all the alleles of all loci appear on the gel.
-- We assume that samples were taken at random.
-- We assume that Hardy-Weinberg equilibrium holds at each locus in questions.

Note: recent results show that it isn't necessary to assume that all alleles
of all loci appear on the gel. At least as far as computing
heterozygosity is concerned, it's enough to assume that a random
sample of the alleles appear on the gel with the chance of a band
appearing being independent of its frequency.

C. The Output 
1. Identifying information
    a. Program name, version, and build date.
    b. The values of Iterations.
    c. The name and contents of the data file.

2. Linked bands

3. Summary stats on band number and frequency

4. Results on similarity
   a. Definition of similarity
   b. Permutation tests on similarities
   c. Chance of identical lanes
   d. Mean between group similarities
   e. More permutation tests
   f. Monomorphic bands and similarity
   g. Lynch-like F_{ST}

5. Heterozygosity computations


1. Identifying information
The first thing GELSTATS prints is some identifying information. In the
output you'll find  its name, the version of the program that you've
run, and the date on which the program was compiled. It also reports the
 value of  the parameter "Iterations" and the name of the file from
which it read its data. After this information, GELSTATS prints a copy
of the data. In practice, we generate a temporary data file, run
GELSTATS on it, and then annotate the GELSTATS output since it contains
a complete copy of the input file.

In the process of reading your data file GELSTATS makes a couple of
checks. It determines if entries other than 0 or 1 appear in the main
array of the data and whether you have numbered the groups with
appropriate labels. If the data file doesn't meet these minimal
requirements, GELSTATS will print a warning message and quit.  These
messages should appear on you screen somewhere, not in the output file
(if you've redirected output). If you get an empty output file, look on
your screen or console (where C++ puts cerr) for error messages.


2. Linked bands
The program reports a list of all monomorphic bands in the data set.
These bands could represent alleles that are fixed in the population.
Some theoretical results about how monomorphic
bands effect the band-sharing estimates of similarity are available.
See the section 4.d below.

The program the reports any linked bands it detects. Groups of bands
with identical patterns of occurrence could represent closely or
completely linked loci. The program reports all groups of bands with
the same pattern of occurrence except for monomorphic bands, which are
listed earlier in the output. You might want to consider eliminating
all but one of each of these "linkage" groups and running the data
again. If the results are changed significantly, you've got to make a
decision. 


3. Summary stats on band number and frequency
Next comes summary statistics on band numbers and frequencies. First,
the number of lanes  found belonging to each group is reported. Then
the number of bands in each lane. Following this, the maximum and
minimum number of bands are reported, along with an estimate of the
number of loci appearing on the gel. We don't suggest that this is a
great estimate of the number of loci, but it does tell you something
important. 

Assume that there are L loci and that each locus contributes either 1 or
2 bands to each lane. Then the number of bands in any lane should be
between L and 2L. If M and m are the maximum and minimum number of
bands appearing in the lanes of the gel, we should have that
 L <= m <= M <= 2L so that M/2 <= L <= m. Gelstats reports M/2 and  m.
If there are no integers that lie between these values, then there are
loci some of whose alleles appear on the gel while others do not. Your
gel is "missing some bands." Perhaps there are alleles that have
co-migrated. Perhaps some were run off the end of the gel. Perhaps some
weren't well visualized by your probes. Perhaps you read the gel wrong.
In any event, when no integer lies between the upper and lower limits
M/2 and m, you KNOW that your data violates some of the assumptions of
the procedures used by GELSTATS. We're not saying "throw the stuff
out", just "be careful".  Examples show that the extent of the
violation (number of missing bands, say) need not be monotonically
related to the size of the gap, M/2-m. Frequently, correct conclusions
can be drawn from data that violates some of the assumptions of the
program. 

By the way, it is worth comparing the interval estimate [M/2,m] of L
with estimates that Gelstats generates later in the output. Don't
expect exact agreement, but wild departures signal a difficulty.

After this interval estimate, the frequencies with which each band
appears in each group and in the data set as a whole are given. Then
an estimate (the Jin-Chakraborty estimate --- see below) of the frequency
of the allele creating each band is given. 

The mean number of bands, standard deviation of the number of bands,
and standard error of the number of bands are provided both by group
and for the data set as a whole. You can probably do t-test with these
numbers, at least if you know that the distribution of band numbers
isn't badly non-normal. 

After this, permutation tests comparing the number of bands appearing
in different groups are performed. We like these at least as well as
t-tests since they make fewer assumptions. Earlier versions of GELSTATS
did t-tests as well as permutation tests. The results were almost always
the same.


4. Results on similarity
    a. Definition of similarity
    b. Permutation tests on similarities
    c. Chance of identical lanes
    d. Mean between group similarities
    e. More permutation tests
    f. Monomorphic bands and similarity
    g. Lynch-like F_{ST}
    

a. Definition of similarity
These results are based on a band-sharing index of similarity. If n_i is
the number of bands appearing in lane i and n_{ij} is the number of
bands that lane i and j have in common, the similarity of lanes i and j
is defined to be S_{ij} = 2n_{ij}/(n_i + n_j). 

Gelstats reports the values of the similarities S_{ij} for all possible
pairwise comparisons. Since S_{ij}=S{ji}, only a lower triangular matrix
of values needs to be reported. Gelstats throws in the 1's on the
diagonal --- S_{ii}=1 for all i --- and prints a triangular array of
similarities in the output. You can use a wordprocessor or editor to cut
out this section of output, and put it in a new file. The similarity
matrix can then be imported by a variety of other statistical packages.
See the appendix for detailed information about how to load the
similarity matrix into SYSTAT.

Note that the similarity of two lanes is not defined if the sum of the
number of bands in the lanes is 0. If your data set has lanes with this
property, GELSTATS will set such similarities to 0, print a warning
message, and continue with its work. You should pay attention to the
warning since there's probably something wrong with your data and it is
certainly the case that some of the values GELSTATS reports in the
remainder of the output are wrong. Gelstats computes the similarity of
each lane with itself. This results in 1's on the diagonal of the
similarity matrix. If one lane has no bands, the error message
mentioned above will be printed and a 0 will appear as a diagonal entry
in the similarity matrix. Again, you should worry why you have a lane
with no bands.


b. Permutation tests on similarities

We've yet to come up with a good method for testing for differences in
similarity levels within and between groups that works well in every
situation. GELSTATS performs two kinds of permutation tests on
similarities: one, we believe, produces correct significance levels,
but sometimes lacks power. The second method can detect differences in
within-group similarity levels, but only produces "pseudo"
p-values. These generally indicate the relative degrees of differences
but do not represent significance levels.

To explain the differences between the methods and indicate
(graphically) when the first method can be expected to perform well we
introduce a  means for illustrating differences of within-
and between-group similarities. These graphics are based on a
metaphore: to simulate a sample of dissimilarity measures from within
a population, we could select N random numbers from an interval and
compute the squares of the differences of all possible pairs of
numbers. (The average size of these measures is essentially the
variance of the uniform distribution on the interval from which we
drew the random numbers: smaller intervals correspond to greater
simlarity values while larger interval correspond to smaller
similarity (greater dissimilarity) values.

Here's a picture that represents two populations by showing the the
intervals from which we draw random numbers to similate their
dissimilarity values:


1:    [--------]
2:      [---]

The picture indicates that similarities are higher in group 2 than in group 1.
The next picture illustrates a similar situtation with respect to the magnitudes
of within-group similarities but differs from the first in that the sizes
of the between-group dissimilarities are much larger:


1:    [--------]
2:                  [---]


In terms of real populations, the within-group similarities of the 
two populations differ to the same extent as before, but in the
situation represented by the second picture, the populations more
differentiated than in the first.


The first kind of permutation test performed by GELSTATS is concerned
with a null hypothesis represented by the picture:

1:    [--------]
2:    [--------]

that is, the similarities in the groups are the same and the groups
are undifferentiated. The test seeks evidence to reject this null hypothesis
in favor of hypotheses represented by

1:    [--------]
2:      [---]
(i.e. within group similarities differ)
or

1:    [--------]
2:                   [--------]    
(i.e. within group similarities are the same, groups are differentiated)
or

1:    [-----------]
2:                      [---]    
(i.e. within group similarities differ and groups are differentiated)

(or any of a number of other possibilities).

Here's what the test does: it computes the mean pairwise similarity in
group 1, in group 2, and between groups 1 and 2 and the differences in
these means is recorded. Then the individuals are permuted randomly to
form new groups and the differences of mean pairwise similarities are
computed again; the fractions of random arrangements of individuals
into groups that give mean differences as large as observed in the
real data is reported.

This test produces exact p-values in the sense that, if you draw
random samples from populations satisfying the null-hypothesis and
test them at significance level alpha, then  100(alpha) per cent of
the tests will reject the null hpothesis by saying that the
within-group simlarities of the groups differ. By the way, testing with the same
significance, alpha of the tests will say that there's a difference in, say,
the within-group 1  and between-groups 1 and 2 similarities.


The trick --- or the problem --- is determining what to conclude when
tests report a significant difference. Using w1, w2, b12 to denote the
center of the distributions of the within-group 1, within-group 2, and
between-groups 1 and 2 similarities, we can associate schematic
pictures with test results as follows:

A. w1 < w2, no difference in b12, w1 or in b12, w2
1:  [--------]
2:    [---]
(conclude: different similarity levels)

B. w1 < w2,  b12 < w1 and  b12< w2
1:  [--------]
2:                  [---]

(conclude: different similarity levels within the groups, and groups
differentiated)

C. w1=w2, w1 > b12, w2 > b12 
1:  [------]
2:               [------]

or

1:  [------]
2:               [---]
(conclude: groups are differentiated, within-group similarities may
or may not be the same.)

That is, different levels of within-group similarity may not be
detected in the presence of significant differentiation between the
groups. Typically, the test reports no difference between within-group
similarities when the groups are differentiated since small
differences between the groups are lost (compared to the large
between-group differences) when individuals are permuted among groups.

So: you can test, at the provided significance level, for  differences
in levels of within-group simlarities or for differentiation of the
groups. But if the groups are differentiated, the power of the test of
within-group simliarities is very small.

To compensate for the rather poor performance of the first permuation
method at detecting similarity differences in the presence of group
differentiation, GELSTATS performs a second permutation
procedure. This procedure provides a summary statistic that can
indicate similarity differences even when groups are differentiated,
but the fraction reported by the procedure is not a significance level
for rejecting the null hypothesis. In this procedure, individuals are
not permuted among the groups. Instead, all possible pairwise
similarities from within the two groups are computed and these scores
are tested by a permutation test as if they were independent
variates. They're not, independent, of course, because some pairwise
similarity determinations have individuals in common. This procedure
permutes only within-group similarity measures and ignores the
between-group similarities, so it doesn't suffer from the phenomenon
that causes loss of power with the first procedue. It's very good at
pointing out differences. The only problem is that, because of the
lack of dependence, the fractions reported are not significance
levels. We call them "p-values" or "pprime" values.

The values reported for the whale data in the BioTechniques paper
(Rogstad & Pelikan 1996) in the section where GELSTATS output is
compared with an MDS analysis are from this second permutation
procedure.


In the output of GELSTATS, the results of the second procedure described above 
appears first. 


c. Chance of identical lanes 

After this, the statistic Sbar_to_the_Xbar, which is the mean
similarity raised to the mean number of bands is reported for the data
as a whole and for each group. This number measures the chances that
two lanes will show identical patterns of bands.

This section also reports the mean within-group similarity by group 
(and the mean band number).

d. Mean between group similarities

The means of the between-group similarities are reported next.


e. More permutation tests

The results of "first" permutation comparison described above
 are then reported. These are the tests based on reassigning lanes to
groups at random. 

f. Monomorphic bands and similarity
We call bands that appear in every lane "monomorphic" bands. The
terminology comes from the assumption that if a band appears in every
lane it represents an allele occurring with very high frequency ---
practically, the allele is fixed in the population. If your data has
lots of monomorphic bands, you need to decide whether you should do
something about them. If you eliminate them from consideration, you
could be removing important evidence (of low heterozygosity or high
similarity, say). It is conceivable that some populations are identical
at a large number of loci; adding a huge number of monomorphic loci to
an otherwise informative data set can alter the conclusions you
draw from the data.

In deciding what to do about monomorphic bands, you need to know two
things: 
    
    	1) Adding or removing monomorphic bands can alter the relative
magnitudes of pairwise similarities considerably and
    
	2) Such alterations can only make substantial changes in the
relative sizes of similarities  when different lanes
have markedly different numbers of bands in them.

e. Lynch-like F_{ST}
Lynch (1991) suggests that F_{ST} can be estimated as

    F'_{ST} = (1-Sb)/(2-Sb-Sw).

Here, Sw is obtained by finding the mean pairwise similarity within each
of the populations and then computing the average of these means. Sb
is a measure of the between-group similarities and is obtained by
finding, for each pair of populations i and j, the average similarity S_{ij}'
between pairs of individuals selected from the two populations and
setting

    S_{ij} = 1 + S_{ij}' - (S_i + S_j)/2

where S_i and S_j are the average similarities of individuals within
the populations i and j. Then Sw is the mean of the S_{ij}, taken over
all pairs of sub populations i and j.

The program reports the values of this estimate,  F'_{ST}.


5. Heterozygosity computations
    a. Estimating allele frequencies
    b. Estimating heterozygosity
    c. Proportion loci polymorphic
    d. Comparing heterozygosities
    e. Good and bad news about the assumptions

The estimates of heterozygosity are based on finding the frequency of
alleles creating each of the bands on the gel. We assume that each
allele at each locus results in one band on the gel. The term "population
band" refers to the location occupied by the bands created by one
allele. Thus, an individual possesses a particular allele if an actual
band appears at the level of the  population band for that allele in that
individual's lane.

a. Estimating allele frequencies
Begin by considering a single population band on a gel with n lanes.
Since alleles at VNTR loci are codominant, a lane will show a band in
this population band if the individual assigned to the lane is either a
homozygote or heterozygote for the allele associated with the
population band. Assuming Hardy-Weinberg equilibrium, we can estimate
the frequency of the allele from the frequency of bands. If there are k
lanes with bands in the population band, and p is the frequency of the
allele, then we expect that
   
                 k/n = p^2 + 2p(1-p).

Solving this equation for the allele frequency p yields
     
                 p = 1- sqrt(1-(k/n)).

This method of estimating allele frequencies from band frequencies was
proposed by Stephens et. al. (1992). The estimate it provides for p is
biased --- on the average it is an over estimate of p. Nevertheless we
have proved that the estimate is a so-called "maximum likelihood
estimator" of p. This means that asymptotically (with larger and larger
n) it is unbiased and enjoys optimal variance properties.

An improvement on the Stephens et. al. estimate is provided by Jin and
Chakraborty (1994). Provided k < n, their formula estimates p as

                p = 1- sqrt(1-s) - (1/(8n))(s/sqrt(1-s)),
where s = k/n.

This estimate is also biased, but the bias is quite a bit smaller.
Since the extra correction term in the Jin-Chakraborty  formula vanishes as n
tends to infinity, we see that their formula is also asymptotically
unbiased.

b. Estimating heterozygosity
Using either method of estimating allele frequencies, one can proceed to
estimate heterozygosity as follows. Because we've assumed that all the
alleles of all the loci appear on the gel, the sum of the frequencies of
all the alleles creating bands on the gel must be L, the number of loci
contributing bands to the gel. 

If the average heterozygosity of the loci creating bands on the gel is
H, then the fraction H of the loci will contribute 2 bands to a lane
and (1-H) of the loci will contribute 1 band to the lane.  Thus, we
expect that each lane will have 2HL + (1-H)L= (1+H)L bands. With n
lanes on the gel, we expect a total of T=n(1+H)L bands appearing on the
gel. Solve this equation for H to obtain the formula

                H = (T/(nL)) -1

expressing the heterozygosity in terms of the number of loci, the number
of lanes, and the total number of bands on the gel. We have proved that
this estimate of H is asymptotically unbiased (as n tends to infinity).


GELSTATS estimates H using the above formula after determining L as the
sum of allele frequencies using the formula os Stephens et. al. and
again after determining L as the sum of the frequencies obtained using
the Jin-Chakraborty formula. We call these the Stephens and J-C
estimates of heterozygosity. GELSTATS reports these estimates for the
data set as a whole, and again performing the computations within each
group. 

Finally, some versions of GELSTATS provide a third estimate of heterozygosity based on
applying a bias correction to the formula

    H = -1 + Sum(k_i)/(n Sum(p_i)).

Even after making the J-C correction in the estimate for the
frequencies p, dividing by their sum introduces another bias since the
expected value of a ratio is not in general the raio of the expected
values. GELSTATS reports  this third method for estimating
heterozygosity. This third estimate  is obtained by using the Stephens
formula for gene frequencies and then correcting for the bias after
diving by the sum of the frequencies. (We used a Taylor expansion
around the expected value of the frequencies, and include all the
variance terms in the correction, neglecting the (generally unknown)
covariance terms --- details will appear elsewhere.) Simulations show
that this corrected estimate is about as accurate as
 the J-C based estimate of H. 

Of course, all estimates of H improve with sample size (number of
lanes). Both the J-C and our "corrected-Stephens" estimates are quite
good for fairly large values of H, with average errors less than 4%
when H > 0.5 and the sample size is > 11. For 11 lanes and small values
of H, J-C heterozygosities can have average errors of 10% or more,
while corrected-Stephens H's generally have 2% to 4% errors. If you
want to estimate heterozygosity absolutely, report the value of this
bias corrected estimate if its value of less than 0.5, otherwise report
the JC-based estimate. Based on simulations, we believe that this
should produce the correct value to within about 5\% provided your
sample size is >15. See Pelikan and Rogstad (1996).

In versions that compute it, the corrected-Stephens estimate of heterozygosity is provided for each group in the data set and for the data set as a whole.

Some argument could be made for using the whole-data-set estimates of L for
finding the within group heterozygosity estimates, and GELSTATS
provides enough data in the output for you to accomplish this by hand.
We don't make GELSTATS do this since people may be running GELSTATS on
data with multiple groups having different numbers of loci --- a
setting in which using the whole data set's L could result in bad
estimates of H.

Since the presence or absence of one band in a lane is not independent
of the presence of other bands, we cannot obtain a useful expression for
the sampling variance of the estimates of L and H provided by these
method.  Of course, the permutation methods used by GELSTATS  don't
require knowing the variance in order to make inferences.


c. Proportion loci polymorphic 

GELSTATS reports estimates of the proportion of loci which are
polymorphic and the average number of alleles per locus. These estimates
are provided for the data set as a whole and for each group. Since these
numbers depend on estimates of the number of loci, GELSTATS provides
them based on both Stephens and J-C estimates of L.

People usually call a locus monomorphic if the most frequent allele has
frequency above some critical value (0.95 or 0.99 for example). Here we
call a locus monomorphic if the observed frequency of an allele at the
locus is 1.0 (that is, if the band appears in every lane). Bands with
frequencies less than 1 are assumed to lie at polymorphic loci.

There's considerable sampling error in this estimation of the
proportion of loci polymorphic, and the error depends on the number of
loci examined. See  Nei (1987) page 177, who points out that for small
numbers of loci, the sampling error is so large that estimates of the
proportion of loci that are polymorphic are useless. 


d. Comparing heterozygosities 
Until version 2.12 GELSTATS provided two methods for comparing the
heterozygosities of groups in the data. In this section, J-C estimates
of L are used to find heterozygosity. These are unbiased enough for the
purpose of comparing different groups. 

First, for each pair of groups the program
computes the difference in the J-C estimates of heterozygosity in the
groups, and then randomly permutes the individuals among the groups
Iterations times, counting how often the heterozygosity difference is
as extreme as observed with the original grouping. 

This method provides a good means for comparing the
heterozygosities of two groups provided that the groups are not
genetically differentiated. This is exactly the situation we were
originally interested in: comparing parent and offspring generations to
determine the extent of inbreeding (assuming no selection). 

Starting with version 2.12, this first method is omitted by GELSTATS:
it is useful only in special situations and adds considerably to
the execution time of the program. Still, given the hypothesis of
non-differentiation, this method is probably more
powerful than our second method. 

If the groups are differentiated, pooling them  and
selecting a subset will almost certainly result in a group with higher
heterozygosity. So the method described above is not appropriate when
groups are genetically differentiated. For this reason, we provide
another  method for comparing heterozygosities of groups. This method
compares two groups at a time. With a group of size g there are g(g-1)/2
subsets of size g-2. Each of these subsets yields an estimate of
heterozygosity of the population from which the group was selected. If a
second group has size h, it has h(h-1)/2 subsets of size h-2, each
yielding an estimate of the heterozygosity of the group. Then there are
g(g-1)h(h-1)/4 possible pairwise comparisons of heterozygosity values
for the two groups.
If this number is not too large (not bigger than 3 times the number of
Iterations specified), the program makes all possible comparisons and
reports the fraction of the comparisons in which the first group had
higher heterozygosity than the second.

The program also compares the heterozygosity of the groups by comparing
the heterozygosity values  of randomly chosen subsets of the two groups.
Rather than making all comparisons, it makes Iterations randomly
selected comparisons.

Since the sampling variance of the heterozygosity determinations based
on samples of size g-2 is larger than the variance of determinations
based on samples of size g, we expect that the fractions reported by the
above procedures will provide a conservative estimate of the
significance of the difference in heterozygosity of the two groups.


e. Good and bad news about the assumptions
We performed extensive theoretical and simulation studies on a variety
of methods for estimating heterozygosity before selecting the methods
used in GELSTATS. Some of these results will be submitted for
publication.  We use the J-C based estimate of H because, on the
average, it is the most accurate. We use the Stephens estimate for H
because it provides an under-estimate of H. (It overestimates the
frequencies p, and hence the sum of the frequencies, L. The reciprocal
of L enters the formula for H, which is why Stephens p's give an under
estimate of H.) Both these methods are sensitive to departures from our
assumptions. 

In particular, the estimated H values can be wildly wrong if
Hardy-Weinberg equilibrium doesn't hold. In simulations with
populations having different fixation coefficients, the errors in
estimated values of H were frequently as large as 10 or 20 per cent
with fairly modest (F = 0.2) fixation coefficients. The variance of the
heterozygosity estimates remained small, however, so differences in
heterozygosity estimates for two groups can be quite accurate estimates
of differences in the heterozygosities of the groups  provided the
groups have nearly the same fixation coefficient.

A second result of the studies is that neither method of estimating H
is sensitive to missing bands (eg., bands run off the gel), provided
that the chances of a band not appearing are independent of the band's
frequency. Roughly speaking, eliminating a band has the same
proportional effect on T and L, so a  missing a band doesn't alter the
ratio of T and L. And H depends only on the ratio. We have established
this fact analytically as well: in the limit of large samples, with
large numbers of alleles, you need only assume that a random sample of
alleles appears on the gel.


IV. BUILDING GELSTATS

There's nothing especially fancy to be done, but remember to link with a
library containing mathematical functions. As provided, the source
compiles under  using Borland's C++ version 4.5. running under Windows.
(From within the IDE define a project called gelstats, add all the files
*.cpp as nodes, and run "build all".)

From DOS use something like
	
	bcc -ml -egelstats.exe -ot *.cpp

For other compilers, you may need to instruct the linker to use a
library containing mathematical functions (sqrt() etc.) We have always
used a "Large" memory model for DOS versions of the program. The DOS
executable we've distributed was built with Borland's C++ version 4.5
and uses the i286 instruction set. You can run it on an AT-clone. By
building the program using i386 or i486 instruction sets (assuming
you've got one of those chips), you might get slightly better
performance or smaller size. I wouldn't bother, though.

On Unix, you probably only need to make one modification: uncomment 
the line 

#define UNIX


near the top of the source file maths.cpp. Under UNIX, the program uses
getpid() to generate a seed for the random number generator. Under DOS,
it uses the time. Your UNIX C++ compiler won't be able to link with DOS
library functions for manipulating times and will complain if you ask
it to do so.  After this modification, try the commands 

            gcc  -c *.cpp 
            gcc *.o -o gelstats -lm -lstdc++


If you don't have a C++ compiler and really need to build GELSTATS for
yourself, let us know: we have recent versions of GELSTATS written in C
that build nicely with a variety of compilers.

You are welcome to modify the program to suit your needs. You may also
distribute unmodified versions of the program and source provided you
don't do so for a profit and provided you include the copyright notice
found at the top of this file. Please don't distribute modified
versions of the source code or program. The only reason for this request
is that it is important for researchers to be able to say exactly what
computations they performed. They can only do this by referencing the
program if there's only one version of the program around. So 
PLEASE: don't distribute modified versions of GELSTATS under that name.

----------------------------
V. THANKS

Many people working in Steve Rogstad's lab have used verions of this
program on their data and provided valuable bug reports, feedback and
suggestions. Hae Lim and Dan Busemeyer tested the program extensively
and deserve special thanks, as does Brian Keane whose huge datasets lead
us to many improvements in GELSTATS.

Three anonymous referees of the BioTechniques article describing
GELSTATS suggested improvements in the article, the program, and the
documentation. Changes made in direct response to these suggestions
include a discussion of how comments can be included in data files and
a revised treatment of linked and monomorphic bands that makes the
output of GELSTATS easier to use and understand.

We thank Tony Leonard for helpful discussions and suggestions
about testing for differences in similarities. He's 
developing improved methods for this sticky problem.

A discussion by many people in the sci.bio.computing newsgroup lead us
to include a table of J.-C. frequency estimates of the alleles in the
output of GELSTATS. 

VI. BIBLIOGRAPHY

Good, P. Permutation Tests. Springer-Verlag, New York. 1993.

Lynch, Michael. The similarity index and DNA fingerprinting. Mol. Biol.
Evol. (1990) 7(5) pp478-484.

Lynch, Michael. Analysis of population genetic structure by DNA
fingerpinting. In "DNA fingerpinting approaches and applications", T.
Burke et.al. editors. Birkhauser (1991) pp113-126.

Jin, Li  and Ranajit Chakraborty. Estimation of genetic distance and
coefficient of gene diversity from single-probe multilocus DNA
fingerprinting data. Mol. Biol. Evol. (1994) 11(1) pp120-127.

Masatoshi Nei, Molecular Evolutionary Genetics, Columbia University
Press, 1987, New York.

Pelikan, S. and S. Rogstad. You can estimate heterozygosity with
multilocus probes. Pre-print, University of Cincinnati, 1996.

Rogstad, S.  and S. Pelikan. GELSTATS: a computer program for population
genetics analyses using VNTR multilocus probe data.
BioTechniques, Dec 1996 21(6) ???-???

Stephens, J.C., D.A. Gilbert, N. Yuhki, and S.F. O'Brian. Estimation of
heterozygosity from single probe multilocus DNA fingerprints. Mol. Biol.
Evol. (1992) Vol 9 pp729-743.


VII. APPENDIX 
Loading similarity matrices into SYSTAT.

By redirecting the output from GELSTATS to a file, you can use portions
of the output in other programs. This appendix tells you how to load
the array of pairwise similarities produced by GELSTATS into the SYSTAT
statistical program. First, load the output (as an ASCII file) into
your favorite editor or word processor, cut out the array of
similarities and save them in ASCII format in a separate file. Note
that it does not matter that what should be a triangular similarity
matrix has several of its lower lines that wrap around the screen.
(Some wordprocessors automatically wrap long lines.) Just save the
matrix exactly as it is.

Then load the array into SYSTAT by entering the DATA module and giving
the following commands:

get filename
save filename.sys 
        [this will be the datafile to use for your analyses]
input variablename(1-n) 
        [where n is the number of individuals in the dataset]
type similarity
run

We have frequently done multidimensional scaling (the MDS module of
SYSTATS) on similarity matrices produced by GELSTATS.

VIII. OUTPUT FROM SAMPLE.DAT

Here's the complete output that results from running GELSTATS on the
file SAMPLE.DAT. You can partially test your version of the program by
comparing your output with what is provided here. Remember that the
"p-values" are based on random sampling, so you shouldn't expect to get
exactly the same values that are shown here.
The command line used to generate the file was

gelstats sample.dat > sample.out

BEGIN SAMPLE.OUT
Output from program GELSTATS version 2.6 built 4 November 96
Iterations is set to 5000
Reading data from file: sample.dat
Here's the data I've just read


-----------------------------------------------
1 1 1 1 2 2 2 1 1 1 1 0 0 2 2 2 0 0 0 0 0 0 
1 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 
1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 
1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 
0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 
1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 
1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 1 0 
1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 
0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 
1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 
0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 
1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 
0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 0 
1 1 1 0 1 0 1 0 0 0 1 0 1 1 1 1 1 1 1 1 0 1 
1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 
1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 
1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 
0 1 0 0 1 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 1 1 
1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 1 0 
1 0 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 
0 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 1 1 0 
1 1 1 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 
1 1 1 1 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 
1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 
1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 
1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 1 0 1 0 1 
0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 
0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 
0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 
1 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 
0 1 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 1 
0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 
1 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 
0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1 
1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 
1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 
0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 
1 1 0 1 1 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 
1 1 1 1 0 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 


-----------------------------------------------
Number of bands = 41
Number of lanes = 22
Number of groups = 3


-----------------------------------------------
Searching for monomorphic (fixed) bands
The following 1 bands are fixed:
31 

-----------------------------------------------

Searching for linked markers (identical rows)
Monomorphic (fixed) rows are given above and are not reported as linked.
You might want to eliminate identical rows and start again.
Or maybe not ...
Each line contains the numbers of rows in a linkage group:


-----------------------------------------------
Group sizes:
Group    Size     
0        8        
1        8        
2        6        


-----------------------------------------------
Number of bands in each lane:
0     28    
1     31    
2     23    
3     26    
4     22    
5     20    
6     22    
7     19    
8     25    
9     20    
10    20    
11    20    
12    19    
13    21    
14    21    
15    20    
16    18    
17    22    
18    25    
19    19    
20    17    
21    19    


-----------------------------------------------
The maximum and minimum number of bands in the lanes
are max = 31 and min = 17
This gives the estimate of locus number L: 15.5000 <= L <= 17


-----------------------------------------------
Frequency of bands by groups and as a whole:
Band     0        1        2        Whole    
0        0.2500   0.6250   0.0000   0.3182   
1        0.1250   0.2500   0.0000   0.1364   
2        0.1250   0.3750   0.0000   0.1818   
3        0.0000   0.1250   0.1667   0.0909   
4        0.5000   0.5000   0.3333   0.4545   
5        0.7500   0.2500   0.3333   0.4545   
6        0.2500   0.6250   0.3333   0.4091   
7        0.3750   0.2500   0.3333   0.3182   
8        0.0000   0.2500   0.3333   0.1818   
9        0.2500   0.3750   0.1667   0.2727   
10       0.8750   0.8750   0.5000   0.7727   
11       0.8750   1.0000   0.6667   0.8636   
12       0.3750   0.8750   0.6667   0.6364   
13       0.7500   0.5000   0.8333   0.6818   
14       0.7500   0.5000   0.3333   0.5455   
15       0.7500   0.7500   0.8333   0.7727   
16       0.2500   0.5000   0.8333   0.5000   
17       0.6250   0.3750   0.6667   0.5455   
18       0.7500   0.6250   0.0000   0.5000   
19       1.0000   0.7500   0.6667   0.8182   
20       0.6250   0.3750   0.6667   0.5455   
21       0.2500   0.7500   0.3333   0.4545   
22       0.2500   0.6250   0.6667   0.5000   
23       0.6250   0.8750   1.0000   0.8182   
24       0.7500   1.0000   1.0000   0.9091   
25       0.2500   0.2500   0.3333   0.2727   
26       0.5000   0.6250   0.5000   0.5455   
27       0.3750   0.3750   0.1667   0.3182   
28       0.8750   0.7500   1.0000   0.8636   
29       0.1250   0.3750   0.8333   0.4091   
30       0.5000   0.7500   0.3333   0.5455   
31       1.0000   1.0000   1.0000   1.0000   
32       0.2500   0.5000   0.6667   0.4545   
33       0.3750   0.3750   0.3333   0.3636   
34       0.7500   0.7500   0.5000   0.6818   
35       0.7500   0.7500   0.5000   0.6818   
36       0.7500   1.0000   0.8333   0.8636   
37       0.5000   0.5000   0.6667   0.5455   
38       0.5000   0.6250   0.5000   0.5455   
39       0.1250   0.6250   0.5000   0.4091   
40       0.1250   0.7500   0.6667   0.5000   


-----------------------------------------------
JC estimates of frequency of alleles :
Band/frequency
0        0.1721   
1        0.0698   
2        0.0943   
3        0.0460   
4        0.2580   
5        0.2580   
6        0.2283   
7        0.1721   
8        0.0943   
9        0.1454   
10       0.5141   
11       0.6174   
12       0.3910   
13       0.4291   
14       0.3212   
15       0.5141   
16       0.2889   
17       0.3212   
18       0.2889   
19       0.5627   
20       0.3212   
21       0.2580   
22       0.2889   
23       0.5627   
24       0.6814   
25       0.1454   
26       0.3212   
27       0.1721   
28       0.6174   
29       0.2283   
30       0.3212   
31       1.0000   
32       0.2580   
33       0.1997   
34       0.4291   
35       0.4291   
36       0.6174   
37       0.3212   
38       0.3212   
39       0.2283   
40       0.2889   


-----------------------------------------------
number of lanes = 22.0000
mean number of bands = 21.6818
Standard deviation = 3.4418
Standard error of mean = 0.7338


-----------------------------------------------
Summary stats on number of bands by group
In Group 0
number in group = 8.0000
mean number of bands = 19.8750
standard deviation = 2.5319
standard error = 0.8952
In Group 1
number in group = 8.0000
mean number of bands = 24.0000
standard deviation = 4.2762
standard error = 1.5119
In Group 2
number in group = 6.0000
mean number of bands = 21.0000
standard deviation = 0.8944
standard error = 0.3651


-----------------------------------------------
Permutation tests on the number of bands in different groups
Number in group 1 >= number in group 0 with p = 0.0202
Number in group 2 >= number in group 0 with p = 0.1882
Number in group 2 <= number in group 1 with p = 0.0480


-----------------------------------------------
Table of pairwise similarities
1.0000 
0.7797 1.0000 
0.6667 0.7407 1.0000 
0.6667 0.8070 0.7755 1.0000 
0.5200 0.6038 0.6667 0.6250 1.0000 
0.6250 0.5882 0.5581 0.5652 0.7143 1.0000 
0.5600 0.7547 0.6667 0.6250 0.6364 0.5714 1.0000 
0.5532 0.6400 0.5714 0.6222 0.5854 0.6154 0.6341 1.0000 
0.6792 0.7857 0.6250 0.7451 0.6383 0.6667 0.6383 0.6818 1.0000 
0.5833 0.5490 0.6512 0.5652 0.7143 0.6500 0.5714 0.5128 0.4889 1.0000 
0.5000 0.5882 0.6047 0.6087 0.5714 0.5500 0.6667 0.5128 0.6222 0.4500 1.0000 
0.6250 0.6275 0.5581 0.5652 0.4762 0.6000 0.6190 0.6154 0.6667 0.6000 0.7500 1.0000 
0.5957 0.6000 0.7619 0.5778 0.5366 0.4615 0.6341 0.5263 0.4545 0.5641 0.4615 0.4103 1.0000 
0.6122 0.6538 0.6364 0.6383 0.5116 0.5366 0.6977 0.7000 0.6522 0.4390 0.6829 0.6341 0.5500 1.0000 
0.6122 0.5769 0.6364 0.5957 0.5581 0.6341 0.6977 0.6000 0.5217 0.6341 0.7317 0.6829 0.6000 0.6667 1.0000 
0.5833 0.5882 0.5116 0.6522 0.5714 0.6000 0.5238 0.7179 0.6667 0.5000 0.5000 0.5000 0.5128 0.5366 0.6829 1.0000 
0.5217 0.5714 0.7805 0.6364 0.6000 0.4737 0.6000 0.4865 0.4651 0.5263 0.6316 0.4737 0.8108 0.5128 0.6154 0.4737 1.0000 
0.6400 0.7547 0.7556 0.8333 0.5455 0.4762 0.6364 0.6829 0.6809 0.3810 0.5714 0.4762 0.6341 0.7442 0.5581 0.6667 0.6500 1.0000 
0.7925 0.7143 0.6250 0.5882 0.6383 0.7111 0.5106 0.5455 0.6400 0.5333 0.5778 0.5778 0.5909 0.5652 0.6087 0.5778 0.6047 0.6383 1.0000 
0.5532 0.6400 0.6190 0.6222 0.4878 0.4615 0.6829 0.5263 0.5909 0.4103 0.6154 0.5128 0.6316 0.7500 0.5500 0.4615 0.6486 0.7805 0.6364 1.0000 
0.4889 0.4583 0.6500 0.5116 0.5641 0.4865 0.4615 0.5000 0.5714 0.6486 0.7027 0.6486 0.5556 0.5263 0.5789 0.4324 0.6286 0.4615 0.5714 0.5556 1.0000 
0.5532 0.6800 0.5714 0.5333 0.4878 0.5641 0.7317 0.6842 0.6818 0.4615 0.6154 0.6667 0.5789 0.6500 0.6000 0.6154 0.4865 0.6829 0.5909 0.7368 0.5556 1.0000 


-----------------------------------------------
Permutation tests of similarity values
Within group 1 > within group 0 with pprime = 0.1450
Within group 1 > between groups 1 and 0 with pprime = 0.0990
Within group 0 > between groups 1 and 0 with pprime = 0.4930
Within group 2 > within group 0 with pprime = 0.3730
Within group 2 > between groups 2 and 0 with pprime = 0.0566
Within group 0 > between groups 2 and 0 with pprime = 0.0896
Within group 2 < within group 1 with pprime = 0.2708
Within group 2 < between groups 2 and 1 with pprime = 0.3800
Within group 1 > between groups 2 and 1 with pprime = 0.2328


-----------------------------------------------
Probability of identical lanes, by group:
In group 0 :
	mean similarity is: 0.5999
 	mean band number is: 19.8750
 	sbar_to_xbar = 3.8800e-05
In group 1 :
	mean similarity is: 0.6278
 	mean band number is: 24.0000
 	sbar_to_xbar = 1.4025e-05
In group 2 :
	mean similarity is: 0.6093
 	mean band number is: 21.0000
 	sbar_to_xbar = 3.0290e-05
In whole group:
	mean similarity is: 0.6009
	mean band number is: 21.6818
	sbar_to_xbar = 1.5993e-05

Mean similarities between groups:
	1 and 0 = 0.5996
	2 and 0 = 0.5711
	2 and 1 = 0.6147


-----------------------------------------------
Lynch's F_{st} = 0.0424


-----------------------------------------------
Similarities by permuting lanes among groups:
Within group 1 > within group 0 with p= 0.2400
Within group 1 > between group 0 and 1 with p= 0.1300
Within group 0 > between group 0 and 1 with p= 0.4512
Within group 2 > within group 0 with p= 0.3628
Within group 2 > between group 0 and 2 with p= 0.0788
Within group 0 > between group 0 and 2 with p= 0.0980
Within group 2 < within group 1 with p= 0.2986
Within group 2 < between group 1 and2 with p= 0.4120
Within group 1 > between group 1 and 2 with p= 0.2504


-----------------------------------------------
Bias-corrected estimates of heterozygosity and Stephens estimates
In group 0:
Stephens estimate = 0.4945
Bias-corrected Stephens estimate = 0.6064
In group 1:
Stephens estimate = 0.4449
Bias-corrected Stephens estimate = 0.5408
In group 2:
Stephens estimate = 0.4367
Bias-corrected Stephens estimate = 0.5561
In group as a whole:
Stephens estimate = 0.5477
Bias-corrected Stephens estimate = 0.5937


-----------------------------------------------

Standard-Stephens heterozygosity estimates
for whole data set:
	Number loci = 14.0087
	Heterozygosity = 0.5477
	Proportion of loci polymorphic = 0.9286
	Avg number alleles per locus = 2.9268
In group 0:
	Number loci = 13.2991
	Heterozygosity = 0.4945
	Proportion of loci polymorphic= 0.8496
	Avg number alleles per locus = 2.9325
In group 1:
	Number loci = 16.6106
	Heterozygosity = 0.4449
	Proportion of loci polymorphic= 0.7592
	Avg number alleles per locus = 2.4683
In group 2:
	Number loci = 14.6164
	Heterozygosity = 0.4367
	Proportion of loci polymorphic= 0.7263
	Avg number alleles per locus = 2.5314


-----------------------------------------------
Standard J-C heterozygosity estimates:
For whole data:
	Number loci = 13.7971
	Heterozygosity = 0.5715
	Proportion of loci polymorphic= 0.9275
	Avg number alleles per locus = 2.9716
In group 0:
	Number loci = 12.7929
	Heterozygosity = 0.5536
	Proportion of loci polymorphic= 0.8437
	Avg number alleles per locus = 3.0486
In group 1:
	Number loci = 16.0526
	Heterozygosity = 0.4951
	Proportion of loci polymorphic= 0.7508
	Avg number alleles per locus = 2.5541
In group 2:
	Number loci = 14.0024
	Heterozygosity = 0.4997
	Proportion of loci polymorphic= 0.7143
	Avg number alleles per locus = 2.6424


-----------------------------------------------
Based on JC-heterozygosities, Nei's F_{ST} = 0.0968


-----------------------------------------------
Permutation tests on heterozygosity of subgroups
Test of H values using subsets of size n-2 from each group
Fraction  group 1 bigger than group 0 = 0.1232
That is,
H in group 0 is bigger than H in group 1 with p = 0.1232

Exact fraction of all pairwise comparisons  with 
group 1 heterozygosity bigger than group 0 heterozygosity 
 p = 0.1237
That is,
H in group 0 is bigger than H in group 1 with p = 0.1237


Fraction  group 2 bigger than group 0 = 0.0526
That is,
H in group 0 is bigger than H in group 2 with p = 0.0526

Exact fraction of all pairwise comparisons  with 
group 2 heterozygosity bigger than group 0 heterozygosity 
 p = 0.0548
That is,
H in group 0 is bigger than H in group 2 with p = 0.0548


Fraction  group 2 bigger than group 1 = 0.2268
That is,
H in group 1 is bigger than H in group 2 with p = 0.2268

Exact fraction of all pairwise comparisons  with 
group 2 heterozygosity bigger than group 1 heterozygosity 
 p = 0.2310
That is,
H in group 1 is bigger than H in group 2 with p = 0.2310


DONE
END SAMPLE.OUT


/*Old text to be removed*/


Permutation tests are performed for all possible combinations of
within-group and between-group similarities, and the "p prime" values
(p') are reported. These values are useful for exploratory data
analysis and generally indicate the relative sizes and the magnitudes
of the significance of the differences in the values of within-group
and between-group similarities.  Individuals are not randomized across
groups in this permutation test: instead, all pairwise similarities
from within two groups are permuted, and the fraction of the
randomized similarity values the have mean differences as extreme as the
original data is reported. Because all possible pairwise similarity
values are used in this procedure (and two similarities of pairs that
have an individual in common are dependent), the resulting values are
not genuine significance levels.


After the permutation tests described above, GELSTATS reports the
results of a  permutation test in which the mean similarities
within the groups and between the groups are computed and then lanes
are assigned to groups of similar sizes at random. What is reported are
the fraction of the random assignments that produce differences in mean
pairwise similarities as large as those observed in the actual data. We
believe you can use the reported p-values to test the hypotheses that
the distributions of similarities within or between the groups have the
same center. Tests are also performed to compare within-group similarities
to between-group similarities.

Please note that you need to consider the within- and between-group
similarity tests at the same time to figure out what's going on. Consider
both the possibility of different similarity levels within groups and
genetic differentiation between the groups when interpreting the
results of the similarity tests. See the BioTechniques paper for an
example.