/* GELSTATS source code and documentation copyright 1995 and 1996 Stephan Pelikan, Steven H. Rogstad and the University of Cincinnati. We're giving this program away. You alone must decide if it is correct and suitable for your purposes. If you do use it, you do so at your own risk. You may redistribute the source code or executable versions of this program provided that you 1) include this copyright notice, 2) do not charge money for it, and 3) distribute the program without modifications. */ This README file is to accompany the program GELSTATS version 2.6 built 28 October 1996 0. INTRODUCTION I. PREPARING A DATA FILE II. RUNNING THE PROGRAM III. UNDERSTANDING THE OUTPUT IV. BUILDING GELSTATS V. THANKS VI. BIBLIOGRAPHY VI. APPENDIX VII. OUTPUT FROM SAMPLE.DAT 0. INTRODUCTION a. General Remarks b. Referencing GELSTATS c. The Authors d. Obtaining GELSTATS As you can see, each section of this document starts with an outline of the subsections it contains. a. General Remarks The program GELSTATS is designed to help you make inferences based on multi-locus VNTR gels. We've made it available as a DOS executable and as C++ source code. Ideally you will decide on the suitability of the program by reading the source code carefully. This document is designed to help you understand the program and its output. We're giving it to you for free and hope you find its worth every penny you paid us for it. b. Referencing GELSTATS If you want to use the results obtained with GELSTATS in your publications you can provide an unambiguous reference to the program by referencing the name (GELSTATS) and version number as well as our names and addresses. You can always reference the program as a publication: S. Pelikan and S. Rogstad, "GELSTATS version 2.6" University of Cincinnati, Cincinnati, Ohio. 1996. The alternative is to reference a publication describing the GELSTATS program: Rogstad, Steven H.. and Steve. Pelikan. GELSTATS: a computer program for population genetics analyses using VNTR multilocus probe data. BioTechniques, Dec 1996 21(6) ???-??? Even referencing this paper, you should probably mention the version and date of the program you used. We would like to hear about problems you find in GELSTATS, corrections you can propose for any errors, suggestions you have for improvements to the program, etc. Email is the best way to reach us. IF YOU USE THIS PROGRAM, WE'D LIKE TO HEAR FROM YOU. PLEASE SEND US A NOTE OR EMAIL (see below). THIS WILL LET US DOCUMENT THE USEFULNESS OF THE PROGRAM AND ENABLE US TO NOTIFY YOU DIRECTLY AS NEW VERSIONS ARE MADE AVAILABLE OR IF WE DISCOVER ERRORS IN THE PROGRAM. c. The Authors Steve Pelikan Department of Mathematical Sciences University of Cincinnati Cincinnati, OH, 45221-0025 steve.pelikan@uc.edu -or- pelikan@math.uc.edu Steven Rogstad Biological Sciences ML 6 University of Cincinnati Cincinnati, OH, 45221-0006 rogstad@email.uc.edu d. Obtaining GELSTATS You can obtain GELSTATS by anonymous FTP from ftp.uc.edu. You can choose between a UNIX tar file (gelstats.tar), a compressed UNIX tar file (gelstats.tar.gz) for which you'll need the decompression program "gzip", and a compressed DOS zip archive (gelstats.zip) for which you'll need an "unzip" program such as the shareware program "pkunzip". Don't forget that you need to transfer these files in BINARY mode! You can also obtain the documentation, program, or complete distribution over the WEB from Pelikan's homepage with this URL: http://math.uc.edu/~pelikan The distribution includes a DOS executable file and C++ source code, along with this README file and a sample dataset. See section IV. (BUILDING GELSTATS) if you want to compile GELSTATS yourself. We have reason to believe you can build it on MSDOS with Borland C++ version 4.5. The program also builds using The Free Software Foundation's C/C++ compiler (GCC version 2.7.0) on SUNOS 4.1.3. Indeed, development of the program was carried out exclusively with the fine tools provided by the FSF. The distribution should contain 10 files: gelstats.exe sample.dat readme gelstats.cpp arrays.cpp maths.cpp arrays.h maths.h I. PREPARING A DATA FILE The data file is an ASCII file. It has no punctuation in it. Numbers are separated by spaces and/or end-of-lines. The first line of the data file describes the size and general nature of your data. On the first line you specify exactly 3 integers: N, P, and G. N is the number of population bands on your gel (or the number of alleles you have studied). P is the number of individuals you have studied (the number of lanes on the gel), and G is the number of groups to which you have assigned the lanes. The groups are numbered with consecutive integers starting with 0: 0,1,...,G-1. On the next line after N,P,G there should be P numbers which indicate the group to which different lanes are assigned. The number k appears as the lth number on this line if and only if lane l is assigned to group k. You must include this line. If you don't want your data treated as groups, take G as 1 on the first line, and make all P numbers on the second line equal to 0. The remainder of the data file you should think of as a gel. The lines of the file are population (synoptic) bands and the columns of the file are lanes. There is a 1 in column i and row j if and only if the band number j appeared in lane i. Otherwise, there is a 0 in column i and row j. You can prepare your data file using any editor, wordprocessor, or spreadsheet program you like as long as you can save the file in ASCII format. GELSTATS will likely get confused if you feed it a Word or WordPerfect file (for example) that has hidden formatting symbols in it. If you want to put comments in your data file, put them after the data. GELSTATS only reads from the file until it has acquired the amount of data you have specified with N and P (lanes and bands). Comments after the data are ignored. Here's an example of an acceptable data file: 5 12 3 0 1 2 0 1 2 0 1 2 0 1 2 0 0 1 0 1 1 0 1 1 0 0 0 1 1 0 0 0 0 0 0 1 0 1 1 0 1 1 0 0 1 1 0 0 0 1 0 1 1 0 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 A sample data file of more realistic proportions should have arrived with your copy of GELSTATS. It is called sample.dat GELSTATS makes some simple checks of your data file to ensure that the groups are numbered correctly and that there is enough data for the number of lanes and bands you have specified. It also checks to make sure that the main body of the data contains only 0's and 1's. If your data doesn't meet these minimal requirements, GELSTATS will print a message and stop. II. RUNNING THE PROGRAM The easiest way to run the program is to arrange to have the program (GELSTATS.EXE under DOS or GELSTATS under UNIX) and the datafile (say it's called SAMPLE.DAT) in the same directory. The you can simply give the command GELSTATS SAMPLE.DAT and the output of the program will scroll past on your screen. Usually, you'll want to save and inspect the output and the easiest way to do this is to redirect the output to a file. The command line to do this is GELSTATS SAMPLE.DAT > SAMPLE.OUT There is only one command line option. It is used to specify the number of randomizations that are performed by the permutation test procedures of GELSTATS. You can specify this value, the number of "Iterations", by giving an integer on the command line after the name of the data file: GELSTATS SAMPLE.DAT 100 > SAMPLE.OUT When the number of iterations is large, it can take quite a while for the program to run, and the default value of Iterations is fairly large (5000). While you're experimenting with the program and testing to make sure your data file is in the proper form, you probably want to set Iterations to a small positive value (10 or 100 are fine). III. UNDERSTANDING THE OUTPUT A. Preface B. Is Gelstats For You? (The assumptions of Gelstats) C. The Output A. Preface This section is intended to tell what the output of a GELSTATS run means. In order to do that we need to explain what GELSTATS does and to indicate why we have chosen the methods we have. The output of a GELSTATS run comes in sections, and after a general discussion of the techniques of the program, we'll describe the output and methods section by section. Before we begin, though, a digression on permutation methods is in order. Essentially all of the output of GELSTATS that you will use for making inferences is based on permutation methods. Because these methods aren't as well known as they should be, we'll begin with an example. A good reference is Good (1993). Suppose you want to compare the heights in two groups of people. You take random samples of the groups and measure the heights of the people in the samples, obtaining X1,X2,...,Xn and Y1,Y2,...,Ym. To find out if the mean heights of the two groups is the same you could do a t-test. This would be justified if you knew that heights were normally distributed in the two groups. You'd also be quite justified in using a t-test if the distribution of heights wasn't too far from normal and the sizes of the samples were rather large. A permutation approach to the question would proceed as follows. Arrange the heights in a row of n+m numbers: X1,X2,...,Xn,Y1,Y2,...,Ym with all the group-1 scores on the left. Any arrangement of these numbers in a row we will call a "configuration". For this initial configuration, find the sum of the first n numbers and call it the "target". For any configuration, we'll refer to the sum of the first n numbers in the configuration as its "score". Suppose we randomly rearrange the numbers to form another configuration and compute its score. Will the score be bigger or smaller that the target? That depends, of course. If the heights from group 1 tend to be higher than the heights in groups 2, the chances are good that the random configuration will have a score that is smaller than the target. On the other hand, if the heights don't really differ in the two groups, we'd expect that the chances would be about 50-50 whether the score of a random configuration will exceed the target or not. This is the idea behind a permutation test. Compute all possible configurations, find their scores, and determine the fraction, p, of the configurations for which the score exceeds the target. If this fraction is small, p is the significance with which we conclude that the center of the distribution of heights in group 1 is larger than the center of the distribution of the heights in group 2. When there are many observations, the number of possible configurations is huge and people usually content themselves with determining the fraction of a random collection of configurations that have a score exceeding the target. This general scheme, which might be called an "approximate permutation method", is what GELSTATS does to test various hypotheses. B. Is GELSTATS For You? (The assumptions of GELSTATS) The advantages of the permutation methods we use are that they don't assume a particular form for the distribution of the variables we're working with. They also provide good (exact) significance levels even with small sample sizes. For large sample sizes, they're as good (powerful) as any other method. For making inferences about similarity or heterozygosity within and between populations, the alternative to the permutation methods adopted here are some approximate parametric methods devised by Lynch (1990). These methods rely on working out the theoretical mean and sampling variance for the variates in question and then assuming that the sample sizes are large enough that the central limit theorem applies. (This says that the distributions are nearly normal). With current methods, we find it impossible to run enough lanes on a gel to satisfy our concerns about the normality assumptions. Nevertheless, simulation studies we did suggest that many times, Lynch's methods work quite well with reasonably small sample sizes and produce results that agree with permutation methods. There is one reason for which you should consider avoiding permutation methods. If your gels are such that comparisons of distant lanes are inaccurate, you might be better with Lynch's methods. As pointed out by Lynch, samples can be assigned to lanes on a gel in such a way that using his methods, you only need to make comparisons between adjacent lanes in order to test hypotheses about different levels of similarities between groups. In generating random configurations, the permutation methods assume that all pair-wise comparisons of lanes are equally valid. SOME ASSUMPTIONS: -- We assume all the alleles of all loci appear on the gel. -- We assume that samples were taken at random. -- We assume that Hardy-Weinberg equilibrium holds at each locus in questions. Note: recent results show that it isn't necessary to assume that all alleles of all loci appear on the gel. At least as far as computing heterozygosity is concerned, it's enough to assume that a random sample of the alleles appear on the gel with the chance of a band appearing being independent of its frequency. C. The Output 1. Identifying information a. Program name, version, and build date. b. The values of Iterations. c. The name and contents of the data file. 2. Linked bands 3. Summary stats on band number and frequency 4. Results on similarity a. Definition of similarity b. Permutation tests on similarities c. Chance of identical lanes d. Mean between group similarities e. More permutation tests f. Monomorphic bands and similarity g. Lynch-like F_{ST} 5. Heterozygosity computations 1. Identifying information The first thing GELSTATS prints is some identifying information. In the output you'll find its name, the version of the program that you've run, and the date on which the program was compiled. It also reports the value of the parameter "Iterations" and the name of the file from which it read its data. After this information, GELSTATS prints a copy of the data. In practice, we generate a temporary data file, run GELSTATS on it, and then annotate the GELSTATS output since it contains a complete copy of the input file. In the process of reading your data file GELSTATS makes a couple of checks. It determines if entries other than 0 or 1 appear in the main array of the data and whether you have numbered the groups with appropriate labels. If the data file doesn't meet these minimal requirements, GELSTATS will print a warning message and quit. These messages should appear on you screen somewhere, not in the output file (if you've redirected output). If you get an empty output file, look on your screen or console (where C++ puts cerr) for error messages. 2. Linked bands The program reports a list of all monomorphic bands in the data set. These bands could represent alleles that are fixed in the population. Some theoretical results about how monomorphic bands effect the band-sharing estimates of similarity are available. See the section 4.d below. The program the reports any linked bands it detects. Groups of bands with identical patterns of occurrence could represent closely or completely linked loci. The program reports all groups of bands with the same pattern of occurrence except for monomorphic bands, which are listed earlier in the output. You might want to consider eliminating all but one of each of these "linkage" groups and running the data again. If the results are changed significantly, you've got to make a decision. 3. Summary stats on band number and frequency Next comes summary statistics on band numbers and frequencies. First, the number of lanes found belonging to each group is reported. Then the number of bands in each lane. Following this, the maximum and minimum number of bands are reported, along with an estimate of the number of loci appearing on the gel. We don't suggest that this is a great estimate of the number of loci, but it does tell you something important. Assume that there are L loci and that each locus contributes either 1 or 2 bands to each lane. Then the number of bands in any lane should be between L and 2L. If M and m are the maximum and minimum number of bands appearing in the lanes of the gel, we should have that L <= m <= M <= 2L so that M/2 <= L <= m. Gelstats reports M/2 and m. If there are no integers that lie between these values, then there are loci some of whose alleles appear on the gel while others do not. Your gel is "missing some bands." Perhaps there are alleles that have co-migrated. Perhaps some were run off the end of the gel. Perhaps some weren't well visualized by your probes. Perhaps you read the gel wrong. In any event, when no integer lies between the upper and lower limits M/2 and m, you KNOW that your data violates some of the assumptions of the procedures used by GELSTATS. We're not saying "throw the stuff out", just "be careful". Examples show that the extent of the violation (number of missing bands, say) need not be monotonically related to the size of the gap, M/2-m. Frequently, correct conclusions can be drawn from data that violates some of the assumptions of the program. By the way, it is worth comparing the interval estimate [M/2,m] of L with estimates that Gelstats generates later in the output. Don't expect exact agreement, but wild departures signal a difficulty. After this interval estimate, the frequencies with which each band appears in each group and in the data set as a whole are given. Then an estimate (the Jin-Chakraborty estimate --- see below) of the frequency of the allele creating each band is given. The mean number of bands, standard deviation of the number of bands, and standard error of the number of bands are provided both by group and for the data set as a whole. You can probably do t-test with these numbers, at least if you know that the distribution of band numbers isn't badly non-normal. After this, permutation tests comparing the number of bands appearing in different groups are performed. We like these at least as well as t-tests since they make fewer assumptions. Earlier versions of GELSTATS did t-tests as well as permutation tests. The results were almost always the same. 4. Results on similarity a. Definition of similarity b. Permutation tests on similarities c. Chance of identical lanes d. Mean between group similarities e. More permutation tests f. Monomorphic bands and similarity g. Lynch-like F_{ST} a. Definition of similarity These results are based on a band-sharing index of similarity. If n_i is the number of bands appearing in lane i and n_{ij} is the number of bands that lane i and j have in common, the similarity of lanes i and j is defined to be S_{ij} = 2n_{ij}/(n_i + n_j). Gelstats reports the values of the similarities S_{ij} for all possible pairwise comparisons. Since S_{ij}=S{ji}, only a lower triangular matrix of values needs to be reported. Gelstats throws in the 1's on the diagonal --- S_{ii}=1 for all i --- and prints a triangular array of similarities in the output. You can use a wordprocessor or editor to cut out this section of output, and put it in a new file. The similarity matrix can then be imported by a variety of other statistical packages. See the appendix for detailed information about how to load the similarity matrix into SYSTAT. Note that the similarity of two lanes is not defined if the sum of the number of bands in the lanes is 0. If your data set has lanes with this property, GELSTATS will set such similarities to 0, print a warning message, and continue with its work. You should pay attention to the warning since there's probably something wrong with your data and it is certainly the case that some of the values GELSTATS reports in the remainder of the output are wrong. Gelstats computes the similarity of each lane with itself. This results in 1's on the diagonal of the similarity matrix. If one lane has no bands, the error message mentioned above will be printed and a 0 will appear as a diagonal entry in the similarity matrix. Again, you should worry why you have a lane with no bands. b. Permutation tests on similarities We've yet to come up with a good method for testing for differences in similarity levels within and between groups that works well in every situation. GELSTATS performs two kinds of permutation tests on similarities: one, we believe, produces correct significance levels, but sometimes lacks power. The second method can detect differences in within-group similarity levels, but only produces "pseudo" p-values. These generally indicate the relative degrees of differences but do not represent significance levels. To explain the differences between the methods and indicate (graphically) when the first method can be expected to perform well we introduce a means for illustrating differences of within- and between-group similarities. These graphics are based on a metaphore: to simulate a sample of dissimilarity measures from within a population, we could select N random numbers from an interval and compute the squares of the differences of all possible pairs of numbers. (The average size of these measures is essentially the variance of the uniform distribution on the interval from which we drew the random numbers: smaller intervals correspond to greater simlarity values while larger interval correspond to smaller similarity (greater dissimilarity) values. Here's a picture that represents two populations by showing the the intervals from which we draw random numbers to similate their dissimilarity values: 1: [--------] 2: [---] The picture indicates that similarities are higher in group 2 than in group 1. The next picture illustrates a similar situtation with respect to the magnitudes of within-group similarities but differs from the first in that the sizes of the between-group dissimilarities are much larger: 1: [--------] 2: [---] In terms of real populations, the within-group similarities of the two populations differ to the same extent as before, but in the situation represented by the second picture, the populations more differentiated than in the first. The first kind of permutation test performed by GELSTATS is concerned with a null hypothesis represented by the picture: 1: [--------] 2: [--------] that is, the similarities in the groups are the same and the groups are undifferentiated. The test seeks evidence to reject this null hypothesis in favor of hypotheses represented by 1: [--------] 2: [---] (i.e. within group similarities differ) or 1: [--------] 2: [--------] (i.e. within group similarities are the same, groups are differentiated) or 1: [-----------] 2: [---] (i.e. within group similarities differ and groups are differentiated) (or any of a number of other possibilities). Here's what the test does: it computes the mean pairwise similarity in group 1, in group 2, and between groups 1 and 2 and the differences in these means is recorded. Then the individuals are permuted randomly to form new groups and the differences of mean pairwise similarities are computed again; the fractions of random arrangements of individuals into groups that give mean differences as large as observed in the real data is reported. This test produces exact p-values in the sense that, if you draw random samples from populations satisfying the null-hypothesis and test them at significance level alpha, then 100(alpha) per cent of the tests will reject the null hpothesis by saying that the within-group simlarities of the groups differ. By the way, testing with the same significance, alpha of the tests will say that there's a difference in, say, the within-group 1 and between-groups 1 and 2 similarities. The trick --- or the problem --- is determining what to conclude when tests report a significant difference. Using w1, w2, b12 to denote the center of the distributions of the within-group 1, within-group 2, and between-groups 1 and 2 similarities, we can associate schematic pictures with test results as follows: A. w1 < w2, no difference in b12, w1 or in b12, w2 1: [--------] 2: [---] (conclude: different similarity levels) B. w1 < w2, b12 < w1 and b12< w2 1: [--------] 2: [---] (conclude: different similarity levels within the groups, and groups differentiated) C. w1=w2, w1 > b12, w2 > b12 1: [------] 2: [------] or 1: [------] 2: [---] (conclude: groups are differentiated, within-group similarities may or may not be the same.) That is, different levels of within-group similarity may not be detected in the presence of significant differentiation between the groups. Typically, the test reports no difference between within-group similarities when the groups are differentiated since small differences between the groups are lost (compared to the large between-group differences) when individuals are permuted among groups. So: you can test, at the provided significance level, for differences in levels of within-group simlarities or for differentiation of the groups. But if the groups are differentiated, the power of the test of within-group simliarities is very small. To compensate for the rather poor performance of the first permuation method at detecting similarity differences in the presence of group differentiation, GELSTATS performs a second permutation procedure. This procedure provides a summary statistic that can indicate similarity differences even when groups are differentiated, but the fraction reported by the procedure is not a significance level for rejecting the null hypothesis. In this procedure, individuals are not permuted among the groups. Instead, all possible pairwise similarities from within the two groups are computed and these scores are tested by a permutation test as if they were independent variates. They're not, independent, of course, because some pairwise similarity determinations have individuals in common. This procedure permutes only within-group similarity measures and ignores the between-group similarities, so it doesn't suffer from the phenomenon that causes loss of power with the first procedue. It's very good at pointing out differences. The only problem is that, because of the lack of dependence, the fractions reported are not significance levels. We call them "p-values" or "pprime" values. The values reported for the whale data in the BioTechniques paper (Rogstad & Pelikan 1996) in the section where GELSTATS output is compared with an MDS analysis are from this second permutation procedure. In the output of GELSTATS, the results of the second procedure described above appears first. c. Chance of identical lanes After this, the statistic Sbar_to_the_Xbar, which is the mean similarity raised to the mean number of bands is reported for the data as a whole and for each group. This number measures the chances that two lanes will show identical patterns of bands. This section also reports the mean within-group similarity by group (and the mean band number). d. Mean between group similarities The means of the between-group similarities are reported next. e. More permutation tests The results of "first" permutation comparison described above are then reported. These are the tests based on reassigning lanes to groups at random. f. Monomorphic bands and similarity We call bands that appear in every lane "monomorphic" bands. The terminology comes from the assumption that if a band appears in every lane it represents an allele occurring with very high frequency --- practically, the allele is fixed in the population. If your data has lots of monomorphic bands, you need to decide whether you should do something about them. If you eliminate them from consideration, you could be removing important evidence (of low heterozygosity or high similarity, say). It is conceivable that some populations are identical at a large number of loci; adding a huge number of monomorphic loci to an otherwise informative data set can alter the conclusions you draw from the data. In deciding what to do about monomorphic bands, you need to know two things: 1) Adding or removing monomorphic bands can alter the relative magnitudes of pairwise similarities considerably and 2) Such alterations can only make substantial changes in the relative sizes of similarities when different lanes have markedly different numbers of bands in them. e. Lynch-like F_{ST} Lynch (1991) suggests that F_{ST} can be estimated as F'_{ST} = (1-Sb)/(2-Sb-Sw). Here, Sw is obtained by finding the mean pairwise similarity within each of the populations and then computing the average of these means. Sb is a measure of the between-group similarities and is obtained by finding, for each pair of populations i and j, the average similarity S_{ij}' between pairs of individuals selected from the two populations and setting S_{ij} = 1 + S_{ij}' - (S_i + S_j)/2 where S_i and S_j are the average similarities of individuals within the populations i and j. Then Sw is the mean of the S_{ij}, taken over all pairs of sub populations i and j. The program reports the values of this estimate, F'_{ST}. 5. Heterozygosity computations a. Estimating allele frequencies b. Estimating heterozygosity c. Proportion loci polymorphic d. Comparing heterozygosities e. Good and bad news about the assumptions The estimates of heterozygosity are based on finding the frequency of alleles creating each of the bands on the gel. We assume that each allele at each locus results in one band on the gel. The term "population band" refers to the location occupied by the bands created by one allele. Thus, an individual possesses a particular allele if an actual band appears at the level of the population band for that allele in that individual's lane. a. Estimating allele frequencies Begin by considering a single population band on a gel with n lanes. Since alleles at VNTR loci are codominant, a lane will show a band in this population band if the individual assigned to the lane is either a homozygote or heterozygote for the allele associated with the population band. Assuming Hardy-Weinberg equilibrium, we can estimate the frequency of the allele from the frequency of bands. If there are k lanes with bands in the population band, and p is the frequency of the allele, then we expect that k/n = p^2 + 2p(1-p). Solving this equation for the allele frequency p yields p = 1- sqrt(1-(k/n)). This method of estimating allele frequencies from band frequencies was proposed by Stephens et. al. (1992). The estimate it provides for p is biased --- on the average it is an over estimate of p. Nevertheless we have proved that the estimate is a so-called "maximum likelihood estimator" of p. This means that asymptotically (with larger and larger n) it is unbiased and enjoys optimal variance properties. An improvement on the Stephens et. al. estimate is provided by Jin and Chakraborty (1994). Provided k < n, their formula estimates p as p = 1- sqrt(1-s) - (1/(8n))(s/sqrt(1-s)), where s = k/n. This estimate is also biased, but the bias is quite a bit smaller. Since the extra correction term in the Jin-Chakraborty formula vanishes as n tends to infinity, we see that their formula is also asymptotically unbiased. b. Estimating heterozygosity Using either method of estimating allele frequencies, one can proceed to estimate heterozygosity as follows. Because we've assumed that all the alleles of all the loci appear on the gel, the sum of the frequencies of all the alleles creating bands on the gel must be L, the number of loci contributing bands to the gel. If the average heterozygosity of the loci creating bands on the gel is H, then the fraction H of the loci will contribute 2 bands to a lane and (1-H) of the loci will contribute 1 band to the lane. Thus, we expect that each lane will have 2HL + (1-H)L= (1+H)L bands. With n lanes on the gel, we expect a total of T=n(1+H)L bands appearing on the gel. Solve this equation for H to obtain the formula H = (T/(nL)) -1 expressing the heterozygosity in terms of the number of loci, the number of lanes, and the total number of bands on the gel. We have proved that this estimate of H is asymptotically unbiased (as n tends to infinity). GELSTATS estimates H using the above formula after determining L as the sum of allele frequencies using the formula os Stephens et. al. and again after determining L as the sum of the frequencies obtained using the Jin-Chakraborty formula. We call these the Stephens and J-C estimates of heterozygosity. GELSTATS reports these estimates for the data set as a whole, and again performing the computations within each group. Finally, some versions of GELSTATS provide a third estimate of heterozygosity based on applying a bias correction to the formula H = -1 + Sum(k_i)/(n Sum(p_i)). Even after making the J-C correction in the estimate for the frequencies p, dividing by their sum introduces another bias since the expected value of a ratio is not in general the raio of the expected values. GELSTATS reports this third method for estimating heterozygosity. This third estimate is obtained by using the Stephens formula for gene frequencies and then correcting for the bias after diving by the sum of the frequencies. (We used a Taylor expansion around the expected value of the frequencies, and include all the variance terms in the correction, neglecting the (generally unknown) covariance terms --- details will appear elsewhere.) Simulations show that this corrected estimate is about as accurate as the J-C based estimate of H. Of course, all estimates of H improve with sample size (number of lanes). Both the J-C and our "corrected-Stephens" estimates are quite good for fairly large values of H, with average errors less than 4% when H > 0.5 and the sample size is > 11. For 11 lanes and small values of H, J-C heterozygosities can have average errors of 10% or more, while corrected-Stephens H's generally have 2% to 4% errors. If you want to estimate heterozygosity absolutely, report the value of this bias corrected estimate if its value of less than 0.5, otherwise report the JC-based estimate. Based on simulations, we believe that this should produce the correct value to within about 5\% provided your sample size is >15. See Pelikan and Rogstad (1996). In versions that compute it, the corrected-Stephens estimate of heterozygosity is provided for each group in the data set and for the data set as a whole. Some argument could be made for using the whole-data-set estimates of L for finding the within group heterozygosity estimates, and GELSTATS provides enough data in the output for you to accomplish this by hand. We don't make GELSTATS do this since people may be running GELSTATS on data with multiple groups having different numbers of loci --- a setting in which using the whole data set's L could result in bad estimates of H. Since the presence or absence of one band in a lane is not independent of the presence of other bands, we cannot obtain a useful expression for the sampling variance of the estimates of L and H provided by these method. Of course, the permutation methods used by GELSTATS don't require knowing the variance in order to make inferences. c. Proportion loci polymorphic GELSTATS reports estimates of the proportion of loci which are polymorphic and the average number of alleles per locus. These estimates are provided for the data set as a whole and for each group. Since these numbers depend on estimates of the number of loci, GELSTATS provides them based on both Stephens and J-C estimates of L. People usually call a locus monomorphic if the most frequent allele has frequency above some critical value (0.95 or 0.99 for example). Here we call a locus monomorphic if the observed frequency of an allele at the locus is 1.0 (that is, if the band appears in every lane). Bands with frequencies less than 1 are assumed to lie at polymorphic loci. There's considerable sampling error in this estimation of the proportion of loci polymorphic, and the error depends on the number of loci examined. See Nei (1987) page 177, who points out that for small numbers of loci, the sampling error is so large that estimates of the proportion of loci that are polymorphic are useless. d. Comparing heterozygosities Until version 2.12 GELSTATS provided two methods for comparing the heterozygosities of groups in the data. In this section, J-C estimates of L are used to find heterozygosity. These are unbiased enough for the purpose of comparing different groups. First, for each pair of groups the program computes the difference in the J-C estimates of heterozygosity in the groups, and then randomly permutes the individuals among the groups Iterations times, counting how often the heterozygosity difference is as extreme as observed with the original grouping. This method provides a good means for comparing the heterozygosities of two groups provided that the groups are not genetically differentiated. This is exactly the situation we were originally interested in: comparing parent and offspring generations to determine the extent of inbreeding (assuming no selection). Starting with version 2.12, this first method is omitted by GELSTATS: it is useful only in special situations and adds considerably to the execution time of the program. Still, given the hypothesis of non-differentiation, this method is probably more powerful than our second method. If the groups are differentiated, pooling them and selecting a subset will almost certainly result in a group with higher heterozygosity. So the method described above is not appropriate when groups are genetically differentiated. For this reason, we provide another method for comparing heterozygosities of groups. This method compares two groups at a time. With a group of size g there are g(g-1)/2 subsets of size g-2. Each of these subsets yields an estimate of heterozygosity of the population from which the group was selected. If a second group has size h, it has h(h-1)/2 subsets of size h-2, each yielding an estimate of the heterozygosity of the group. Then there are g(g-1)h(h-1)/4 possible pairwise comparisons of heterozygosity values for the two groups. If this number is not too large (not bigger than 3 times the number of Iterations specified), the program makes all possible comparisons and reports the fraction of the comparisons in which the first group had higher heterozygosity than the second. The program also compares the heterozygosity of the groups by comparing the heterozygosity values of randomly chosen subsets of the two groups. Rather than making all comparisons, it makes Iterations randomly selected comparisons. Since the sampling variance of the heterozygosity determinations based on samples of size g-2 is larger than the variance of determinations based on samples of size g, we expect that the fractions reported by the above procedures will provide a conservative estimate of the significance of the difference in heterozygosity of the two groups. e. Good and bad news about the assumptions We performed extensive theoretical and simulation studies on a variety of methods for estimating heterozygosity before selecting the methods used in GELSTATS. Some of these results will be submitted for publication. We use the J-C based estimate of H because, on the average, it is the most accurate. We use the Stephens estimate for H because it provides an under-estimate of H. (It overestimates the frequencies p, and hence the sum of the frequencies, L. The reciprocal of L enters the formula for H, which is why Stephens p's give an under estimate of H.) Both these methods are sensitive to departures from our assumptions. In particular, the estimated H values can be wildly wrong if Hardy-Weinberg equilibrium doesn't hold. In simulations with populations having different fixation coefficients, the errors in estimated values of H were frequently as large as 10 or 20 per cent with fairly modest (F = 0.2) fixation coefficients. The variance of the heterozygosity estimates remained small, however, so differences in heterozygosity estimates for two groups can be quite accurate estimates of differences in the heterozygosities of the groups provided the groups have nearly the same fixation coefficient. A second result of the studies is that neither method of estimating H is sensitive to missing bands (eg., bands run off the gel), provided that the chances of a band not appearing are independent of the band's frequency. Roughly speaking, eliminating a band has the same proportional effect on T and L, so a missing a band doesn't alter the ratio of T and L. And H depends only on the ratio. We have established this fact analytically as well: in the limit of large samples, with large numbers of alleles, you need only assume that a random sample of alleles appears on the gel. IV. BUILDING GELSTATS There's nothing especially fancy to be done, but remember to link with a library containing mathematical functions. As provided, the source compiles under using Borland's C++ version 4.5. running under Windows. (From within the IDE define a project called gelstats, add all the files *.cpp as nodes, and run "build all".) From DOS use something like bcc -ml -egelstats.exe -ot *.cpp For other compilers, you may need to instruct the linker to use a library containing mathematical functions (sqrt() etc.) We have always used a "Large" memory model for DOS versions of the program. The DOS executable we've distributed was built with Borland's C++ version 4.5 and uses the i286 instruction set. You can run it on an AT-clone. By building the program using i386 or i486 instruction sets (assuming you've got one of those chips), you might get slightly better performance or smaller size. I wouldn't bother, though. On Unix, you probably only need to make one modification: uncomment the line #define UNIX near the top of the source file maths.cpp. Under UNIX, the program uses getpid() to generate a seed for the random number generator. Under DOS, it uses the time. Your UNIX C++ compiler won't be able to link with DOS library functions for manipulating times and will complain if you ask it to do so. After this modification, try the commands gcc -c *.cpp gcc *.o -o gelstats -lm -lstdc++ If you don't have a C++ compiler and really need to build GELSTATS for yourself, let us know: we have recent versions of GELSTATS written in C that build nicely with a variety of compilers. You are welcome to modify the program to suit your needs. You may also distribute unmodified versions of the program and source provided you don't do so for a profit and provided you include the copyright notice found at the top of this file. Please don't distribute modified versions of the source code or program. The only reason for this request is that it is important for researchers to be able to say exactly what computations they performed. They can only do this by referencing the program if there's only one version of the program around. So PLEASE: don't distribute modified versions of GELSTATS under that name. ---------------------------- V. THANKS Many people working in Steve Rogstad's lab have used verions of this program on their data and provided valuable bug reports, feedback and suggestions. Hae Lim and Dan Busemeyer tested the program extensively and deserve special thanks, as does Brian Keane whose huge datasets lead us to many improvements in GELSTATS. Three anonymous referees of the BioTechniques article describing GELSTATS suggested improvements in the article, the program, and the documentation. Changes made in direct response to these suggestions include a discussion of how comments can be included in data files and a revised treatment of linked and monomorphic bands that makes the output of GELSTATS easier to use and understand. We thank Tony Leonard for helpful discussions and suggestions about testing for differences in similarities. He's developing improved methods for this sticky problem. A discussion by many people in the sci.bio.computing newsgroup lead us to include a table of J.-C. frequency estimates of the alleles in the output of GELSTATS. VI. BIBLIOGRAPHY Good, P. Permutation Tests. Springer-Verlag, New York. 1993. Lynch, Michael. The similarity index and DNA fingerprinting. Mol. Biol. Evol. (1990) 7(5) pp478-484. Lynch, Michael. Analysis of population genetic structure by DNA fingerpinting. In "DNA fingerpinting approaches and applications", T. Burke et.al. editors. Birkhauser (1991) pp113-126. Jin, Li and Ranajit Chakraborty. Estimation of genetic distance and coefficient of gene diversity from single-probe multilocus DNA fingerprinting data. Mol. Biol. Evol. (1994) 11(1) pp120-127. Masatoshi Nei, Molecular Evolutionary Genetics, Columbia University Press, 1987, New York. Pelikan, S. and S. Rogstad. You can estimate heterozygosity with multilocus probes. Pre-print, University of Cincinnati, 1996. Rogstad, S. and S. Pelikan. GELSTATS: a computer program for population genetics analyses using VNTR multilocus probe data. BioTechniques, Dec 1996 21(6) ???-??? Stephens, J.C., D.A. Gilbert, N. Yuhki, and S.F. O'Brian. Estimation of heterozygosity from single probe multilocus DNA fingerprints. Mol. Biol. Evol. (1992) Vol 9 pp729-743. VII. APPENDIX Loading similarity matrices into SYSTAT. By redirecting the output from GELSTATS to a file, you can use portions of the output in other programs. This appendix tells you how to load the array of pairwise similarities produced by GELSTATS into the SYSTAT statistical program. First, load the output (as an ASCII file) into your favorite editor or word processor, cut out the array of similarities and save them in ASCII format in a separate file. Note that it does not matter that what should be a triangular similarity matrix has several of its lower lines that wrap around the screen. (Some wordprocessors automatically wrap long lines.) Just save the matrix exactly as it is. Then load the array into SYSTAT by entering the DATA module and giving the following commands: get filename save filename.sys [this will be the datafile to use for your analyses] input variablename(1-n) [where n is the number of individuals in the dataset] type similarity run We have frequently done multidimensional scaling (the MDS module of SYSTATS) on similarity matrices produced by GELSTATS. VIII. OUTPUT FROM SAMPLE.DAT Here's the complete output that results from running GELSTATS on the file SAMPLE.DAT. You can partially test your version of the program by comparing your output with what is provided here. Remember that the "p-values" are based on random sampling, so you shouldn't expect to get exactly the same values that are shown here. The command line used to generate the file was gelstats sample.dat > sample.out BEGIN SAMPLE.OUT Output from program GELSTATS version 2.6 built 4 November 96 Iterations is set to 5000 Reading data from file: sample.dat Here's the data I've just read ----------------------------------------------- 1 1 1 1 2 2 2 1 1 1 1 0 0 2 2 2 0 0 0 0 0 0 1 1 0 1 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 1 0 1 1 1 1 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 1 1 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 1 1 1 1 1 1 0 1 1 1 1 0 1 1 1 1 1 1 0 0 1 1 1 0 0 0 1 0 1 1 1 0 1 0 1 0 0 0 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 0 1 1 0 0 1 1 1 1 1 1 0 1 1 0 0 0 1 1 0 1 1 1 0 0 0 1 1 1 0 0 1 0 1 0 0 1 0 0 1 1 1 0 1 0 1 1 0 1 0 0 0 0 1 1 1 1 1 1 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 1 0 1 0 1 1 0 1 0 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 1 1 0 1 1 1 1 0 0 0 1 1 0 0 1 0 1 0 1 0 1 0 0 0 0 1 1 1 1 1 1 1 0 0 1 0 0 1 0 1 0 1 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 1 0 0 0 1 1 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 1 1 0 1 0 0 1 1 1 0 0 0 1 1 0 1 0 1 0 1 0 1 0 0 1 1 1 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 0 0 1 1 1 0 1 0 1 1 1 1 0 1 1 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1 1 1 1 0 1 1 0 1 1 0 0 0 0 0 0 1 0 1 0 1 0 0 1 0 1 0 0 0 0 0 0 1 0 1 0 1 0 1 1 1 1 1 0 0 1 0 1 0 1 1 0 1 1 0 0 1 1 1 1 1 0 1 1 1 1 1 0 1 1 0 1 0 1 0 0 1 1 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 0 1 1 1 0 1 1 1 1 1 1 1 0 0 0 1 1 1 0 1 0 1 1 0 1 1 0 0 1 0 0 1 0 1 1 1 1 0 1 0 1 0 1 0 1 1 0 0 1 1 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 0 0 0 0 1 0 0 1 0 0 0 1 1 1 1 0 1 1 0 0 1 1 1 0 1 1 0 0 0 0 0 0 0 ----------------------------------------------- Number of bands = 41 Number of lanes = 22 Number of groups = 3 ----------------------------------------------- Searching for monomorphic (fixed) bands The following 1 bands are fixed: 31 ----------------------------------------------- Searching for linked markers (identical rows) Monomorphic (fixed) rows are given above and are not reported as linked. You might want to eliminate identical rows and start again. Or maybe not ... Each line contains the numbers of rows in a linkage group: ----------------------------------------------- Group sizes: Group Size 0 8 1 8 2 6 ----------------------------------------------- Number of bands in each lane: 0 28 1 31 2 23 3 26 4 22 5 20 6 22 7 19 8 25 9 20 10 20 11 20 12 19 13 21 14 21 15 20 16 18 17 22 18 25 19 19 20 17 21 19 ----------------------------------------------- The maximum and minimum number of bands in the lanes are max = 31 and min = 17 This gives the estimate of locus number L: 15.5000 <= L <= 17 ----------------------------------------------- Frequency of bands by groups and as a whole: Band 0 1 2 Whole 0 0.2500 0.6250 0.0000 0.3182 1 0.1250 0.2500 0.0000 0.1364 2 0.1250 0.3750 0.0000 0.1818 3 0.0000 0.1250 0.1667 0.0909 4 0.5000 0.5000 0.3333 0.4545 5 0.7500 0.2500 0.3333 0.4545 6 0.2500 0.6250 0.3333 0.4091 7 0.3750 0.2500 0.3333 0.3182 8 0.0000 0.2500 0.3333 0.1818 9 0.2500 0.3750 0.1667 0.2727 10 0.8750 0.8750 0.5000 0.7727 11 0.8750 1.0000 0.6667 0.8636 12 0.3750 0.8750 0.6667 0.6364 13 0.7500 0.5000 0.8333 0.6818 14 0.7500 0.5000 0.3333 0.5455 15 0.7500 0.7500 0.8333 0.7727 16 0.2500 0.5000 0.8333 0.5000 17 0.6250 0.3750 0.6667 0.5455 18 0.7500 0.6250 0.0000 0.5000 19 1.0000 0.7500 0.6667 0.8182 20 0.6250 0.3750 0.6667 0.5455 21 0.2500 0.7500 0.3333 0.4545 22 0.2500 0.6250 0.6667 0.5000 23 0.6250 0.8750 1.0000 0.8182 24 0.7500 1.0000 1.0000 0.9091 25 0.2500 0.2500 0.3333 0.2727 26 0.5000 0.6250 0.5000 0.5455 27 0.3750 0.3750 0.1667 0.3182 28 0.8750 0.7500 1.0000 0.8636 29 0.1250 0.3750 0.8333 0.4091 30 0.5000 0.7500 0.3333 0.5455 31 1.0000 1.0000 1.0000 1.0000 32 0.2500 0.5000 0.6667 0.4545 33 0.3750 0.3750 0.3333 0.3636 34 0.7500 0.7500 0.5000 0.6818 35 0.7500 0.7500 0.5000 0.6818 36 0.7500 1.0000 0.8333 0.8636 37 0.5000 0.5000 0.6667 0.5455 38 0.5000 0.6250 0.5000 0.5455 39 0.1250 0.6250 0.5000 0.4091 40 0.1250 0.7500 0.6667 0.5000 ----------------------------------------------- JC estimates of frequency of alleles : Band/frequency 0 0.1721 1 0.0698 2 0.0943 3 0.0460 4 0.2580 5 0.2580 6 0.2283 7 0.1721 8 0.0943 9 0.1454 10 0.5141 11 0.6174 12 0.3910 13 0.4291 14 0.3212 15 0.5141 16 0.2889 17 0.3212 18 0.2889 19 0.5627 20 0.3212 21 0.2580 22 0.2889 23 0.5627 24 0.6814 25 0.1454 26 0.3212 27 0.1721 28 0.6174 29 0.2283 30 0.3212 31 1.0000 32 0.2580 33 0.1997 34 0.4291 35 0.4291 36 0.6174 37 0.3212 38 0.3212 39 0.2283 40 0.2889 ----------------------------------------------- number of lanes = 22.0000 mean number of bands = 21.6818 Standard deviation = 3.4418 Standard error of mean = 0.7338 ----------------------------------------------- Summary stats on number of bands by group In Group 0 number in group = 8.0000 mean number of bands = 19.8750 standard deviation = 2.5319 standard error = 0.8952 In Group 1 number in group = 8.0000 mean number of bands = 24.0000 standard deviation = 4.2762 standard error = 1.5119 In Group 2 number in group = 6.0000 mean number of bands = 21.0000 standard deviation = 0.8944 standard error = 0.3651 ----------------------------------------------- Permutation tests on the number of bands in different groups Number in group 1 >= number in group 0 with p = 0.0202 Number in group 2 >= number in group 0 with p = 0.1882 Number in group 2 <= number in group 1 with p = 0.0480 ----------------------------------------------- Table of pairwise similarities 1.0000 0.7797 1.0000 0.6667 0.7407 1.0000 0.6667 0.8070 0.7755 1.0000 0.5200 0.6038 0.6667 0.6250 1.0000 0.6250 0.5882 0.5581 0.5652 0.7143 1.0000 0.5600 0.7547 0.6667 0.6250 0.6364 0.5714 1.0000 0.5532 0.6400 0.5714 0.6222 0.5854 0.6154 0.6341 1.0000 0.6792 0.7857 0.6250 0.7451 0.6383 0.6667 0.6383 0.6818 1.0000 0.5833 0.5490 0.6512 0.5652 0.7143 0.6500 0.5714 0.5128 0.4889 1.0000 0.5000 0.5882 0.6047 0.6087 0.5714 0.5500 0.6667 0.5128 0.6222 0.4500 1.0000 0.6250 0.6275 0.5581 0.5652 0.4762 0.6000 0.6190 0.6154 0.6667 0.6000 0.7500 1.0000 0.5957 0.6000 0.7619 0.5778 0.5366 0.4615 0.6341 0.5263 0.4545 0.5641 0.4615 0.4103 1.0000 0.6122 0.6538 0.6364 0.6383 0.5116 0.5366 0.6977 0.7000 0.6522 0.4390 0.6829 0.6341 0.5500 1.0000 0.6122 0.5769 0.6364 0.5957 0.5581 0.6341 0.6977 0.6000 0.5217 0.6341 0.7317 0.6829 0.6000 0.6667 1.0000 0.5833 0.5882 0.5116 0.6522 0.5714 0.6000 0.5238 0.7179 0.6667 0.5000 0.5000 0.5000 0.5128 0.5366 0.6829 1.0000 0.5217 0.5714 0.7805 0.6364 0.6000 0.4737 0.6000 0.4865 0.4651 0.5263 0.6316 0.4737 0.8108 0.5128 0.6154 0.4737 1.0000 0.6400 0.7547 0.7556 0.8333 0.5455 0.4762 0.6364 0.6829 0.6809 0.3810 0.5714 0.4762 0.6341 0.7442 0.5581 0.6667 0.6500 1.0000 0.7925 0.7143 0.6250 0.5882 0.6383 0.7111 0.5106 0.5455 0.6400 0.5333 0.5778 0.5778 0.5909 0.5652 0.6087 0.5778 0.6047 0.6383 1.0000 0.5532 0.6400 0.6190 0.6222 0.4878 0.4615 0.6829 0.5263 0.5909 0.4103 0.6154 0.5128 0.6316 0.7500 0.5500 0.4615 0.6486 0.7805 0.6364 1.0000 0.4889 0.4583 0.6500 0.5116 0.5641 0.4865 0.4615 0.5000 0.5714 0.6486 0.7027 0.6486 0.5556 0.5263 0.5789 0.4324 0.6286 0.4615 0.5714 0.5556 1.0000 0.5532 0.6800 0.5714 0.5333 0.4878 0.5641 0.7317 0.6842 0.6818 0.4615 0.6154 0.6667 0.5789 0.6500 0.6000 0.6154 0.4865 0.6829 0.5909 0.7368 0.5556 1.0000 ----------------------------------------------- Permutation tests of similarity values Within group 1 > within group 0 with pprime = 0.1450 Within group 1 > between groups 1 and 0 with pprime = 0.0990 Within group 0 > between groups 1 and 0 with pprime = 0.4930 Within group 2 > within group 0 with pprime = 0.3730 Within group 2 > between groups 2 and 0 with pprime = 0.0566 Within group 0 > between groups 2 and 0 with pprime = 0.0896 Within group 2 < within group 1 with pprime = 0.2708 Within group 2 < between groups 2 and 1 with pprime = 0.3800 Within group 1 > between groups 2 and 1 with pprime = 0.2328 ----------------------------------------------- Probability of identical lanes, by group: In group 0 : mean similarity is: 0.5999 mean band number is: 19.8750 sbar_to_xbar = 3.8800e-05 In group 1 : mean similarity is: 0.6278 mean band number is: 24.0000 sbar_to_xbar = 1.4025e-05 In group 2 : mean similarity is: 0.6093 mean band number is: 21.0000 sbar_to_xbar = 3.0290e-05 In whole group: mean similarity is: 0.6009 mean band number is: 21.6818 sbar_to_xbar = 1.5993e-05 Mean similarities between groups: 1 and 0 = 0.5996 2 and 0 = 0.5711 2 and 1 = 0.6147 ----------------------------------------------- Lynch's F_{st} = 0.0424 ----------------------------------------------- Similarities by permuting lanes among groups: Within group 1 > within group 0 with p= 0.2400 Within group 1 > between group 0 and 1 with p= 0.1300 Within group 0 > between group 0 and 1 with p= 0.4512 Within group 2 > within group 0 with p= 0.3628 Within group 2 > between group 0 and 2 with p= 0.0788 Within group 0 > between group 0 and 2 with p= 0.0980 Within group 2 < within group 1 with p= 0.2986 Within group 2 < between group 1 and2 with p= 0.4120 Within group 1 > between group 1 and 2 with p= 0.2504 ----------------------------------------------- Bias-corrected estimates of heterozygosity and Stephens estimates In group 0: Stephens estimate = 0.4945 Bias-corrected Stephens estimate = 0.6064 In group 1: Stephens estimate = 0.4449 Bias-corrected Stephens estimate = 0.5408 In group 2: Stephens estimate = 0.4367 Bias-corrected Stephens estimate = 0.5561 In group as a whole: Stephens estimate = 0.5477 Bias-corrected Stephens estimate = 0.5937 ----------------------------------------------- Standard-Stephens heterozygosity estimates for whole data set: Number loci = 14.0087 Heterozygosity = 0.5477 Proportion of loci polymorphic = 0.9286 Avg number alleles per locus = 2.9268 In group 0: Number loci = 13.2991 Heterozygosity = 0.4945 Proportion of loci polymorphic= 0.8496 Avg number alleles per locus = 2.9325 In group 1: Number loci = 16.6106 Heterozygosity = 0.4449 Proportion of loci polymorphic= 0.7592 Avg number alleles per locus = 2.4683 In group 2: Number loci = 14.6164 Heterozygosity = 0.4367 Proportion of loci polymorphic= 0.7263 Avg number alleles per locus = 2.5314 ----------------------------------------------- Standard J-C heterozygosity estimates: For whole data: Number loci = 13.7971 Heterozygosity = 0.5715 Proportion of loci polymorphic= 0.9275 Avg number alleles per locus = 2.9716 In group 0: Number loci = 12.7929 Heterozygosity = 0.5536 Proportion of loci polymorphic= 0.8437 Avg number alleles per locus = 3.0486 In group 1: Number loci = 16.0526 Heterozygosity = 0.4951 Proportion of loci polymorphic= 0.7508 Avg number alleles per locus = 2.5541 In group 2: Number loci = 14.0024 Heterozygosity = 0.4997 Proportion of loci polymorphic= 0.7143 Avg number alleles per locus = 2.6424 ----------------------------------------------- Based on JC-heterozygosities, Nei's F_{ST} = 0.0968 ----------------------------------------------- Permutation tests on heterozygosity of subgroups Test of H values using subsets of size n-2 from each group Fraction group 1 bigger than group 0 = 0.1232 That is, H in group 0 is bigger than H in group 1 with p = 0.1232 Exact fraction of all pairwise comparisons with group 1 heterozygosity bigger than group 0 heterozygosity p = 0.1237 That is, H in group 0 is bigger than H in group 1 with p = 0.1237 Fraction group 2 bigger than group 0 = 0.0526 That is, H in group 0 is bigger than H in group 2 with p = 0.0526 Exact fraction of all pairwise comparisons with group 2 heterozygosity bigger than group 0 heterozygosity p = 0.0548 That is, H in group 0 is bigger than H in group 2 with p = 0.0548 Fraction group 2 bigger than group 1 = 0.2268 That is, H in group 1 is bigger than H in group 2 with p = 0.2268 Exact fraction of all pairwise comparisons with group 2 heterozygosity bigger than group 1 heterozygosity p = 0.2310 That is, H in group 1 is bigger than H in group 2 with p = 0.2310 DONE END SAMPLE.OUT /*Old text to be removed*/ Permutation tests are performed for all possible combinations of within-group and between-group similarities, and the "p prime" values (p') are reported. These values are useful for exploratory data analysis and generally indicate the relative sizes and the magnitudes of the significance of the differences in the values of within-group and between-group similarities. Individuals are not randomized across groups in this permutation test: instead, all pairwise similarities from within two groups are permuted, and the fraction of the randomized similarity values the have mean differences as extreme as the original data is reported. Because all possible pairwise similarity values are used in this procedure (and two similarities of pairs that have an individual in common are dependent), the resulting values are not genuine significance levels. After the permutation tests described above, GELSTATS reports the results of a permutation test in which the mean similarities within the groups and between the groups are computed and then lanes are assigned to groups of similar sizes at random. What is reported are the fraction of the random assignments that produce differences in mean pairwise similarities as large as those observed in the actual data. We believe you can use the reported p-values to test the hypotheses that the distributions of similarities within or between the groups have the same center. Tests are also performed to compare within-group similarities to between-group similarities. Please note that you need to consider the within- and between-group similarity tests at the same time to figure out what's going on. Consider both the possibility of different similarity levels within groups and genetic differentiation between the groups when interpreting the results of the similarity tests. See the BioTechniques paper for an example.