STATISTICS IN GENETICS
Home
Program Committee
Topics
Invited Speakers
List of Participants
Schedule
Deadlines
Registration
Social Programme
Useful Links
Contact
Schedule Time Allocation
  • 50 minutes for each of the invited talks (incl. 10 min discussion),
  • 30 minutes for each of the contributed talks (incl. 5 min discussion)
Abstracts
  • Click on the title of the talk to read the abstract !
 
TimeWednesday, August 14, 2002
9.30 - 10.20David Balding
Approximate Bayesian Computation in Population Genetics
10.20-10.50Nicola Armstrong
Incorporating Interference into the Linkage Analysis of Experimental Crosses
10.50-11.20Oliver Pybus
Using Statistical Genetics To Study Viral Epidemic Behaviour
11.20-13.30Lunch Break
13.30-14.20Carsten Wiuf
Inferring Recombination from Genotypes
14.20-14.50Bojan Basrak
A nonparametric method of QTL mapping using sib-pairs
14.50-15.20Heike Bickeböller
Identification and Analysis of Candidate Genes for Complex Diseases
15.20-15.50Coffee Break
15.50-16.40Gesine Reinert
Words count statistics for biological sequences
16.40-17.10Anne-Laure Boulesteix
Stochastic Modelling for the COMET-assay
17.10-17.40Patrick Lindsey
Analysis of measurement errors in quantitative PCR data using multivariate distributions with correlation matrices
17.40Poster Session and Buffet
 
TimeThursday, August 15, 2002
9.30-10.20Paul Eilers
Statistical Applications in Microarray Technology
10.20-10.50Ingrid Glad
Towards a comprehensive statistical model of cDNA microarray experiments
10.50-11.20Mark van de Wiel
Significance Analysis of Microarrays using Rank Scores
11.20-13.30Lunch Break
13.30-14.20Arndt von Haeseler
Assessing Variability by Joint Sampling of Alignments and Mutation Rates
14.20-14.50Martin Möhle
A Markov Chain Monte Carlo Algorithm for the Ewens Sampling Distribution and Applications to Neutrality Tests
14.50-15.20Ellen Baake
Mutation-selection models: Branching, ancestry, and large deviations
15.20-15.50Coffee Break
15.50-16.40Nuala Sheehan
A graphical models perspective on complexity in genetics
16.40-17.10Andreas Wienke
Genetic analysis of cause of death: advantages and limitations of frailty models
17.10-17.40Qihua Tan
A survival analysis approach for measuring gene longevity associations
 
TimeFriday, August 16, 2002
9.30-10.20Chris Holmes
Detection and modelling of interactions in gene expression using Bayesian spline models
10.20-10.50Anja von Heydebreck
Variance stabilization and robust normalization for microarray gene expression data
10.50-11.20Korbinian Strimmer
Quasi-Likelihood and Gene Expression Analysis
11.20-13.30Lunch Break
13.30-14.20Michael Newton
On modeling genomic aberrations in cancer cells
14.20-14.50Annibale Biggeri
A hierarchical Bayesian model to study temperature-dependent variation of sequence-specific hybridization to cDNA Microarray
14.50-15.20Hizir Sofyan
Fuzzy Clustering in Gene Expression Data
15.20-16.10Terry Speed
Summarizing and comparing GeneChip data
16.10Farewell Drinks

Posters

NameTitle
Florian BurckhardtGEM: Centres for Genetic Epidemiological Studies
David De LorenzoDetecting footprints of natural selection along the genome of Drosophila melanogaster
Ma'ayan FishelsonSuperlink: A New Program for Exact Genetic Linkage
Gideon GreenspanBayesian Learning of Haplotype Blocks
Andreas HahnExpression analysis in Inflammatory bowel disease
Sonia Hernandez AlonsoStrategies for detecting several loci in the same chromosome affecting a trait
Volkmar LiebscherStochastic modelling for gene expression data
Andreas MartinStochastic Models for Delta-Notch Regulation
Kerstin RoseliusMolecular population genetics and speciation in Lycopersicon
Claus VoglVariation within and between populations in Drosophila ananassae


Abstracts

Nicola Armstrong Incorporating Interference into the Linkage Analysis of Experimental Crosses
Genetic mapping makes use of recombination patterns to order a set of loci and estimate interlocus distances. In practice, genetic inteference is usually assumed to be absent, although crossover interference has been observed in many organisms. Here we examine the practical implications of incorporating interference via the chi-2 model in genetic mappping of experimental organisms. We extend the (no interference) Lander-Green algorithm (PNAS 1987) to include crossover interference by introducing a modification to their hidden Markov Model (HMM) of the recombination process.
Ellen Baake Mutation-selection models: Branching, ancestry, and large deviations
In joint work with J. Hermisson, O. Redner, and H.-O. Georgii, we consider the genetics of populations under the joint action of mutation and selection. To this end, mutation and reproduction are modelled as a multi-type branching process, of which we consider both the forward and the backward direction of time. The stationary state of the reversed process is the ancestral distribution, which turns out as a key for the study of mutation-selection balance. For the single-step mutation model, we analyze large deviations of the empirical measure of the backward process. The rate function for the empirical mean genotype is given by a variational formula which can be solved explicitly in a number of limiting cases. The equilibrium properties of both the present and the ancestral population are then obtained via a second variational step, which involves maximization over one scalar variable only.
Ref.: J. Hermisson, O. Redner, H. Wagner, and E. Baake, Mutation-selection balance: Ancestry, load, and maximum principle, Theor. Pop. Biol., in press.
David Balding Approximate Bayesian Computation in Population Genetics
I will discuss methods that have been evolving recently within the population genetics literature for approximating low-dimensional marginal posterior distributions under complex models involving large numbers of nuisance parameters. Although MCMC is sometimes feasible, there are typically problems with poor mixing, and model comparison is usually unachieveable. The alternative being proposed is based on simulation of parameters and datasets, from the prior and model respectively, followed by various forms of nonparametric regression to model the posterior density in terms of appropriate data summary statistics. Several levels of approximation are involved, but the reward is the ability to handle complex models and to perform model comparison via approximate Bayes factors.

I will discuss applications in human population genetics and conservation genetics, as the possibility of applications in other fields.

This is joint work with Wenyang Zhang, Statistics, University of Kent, and Mark Beaumont, Animal & Microbial Sciences, University of Reading.
Bojan Basrak A nonparametric method of QTL mapping using sib-pairs
There are several well established parametric methods used for quantitative trait loci (QTL) linkage analysis. Most of them rely on the assumption that the trait is normally distributed. Such methods used with appropriate thresholds of significance (based for instance on interval mapping analysis of Lander and Botstein) have been applied with considerable success in many medical and biological studies. Still, many traits do not follow normal distribution.
Biologist try to deal with this problem by performing some nonlinear transformation of the trait. However, it seems that by doing so, they frequently improve fit to normal distribution only for the observations close to the mean of the distribution. This is very dangerous, since informativeness of a sib-pair depends on how extreme its trait values actually are. In particular, Risch and Zhang showed that the most informative pairs are usually found at the very extremes of phenotypic distribution. This fact is well known among practitioners as well.
Nonparametric methods of QTL mapping have been considered in the past, by Kruglyak and Lander for instance. But unlike their method our approach is derived with human studies in mind. Both simulation studies and real life applications show interesting potential of the method that we propose.
Heike Bickeböller Identification and Analysis of Candidate Genes for Complex Diseases
The goals and means of genetic epidemiological studies using genetic marker data will be discussed. The tool box for the investigation of a susceptibility gene for a complex trait contains linkage and association approaches and varies study designs. Principles of linkage and association analyses will be shortly reviewed. Linkage analysis is widely used in genome scans and candidate gene studies. Association, including family-based association studies, is generally used for candidate genes and for fine mapping. Both genome scan and candidate gene strategies have their advantages. They are not exclusive, but should be combined. For genome scans there is no need for defining biological mechanisms. Using non-parametric linkage methods the major problem is to achieve good power even for genes with a moderate effect while using stringent criteria for genome-wide significance levels. The use of assocation in genome-scans seems at the moment not really feasible even with 1cM maps due to the need of a strong to moderately strong linkage disequilibrium. Candidate genes are genes which may be "functionally related to the disease". In a narrow sense the gene (or gene region) should relate to demonstrated pathophysiologic abnormality or to an animal model of disease. In a broad sense the gene is part of a biological system hypothesized to play a role in the disorder. Ideally the gene is in a chromosomal region in which (or close to which) some hints to linkage have been observed. If an abundance of candidates is hypothesized a more stringent significant level, perhaps even close to a genome-wide level, needs to be applied.
Annibale Biggeri A hierarchical Bayesian model to study temperature-dependent variation of sequence-specific hybridization to cDNA Microarray
A. Biggeri, S. Toti, C. DeFilippo, K.Morneau, A. Bergerat, M. Gasparini and D. Cavalieri

"Center for Genomics Research" Dept. of Pharmacology - Dept of Statistics, University of Florence
Dept. of Mathematics, Politecnico di Torino
Bauer Center for Genomic Research, Harvard University

An open issues in Microarray experiments on samples, whose genomic sequence is not necessarily identical to that used to design the probes printed on the array, is the extent of signal variation related to sequence specificity or cross-hybridisation. A second issue regards the consequences of such phenomenon on the inference about relative expression levels. To address these topics we designed a microarray experiment on Saccharomyces cerevisiae as follows: S. cerevisiae DNA microarrays have been produced amplifying all the genes of genome of the S. cerevisiae strain S288c (6200 ORFs) and spotting the purified PCR products on poly-L-lysine coated slides using a robot arrayer. S288c is the laboratory strain whose sequence has been determined according to the S. cerevisiae genome sequencing project. DNA from two different strains, S288c and a wine strain isolated from Montalcino grapes, was allowed to hybridise to a microarray representing all the yeast coding regions at three different temperatures (50, 55, 60 °C). Replicates are produced for each cell of the factorial design with dye swap.
Since we considered two strains of S. cerevisiae whose genomes show differences in the genomic sequences, we expect differential hybridisation to occur. Differential hybridisation can be the result of non-completely specific binding of the wine strain DNA to the array containing probes designed on the S288c DNA sequence. It reflects the amount of sequence variation in a given probe. Varying hybridisation temperature would modulate such partially specific binding reaction to an extent to be quantified, but might also cause aspecific binding to other probes on the array.

We specify a hierarchical Bayesian model to decompose the different sources of variability and adjust for temperature-related differential hybridization and cross-hybridization. To this purpose, expression levels for each of the two dyes are written symbolically as

level = slide + dye + temperature(strain) + strain + gene + error

and appropriate interaction terms. The effect of temperature is treated as random, since it could vary between spots. Whether such a variation exists is one of the hypotheses to be tested with this experiment.
Next, gene-specific random effects and a spatially structured error random effects are specified. Alternatively, space-varying coefficient models could be assumed for the temperature effects. Both models are realistically complex attempts at taking into account possible array effects and can be analysed in the presence of a large amount of data. Full joint posterior distributions are approximated by MCMC methods. Formal model comparison is based on the Deviance Information Criterion.
Anne-Laure Boulesteix Stochastic Modelling for the COMET-assay
The COMET-assay is a tool in biology and medicine to determine defects in the DNA repair mechanism of mammalia by a single cell gel electrophoresis. We present a simple point process model for the available image data. The problem to deduce whether a repair defect is present relates strongly to a deconvolution problem for empirical intensities.
Florian Burckhardt GEM: Centres for Genetic Epidemiological Studies
Wichmann, H.E., Burckhardt

In medical genetics, the focus is shifting from monogenic to genetically complex diseases. Their etiology is multifactorial and is presumably based on several partly interacting genetic mechanisms, which in addition may be modulated by environmental factors. With their higher prevalence, complex diseases such as Alzheimers, asthma or diabetes also cause a higher burden of illness and are thus of high socioeconomical relevance.
Research strategies for complex diseases require large samples of cases and controls or families with pedigree structures according to a well defined study design in order to achieve sufficient statistical power. A close collaboration between clinicians and scientists from the fields of human genetics, molecular genetics, genetic epidemiology and genetic biometry is essential for a successful study.
The logistic complexity needed is coordinated by special competence centres for Genetic Epidemiological Methods (GEM). The centres form an integrated structure and common research platform for the study of complex diseases and offer a broad spectrum of methods and logistical support. Each of the seven GEMs in Germany focuses on separate aspects. GEM Munich specialises on logistics for family ascertainment with the accompanying information management and interface to genotyping facilities for SNPs. Further issues include good epidemiological practice, ethics and privacy management.
David De Lorenzo Detecting footprints of natural selection along the genome of Drosophila melanogaster
David De Lorenzo, Sascha Glinka, Lino Ometto and Wolfgang Stephan

Previous surveys of D. melanogaster have shown that chromosomal regions with low recombination rates exhibit reduced levels of nucleotide variation. In several cases, this has been demonstrated to be due to the effect of positively selected variants, which on their way to fixation sweep linked polymorphisms. These "selective sweeps" affect long DNA tracts in regions of low recombination, but are expected to cause reductions of variation over shorter DNA fragments in chromosomal segments of medium to high recombination.
The goals of this project are: 1) to identify and characterize the genes that were subject to positive directional selection, and 2) to estimate the rate of advantageous substitutions in the recent history of this species. For these purposes, we surveyed levels of intraspecific polymorphism and interspecific divergence (compared with D. simulans) in two natural populations of D. melanogaster, one from Europe and one from Africa. The first results revealed several regions with significantly reduced levels of polymorphism in the European population, but none in the African population. This is consistent with the hypothesis that selective sweeps occurred in the European population during the recent adaptation of this population to temperate zones.
Paul Eilers Statistical Applications in Microarray Technology
Microarrays are a real challenge to statistics: never before we were confronted with data sets with so few observations on so many (very) noisy, unordered, variables. And never before were our clients so convinced that there are tons of biological information hidden in these data. As a result papers with many types of creative ad-hoc "statistical" procedures appear in high-profile journals and "Bioinformatics" is one of the best fund-raising buzzwords of these years.
Much less attention goes to statistics as applied to the technological aspects of microarrays. Perhaps some day they will be reliable and accurate chemical instruments, but in the mean time we are confronted with many sources of variation and bias and it is of great practical value to isolate them, to estimate their size and, if possible, to compensate for them.
My presentation is inspired by a study in which the quality of translation of mRNA to cDNA was investigated. Translation starts at one end of the mRNA molecule and there is a chance that it may not succeed completely, meaning that many mRNA molecules will be translated incompletely. This can be studied systematically by spotting an array with oligos that appear near the start, the middle and the end of a molecule of mRNA or its corresponding cDNA. The study was well designed, with threefold replication of arrays and sixfold replication of genes, comparing the hybridisation efficiency of mRNA and cDNA, to "sense" and "anti-sense" sequences.
This type of data fits nicely into the classical Analysis of Variance (ANOVA) framework. It has similar of large-scale agricultural experiments, where the spots on the array take the role of the traditional plots of land. A bonus is that classical software can be used, leading to quick answers and meaningful statements of significance.
Like in agriculture, spatial analysis is also of great interest here. Background strength and hybridisation efficiency generally show strong systematic patterns, which can be coupled to subgrids, rows, columns, printing pins, or to general spatial trends. This leads to quite complicated models that combine two-dimensional smoothing and ANOVA.
Ma'ayan Fishelson Superlink: A New Program for Exact Genetic Linkage
Superlink: A New Program for Exact Genetic Linkage Analysis of General Pedigrees

Ma'ayan Fishelson , Dan Geiger

Keywords: Bayesian networks, Fastlink, Genehunter, Genetic linkage analysis, Vitesse, Two loci disease models.

Genetic linkage analysis is a useful tool for mapping disease genes. It allows one to use statistical tools to associate functionality of genes to their location on the chromosome. Generally speaking, this analysis uses a probabilistic model of inheritance of genetic materials and applies it to data in the form of pedigrees, where some of the individuals are annotated with information on the trait of interest and information on their genetic makeup. As highly-informative genetic marker maps have been developed, multipoint linkage analysis has become a crucial part in linkage analysis studies due to its supremacy on pairwise linkage analysis for locating genes and detecting linkage. However, the computational complexity required to perform such calculations increases exponentially due to the large number of markers that participate in the analysis, the high polymorphism of the markers under study, the size of the pedigree, and the number of untyped people in the pedigree. These factors highly constrain the space and time requirements of existing programs. Some programs fail to run as the number of markers, the degree of polymorphism of the markers, or the size of the pedigree increase. Other programs can handle a large number of markers but can only analyze small pedigrees. We have addressed the increasing need for a program that performs multipoint likelihood calculations on general pedigrees with a higher number of polymorphic markers. We implemented our algorithms in a computer program, called Superlink, that computes pedigree likelihood for complex diseases in the presence of multiple polymorphic markers in fully general pedigrees, taking into account a variety of disease models. Superlink compares favorably with current linkage software with regards to the following criteria: functionality, speed, memory requirements and extensibility. This can be seen from the experimental results described below.
Currently, there are two main approaches to computing pedigree likelihood exactly: Elston-Stewart [3] and Lander-Green [5,6,7]. Both of these algorithms are variants of variable elimination methods [2,16] that depend on different strategies to finding an elimination order. The complexity of the Elston-Stewart algorithm is linear in the pedigree size (for pedigrees with a simple structure) but exponential in the number of markers. On the other hand, the Lander-Green method is linear in the number of markers but exponential in the number of individuals. In Superlink, we used the framework of Bayesian networks as the internal representation of linkage analysis problems [4]. Using this representation allows us to give a unified treatment to both approaches and to handle a wide variety of linkage analysis problems. Whenever feasible, we use variable elimination alone to calculate the likelihood of the pedigree. Otherwise, our algorithm combines variable elimination with conditioning (a divide and conquer approach) to achieve the best time-space tradeoff given the memory available for the linkage analysis problem. The crucial point of the algorithm is that conditioning is performed only after some steps of variable elimination, when the memory requirements are about to exceed the limitations. Such conditioning often applies only to parts of the Bayesian network and thus, computations in other, unrelated, parts of the network are not repeated unnecessarily. The elimination order is chosen automatically according to the parameters of the specific linkage problem. For small pedigrees with a large number of markers, the algorithm chooses a peeling order, based on the Lander-Green approach, that proceeds locus after locus. For large pedigrees with a few markers, the algorithm chooses an Elston-Stewart style elimination order which "peels" one nuclear family at a time. Other linkage problems are handled by finding a good elimination order. Often the program chooses an elimination order that is a combination of these two extreme known choices of ordering.
Another crucial feature of our program is the preprocessing step performed on the Bayesian network that reduces the range of values that are feasible for each variable given the data. This step often has a large impact on the memory and time requirements of the calculations. Superlink allows for analysis of sex-linked traits and also allows for a disease phenotype to be under the control of two loci [11, 14, 15].
We have run several experiments to compare our program to some of the leading linkage programs currently, Fastlink [1, 8, 12, 13], Genehunter [5, 6, 7] and Vitesse v1.0 [10]. We have not been able, so far, to try Vitesse v2.0 [9] but we have indications that our program outperforms it on all inputs. The running environment on which all experiments were performed was a Sun OS version 5.7 (sun4u) with 2624 MB RAM. In one of the experiments, we used 12 datasets with a medium sized topology taken from a coronary heart disease study and increasing complexity in terms of the number of loci. The pedigree size exceeds the size that can be handled by Genehunter and only the first few files can be run by Fastlink and Vitesse before the memory requirements become too large. Superlink can run on all the files except for the last one on which the computation will require over a 100 hours in order to complete. It is also important to note that, for the files that could be run by Fastlink and Vitesse, the running times are shorter for Superlink. For example, datasetEA2, required 0.39 seconds by Vitesse and 79.32 seconds by Fastlink and only 0.14 seconds by our program. DatasetEA5 required 84.66 seconds by Vitesse and only 1.19 seconds by Superlink. This dataset cannot run on Fastlink. In another experiment we used a medium-sized looped topology. Vitesse doesn't handle looped pedigrees and therefore failed to run on these files. Fastlink can only run on the first file and its running time is 3933.7 seconds, whereas Superlink takes only 2.56 seconds to run on this file. More experimental results, the full paper, data sets, and executables, are available at bioinfo.cs.technion.ac.il/superlink.

References:
  1. Cottingham, R.W., Idury, R.M. and Schäffer, A.A. 1993. Am. J. of Hum. Genet., 53:252-263.
  2. Dechter, R. 1998. In J.M.I. (Ed.) Learning in Graphical Models (pp.75-104). Kluwer Academic Press.
  3. Elston, R.C. and Stewart, J. 1971. Hum. Hered., 21:523-542.
  4. Friedman, N., Geiger, D. and Lotner, N. 2000. Proc. Sixteenth Conf. Of UAI.
  5. Kruglyak, L., Daly, M.J. and Lander, E.S. 1995. Am. J. of Hum. Genet., 56:519-527.
  6. Kruglyak, L., Daly, M.J., Reeve-Daly, M.P. and Lander, E.S. 1996. Am. J. of Hum. Genet., 58:1346-1363.
  7. Lander, E.S. and Green, P. 1987. Proc. Natl. Acad. Sci., 84:2363-2367.
  8. Lathrop, G.M. and Ott J. 1990. Am. J. of Hum. Genet., 47(A188).
  9. O'Connell JR. 2001. Hum. Hered., 51(4):226-40.
  10. O'connell, J.R. and Weeks, D.E. 1995. Nat. Genet., 11:402-408.
  11. Risch, N. 1990. Am. J. of Hum. Genet., 46:222-228.
  12. Schäffer, A.A. 1996. Hum. Hered., 46:226-235.
  13. Schäffer, A. A., Gupta, S.K., Shriram, K. and Cottingham R.W. 1994. Hum. Hered., 44:225-237.
  14. Schork, N.J., Boehnke, M., Terwilliger, J.D. and Ott, J. 1993. Am. J. of Hum. Genet., 53:1127-1136.
  15. Strauch, K., Fimmers, R., Kurz, T., Deichmann, K.A., Wienker, T.F., and Baur M.P. 2000. Am. J. of Hum. Genet., 66:1945-1957.
  16. Zhang, N.L. and Poole, D. 1994. In Proc. of the Tenth Canadian Conference on Artificial Intelligence, 171-178.
Ingrid Glad Towards a comprehensive statistical model of cDNA microarray experiments
A microarray experiment is a complex sequence of several complicated laboratory and/or computer related tasks, leading in the end to a I × J matrix of numbers representing gene expressions for I genes and J individuals or replications.
Often, each task is studied or modelled separately, and the 'result' of one step is plugged into the next step for further processing. This facilitates a good overview and optimal handling in each step, but creates also major drawbacks: uncertainty is not propagated through the sequence of various tasks, each step is based on a statistical model which might be contradictory from task to task, and it is not possible to perform joint inference from the experiment as a whole.

We develop a coherent and structured statistical model which describes the experiment from bottom to top, takes care of uncertainties and dependencies, and answers specific questions of interest. We follow a hierarchical Bayesian modelling framework and resort to MCMC for making inference. Starting out with the unknown number of mRNA molecules of a gene i (i=1,...,I) in the target solution of a specific tissue, we follow these molecules through the process of reverse transcription and dying, mixing and centrifugation, bathing, hybridization, and washing. In the hybridization modelling we include also a model for the production of slides. Conditioned on the initial amount of mRNA in the target solution, we obtain a hierarchical model for the final amount of remaining hybridized target on each probe.
On top of this comes the image analysis step, producing colour intensities for spots and background in two channels via imaging machine specific scanning, segmentation, filtering and possibly normalisation. At the ultimate level in our hierarchical model we combine different arrays in order to answer interesting questions related to for example differences before and after cancer treatment, differences in biopsies from the same tumor (intra tumor heterogeneity) or clustering of patients and/or genes.

This is joint work with Arnoldo Frigessi (Norwegian Computing Center), Heidi Lyng (Norwegian Cancer Hospital) and Mark van de Wiel (Eindhoven University of Technology).

Related literature:
  • Baggerly, K.A., Coombes, K.R., Hess, K.R, Stivers, D.N., Abruzzo, L.V., and Zhang, W. (2001). Identifying differentially expressed genes in cDNA microarray experiments. Journal of Computational Biology 8, 639-659.
  • Efron, B., Tibshirani, R., Storey, J.D., and Tusher, V. (2001). Empirical Bayes analysis of a microarray experiment. Journal of the American Statistical Association 96, 1151-1160.
  • Newton, M.A., Kendziorski, C.M., Richmond, C.S., Blattner, F.R., and Tsui, K.W. (2001). On differential variability of expression ratios: Improving statistical inference about gene expression changes from microarray data. Journal of Computational Biology 8, 37-52.
Gideon Greenspan Bayesian Learning of Haplotype Blocks
Observable haplotype blocks arise from the interaction between recombination hot-spots, bottleneck effects and genetic drift. The presence of recombination hot-spots in human chromosomes has been demonstrated by several recent high-resolution studies of SNP covariation. They separate between stretches of up to 100,000 base pairs in which almost no recombination takes place, so the SNPs lying between hot-spots act a single multi-site allele or `haplotype block'. A bottleneck occurs when a locally-reproducing population is descended from a small group of individuals, for example due to migration. As the new population grows, it will exhibit far less genetic variation within each block than expected for its size. These small populations also undergo significant genetic drift, in which the variation is decreased further by many generations of random mating.

An accurate statistical model of the haplotype blocks present in a chromosomal region can be used to strengthen the power of genetic association analysis, improve the accuracy of general haplotype resolution and further our understanding of the recombination process itself. Empirical studies of populations descended from a bottleneck, confirmed by our simulations, show that the haplotype blocks in a chromosomal region can be modelled by a dynamic Bayesian network. Each hidden variable corresponds to the ancestral source of a haplotype block, with first-order Markovian transition probabilities reflecting the recombination which has occurred at the hot-spots in the intervening generations. Each SNP observed in an individual chromosome depends upon the ancestral block from which it is descended, under a suitable site-specific mutation model.

We have developed a general tool which learns this dynamic Bayesian network model from raw SNP data. The problem differs from classical Markov model training in several ways. The location of hot-spots is not given, requiring a selection between 2^(loci-1) possible network topologies. The values for each hidden state must also be inferred, a difficulty compounded by the presence of failed measurements and the fact that only joint SNP measurements from pairs of chromosomes are often available. Utilizing an ML (maximum likelihood) approach leads to over-fitting, producing a model in which there are no recombination hot-spots and too many ancestral haplotypes. So we adopt the MDL (minimum description length) criterion, which seeks to minimize the number of bits required to represent data D with a model M, given by DL(M)-log2(Pr(D|M)).

Starting with no hot-spots, our search strategy iterates over possible hot-spot insertions (or deletions and nudges in later rounds), trying only those operations which improve our score more than their neighbors, repeating until no further improvement can be found. For a particular assignment of hot-spots, the ancestral haplotype blocks for each subject are obtained via model-based Bayesian clustering, which handles both joint unphased and failed measurements. The transition probabilities between the discovered blocks are inferred by EM, as are site-specific mutation rates. Tests on both simulated and real-world data demonstrate our method's ability to recover the haplotype block distribution of a chromosomal region from phased or unphased samples. Our algorithm is guaranteed to converge and takes O(loci*samples) time.
Arndt von Haeseler Assessing Variability by Joint Sampling of Alignments and Mutation Rates
Joint work with Dirk Metzler, Roland Fleißner and Anton Wakolbinger

When two sequences are aligned with a single set of alignment parameters, or when mutation parameters are estimated on the basis of a single "optimal" sequence alignment, the variability of both the alignment and the estimated parameters can be seriously underestimated. To obtain a more realistic impression of the actual uncertainty, we propose sampling sequence alignments and mutation parameters simultaneously from their joint posterior distribution given the two original sequences. We illustrate our method with human and orangutan sequences from the hyper variable region I and with gene-pseudogene pairs.
Andreas Hahn Expression analysis in Inflammatory bowel disease
Inflammatory bowel disease comes in two forms: Crohn's disease and ulcerative colitis. Both of them are genetically complex, but are also influenced by environmental factors and have increasing prevalence and incidence. The two forms of the disease are of high interest, because until now very few is known about their aetiology. After the finding of NOD2 as an important gene for developing Crohn's disease, we analysed expression data of 33 persons and 35000 clones. Normalization and comparison of the expression levels of three groups of persons with different disease status (normal, Crohn's disease, and ulerative colitis) were conducted. We present new candidate genes and give some functional explanation of their influence to the disease.
Sonia Hernandez Alonso Strategies for detecting several loci in the same chromosome affecting a trait
In the last few years, several multilocus linkage methods for simultaneous detection of the multiple loci affecting a complex or quantitative trait have been introduced. These include the approaches by Cordell et al. (2000), Cox et al. (1999), Dupuis et al (1995) and Tang (2000). Most of these analyses, however, focus in the cases in which the various loci influencing the trait lie in different chromosomes. An exception is found in Farrall (1997), who proposes a maximum likelihood-based method which maintain its power to detect linkage when the susceptibility genes are in the same chromosome, provided that they are not tightly linked.
Nevertheless, there are evolutionary reasons to expect that genes affecting the same trait may lie very close to each other in the same chromosome. In such cases, the above mentioned methods for simultaneous detection fail to detect the different loci. I analyse this problem and possible alternative solutions for both continuous and binary (affected/non affected) traits.

References:
  • Cox et al (1999). Nat. Genet. 21: 213:215
  • Cordell et al. (2000). Am. J. Hum. Genet. 66:1273:1268
  • Dupuis et al. ((1995). Genetics 140:843:856
  • Farrall (1997). Genet. Epidemiol. 14:103:115
  • Tang (2000). Stanford University, Ph. D. Thesis.
Anja von Heydebreck Variance stabilization and robust normalization for microarray gene expression data
Authors: Anja von Heydebreck, Wolfgang Huber, Annemarie Poustka, Martin Vingron

A.v.H., M.V.: Max-Planck-Institute for Molecular Genetics, Berlin, Germany
W.H., A.P.: German Cancer Research Center, Heidelberg, Germany

DNA-microarrays simultaneously measure the expression levels of thousands of genes in a tissue or cell culture sample. Due to variations in the experimental conditions, the measurements from different experiments are generally on different numerical scales and need to be calibrated before further analysis.
Furthermore, the analysis of replicate experiments shows that the variance of the measured expression intensities depends on their expected value. Usually, microarray data are analyzed on a logarithmic scale, which typically leads to larger variability in the small intensity range. This heteroskedasticity makes the logarithms of ratios of expression values from different biological samples difficult to compare.
Our approach for dealing with these problems is based on a simple and well-motivated model of measurement error for the expression intensities [1] comprising an additive and a multiplicative error term. From this model, using a technique described in [2], we derive a parametric family of variance-stabilizing transformations of the measured intensities which are of the form y=arsinh(ax+b). In the limit of large x, where the multiplicative error is dominant, this coincides with the commonly used logarithmic transformation. However for small values of x, the transformation diminishes the fluctuation of the intensities that is usually visible in log-transformed data.
We incorporate the parameters of both the variance-stabilizing transformation and the calibration between experiments into a statistical model which allows for maximum-likelihood estimation of its parameters. However, for the calibration of data from different biological samples, we include a robust estimation technique that is based on least trimmed sum of squares regression. This uses the assumption that in the comparison of two related cell samples, the majority of genes will have roughly unchanged expression levels, while possibly different numbers of genes are up- or downregulated.
Finally, we evaluate the performance of our approach, comparing it with other methods of data transformation and calibration.

References:
  1. D.M. Rocke and B. Durbin (2001): A model for measurement error for gene expression analysis. Journal of Computational Biology 8(6), 557-569.
  2. R. Tibshirani (1988): Estimating transformations for regression via additivity and variance stabilization. Journal of the American Statistical Association, 83:394-405.
  3. W. Huber, A. v.Heydebreck, H. Sueltmann, A. Poustka and M. Vingron: Variance stabilization applied to microarray data calibration and to the quantification of differential expression. To appear in: Proceedings of ISMB 2002.
Chris Holmes Detection and modelling of interactions in gene expression using Bayesian spline models
Microarray experiments allow for the simultaneous measurement of thousands of gene expression levels within cell samples of interest. Often the object of the experiment is to investigate comparative differences in expression levels within two or more tissue (cell) categories for classification, and to discover genes that have concomitant input into regulatory networks or other biological pathways. It is natural to represent such problems as regression tasks, where the interest is to determine which genes are (jointly or individually) discriminative for the categories. The nature of the experiments usually means we have thousands of gene measurements and only a handful of samples; the so called "large p small n" scenario. Standard likelihood based approaches are not feasible as each gene adds one degree of freedom to the regression model and the system is quickly underdetermined. Bayesian procedures do not suffer from this as prior regularisation allows all of the genes to be included within a well formulated model. In this talk we discuss conventional Bayesian approaches to the problem of tissue categorization by gene expression using logistic regression models with and without gene (variable) selection. We also highlight nonlinear approaches to the problem using tensor product splines that can detect and model interactions between gene expression levels. The methods are illustrated using time course gene expression data within malaria infected mosquitos.
Volkmar Liebscher Stochastic modelling for gene expression data
Abstract will be supplied later
Patrick Lindsey Analysis of measurement errors in quantitative PCR data using multivariate distributions with correlation matrices
It is well known that repeated exposure to drugs such as morphine leads to long lasting changes in general behaviour. Several studies have revealed that drug-induced neuronal adaptations occur in the nucleus accumbens. This brain area appears to be highly related with the expression of psychomotor effects due to drugs of abuse.

In order to assess drug-induced neuronal adaptations which might lead to general behavioural changes persisting long after drug exposure, a study involving male Wistar rats was carried out over a period of 35 days. Morphine and saline doses were repeatedly given during the first 14 days. This was followed by an 18 day washout period. Finally, a morphine or saline dose was given on days 33 and 34. The treatments assigned before and after the washout period were alternated such that all four possible treatment sequences were observed. This design is somewhat different from classical cross-over studies because rats in each treatment arm must be killed at each observation time point in order to obtain the brain tissues from the nucleus accumbens. From each tissue sample, RNA was isolated, cDNA was synthesis, and quantitative PCR measurements were obtained for 196 different genes.

Because several repeats were carried out for 73 of these genes at certain time points, the variability induced at some of these different laboratory steps can be measured using random effect models. Hence, several multivariate distributions with correlation matrices are investigated in order to allow for skewness and heavy tails of the data. These distributions are namely the multivariate normal, the multivariate Student t, multivariate power-exponential, and multivariate skew Laplace distributions.

The use of such models enables us to estimate more accurately several variance components. These can now be used when computing confidence intervals for data obtained from quantitative PCR experiments in order to assess more accurately the actual gene expression difference between treatment groups.
Andreas Martin Stochastic Models for Delta-Notch Regulation
Abstract will be supplied later
Martin Möhle A Markov Chain Monte Carlo Algorithm for the Ewens Sampling Distribution and Applications to Neutrality Tests
The Ewens sampling distribution q is one of the fundamental contributions in population genetics to describe the behavior of allele frequencies. Even for moderate sample sizes the state space S of all allele configurations is quite large. Hence it is usually time consuming to calculate for a given function f on S the integral E(f) = \int_S f dq numerically. An efficient Metropolis-Hastings algorithm for the Ewens sampling probabilities of allele configurations is provided to approximate any functional E(f). The standard deviation is discussed and approximative confidence intervals are provided. The method is applied to test the hypothesis of selective neutrality using the homozygosity test and the Ewens-Watterson-Slatkin neutrality test.
Michael Newton On modeling genomic aberrations in cancer cells
Whereas most cells in the body carry the normal complement of 46 chromosomes, the cells within a cancerous tumor very often present highly abnormal genomic structure. Deletions, amplifications, rearrangements and mutations are common at various scales and are highly variable amongst tumors, as indicated by molecular technologies which enable ever better measurement. It is an important statistical problem to separate those abnormalities which are sporadic from those which may not be sporadic and which may have some biological significance. I will discuss a modeling strategy for genomic aberration data which allows us to infer combinations of aberrations that together increase the chance that a precancerous cell will have a descendant tumor lineage. The likelihood component involves a network of pathway structures and MCMC is used to sample from the space of these oncogenic networks. I illustrate the methodology with comparative genomic hybridizations from several recent studies.
Oliver Pybus Using Statistical Genetics To Study Viral Epidemic Behaviour
Coalescent theory can build a bridge between the disciplines of population genetics and mathematical epidemiology, by using pathogen gene sequences to infer the population dynamic history of infectious diseases. We have developed new models to investigate the epidemic behaviour of viruses and to estimate the fundamental epidemiological quantity R(0) from gene sequence data. We have applied our methods to HIV and the Hepatitis C Virus, and our results show significant differences in demographic history among strains, shedding new light on viral genetic diversity and the mechanisms of viral transmission.
Gesine Reinert Words count statistics for biological sequences
This talk will consider word count statistics in the analysis of biological sequences. A sequence is modeled as a stationary ergodic Markov chain, with particular emphasis on the relatively easy case of independent identically distributed letters. In such models, probabilities to word counts can be assigned, and different sequences can be compared with respect to their word contents.

Special emphasis lies on disentangling the complicated dependence structure between word occurrences, due to self-overlap as well as due to overlap between words. The results are applied to statistical inference from DNA sequence data.

Much of this is joint work with Sophie Schbath and Mike Waterman.
Kerstin Roselius Molecular population genetics and speciation in Lycopersicon
Kerstin Roselius, Traudl Feldmaier-Fuchs, Thomas Städler, Wolfgang Stephan

Wild relatives of the cultivated tomato (genus Lycopersicon) encompass the entire range from highly selfing to obligately outcrossing taxa. Although the extant nine tomato species are morphologically distinct, molecular evidence is mounting that these species are indeed very young. This allows us to use a population genetic approach to analyze effects of the mating system, demography, and natural selection on patterns of nucleotide variability and, eventually, the origin and divergence of the tomato species. Of particular interest is the role of the mating system and its potential interaction with demographic factors, such as local extinction and recolonization. We intend to estimate DNA sequence variation within and between species at a genome-wide scale, i.e. at multiple loci across the genome. Our preliminary results (based on six nuclear loci) reveal highly significant differences in average levels of nucleotide variation (estimates of the parameters theta and pi), in that the partially selfing species show much lower levels of DNA polymorphism than the outcrossing species. Recombination rates for each locus and species can be estimated from the sequence data (composite-likelihood estimator), as well as from a published recombination map of the tomato genome. Thus far, effects of recombination on levels of DNA polymorphism appear to be relatively weak, arguing against the presence of strong positive or purifying (background) selection, as is ubiquitous in Drosophila populations This suggests that population substructure and demographic factors are important determinants of genetic diversity in the tomato genus.
Nuala Sheehan A graphical models perspective on complexity in genetics
Probability and likelihood calculations form an essential component in the analysis of genetic data on pedigrees. Exact computational algorithms break down when there are too many interconnecting loops due to the enormous storage requirements involved [1,4]. Loops arise when the pedigree structure is complex, typically due to inbreeding or inter-marital relationships. However, they can also arise on relatively simple pedigrees when the genetic model is complicated e.g. when there are several linked loci to consider jointly. In these cases, estimates can be obtained either using Markov chain Monte Carlo (MCMC) [2,7] methods or by simplifying some aspect of the problem.
However, MCMC methods have not really been tested extensively on these large problems and tend to be viewed with some suspicion in practice, due to the unreliability of the resulting estimates. In particular, problems with slow mixing arise unless some form of block or joint updating is used.

Such computational problems become explicit in a graphical model representation where the term graphical modelling [3] refers to methods which exploit local dependencies to express complex relationships for modelling and computation. Although graphical models feature explicitly in several specific applications in genetics, the general applicability of this approach to solving complex genetics problems has not been widely appreciated. The ideas will be illustrated with an application to the problem of detecting a quantitative trait locus (QTL) from possibly incomplete marker data on general pedigrees.The methods are entirely general and relevant to a wide range of genetics applications.

References:
  1. C.Cannings, E,A, Thompson and M.Skolnick (1978). Probability functions on complex pedigrees. Advances in Applied Probability 10:26--61
  2. W.K.Hastings.(1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57:97--109.
  3. S.L.Lauritzen (1996). Graphical Models Clarendon Press, Oxford UK.
  4. S.L.Lauritzen and D.J.Spiegelhalter (1988). Local computations with probabilities on graphical structures and their applications to expert systems. Journal of the Royal Statistical Society, Series B 50:157--224.
  5. E,A, Thompson.(2001) Monte Carlo methods on genetic structures. In O.E.Barndorff-Nielsen, D.R.Cox and C.Klüppelberg (eds.) Complex Stochastic Systems, 176-218
Hizir Sofyan Fuzzy Clustering in Gene Expression Data
Clustering can be used to obtain first impressions from gene expression data. This method assigns genes to certain clusters based on their expression profiles. Since it is assumed that co-regulated genes would perform similar expression pattern, clustering would discover the functionally related genes.

By conventional clustering methods, gene is either assigned to or not assigned to a defined group. Fuzzy clustering which apply the concept of fuzzy sets to cluster analysis may pertaining to group at each point of the data set by a membership function. Its advantage is that it can adapt towards noisy gene data and classes that are not well separated. We considered the Saccharomyces cerevisiae data set by Cho et al, (1998 ). Using fuzzy clustering method that we had implemented in XploRe, we discovered underlying structures and patterns in gene expression data.

References:
  • Cho,R.J., Cambell,M.J., Winzeler,E.A., Steinmetz,E.A., Conway,A., Wodicka,L., Wolfsberg, T.J., Gabrielian, A.E., Landsman,D., Lockhart,D.J. and Davis, R.W. (1998). A genome-wide transcriptional analysis of the mitotic cell cycle. Mol. Cell., 2, 65-73.
  • Härdle, W., Hlavka, Z., and Klinke, S., (2000). XploRe Application Guide, Springer Verlag, Heidelberg.
Terry Speed Summarizing and comparing GeneChip data
Data from the Affymetrix GeneChip system consists of many thousands of fluorescence intensities indicating the amount of binding of sample mRNA to sets of so-called perfect match and mismatch 25mer oligonucleotide sequences. Conventional wisdom has it that these should be summarized at some level, for comparative analysis. When and how this is done can make quite a difference. Affymetrix recently changed their long-standing algoriths to new ones, while an alternative approach due to Cheng Li and Wing Wong has also been widely used, and a number of other groups have their own methods for summarizing these data. In this talk I will describe some of the issues involved, present some comparative analyses, and discuss what else I think is needed before this topic ceases to be an pressing one.
Korbinian Strimmer Quasi-Likelihood and Gene Expression Analysis
Error models for gene expression measurements are essential in most analyses of array data. Despite this importance, the true probabilistic model underlying the intensity readings is not fully known. Hence, in currently used statistical approaches the choice is between some simple parametric models (usually a transformed Normal) or the reliance on empirical distributions. Both these strategies appear not be optimal for gene expression data, as the non-parametric approach ignores known structural information whereas the fully parametric models run the risk of misspecification. A further problem is the choice of a suitable scale.
Here I present a semi-parametric framework for gene expression analysis that occupies middle-ground between these two extremes. In this approach inference is based an approximate likelihood function (extended quasi-likelihood). Only partial knowledge about the unknown true distribution is required to construct this function. In case of gene expression this information is available in the form of the postulated variance structure of the data (on which luckily there is consensus). As the quasi-likelihood behaves (almost) like a proper likelihood, it allows for the estimation of calibration and variance parameters, and it is also straightforward to obtain approximate confidence intervals. Hence, it enables, e.g., the detection of significant differential expression and the assignment of reliability values to clusters. Unlike most other frameworks, it also allows analysis on any preferred scale, i.e. both on the original linear scale as well as on a transformed scale.
Qihua Tan A survival analysis approach for measuring gene longevity associations
New approaches are needed to explore the different ways in which genes affect the human life span. One needs to assess the genetic effects themselves, as well as gene-environment interactions and sex dependency. In this paper, we present a new model that combines both genotypic and demographic information in the estimation of the genetic influence on life spans. Based on Cox's proportional hazard assumption, the model measures the risks for each gene as well as for gene-environment and gene-sex interactions, while controlling for confounding factors. A two-step MLE is introduced to obtain a non-parametric form of the baseline hazard function. The model is applied to genotypic data from Italian centenarian studies to estimate relative risks of candidate genes, risks due to interactions and initial frequencies of different genes in the population. Results from models that either do or do not take into consideration individual heterogeneity are compared. It is shown that ignoring the existence of heterogeneity in individual's unobserved frailty can lead to a systematic underestimation of genetic effects and effects due to interactions.
Claus Vogl Variation within and between populations in Drosophila ananassae
Claus Vogl, Aparup Das, Wolfgang Stephan

Before the advent of molecular population genetics, estimation of population subdivision from genetic data had to rely on relatively few loci with relatively low variability. Correspondingly, early estimates of population subdivision were population averages, such as F_st. Nowadays, polymorphic markers are available in almost unlimited quantity. Yet only few methods are available that exploit the potential of molecular markers for higher resolution estimates of population subdivision. We use an infinite island model from which a set of populations is sampled. Each population is assumed to have its own effective rate of migration gamma=4N_em. With a combination approach using coalescence and Markov chain Monte Carlo integration, we estimate the posterior distribution of effective rates of migration. As an application, we analyze a data set of the tropical fly species Drosophila ananassae.
Mark van de Wiel Significance Analysis of Microarrays using Rank Scores
Tusher et al. (2001) introduced Significance Analysis of Microarrays (SAM) as a statistical technique for finding significant genes in microarrays. This technique aims to control the False Discovery Rate (FDR), which is the proportion falsely rejected null hypotheses among all rejected null hypotheses. Within the microarray framework a null hypothesis usually corresponds to a statement like 'the gene is not (differentially) expressed'. Usage of SAM is enhanced by the freely available SAM Excel add-in. SAM has the potential of becoming a standard technique and is already used in some medical studies. There is a close connection between SAM and the FDR-based approach introduced by Benjamini et al. (1995). In fact, SAM is a version of this approach that controls the critical levels for the multiple testing procedure in a specific way. SAM has, according to Efron et al. (2000), one major disadvantage: estimation of the number of significant genes is biased, especially when this number is relatively large. This was our main motivation for developing Significance Analysis of Microarrays using Rank Scores (SAM-RS).In SAM, the rejection region is based on the deviation of order statistics from their expected values under the null hypotheses. These order statistics are calculated from test statistics that are computed directly from the gene expression data (e.g. t-type statistics). Since the probability distribution of an order statistic depends on the joint distribution of all gene expressions, which is found by a permutation algorithm, estimation of the number of falsely rejected genes depends on the expression levels of all genes. Therefore, 'unusual' patterns of the expression levels (very high, very low, skewed) may bias estimation. In other words, the estimation of the number of falsely rejected genes highly depends on the probability distributions of the expressed genes. We show that use of rank scores implies an unbiased estimate of the expected number of falsely called genes. Hence, SAM-RS allows for better control of the FDR than SAM does. The choice for the popular Wilcoxon rank scores may very well be too discrete due to the small number of attainable levels. We recommend using normal rank scores, which are inverse standard normal transformations of the ranks, to solve this problem. It is well known in nonparametric testing theory that normal rank score statistics are asymptotically as efficient under normality as t-statistics. More importantly, it has been demonstrated that even for small sample sizes (sample size corresponds to the number of measurements for each gene in this setting) efficiency of a normal scores test is high. Using an example data set provided by the SAM software, we observe that also in this dependent multiple testing setting the normal rank score statistics do the job well for a case with 8 experimental units and 1000 genes. A very practical point of SAM-RS is that it easily fits into the SAM software.
Finally, following the device of controlling the FDR correctly, we propose two procedures that incorporate more selective criteria within the statistical tests as opposed to the often used 'k-fold rule', which is applied outside the multiple test as an extra criterion for genes to be called. We believe the latter procedure is somewhat odd: one attempts to control the FDR to deal with multiple testing correctly, but then one looses control of this FDR by adding an extra selection criterion outside the statistical test. Including this more selective criterion in the testing framework gives the scientist the opportunity to make a stronger statement about the genes that are called. The new procedures are illustrated with example data sets.
  • Tusher, V.G., R. Tibshirani, and G. Chu (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl. Acad. Sci. 98, 5116-5121.
  • Benjamini, Y., and Y. Hochberg (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. Roy. Statist. Soc., B 57, 289-300.
  • Efron, Tibshirani, Goss, and Chu (2000).Microarrays and their use in a comparative experiment. Technical Report 2000-37B/213, Stanford University. http://www-stat.stanford.edu/~tibs/research.html.
Andreas Wienke Genetic analysis of cause of death: advantages and limitations of frailty models
Twin data on mortality by cause provide a powerful tool for studying the role of genes and other factors in individual susceptibility to complex diseases and mortality. More details about the nature of the ageing process can be obtained when additional information about twins' survival and covariates is available. For example, the genetic nature of observed covariates can be analysed. Further, one can investigate the dependence between unobserved susceptibility and observed covariates. In this paper we discuss the results of the analysis of cause-specific mortality data for Danish twins in the presence of information about individual body mass index (BMI) and smoking behaviour with special focus on coronary heart disease and respiratory diseases. The correlated frailty model was used for the genetic analysis of aggregated and cause-specific mortality data. The contribution of observed covariates in the dependence between causes of death is evaluated. The proportions of variance attributable to genetic and environmental factors are assessed using the structural equation model approach, which allows for the calculation of heritability. The cases with and without observed covariates are compared and different frailty models are considered and discussed (Yashin and Iachine 1995; Wienke et al. 2000; 2001).
  • Yashin A.I. and Iachine, I.A. (1995). Genetic analysis of durations: Correlated frailty model applied to survival of Danish Twins. Genetic Epidemiology. 12: 529-538.
  • A. Wienke, K. Christensen, N.V. Holm, A.I. Yashin (2000) Heritability of death from respiratory diseases: an analysis of Danish twin survival data using a correlated frailty model. In: Medical Infobahn for Europe. A. Hasman et al. (Eds.), IOS Press, Amsterdam, 407 - 411
  • Wienke, A., Holm, N., Skytthe, A., Yashin A. (2001) The heritability of mortality due to heart diseases: a correlated frailty model applied to Danish twins. Twin Research 4, 266-274
Carsten Wiuf Inferring Recombination from Genotypes
Hudson and Kaplan (1985) introduced the four-gamete test to test for the presence of recombination in a sample of DNA sequences. Further, they came up with a lower bound R_M to the actual number of recombination events experienced in the sample's history. R_M+1 can be interpreted as the minimum number of topologies required to explain the sequences. They assumed an infinite-site model as the data generating process.

Today many data sets spanning large chromosomal regions are generated. Because of limited resources, these data sets often consist of unphased genotypes, rather than haplotypes. That is, if an individual is heterozygote for a SNP with alleles 0 and 1 it is not known which of an individuals two chromosomes harbor the 0 allele, respectively the 1 allele.

In this talk, I discuss a test similar to the four-gamete test and introduce a lower bound R_M^g to the number of recombination events experienced in the sample's history. R_M^g+1 can be interpreted as a lower bound to the number of topologies required to explain the sample. It is shown, using simulation, that in general [E(R_M^g)+1]/[E(R_M)+1]>0.90. In addition, various results, theoretical and simulated, about incompatibilities in a sample of genotypes are presented. If time allows, an analysis of a real data set is shown, using the above measures.


---------------------------------------------------