Abstracts 1996 |
In this paper it is shown that the use of non-singular block invariant matrices of covariates leads to `generalized estimating equations' estimators (GEE estimators; Liang, K.-Y. & Zeger, S. (1986). Biometrika, 73(1), 13-22) which are identical regardless of the `working' correlation matrix used. Moreover, they are efficient (McCullagh, P. (1983). The Annals of Statistics, 11(1), 59-67). If on the other hand only time invariant covariates are used the efficiency gain in choosing the `correct' vs. an `incorrect' correlation structure is shown to be negligible. The results of a simple simulation study suggest that although different GEE estimators are no more identical and are no more as efficient as an ML estimator, the differences are still negligible if both time and block invariant covariates are present.
Discrete-time grouped duration data, with one or multiple types of terminating events, are often observed in social sciences or economics. In this paper we suggest and discuss dynamic models for flexible Bayesian nonparametric analysis of such data. These models allow simultaneous incorporation and estimation of baseline hazards and time-varying covariate effects, without imposing particular parametric forms. Methods for exploring the possibility of time-varying effects, as for example the impact of nationality or unemployment insurance benefits on the probability of re-employment, have recently gained increasing interest. Our modelling and estimation approach is fully Bayesian and makes use of Markov Chain Monte Carlo (MCMC) simulation techniques. A detailed analysis of unemployment duration data, with full-time job, part-time job and other causes as terminating events, illustrates our methods and shows how they can be used to obtain refined results and interpretations.
This paper investigates the sensitivity of maximum quasi likelihood estimators of the covariate effects in duration models in the presence of misspecification due to neglected heterogeneity or misspecification of the hazard function. We consider linear models for r(T) where T is duration and r is a known, strictly increasing function. This class of models is also referred to as location-scale models. In the absence of censoring, Gould and Lawless (1988) have shown that maximum likelihood estimators of the regression parameters are consistent and asymptotically normally distributed under the assumption that the location-scale structure of the model is of the correct form. In the presence of censoring, however, model misspecification leads to inconsistent estimates of the regression coefficients for most of the censoring mechanisms that are widely used in practice. We propose a semiparametric EM-estimator, following ideas of Ritov (1990), and Buckley and James (1979). This estimator is robust against misspecification and is highly recommended if there is heavy censoring and if there may be specification errors. We present the results of simulation experiments illustrating the performance of the proposed estimator.
We investigate the possible bias due to an erroneous missing at random assumption if adjusted odds ratios are estimated from incomplete covariate data using the maximum likelihood principle. A relation between complete case estimates and maximum likelihood estimates allows us to identify situations where the bias vanishes. Numerical computations demonstrate that the bias is most serious if the degree of the violation of the missing at random assumption depends on the value of the outcome variable or of the observed covariate. Implications for the analysis of prospective and retrospective studies are given.
A large number of different Pseudo-R2 measures for some common limited dependent variable models are surveyed. Measures include those based solely on the maximized likelihoods with and without the restriction that slope coefficients are zero, those which require further calculations based on parameter estimates of the coefficients and variances and those that are based solely on whether the qualitative predictions of the model are correct or not. The theme of the survey is that while there is no obvious criterion for choosing which Pseudo-R2 to use, if the estimation is in the context of an underlying latent dependent variable model, a case can be made for basing the choice on the strength of the numerical relationship to the OLS-R2 in the latent dependent variable. As such an OLS-R2 can be known in a Monte Carlo simulation, we summarize Monte Carlo results for some important latent dependent variable models (binary probit, ordinal probit and Tobit) and find that a Pseudo-R2 measure due to McKelvey and Zavoina scores consistently well under our criterion. We also very briefly discuss Pseudo-R2 measures for count data, for duration models and for prediction-realization tables.
A full likelihood approach for marginal regression modeling of correlated multicategorical data is proposed. It is in fact an extension of the approach of Fitzmaurice and Laird (1993) for repeated binary response. The association is directly modeled in terms of conditional odds ratio parameters resulting in the fact that the maximum likelihood estimates of mean and association parameters are asymptotically independent. The technical details are worked out and the approach is illustrated with data previously analyzed by Miller, Davis and Landis (1993).
The generalized method of moments (GMM) estimation technique is discussed for count data models with endogenous regressors. Count data models can be specified with additive or multiplicative errors. It is shown that, in general, a set of instruments is not orthogonal to both error types. Simultaneous equations with a dependent count variable often do not have a reduced form which is a simple function of the instruments. However, a simultaneous model with a count and a binary variable can only be logically consistent when the system is recursive. The GMM estimator is used in the estimation of a model explaining the number of visits to doctors, with as a possible endogenous regressor a self-reported binary health index. Further, a model is estimated, in stages, that includes latent health instead of the binary health index.
Due to progress in statistical methods and improved data processing capabilities, count data modelling has become increasingly popular in the social sciences. In empirical international relations and international conflict research, however, the use of event count models has been largely restricted to the application of the simple Poisson approach so far. This article outlines the methodological weaknesses of the model and presents some improvements which are applied to the problem of international interventionism. The cross-sectional data set used covers the behaviour of states during the period from 1970 to 1989, and thus avoids some theoretical problems of the standard long-term dyadic approach. The main result of the analysis is the empirical irrelevance of idealist conceptions claiming pacifying effects of democratization or fostering of economic prosperity.
Generalized linear models (GLM) allow for a wide range of statistical models for regression data. In particular, the logistic model is usually applied for binomial observations. Canonical links for GLM's such as the logit link in the binomial case, are often used because in this case sufficient statistics for the regression parameter exist which allow for simple interpretation of the results. However, in some applications, the overall fit as measured by the p-values of goodness of fit statistics (as the residual deviance) can be improved significantly by the use of a noncanonical link. In this case, the interpretation of the influence of the covariables is more complicated compared to GLM's with canonical link functions. It will be illustrated through simulation that the p-value associated with the common goodness of link tests is not appropriate to quantify the changes to mean response estimates and other quantities of interest when switching to a noncanonical link. In particular, the rate of misspecifications becomes considerably large, when the inverse information value associated with the underlying parametric link model increases. This shows that the classical tests are often too sensitive, in particular, when the number of observations is large. The consideration of a generalized p-value function is proposed instead, which allows the exact quantification of a suitable distance to the canonical model at a controlled error rate. Corresponding tests for validating or discriminating the canonical model can easily performed by means of this function.
The development of adequate models for binary time series data with covariate adjustment has been an active research area in the last years. In the case, where interest is focused on marginal and association parameters, generalized estimating equations (GEE) (see for example Lipsitz, Laird and Harrington (1991) and Liang, Zeger and Qaqish (1992)) and likelihood (see for example Fitzmaurice and Laird (1993) and Molenberghs and Lesaffre (1994)) based methods have been proposed. The number of parameters required for the full specification of these models grows exponentially with the length of the binary time series. Therefore, the analysis is often focused on marginal and first order parameters. In this case, the multivariate probit model (Ashford and Sowden (1970)) becomes an attractive alternative to the above models. The application of the multivariate probit model has been hampered by the intractability of the maximum likelihood estimator, when the length of the binary time series is large. This paper shows that this difficulty can be overcome by the use of Markov Chain Monte Carlo methods. This analysis also allows for valid point and interval estimates of the parameters in small samples. In addition, the analysis is adopted to handle the case of missing at random responses. The approach is illustrated on data involving binary responses measured at unequally spaced time points. Finally, this data analysis is compared to a GEE analysis given in Fitzmaurice and Lipsitz (1995).
The Generalized Estimating Equations (GEE) proposed by Liang and Zeger (1986) have found considerable attention in the last years and several extensions have been proposed. This paper will give a more intuitive description how GEE have been developed during the last years. Additionally we will describe the advantages and disadvantages of the different parametrisations that have been proposed in the literature. We will also give a brief review of the literature available on this topic. [ Published in: Biometrical Journal 40 (2), 115-139 ]
Data from clinical studies often contain time-dependent covariates, e.g. events like transplantation or an adverse drug reaction, or the changing measurements of laboratory data. The common approach uses only the covariate information at time t=0 for regression analyses, but this baseline analysis is not very satisfying. This paper applies the linear counting process by Aalen for failure time analysis, modified to deal with time-dependent covariates. In the main part we describe methods to estimate and visualize the cumulated regression function with respect to time-dependent covariates. After introducing a test for significance of the influence of covariates we display different methods to investigate model validity depending on martingale residuals, or by use of the Arjas plot. Coding and interpretation problems are shortly discussed. Results are illustrated with data from the Stanford Heart Transplantation Study and a study on Oropharynx carcinoma.
We consider the problem of estimating the unknown breakpoints in segmented generalized linear models. Exact algorithms for calculating maximum likelihood estimators are derived for different types of models. After discussing the case of a GLM with a single covariate having one breakpoint a new algorithm is presented when further covariates are included in the model. The essential idea of this approach is then used for the case of more than one breakpoint. As further extension an algorithm for the situation of two regressors each having a breakpoint is proposed. These techniques are applied for analysing the data of the Munich rental table. It can be seen that these algorithms are easy to handle without too much computational effort. The algorithms are available as GAUSS-programs.
If a linear regression is fit to log-transformed mortalities and the estimate is back-transformed according to the formula Ee^Y = e^{\mu + \sigma^2/2} a systematic bias occurs unless the error distribution is normal and the scale estimate is gauged to normal variance. This result is a consequence of the uniqueness theorem for the Laplace transform.We determine the systematic bias of minimum-L_2 and minimum-L_1 estimation with sample variance and interquartile range of the residuals as scale estimates under a uniform and four contaminated normal error distributions. Already under innocent looking contaminations the true mortalities may be underestimated by 50% in the long run.
Moreover, the logarithmic transformation introduces an instability into the model that results in a large discrepancy between rg_Huber estimates as the tuning constant regulating the degree of robustness varies.
Contrary to the logarithm the square root stabilizes variance, diminishes the influence of outliers, automatically copes with observed zeros, allows the `nonparametric' back-transformation formula E Y^2 = \mue^2 + \sigma^2, and in the homoskedastic case avoids a systematic bias of minimum-L_2 estimation with sample variance.
For the company-specific table 3 of [Loeb94], in the age range of 20-65 years, we fit a parabola to root mortalities by minimum-L_2 , minimum-L_1, and robust rg_Huber regression estimates, and a cubic and exponential by least squares. The fits thus obtained in the original model are excellent and practically indistinguishable by a \chi^2 goodness-of-fit test.
Finally , dispensing with the transformation of observations, we employ a Poisson generalized linear model and fit an exponential and a cubic by maximum likelihood.
We describe the identification of prognostic factors in the framework of a completely resected stomach cancer survival-study. For the analysis the dynamic grouped Cox-Model was used allowing for time-varying covariate effects. Therefore the hazard rate might be non-proportional. As estimation concept we applied the posterior mode, computed by iteratively weighted Kalman filtering and smoothing steps. The medical study and questions are described, the statistical method is illustrated, the results are given and interpreted and the method is discussed.
Spline smoothing in non- or semiparametric regression models is usually based on the roughness penalty approach. For regression with normal errors, the spline smoother also has a Bayesian justification: Placing a smoothness prior over the regression function, it is the mean of the posterior given the data. For non-normal regression this equivalence is lost, but the spline smoother can still be viewed as the posterior mode. In this paper, we provide a full Bayesian approach to spline-type smoothing. The focus is on generalized additive models, however the models can be extended to other non-normal regression models. Our approach uses Markov Chain Monte Carlo methods to simulate samples from the posterior. Thus it is possible to estimate characteristics like the mean, median, moments, and quantiles of the posterior, or interesting functionals of the regression function. Also, this provides an alternative for the choice of smoothing parameters. For comparison, our approach is applied to real-data examples analyzed previously by the roughness penalty approach.
Im Rahmen dieser Dissertation werden zeitdiskrete Modelle zur Ereignisanalyse mit dynamischen Effekten vorgestellt, unterteilt nach Ein-Episoden-Ein- Zustands-, Ein-Episoden-Mehr-Zustands- und Mehr-Episoden-Daten. Dabei sind auch Modelle mit nicht proportionalen Hazards zugelassen. Sie sind in Zustandsraumform angegeben. Das Phänomen der Rechtszensierung kann ebenfalls in die Modellierung einbezogen werden. In diesem allgemeinen Modellrahmen wird das Posteriori-Modus-Schätzkonzept zur Bestimmung zeitabhängiger Effekte eingesetzt. Zur numerisch effizienten Schätzung wird der bekannte Kalman Filter und Glätter-Algorithmus zum linear gewichteten Kalman Filter und Glätter modifiziert und dieser wiederum iteriert. Außerdem wird ein neues Schätzverfahren entwickelt und diskutiert, das den Fall einer diffusen beziehungsweise nichtinformativen Start-Priori- Verteilung numerisch stabil und effizient behandet. Insgesamt lassen sich damit Hazardraten und zeitabhängige Kovariableneffekte simultan schätzen bei einem im Vergleich zu anderen Verfahren (z.B. MCMC) geringen zeitlichen und rechnerischen Aufwand.
We are dealing with time series which are measured on an arbitrary scale, e.g. on a categorical or ordinal scale, and which are recorded together with time varying covariates. The conditional expectations are modelled as a regression model, its parameters are estimated via likelihood- or quasi-likelihood-approach. Our main concern are diagnostic methods and forecasting procedures for such time series models. Diagnostics are based on (partial) residual measures as well as on (partial) residual variables; l-step predictors are gained by an approximation formula for conditional expectations. The various methods proposed are illustrated by two different data sets.
Dynamic models extend state space models to non-normal observations. This paper suggests a specific hybrid Metropolis-Hastings algorithm as a simple, yet flexible and efficient tool for Bayesian inference via Markov chain Monte Carlo in dynamic models. Hastings proposals from the (conditional) prior distribution of the unknown, time-varying parameters are used to update the corresponding full conditional distributions. Several blocking strategies are discussed to ensure good mixing and convergence properties of the simulated Markov chain. It is also shown that the proposed method is easily extended to robust transition models using mixtures of normals. The applicability is illustrated with an analysis of a binomial and a binary time series, known in the literature.
Das hier vorgestellte spezifische Modell bietet mit dem dazugehörigen Schätzverfahren eine neue alternative Vorgehensweise für die regresive Analyse binärer korrelierter Zielgrößen. Das Schäatzverfahren für das an die Korrelation angepaßte Modell wird über den Modellvergleich mit loglinearen Modellen und einer auf Odds-Ratios basierenden Reparametrisierung hergeleitet. Dabei wird zwischen verschiedenen Spezialfällen in Abhängigkeit von der Art und Anzahl der Einflußgrößen unterschieden. Die neue Methode besitzt gegenüber anderen den Vorteil, neben der einfacheren Berechnung der Sch\ätzungen zugleich die Adäquatheit des Modells zu prüfen. Der theoretischen Darstellung folgt die ausführliche Beschreibung des Verfahrens an zwei Datensätzen.
This paper considers the estimation of coefficients in a linear regression model with missing observations in the independent variables and introduces a modification of the standard first order regression method for imputation of missing values. The modification provides stochastic values for imputation. Asymptotic properties of the estimators for the regression coefficients arising from the proposed modification are derived when either both the number of complete observations and the number of missing values grow large or only the number of complete observations grows large and the number of missing observations stays fixed. Using these results, the proposed procedure is compared with two popular procedures - one which utilizes only the complete observations and the other which employs the standard first order regression imputation method for missing values. It is suggested that an elaborate simulation experiment will be helpful to evaluate the gain in efficiency especially in case of discrete regressor variables and to examine some other interesting issues like the impact of varying degree of multicollinearity in explanatory variables. Applications to some concrete data sets may also shed some light on these aspects. Some work on these lines is in progress and will be reported in a future article to follow.
Anhand logistischer und additiver Modelle, sowie mittels der isotonen Regression werden Grenzwertvorschläge für die maximale Arbeitsplatzkonzentration (MAK) von Fein- und Gesamtstaubkonzentration diskutiert. Die statistischen Verfahren sind in Ulm et al. (1996) ausführlich beschrieben. Ausgangspunkt der Schwellenwertvorschläge sind Kollektive der Eisenhüttenindustrie (Moers und Saarbrücken) sowie des Maschinenbaus (München) der DFG-Studie "Chronische Bronchitis".
The economic incentives of work absence are empirically studied using a panel of Swedish blue collar workers, both men and women, that either are married or living with a spouse as married. A model for the daily absence decision is derived from standard economic utility theory. An estimable form for the annual number of absence days ist obtained by considering the data generating process in some detail. The model is estimated, using the first two moments, with a generalized method of moment estimator. The panel structure of the data is explicitly considered and a positive dependence between the number of days absent in the two time periods is found for females. A one per cent increase in the cost will lead to a decrease in the mean number of days absent by 1.8 days for females and by 2.7 days for males.
First we show briefly the effects of using the ordinary estimator for the logarithm of the odds ratio in a case-control study with binary risk factor when we have misclassification in the risk factor. Then external validation and repeated measurements, which are two broad strategies to correct for misclassification, are introduced. For both of these models the ML-estimates and their asymptotic variances are derived. Under the assumption that both models have the same costs, the asymptotic variances are compared for two cases. We choose first equal subsample sizes and then optimal subsample sizes. Simulation studies have been carried out in order to get an impression of the probability that the estimates are well defined and of how large the sample sizes have to be so that the asymptotic variances are good approximations.
Data from the Stanford Heart Transplantation Study and our own study on brain tumor include time-dependent covariates like transplantation, which may switch only once, and others changing their value several times during follow-up. But classical analyses never used this additional information. In a comparative study we applied the time-dependent Cox model, pooled Cox regression and the linear counting process by Aalen to these data sets. All methods do show similar results when they are carried out in their 'fixed' version, i.e. using baseline information only, or when covariates are being treated as time-dependent. But the estimated effects do differ remarkably between fixed and time-dependent approaches, thus leading to different interpretations of risks.
This paper describes a software tool for marginal regression methods. MAREG currently handles binary, categorical and continious data with several link functions. Although intended for the analysis of correlated data, uncorrelated data can be analysed. We supplies two different approaches for these problems-Maximum Likelihood and GEE methods. Handling of missing data is also provided. [ Published in: Computational Statistics and Data Analysis, 24, 235-241 ]
In the present paper a mixed approach is proposed for the simultaneously estimation of regression and correlation structure parameters in multivariate probit models using generalized estimating equations for the former and pseudo-score equations for the latter. The finite sample properties of the corresponding estimators are compared to estimators proposed by Qu, Williams, Beck and Medendorp (1992) and Qu, Piedmonte and Williams (1994) using generalized estimating equations for both sets of parameters via a Monte Carlo experiment. As a `reference' estimator for an equicorrelation model, the maximum likelihood (ML) estimator of the random effects probit model is calculated. The results show the mixed approach to be the most robust approach in the sense that the number of datasets for which the corresponding estimates converged was largest relative to the other two approaches. Furthermore, the mixed approach led to the most efficient non-ML estimators and to very efficient estimators for regression and correlation structure parameters relative to the ML estimator if individual covariance matrices were used.
The present paper deals with the estimation of a frailty model of multivariate failure times. The failure times are modeled by an Accelerated Failure Time Model including observed covariates and an unobservable frailty component. The frailty is assumed random and differs across elementary units, but is constant across the spells of a unit or a group. We develop an estimator (of the regression parameters) that combines the GEE approach (Liang and Zeger, 1986) with the Buckley-James estimator for censored data. This estimator is robust against violations of the correlation structure and the distributional assumptions. Some simulation studies are conducted in order to study the empirical performance of the estimator. Finally, the methods are applied to data of repeated appearances of malign ventricular arrhythmias at patients with implanted defibrillator.
We consider the case where a latent variable X cannot be observed directly and instead a variable W=X+U with an heteroscedastic measurement error U is observed. It is assumed that the distribution of the true variable X is a mixture of normals and a type of the EM algorithm is applied to find approximate ML estimates of the distribution parameters of X.
In 1970 Tenenbein has presented a double sampling scheme to estimate the proportion parameter of binomial data. In the context of measurement error models this strategy is known as the internal validation method. A second broad strategy is the repeated measurements method. We show how to apply this method for the estimation of a proportion parameter and try to answer the question which method should be prefered.
A number of authors in the quality control literature have advocated the use of combined-arrays in screening experiments to identify robust product or process designs [Shoemaker, Tsui, and Wu (1991); Nair et al. (1992); Myers, Khuri, and Vining (1992), for example]. This paper considers a product manufacturing or process design setting in which there are several factors under the control of the manufacturer, called control settings, and other environmental (noise) factors that that vary under field or manufacturing conditions. We show how Gupta's subset selection philosophy can be used in such a quality improvement setting to identify combinations of the levels of the control factors that correspond either to products that are robust to environmental variations during their use or to processes that fabricate items whose quality is independent of the variations in the raw materials used in their manufacture. [Gupta (1956, 1965)].
To compare several promising product designs, manufacturers must measure their performance under multiple environmental conditions. In many applications, a product design is considered to be seriously flawed if its performance is poor under any level of the environmental factor. For example, if a particular automobile battery design does not function well under some temperature conditions, then a manufacturer may not want to put this design into production. Thus, in this paper we consider the overall measure of a given product's quality to be its worst performance over the environmental levels. We develop statistical procedures to identify (a near) the optimal product design among a given set of product designs, i.e., the manufacturing design associated with the greatest overall measure of performance. We accomplish this for intuitive procedures based on the split-plot experimental design (and the randomized complete block design as a special case); split-plot designs have the essential structure of a product array and the practical convenience of local randomization. Two classes of statistical procedures are provided. In the first, the delta-best formulation of selection problems, we determine the number of replications of the basic split-plot design that are needed to guarantee, with a given confidence level, the selection of a product design whose minimum performance is within a specified amount, delta, of the performance of the optimal product design. In particular, if the difference between the quality of the best and 2nd best manufacturing designs is delta or more, then the procedure guarantees that the best design will be selected with specified probability. For applications where a split-plot experiment involving several product designs has been completed without the planning required of the delta-best formulation, we provide procedures to construct a "confidence subset" of the manufacturing designs; the selected subset contains the optimal product design with a prespecified confidence level. The latter is called the subset selection formulation of selection problems. Examples are provided to illustrate the procedures.
The possible discrepancy between a hypothesized model and the observed data is measured by so called Goodness of Fit Statistics. In order to decide whether the observed discrepancy is substantial, the distributions of these statistics under the hypothesised model are needed to perform a statistical test. Because of the difficulty to compute the exact distributions, just when the sample size is small, better approximations than provided by common asymptotic theory have to be found. In the case of a loglinear Poisson model we will do that by different bootstrap methods.
On-line monitoring of time series becomes more and more important in different areas of application like medicine, biometry and finance. In medicine, on-line monitoring of patients after transplantation of renals (Smith83) is an easy and prominent example. In finance, fast end reliable recognition of changes in level and trend of intra-daily stock market prices is of obvious interest for ordering and purchasing. In this project, we currently consider monitoring of surgical data like heart-rate, blood pressure and oxygenation. From a statistical point of view, on-line monitoring can be considered as on-line detection of changepoints in time series. That means, changepoints have to be detected in real time as new observations come in, usually in short time intervals. Retrospective detection of changepoints, after the whole batch of observations has been recorded, is nice but useless in monitoring patients during an operation.There are various statistical approaches conceivable for on-line detection of changepoints in time series. Dynamic or state space models seem particularly well suited because ``filtering'' has historically been developed exactly for on-line estimation of the ``state'' of some system. Our approach is based on a recent extension of the so-called multi-process Kalman filter for changepoint detection (Schnatter94). It turned out, however, that some important issues for adequate and reliable application have to be considered, in particular the (appropriate) handling of outliers and, as a central point, adaptive on-line estimation of control- or hyper-parameters. In this paper, we describe a filter model that has this features and can be implemented in such a way that it is useful for real time applications with high frequency time series data.
Recently, simulation based methods for estimation of non-Gaussian dynamic models have been proposed that may also be adapted and generalized for the purpose of changepoint detection. Most of them solve the smoothing problem, but very recently some proposals have been made that could be useful also for filtering and, thus, for on-line monitoring (Kitagawa96a,Kitagawa96b,Shephard96). If these approaches are a useful alternative to our development needs a careful comparison in future and is beyond the scope of this paper.