Skip Navigation

JNCI Monographs 1999 1999(26):43-48;
© 1999 by Oxford University Press
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Siegmund, K. D.
Right arrow Articles by Thomas, D. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Siegmund, K. D.
Right arrow Articles by Thomas, D. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Journal of the National Cancer Institute Monographs, No. 26, 43-48, 1999
© 1999 Oxford University Press


II. GENE CHARACTERIZATION PANEL

Multistage Sampling for Disease Family Registries

Kimberly D. Siegmund, Alice S. Whittemore, Duncan C. Thomas

Affiliations of authors: K. D. Siegmund, D. C. Thomas,Department of Preventive Medicine, University of Southern California, Los Angeles; A. S. Whittemore, Department of Health Research and Policy, Stanford University School of Medicine, CA.

Correspondence to: Kimberly D. Siegmund, Ph.D., Department of Preventive Medicine, University of Southern California, 1540 Alcazar St., Suite 220, Los Angeles, CA 90089-9011 (e-mail: kims{at}rcf.usc.edu).


    ABSTRACT
 Top
 Abstract
 Introduction
 Statistical Methods
 Application
 Discussion
 Appendix A. Score Equations...
 Appendix B. Conditional Logistic...
 Note
 References
 
BACKGROUND: The objectives of a family-based disease registry range from characterizing measured genetic factors and gene-environment interaction effects to detecting novel susceptibility genes. Gathering complete information on exposure and disease status in all family members for a sample of affected subjects (probands) to address these diverse objectives would be prohibitively expensive. METHODS: Multistage sampling can be used to design an efficient family-based disease registry. At each stage, the probands are classified on the basis of previously collected data, and a subsample is selected for more detailed observation. The design can be optimized to minimize the variance of any of the model parameter estimates, subject to a constraint on the total sample size. RESULTS: We describe the basic statistical theory and its application to a four-stage sampling scheme proposed for the Cooperative Family Registry for Epidemiologic Studies of Colorectal Cancer at the University of Southern California.



    INTRODUCTION
 Top
 Abstract
 Introduction
 Statistical Methods
 Application
 Discussion
 Appendix A. Score Equations...
 Appendix B. Conditional Logistic...
 Note
 References
 
Several papers in this monograph describe a variety of basic designs for characterizing genes and gene-environment interactions. These designs include case-control studies that use unrelated controls from the general population (1), family-based case-control designs (2), family-based cohort designs (3), and variants of these designs for testing interaction effects (4). However, all of these basic designs for rare diseases may still be relatively inefficient for studying genes with rare mutations like BRCA1 because the yield of carriers in a random sample of case patients and control subjects will be very low, even if the genetic relative risk is high. To overcome this difficulty, a variety of approaches has been proposed, based on the restriction of samples to case patients and control subjects with a positive family history, who will tend to have a higher carrier frequency than an unselected series of individuals. For example, D. C. Thomas (unpublished data) has examined the relative efficiency of various family-based case-control designs incorporating such restrictions [see also (2)]. Nevertheless, with increasing degree of restriction, the relevance of the findings from such a study to the general population becomes more questionable, particularly if there may be other unmeasured environmental or genetic risk factors that modify the effect of the genes under study. As an alternative, we discuss a class of designs that entails random sampling of case patients from all strata of family history in such a way as to allow population-based inference while at the same time maximizing the statistical efficiency of the design. (Here, an optimal design minimizes the variance for any of the parameter estimates, such as the probability of disease given genotype [penetrance] or the frequency of the variant allele, under a fixed total sample size.) To address these goals, we explore various multistage designs, involving stratified random sampling at each stage with the use of family history data that are collected at each successive stage.

The basic idea of multistage sampling was introduced into epidemiology by White (5) in the context of a case-control design for studying the association between a rare disease and a rare exposure. Since then, there has been an abundance of related work by others (6-11). An elementary two-stage design is the following. In the first stage, one selects a random sample of case patients and control subjects and assesses exposure on all of them; in the second stage, one conducts a subsample from the 2 x 2 table of exposed and unexposed case patients and control subjects and collects additional covariate information only on these sampled individuals. Alternatively, one might obtain a surrogate for exposure on all participants in the first stage as the basis for stratification and then obtain more precise exposure data only on participants in the second stage. The latter is particularly relevant in the genetic context in which family history is a natural surrogate to consider for the genotypes that might be measured in later stages but would be too expensive to obtain for all participants.

Whittemore and Halpern (12) have discussed the application of multistage sampling to the field of genetic epidemiology. As one example, they described a sampling design for a study of prostate cancer. In stage 1, a case-control study using population control subjects was conducted to evaluate the association of diet and other lifestyle characteristics with prostate cancer. A brief family history of disease in fathers and brothers was obtained on all participants in this stage. In stage 2, subsets of the case patients and control subjects were selected conditional on their family history. From these selected participants, additional data on disease status in family members and their ages at onset of prostate cancer or censoring together with medical verification of the reported cancers were collected for segregation analysis. Families that contained three or more medically verified cases with prostate cancer were pursued in stage 3 for DNA samples of blood or tissue specimens, or a combination of both, for linkage analysis. With the use of simulation, Whittemore and Halpern determined the sampling fractions for the second stage that would minimize the variance of the penetrance and allele frequency estimates from the segregation analysis and maximize the yield of carriers for subsequent linkage analysis. They considered an initial sample of 1500 case patients and 1500 control subjects in stage 1 and a subsample of 570 participants in stage 2. In stage 1, 209 case patients and 92 control subjects had a positive family history of disease. Their results indicated that, regardless of which parameter they were estimating, the optimal stage 2 design entailed selecting all available case patients with a positive family history (n = 209) and subsamples of case patients and control subjects from the remaining categories of family history and disease status. The optimal fraction in those remaining categories varied, depending on which parameter was of greatest interest. The greatest difference in sampling fractions was observed for estimating the frequency of the variant allele: A large fraction of family history-positive control subjects (64%) were desired as well as much smaller fractions of family history-negative case patients (20%) and family history-negative control subjects (3%). For optimizing the estimate of the hazard rate in noncarriers, a smaller fraction of family history-positive control subjects (34%) and family history-negative cases (12%) was required, as was a greater fraction of family history-negative control subjects (13%). The optimal sampling fractions to estimate the hazard rate in carriers lay between the extremes given by the other two parameters. For linkage analysis, the optimal design involved maximizing the yield of carriers in the sample. This design entailed subsampling the pedigrees having the highest probability of segregating the gene: all family history-positive case patients and control subjects and 21% of the family history-negative case patients. Because the sample sizes vary across the different strata in the first stage, the largest group in absolute numbers is sometimes the family history-negative case patients.

In the following section, we summarize the basic statistical theory. We then discuss the application of these principles to the design of the National Cancer Institute Cooperative Family Registry for Colorectal Cancer Research at the University of Southern California. This registry has diverse objectives, including characterization of genes that may interact with environmental exposures as well as to provide a resource for locating new genes that influence disease risk. Inevitably, no single design can be optimal for all of these objectives, so we discuss the compromises involved and the design we finally recommended. This design involves three key differences from that discussed by Whittemore and Halpern (12). First, the sample will be based only on families of case patients, with no families of control subjects. Second, we will have available genotype data on the case patients for the purpose of estimating penetrance and allele frequency, rather than relying purely on segregation analysis methods. Third, we will also have genotypes and risk-factor information available on selected family members for the purpose of studying gene-environment interaction effects.


    STATISTICAL METHODS
 Top
 Abstract
 Introduction
 Statistical Methods
 Application
 Discussion
 Appendix A. Score Equations...
 Appendix B. Conditional Logistic...
 Note
 References
 
The methods for characterizing disease genes are likelihood based (e.g., conditional logistic regression and segregation analysis). If resources permitted sampling all probands and their family members in a single stage, we would have a sample of independent families. Consistent parameter estimates could then be obtained by solving the score equation (setting the derivative of the log-likelihood to zero), and parameter variances could be estimated, using the Fisher information in the usual way. Instead, we compute the variance of parameter estimates for various two-stage designs in which only a stratified subsample of families is selected. To compare the efficiencies of two designs having equal sample sizes, we compute the asymptotic relative efficiency, the ratio of the inverse variance estimates.

Appropriate statistical methods for data analysis use the information collected in stage 1 to account for differential sampling across strata. Under maximum likelihood, the score equation is a function of both the data observed at the second stage and the distribution of the stratification variable. This latter distribution may be unknown or difficult to calculate. Although the distribution is not of real interest, its misspecification may lead to bias in our parameter estimates. An alternative method, which is not susceptible to this bias, is the Horvitz-Thompson approach (13-16). The Horvitz-Thompson estimating equation is the score equation using only the data observed in the second stage but weighted by the inverses of the sampling fractions. To be well defined, this method requires that all strata be represented in stage 2. The variance estimator can be written as the sum of two parts: the variance of the maximum likelihood estimator based on the complete data and a penalty term for the loss of precision as a result of sampling. This latter term depends on the number of observations in each stratum, the sampling fractions, and the variability of the score function within the strata. The equations for the score and variance are given in Appendix A. To optimize the two-stage design, we minimize the asymptotic variance of the Horvitz-Thompson estimates.

The Horvitz-Thompson approach can be used for estimating population-based parameters from a two-stage sample taken from traditional genetic and epidemiologic study designs. In stage 1 of a family study, we sample affected patients (probands) and classify them according to history of disease in their first-degree relatives (e.g., 0, 1, 2+ members affected). In stage 2, we sample case probands at random from within each stratum and collect extended pedigree data out to first cousins. The penetrance and allele frequency for an unobserved gene in such samples may be estimated with the use of classical segregation analysis models that account for the disease status of the sampled proband (12). The likelihood contribution for a single family is computed as the ratio of the probability of disease in the family over the probability that the proband is affected. The probability of disease in the family is obtained by summing the joint probability of disease status (d) and genotypes (g) over all possible genotypic combinations. When genotypes for some individuals are observed, we can include the probability of observing those genotypes in our model for a modified segregation analysis. Let the subscripts o and u denote the observed and unobserved genotypes, respectively, then the numerator of the likelihood can be written as


where {theta} denotes the vector of model parameters (carrier and noncarrier penetrances and allele frequency). In theory, this model could be extended to include data on environmental exposure if such data are collected on all family members. However, because it will only be feasible to collect risk factor data on a subset of living family members and comparisons of risk factors would only be meaningful between individuals in the same generation, we resort to conditional logistic regression for family-matched, case-control sets.

Family-matched, case-control sets are used to estimate the relative risk of disease by environmental exposure and genotype and the ratio of relative risks for gene-by-environment interaction effects. The dominance effect of genes can be modeled as recessive, dominant, or multiplicative on the relative risk scale (see Appendix B). With the use of family-matched risk sets, the distribution of families in the stage 1 strata does not contribute to the conditional likelihood. Furthermore, the stage 2 sampling fractions cancel from the likelihood for case-control sets matched on the history of disease in the family. Therefore, a control subject that is a sibling to the case patient always shares the same family history of disease as the case patient, and the analysis is straightforward using ordinary conditional logistic regression. However, a control subject that is a cousin to the case patient will not necessarily share the same family history (e.g., affected mother), and it is again necessary to account for the sampling fractions in the likelihood for the resulting estimates to be valid.

Association or linkage analysis can be used to locate new genes as risk factors for disease. The same case-control methods described above can be used for the purpose of detecting associations of new candidate loci or loci in linkage disequilibrium with unidentified causal genes. Alternatively, linkage analysis can be used, including both model-based methods that describe the transmission of genes in families or model-free methods based on the sharing of ancestral marker alleles. The optimal design for model-based linkage analysis is to maximize the number of carriers in the sample. To accomplish this, we subsample the pedigrees that have a high probability of segregating the gene. We calculate this probability for each stratum and then select families with the largest probability of segregating the disease genotype followed by selecting those with the second largest and so on until we reach our desired sample size. Such pedigrees could later be extended to include more distant relatives, following the sequential sampling method of Cannings and Thompson (17).


    APPLICATION
 Top
 Abstract
 Introduction
 Statistical Methods
 Application
 Discussion
 Appendix A. Score Equations...
 Appendix B. Conditional Logistic...
 Note
 References
 
The goals of the Cooperative Family Registries for Breast and Colorectal Cancer Research include developing a resource of family data for use in both gene characterization and gene hunting. For gene characterization, distinctions are made between major genes and metabolic genes. We refer to a single gene that has a large relative risk of disease because of a rare allele as a major gene. Metabolic genes, conversely, denote genes that have a small relative risk and the disease-predisposing allele is common in the population (e.g., frequency 20% or more). For the genes with rare alleles, attention focuses on estimating allele frequency and absolute penetrance in carriers and noncarriers. For genes with common alleles, we focus instead on the relative penetrance. Detecting departures from the multiplicative effects of the relative risks for genotype and environmental exposure is of interest for both major and metabolic genotypes.

The primary design considerations include which families to select in a stratified sample of the ascertained case probands and from whom to collect blood samples and risk-factor questionnaires. For all designs, we assumed a genetic relative risk of 20 for a major gene and 2 for a metabolic gene as well as a population prevalence of disease of 10%. We expect to be able to ascertain 5000 probands from whom it will be feasible to sample 1000 families (20%) and to obtain blood samples from approximately 4500 individuals. We considered a fixed family structure consisting of the case proband, two siblings, mother, father, two first cousins, the cousins' mother and father (proband's aunt and uncle), and the connecting grandparents.

First, we computed the optimal sampling fractions for minimizing the variance of the segregation parameters for a variety of genetic models when the genotype of only the case proband is observed. In general, the results suggest only slightly oversampling families with 0 and 2+ affected relatives to minimize the variance of the allele frequency. For example, the optimal design for a rare recessive gene (allele frequency = 0.14) is to sample 20.4%, 18.5%, and 22.6% of case probands with 0, 1, and 2+ affected first-degree relatives, respectively. To minimize the variance of the penetrance in carriers and noncarriers, we should undersample case probands with 0 affected relatives and oversample those with 2+ affected relatives. For the same rare recessive gene, we would sample 17.3%, 18.3%, and 49.9% of case probands with 0, 1, and 2+ affected first-degree relatives, respectively, to efficiently estimate the penetrance in gene carriers. The averages of the fractions over the different parameters of interest are given in Table 1Go (19%, 19%, 35% in stratum with 0, 1, 2+ affected relatives, respectively). The choice of parameterization of the penetrance as absolute or relative had no noticeable effect on the optimal design.


View this table:
[in this window]
[in a new window]
 
Table 1. Asymptotic relative efficiency of the Horvitz-Thompson estimates for designs in which different genotypes are observed (genetic relative risk = 20)*

 
The asymptotic relative efficiency results comparing designs with different genotyped family members for a rare major gene and common metabolic gene are given in Tables 1Go and 2,Go respectively. The greatest gain in efficiency (per individual) is seen when the proband is genotyped. For a rare major gene, the genotyping of additional family members increases efficiency the most in estimating the hazard rate in carriers and the allele frequency. Efficiency is also improved with additional genotyping for estimating the relative risk of a metabolic gene (Table 2Go). The largest efficiency gain for the relative risk is observed under the multiplicative model. Compared with genotyping the proband only, the relative efficiency of additionally genotyping both siblings is 3.5 and of additionally genotyping both parents is 4.5.


View this table:
[in this window]
[in a new window]
 
Table 2. Asymptotic relative efficiency for the Horvitz-Thompson estimates for designs in which different genotypes are observed (genetic relative risk = 2)*

 
Asymptotic relative efficiency results for gene-environment interaction effects in case-control sib pairs suggest the greatest efficiency is obtained from sampling pairs that have a positive family history of disease (Table 3)Go. The increase in efficiency is greater for the rare genotype than for the common one. Under both the multiplicative and dominant models, the efficiency gain from having an affected sibling is similar to that of having an affected parent. Under a recessive model, the increase in efficiency is greater when the affected relative is a sibling.


View this table:
[in this window]
[in a new window]
 
Table 3. Asymptotic relative efficiency for estimating gene-by-environment interaction effect in case-sib-control pair with an affected relative compared with an unaffected relative*

 
Finally, we find that, for all three dominance models, the probability of a family segregating a disease allele increases with the number of cases in the family (results not shown). Thus, the families with three or more cases will be the most informative for linkage analysis.

Our preferred "compromise" design, as recommended to the Colorectal Cancer Family Registry at the University of Southern California, is to sample the data in four stages.

In stage 1, we plan to enroll approximately 5000 probands with colorectal cancer from population-based cancer registries. In a short telephone interview, we will collect family history data on colorectal cancer in parents and siblings only and stratify the families on the number of affected parents and siblings.

In stage 2, a random sample of families will be drawn from each stratum. We propose selecting 16% of the probands with no family history of disease, 32% of the probands with one additional affected family member, 48% of those with two, 64% of those with three, etc. We anticipate sampling approximately 1000 families in stage 2. In an extended telephone interview with the proband, we will then inquire about cancer incidence in all first- and second-degree relatives and first cousins and request permission to contact all surviving affected family members, one unaffected sibling of each case, and two unaffected cousins of the proband. For probands without an unaffected sibling, we will seek permission to contact the parents. Blood samples will be drawn, and risk-factor and food-frequency questionnaires will be collected from the probands in this subsample.

In stage 3, blood samples as well as risk-factor and food-frequency questionnaires will be sought from relatives identified in stage 2 whom the proband has granted permission for us to contact. For probands without an affected sibling, parents will be sought for blood samples only. Alternatively, one could select relatives from a subsample of probands defined by cancer in the extended pedigree or by the genotype of the proband.

In a possible fourth stage, we will consider collecting extended pedigrees from families that are potentially informative for linkage analysis on the basis of their apparent segregation of disease that cannot be explained by already known genes, using the sequential sampling approach of Cannings and Thompson (17).

The expected confidence intervals for the penetrance parameters and allele frequency for a major gene and a metabolic gene are given in Table 4.Go These expected confidence intervals are calculated, using the sampling fractions, the extended family history data, and the genotypes on the probands alone (approximately 1125). On the basis of the relative efficiency comparisons in Tables 1Go and 2Go, we anticipate that, by including the genotypes of additional family members, the confidence intervals on the carrier penetrance, relative risk, and gene frequency could be substantially narrowed. For example, if we include the genotypes of two siblings for the common metabolic gene (approximately 3375 total genotypes), the sizes of the estimated confidence intervals for the genetic relative risk are reduced 35%-52%, depending on the genetic model. For a true genetic relative risk of 2, the estimated 95% confidence interval is 1.17-3.43 for a multiplicative susceptibility gene and approximately 1.43-2.79 for one acting in a dominant or a recessive fashion. In a separate case-control analysis of these same sib triplets, the estimated 95% confidence interval for a gene-environment relative risk ratio of 2 ranges from 0.7-5.8 for the multiplicative gene model to 1.0-4.1 for the dominant and recessive model.


View this table:
[in this window]
[in a new window]
 
Table 4. Estimated 95% confidence intervals (CIs) on penetrance and allele frequency from the two-stage sample selecting 16%, 32%, and 48% of families with 0, 1, and 2+ affected relatives, respectively*

 

    DISCUSSION
 Top
 Abstract
 Introduction
 Statistical Methods
 Application
 Discussion
 Appendix A. Score Equations...
 Appendix B. Conditional Logistic...
 Note
 References
 
In genetic epidemiology, multistage sampling permits the allocation of resources to families that are likely to be most informative for a given objective while still allowing population-based inference by using the sampling fractions at each stage. The only requirement for drawing valid inference is that researchers have a random sample of subjects from each stratum. Different end points may be optimized in each stage, and the selection of an optimal design depends on the aims of the study.

The potential information gain depends on the initial sample and the genetic model. Whittemore and Halpern (12) use simulations to show that the best sampling strategy for the case-control design for prostate cancer is to select all family history-positive case patients and a substantial fraction of family history-positive control subjects, a smaller fraction of family history-negative case patients, and the smallest fraction of family history-negative control subjects. However, our simulations for the family cancer registry design, based only on families of case patients and including the genotype of the proband, showed that the sampling weights slightly favored oversampling case patients with two or more affected relatives. Little difference was observed in the weights for sampling case patients with no affected relative and for sampling case patients with one affected relative.

Efficiency calculations for estimating gene-by-environment interaction effects from sibling data favor sampling case patients with a positive family history of disease. These same families are preferred for gene hunting with the use of linkage analysis. Combining this finding with earlier results on estimating segregation parameters resulted in a final recommendation to the Cooperative Family Registry for Colorectal Cancer Research at the University of Southern California to sample case patients with affected relatives at a higher rate than case patients without affected relatives. This procedure will improve our ability to detect gene-environment interactions and to discover new genes in latter stages of the design while maintaining the ability to characterize known genes.


    APPENDIX A. SCORE EQUATIONS AND HORVITZ-THOMPSON VARIANCE FORMULA
 Top
 Abstract
 Introduction
 Statistical Methods
 Application
 Discussion
 Appendix A. Score Equations...
 Appendix B. Conditional Logistic...
 Note
 References
 
The score equation for the complete likelihood is


where {theta} denotes our parameter of interest, f(y;{theta}) the distribution of the stage 2 data (y), NSj - nSj the number of probands classified in stratum Sj but not selected in stage 2, and {omega}Sj the probability of being in that stratum. The Horvitz-Thompson estimating equation is


where fSj denotes the sampling fraction in strata Sj. The asymptotic variance is


where


and



    APPENDIX B. CONDITIONAL LOGISTIC REGRESSION
 Top
 Abstract
 Introduction
 Statistical Methods
 Application
 Discussion
 Appendix A. Score Equations...
 Appendix B. Conditional Logistic...
 Note
 References
 
For a single case-control set, let j = 1, . . . , M index members with j = 1 denoting the case patient. The observed data include the environmental exposure x and the genotype g. The coding for the genotype is given by Gg. For individuals carrying two copies of the normal allele, Gg = 0; for those carrying two copies of the variant allele, Gg = 1. For subjects carrying one allele of each type, Gg ={Delta}, where {Delta} is the dominance effect ({Delta} = 0 for recessive genes, 1 for dominant genes, and 1/2 for genes that act multiplicatively on the relative risk scale). The parameter{theta} = (ß, {gamma}, {delta}) denotes the logarithm of the genetic and environmental relative risks and relative risk ratio for the gene-environment interaction effect. The likelihood contribution for one family is


For a sample of independent case-control sets, the likelihood is given by the product of such terms.


    NOTE
 
Supported by Public Health Service grants CA52862 (K. D. Siegmund, D. C. Thomas) and 5R35CA47448 National Institutes of Health Outstanding Investigator Grant, Research in Cancer Epidemiology and Biostatistics (A. S. Whittemore), National Cancer Institute, National Institutes of Health, Department of Health and Human Services.


    REFERENCES
 Top
 Abstract
 Introduction
 Statistical Methods
 Application
 Discussion
 Appendix A. Score Equations...
 Appendix B. Conditional Logistic...
 Note
 References
 

1 Caparoso N, Rothman N, Wacholder S. Case-control studies of common alleles and environmental factors. Monogr Natl Cancer Inst 1999;26:25-30.

2 Gauderman WJ, Witte JS, Thomas DC. Family-based association studies. Monogr Natl Cancer Inst 1999;26:31-7.

3 Gail MH, Pee D, Carroll R. Kin-cohort designs for gene characterization. Monogr Natl Cancer Inst 1999;26:55-60.

4 Goldstein AM, Andrieu N. Detection of interaction involving identified genes: available study designs. Monogr Natl Cancer Inst 1999;26:49-54.

5 White JE. A two-stage design for the study of the relationship between a rare exposure and a rare disease. Am J Epidemiol 1982;115:119-28.[Abstract/Free Full Text]

6 Breslow NE, Cain KC. Logistic regression for two-stage case-control data. Biometrika 1988;75:11-20.[Abstract/Free Full Text]

7 Cain KC, Breslow NE. Logistic regression analysis and efficient design for two-stage studies. Am J Epidemiol 1988;128:1198-206.[Free Full Text]

8 Weinberg CR, Wacholder S. The design and analysis of case-control studies with biased sampling. Biometrics 1990;46:963-76.[CrossRef][ISI][Medline]

9 Zhao LP, Lipsitz S. Designs and analysis of two-stage studies. Stat Med 1992;11:769-82.[ISI][Medline]

10 Reilly M. Optimal sampling strategies for two-stage studies. Am J Epidemiol 1996;143:92-100.[Abstract/Free Full Text]

11 Breslow NE, Holubkov R. Weighted likelihood, pseudo-likelihood and maximum likelihood methods for logistic regression analysis of two-stage data. Stat Med 1997;16:103-16.[CrossRef][ISI][Medline]

12 Whittemore AS, Halpern J. Multi-stage sampling in genetic epidemiology. Stat Med 1997;16:153-67.[CrossRef][ISI][Medline]

13 Horvitz DG, Thompson DJ. A generalization of sampling without replacement from a finite population. J Am Stat Assoc 1952;47:663-85.[CrossRef][ISI]

14 Flanders WD, Greenland S. Analytic methods for two-stage case-control studies and other stratified designs. Stat Med 1991;10:739-47.[ISI][Medline]

15 Pepe MS, Reilly M, Fleming T. Auxiliary outcome data and the mean score method. J Stat Planning Inference 1994;42:137-60.[CrossRef]

16 Robins JM, Rotnitzky A, Zhao LP. Estimation of regression coefficients when some regressors are not always observed. J Am Stat Assoc 1994;89:846-66.[CrossRef][ISI]

17 Cannings C, Thompson EA. Ascertainment in the sequential sampling of pedigrees. Clin Genet 1977;12:208-12.[ISI][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Request Permissions
Google Scholar
Right arrow Articles by Siegmund, K. D.
Right arrow Articles by Thomas, D. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Siegmund, K. D.
Right arrow Articles by Thomas, D. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?