The Oscars is a specials betting event looked forward to by people all over the world. Play Safe. Free bets valid for 7 days, stake not returned. Out of the 24 categoriesthere are some which receive the most attention and fanfare from both the media and the public.

If overdispersion is detected, the ZINB model often provides an adequate alternative. The probability distribution of a zero-inflated negative binomial random variable Y is given by. Because the ZINB model assumes a negative binomial distribution for the first component of the mixture, it has a more flexible variance function.

Thus it provides a means to account for overdispersion that is not due to the excess zeros. However, the negative binomial, and thus the ZINB model, achieves this additional flexibility at the cost of an additional parameter. Thus, if you fit a ZINB model when there is no overdispersion, the parameter estimates are less efficient compared to the more parsimonious ZIP model. If the ZINB model does not fully account for the overdispersion, more flexible mixture models can be considered. Consider a horticultural experiment to study the number of roots produced by a certain species of apple tree.

The objective is to assess the effect of both the photoperiod and the concentration levels of BAP on the number of roots produced. The analysis begins with a graphical inspection of the data. The FREQ procedure is then used to produce plots of the marginal and conditional distributions of the response variable Roots.

Inspection of Figure 1 reveals a percentage of zero counts that is much larger than what you would expect to observe if the data were generated by simple Poisson or negative binomial processes. The following SAS statements produce plots of the distribution of Roots conditional on Photoperiod :. Figure 2: Distribution of Roots Conditional on Photoperiod.

Figure 2 reveals that under the 8-hour photoperiod, almost all of the shoots produced roots. However, under the hour photoperiod, almost half of the shoots produced no roots. This provides compelling evidence that the data generating process is a mixture and that the probability of observing a zero count is conditional on the photoperiod.

Figure 3 reveals differences in the modes and the skew of the conditional distributions. It is reasonable to conclude that the expected value of Roots is a function of the level of BAP. However, there is little variation in the percentage of zero counts in these conditional distributions, suggesting that BAP is probably not a predictor of the probability of a zero count.

There is some indication of interaction effects, but it is difficult to predict whether they are significant. To summarize, the graphical evidence indicates that a simple Poisson or negative binomial model will not likely account for the prevalence of zero counts and that a mixture model such as a zero-inflated Poisson ZIP model or zero-inflated negative binomial ZINB is needed.

There is also clear evidence that the probability of a zero count depends on the level of Photoperiod. If there is overdispersion, then the model is misspecified and the standard errors of the model parameters are biased downwards. Output 1 displays the fit criteria for the ZIP model. Most of the criteria are useful only for comparing the model fit among given alternative models. However, the Pearson statistic can be used to determine if there is any evidence of overdispersion.

If the model is correctly specified and there is no overdispersion, the Pearson chi-square statistic divided by the degrees-of-freedom has an expected value of 1. The obvious question is whether the observed value of 1.

As indicated in the section Analysis , the scaled Pearson statistic for generalized linear models has a limiting chi-square distribution under certain regularity conditions with degrees of freedom equal to the number of observations minus the number of estimated parameters. For Poisson and negative binomial models, the scale is fixed at 1, so there is no difference between the scaled and unscaled versions of the statistic.

Therefore, a formal one-sided test for overdispersion is performed by computing the probability of observing a larger value of the statistic. The following SAS statements compute the p -value for such a test:. Output 2 reveals a p -value of 0. Output 3 presents the parameter estimates for the ZIP model. Because of the evidence of overdispersion, inferences based on these estimates are suspect; the standard errors are likely to be biased downwards.

Nevertheless, the results as presented indicate that Photoperiod and BAP are significant determinants of the expected value, as are three of the four interactions. Also as expected, Photoperiod is a significant predictor of the probability of a zero count. Another method for assessing the goodness-of-fit of the model is to compare the observed relative frequencies of the various counts to the maximum likelihood estimates of their respective probabilities.

The following SAS statements demonstrate one method of computing the estimated probabilities and generating two comparative plots. The first step is to observe the value of the largest count and the sample size and save them into macro variables. Next, you use the model predictions and the estimated zero-inflation probabilities that are stored in the output data set Zip to compute the conditional probabilities.

You also generate an indicator variable for each count , , where each observation is assigned a value of 1 if count is observed, and 0 otherwise. That is, there is one observation for each variable. In order to generate comparative plots, the data need to be in what is referred to as long form. Ultimately, you need four variables, one whose observations are an index of the values of the counts, a second whose observations are the observed relative frequencies, a third whose observations contain the ZIP model estimates of the probabilities , and a fourth whose observations contain the difference between the observed relative frequencies and the estimated probabilities.

The following SAS statements transpose the two output data sets so that they are in long form. That is, subjects with higher baseline sexual behavior tend to be less likely to be a structural zero in VCD in the 3-month follow-up. VCD, number of vaginal sex encounters using condoms in 3 months after enrollment.

Analysis of maximum likelihood zero inflation parameter estimates for inflated zero component of VCD. In this example, application of zero-inflated models enables us to ascertain the exact effect of the educational intervention. HIV knowledge is associated with VCD, but mainly through its effect on the count for the at-risk subgroup.

However, they are not yet available in SPSS. In this sample code, text in italic can be modified to specify the data source and models, while the text not in italics are SAS key words and must be entered exactly as they appear. In the sample codes above, the logit link is used, yielding the logistic regression.

However, as in modeling binary outcomes, other commonly used link functions such as probit and complementary log-log may also be used, which are both available in SAS. Below is the sample codes for using this procedure. For example, we may apply the following Command for the ZIP analysis for the example in the previous section:. This article discusses structural zeros in count outcomes and how to use zero-inflated models to address this issue.

Zero-inflated models are the natural approach when the status of structural zeros are unknown, that is, when structural zeros cannot be distinguished from random zeros. In cases where this distinction is known, we may take advantage of the additional information and apply hurdle models. Again, the Poisson and NB may be used for modeling the count response, and logistic regression may be applied for the structural zeros.

However, since the status of structural zero is known, no mixture distribution is needed and the Poisson or NB and logistic regression of the hurdle model are essentially two separate models. Thus, no new software is needed for fitting the hurdle model. As illustrated by the real data example presented, zero-inflated models have both conceptual and analytics advantages when there are excessive zeros. The zeroinflated models not only correct the overdispersion arising from the existence of structural zeros, but also allow for the distinction of different risk groups, providing better understanding of the data.

We limited ourselves to parametric models and cross-sectional data analysis because of the availability in common software packages. However, parametric approaches are prone to distribution misspecification, potentially yielding bias in estimates. For example, if the count response for the at-risk subgroup in a study data does not follow the NB distribution, assuming and fitting a ZINB model may yield biased estimates.

Another problem is that cross-sectional models cannot be applied to investigate temporal changes from repeated assessments in longitudinal studies. Some new methods have been developed to address both of these limitations, [22] , [23] but they have not yet been included in popular statistical software packages such as SAS and Stata. We have only discussed the structural zero issue when zero-inflated count variables are used as the response.

The issue is also present when such variables serve as predictors in regression analyses. Indeed, using such variables as predictors and failing to distinguish structural and random zeros results in biased inference and makes it quite difficult to interpret estimates. Her research interests are in ROC analysis, semi-parametric and non-parametric inference, missing data modeling, causal inference, social network analysis, count data analysis and applications of statistical methods to psychosocial research.

Conflict of interest: The authors decalre no conflict of interest. National Center for Biotechnology Information , U. Journal List Shanghai Arch Psychiatry v. Shanghai Arch Psychiatry. Author information Copyright and License information Disclaimer. This article has been cited by other articles in PMC. Keywords: count response, structural zeroes, random zeroes, zero-inflated models. Introduction Count or frequency responses such as number of heart attacks, number of days of alcohol drinking, number of suicide attempts, and number of unprotected sexual encounters during a period of time arise quite often in biomedical and psychosocial research.

Zero-inflated Poisson models. Zero-inflated Poisson distribution In biomedical and psychosocial research the distribution of zeros often exceeds the expected frequency of zeros predicted by the Poisson model. Open in a separate window. Figure 1. Frequencies of scores on the 9-item Patient Health Questionnaire PHQ-9 , sexual partners in past year, and heavy drinking days per month.

Zero-inflated Poisson regression models It is both conceptually and theoretically reasonable to model the outcomes from the two groups of subjects separately due to the heterogeneity of the study sample. Example We use a study of sexual behavior among adolescent girls to illustrate the application of zero-inflated models. Figure 2. Table 1. Analysis of maximum likelihood parameter estimates for count component of VCD. Table 2. Discussion This article discusses structural zeros in count outcomes and how to use zero-inflated models to address this issue.

Footnotes Conflict of interest: The authors decalre no conflict of interest. References 1. Site matters: multisite randomized trial of motivational enhancement therapy in community drug abuse clinics. J Consult Clin Psychol. A randomized controlled study of a webbased performance improvement system for substance abuse treatment providers. J Subst Abuse Treat. Collegiate sporting events and celebratory drinking.

J Stud Alcohol Drugs. Early adolescent psychopathology as a predictor of alcohol use disorders by young adulthood. Drug Alcohol Depend. Alcohol conscientiousness and event-level condom use. Br J Health Psychol. Alcohol outlet density levels of drinking and alcohol-related harm in New Zealand: a national study.

J Epidemiol Community Health. New variable selection methods for zero-inflated count data with applications to the substance abuse field. Stat Med. Randomized trials of alcohol-use interventions with college students and their parents: lessons from the Transitions Project. Clin Trials. Parental alcohol involvement and adolescent alcohol expectancies predict alcohol involvement in male adolescents. Psychol Addict Behav. When should clinicians switch treatments?

An application of signal detection theory to two treatments for women with alcohol use disorders. Behav Res Ther. Targeted versus daily naltrexone: secondary analysis of effects on average daily drinking. Alcohol Clin Exp Res.

Using information, motivational enhancement, and skills training to reduce the risk of HIV infection for low-income urban women: a second randomized clinical trial. Health Psychol. A peer-education intervention to reduce injection risk behaviors for HIV and hepatitis C virus infection in young injection drug users.

Modeling count outcomes from HIV risk reduction interventions: a comparison of competing statistical models for count responses. On the use of zero-inflated and hurdle models for modeling vaccine adverse event count data. J Biopharm Stat. Zero-inflated negative binomial mixed regression modeling of over-dispersed count data with extra zeros. Biom J. Effects of major depression on crack use and arrests among women in drug court.

Modeling count data with excess zeroes: an empirical application to traffic accidents. Sociol Methods Res. Applied Categorical and Count Data Analysis. Modeling the abundance of rare species: statistical-models for counts with extra zeros. Ecol Modell. Hall DB. Zero-Inflated Poisson and binomial regression with random effects: a case study.

Zero-inflated Poisson regression with random effects to evaluate an occupational injury prevention programme.

Zero-inflated count models provide one method to explain the excess zeros by modeling the data as a mixture of two separate distributions: one distribution is typically a Poisson or negative binomial distribution that can generate both zero and nonzero counts, and the second distribution is a constant distribution that generates only zero counts. When the underlying count distribution is a Poisson distribution, the mixture is called a zero-inflated Poisson ZIP distribution; when the underlying count distribution is a negative binomial distribution, the mixture is called a zero-inflated negative binomial ZINB distribution.

Count data that have an incidence of zeros greater than expected for the underlying probability distribution can be modeled with a zero-inflated distribution. The population is considered to consist of two subpopulations. Observations drawn from the first subpopulation are realizations of a random variable that typically has either a Poisson or negative binomial distribution, which might contain zeros.

Observations drawn from the second subpopulation always provide a zero count. Suppose the mean of the underlying Poisson or negative binomial distribution is and the probability of an observation being drawn from the constant distribution that always generates zeros is. The parameter is often called the zero-inflation probability.

The parameters and can be modeled as functions of linear predictors,. The log link function is typically used for. The excess zeros are a form of overdispersion. Fitting a zero-inflated Poisson model can account for the excess zeros, but there are also other sources of overdispersion that must be considered.

If there are sources of overdispersion that cannot be attributed to the excess zeros, failure to account for them constitutes a model misspecification, which results in biased standard errors. If this is an invalid assumption, the data exhibit overdispersion or underdispersion. A useful diagnostic tool that can aid you in detecting overdispersion is the Pearson chi-square statistic.

This statistic, under certain regularity conditions, has a limiting chi-square distribution, with degrees of freedom equal to the number of observations minus the number of parameters estimated. Comparing the computed Pearson chi-square statistic to an appropriate quantile of a chi-square distribution with degrees of freedom constitutes a test for overdispersion.

If overdispersion is detected, the ZINB model often provides an adequate alternative. The probability distribution of a zero-inflated negative binomial random variable Y is given by. Because the ZINB model assumes a negative binomial distribution for the first component of the mixture, it has a more flexible variance function. Thus it provides a means to account for overdispersion that is not due to the excess zeros. However, the negative binomial, and thus the ZINB model, achieves this additional flexibility at the cost of an additional parameter.

Thus, if you fit a ZINB model when there is no overdispersion, the parameter estimates are less efficient compared to the more parsimonious ZIP model. If the ZINB model does not fully account for the overdispersion, more flexible mixture models can be considered.

Consider a horticultural experiment to study the number of roots produced by a certain species of apple tree. The objective is to assess the effect of both the photoperiod and the concentration levels of BAP on the number of roots produced. The analysis begins with a graphical inspection of the data. The FREQ procedure is then used to produce plots of the marginal and conditional distributions of the response variable Roots.

Inspection of Figure 1 reveals a percentage of zero counts that is much larger than what you would expect to observe if the data were generated by simple Poisson or negative binomial processes. The following SAS statements produce plots of the distribution of Roots conditional on Photoperiod :. Figure 2: Distribution of Roots Conditional on Photoperiod.

Figure 2 reveals that under the 8-hour photoperiod, almost all of the shoots produced roots. However, under the hour photoperiod, almost half of the shoots produced no roots. This provides compelling evidence that the data generating process is a mixture and that the probability of observing a zero count is conditional on the photoperiod.

Figure 3 reveals differences in the modes and the skew of the conditional distributions. It is reasonable to conclude that the expected value of Roots is a function of the level of BAP. However, there is little variation in the percentage of zero counts in these conditional distributions, suggesting that BAP is probably not a predictor of the probability of a zero count.

There is some indication of interaction effects, but it is difficult to predict whether they are significant. To summarize, the graphical evidence indicates that a simple Poisson or negative binomial model will not likely account for the prevalence of zero counts and that a mixture model such as a zero-inflated Poisson ZIP model or zero-inflated negative binomial ZINB is needed.

There is also clear evidence that the probability of a zero count depends on the level of Photoperiod. If there is overdispersion, then the model is misspecified and the standard errors of the model parameters are biased downwards. Output 1 displays the fit criteria for the ZIP model. Most of the criteria are useful only for comparing the model fit among given alternative models. However, the Pearson statistic can be used to determine if there is any evidence of overdispersion.

If the model is correctly specified and there is no overdispersion, the Pearson chi-square statistic divided by the degrees-of-freedom has an expected value of 1. The obvious question is whether the observed value of 1. As indicated in the section Analysis , the scaled Pearson statistic for generalized linear models has a limiting chi-square distribution under certain regularity conditions with degrees of freedom equal to the number of observations minus the number of estimated parameters.

We treat variable camper as a categorical variable by including it in the class statement. This will also make the post estimations easier. We might want to compare the current zero-inflated negative binomial model with the plain negative binomial model, which can be done via, for example, Vuong test.

Currently Vuong test is not a standard part of proc genmod , but a macro progra m is available from SAS that does the Vuong test. You can download this macro program following the link and store it on your hard drive. This macro program takes quite a few arguments shown below. We rerun the models to get produce these required input arguments. We have also used the statement store to store the estimates so we can do post-estimation using the same model via proc plm without having to rerun the model.

With the zero-inflated negative binomial model, there are total of six regression parameters which includes the intercept, the regression coefficients for child and camper and the dispersion parameter for the negative binomial portion of the model as well as the intercept and regression coefficient for persons. The plain negative binomial regression model has a total of four regression parameters.

The scale parameters scale1 and scale2 are the dispersion parameters from each corresponding model. The output above shows the Vuong test followed by the Clarke Sign test. The positive values of the Z statistics for Vuong test indicate that it is the first model, the zero-inflated negative binomial model, which is closer to the true model.

Both of these tests have the same null hypothesis and it happens that the two tests are not consistent with each other leading a weak support for the zero-inflated negative binomial model. First off, we examine the distribution of the predicted probability of being an excessive zero by the number of persons in the group. We can see that the larger the group, the smaller the probability, meaning the more likely that the person went fishing.

To get the predict counts we have used the option ilink for inverse link. Notice by default, SAS fixes the value of the predictor variable persons at its mean value. Next, we can also ask proc plm to plot the fitted values by camper variable. Click here to report an error on this page or leave a comment Your Name required. Your Email must be a valid email for us to receive the report! How to cite this page. This page was updated using SAS 9.

However, count data are highly non-normal and are not well estimated by OLS regression. Zero-inflated Poisson Regression — Zero-inflated Poisson regression does better when the data is not overdispersed, i. Ordinary Count Models — Poisson or negative binomial models might be more appropriate if there are not excess zeros. SAS zero-inflated negative binomial analysis using proc genmod A zero-inflated model assumes that zero outcome is due to two different processes.

Dispersion 1 2. Model Information: General information about the data set, outcome variable, distribution and the number of observations used in the model. Class Level Information: For each categorical variable, the number of levels and how the levels are coded.

The last displayed level will be the reference group in the model. In this example, it will be 0. Analysis Of Maximum Likelihood Parameter Estimates: Negative binomial part of the model, estimated using maximum likelihood. Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates: Logistic regression part of the model, for estimating the probability of being an excessive zero.

Looking through the results of regression parameters we see the following: The predictors child and camper in the part of the negative binomial regression model predicting number of fish caught count are both significant predictors. The predictor person in the part of the logit model predicting excessive zeros is statistically significant. For these data, the expected change in log count for a one-unit increase in child is The log odds of being an excessive zero would decrease by 1.

In other words, the more people in the group ,the less likely that the zero would be due to not gone fishing. Put it plainly, the larger the group the person was in, the more likely that the person went fishing. The estimate of the dispersion parameter is displayed with its confidence interval.

Currently Vuong test is not a standard part of proc genmod , but a macro progra m is available from SAS that does the Vuong test. You can download this macro program following the link and store it on your hard drive. This macro program takes quite a few arguments shown below. We rerun the models to get produce these required input arguments. We have also used the statement store to store the estimates so we can do post-estimation using the same model via proc plm without having to rerun the model.

With the zero-inflated negative binomial model, there are total of six regression parameters which includes the intercept, the regression coefficients for child and camper and the dispersion parameter for the negative binomial portion of the model as well as the intercept and regression coefficient for persons.

The plain negative binomial regression model has a total of four regression parameters. The scale parameters scale1 and scale2 are the dispersion parameters from each corresponding model. The output above shows the Vuong test followed by the Clarke Sign test. The positive values of the Z statistics for Vuong test indicate that it is the first model, the zero-inflated negative binomial model, which is closer to the true model. Both of these tests have the same null hypothesis and it happens that the two tests are not consistent with each other leading a weak support for the zero-inflated negative binomial model.

First off, we examine the distribution of the predicted probability of being an excessive zero by the number of persons in the group. We can see that the larger the group, the smaller the probability, meaning the more likely that the person went fishing. To get the predict counts we have used the option ilink for inverse link. Notice by default, SAS fixes the value of the predictor variable persons at its mean value. Next, we can also ask proc plm to plot the fitted values by camper variable.

Click here to report an error on this page or leave a comment Your Name required. Your Email must be a valid email for us to receive the report! How to cite this page. This page was updated using SAS 9. However, count data are highly non-normal and are not well estimated by OLS regression. Zero-inflated Poisson Regression — Zero-inflated Poisson regression does better when the data is not overdispersed, i.

Ordinary Count Models — Poisson or negative binomial models might be more appropriate if there are not excess zeros. SAS zero-inflated negative binomial analysis using proc genmod A zero-inflated model assumes that zero outcome is due to two different processes. Dispersion 1 2. Model Information: General information about the data set, outcome variable, distribution and the number of observations used in the model. Class Level Information: For each categorical variable, the number of levels and how the levels are coded.

The last displayed level will be the reference group in the model. In this example, it will be 0. Analysis Of Maximum Likelihood Parameter Estimates: Negative binomial part of the model, estimated using maximum likelihood. Analysis Of Maximum Likelihood Zero Inflation Parameter Estimates: Logistic regression part of the model, for estimating the probability of being an excessive zero.

Looking through the results of regression parameters we see the following: The predictors child and camper in the part of the negative binomial regression model predicting number of fish caught count are both significant predictors. The predictor person in the part of the logit model predicting excessive zeros is statistically significant.

For these data, the expected change in log count for a one-unit increase in child is The log odds of being an excessive zero would decrease by 1. In other words, the more people in the group ,the less likely that the zero would be due to not gone fishing.

Put it plainly, the larger the group the person was in, the more likely that the person went fishing. The estimate of the dispersion parameter is displayed with its confidence interval. There seems enough indication of over dispersion, meaning that negative binomial model might be more appropriate. Question about the over-dispersion parameter is in general a tricky one. A large over-dispersion parameter could be due to a miss-specified model or could be due to a real process with over-dispersion.

We can see from the table of descriptive statistics above that the variance of the outcome variable is quite large relative to the means. This might be an indication of over-dispersion. A zero-inflated model assumes that zero outcome is due to two different processes. If not gone fishing, the only outcome possible is zero. The two parts of the a zero-inflated model are a binary model, usually a logit model to model which of the two processes the zero outcome is associated with and a count model, in this case, a negative binomial model, to model the count process.

The expected count is expressed as a combination of the two processes. The Stata command is shown below. This will make the post estimations easier. We cannot include the vuong option when using robust standard errors. Using the robust option has resulted in some change in the model chi-square, which is now a Wald chi-square.

This statistic is based on log pseudo-likelihoods instead of log-likelihoods. The model is still statistically significant. We then look the distribution of the predicted probability by the number of persons in the group. We can see that the larger the group, the smaller the probability, meaning the more likely that the person went fishing.

Notice that by default the margins command fixed the expected predicted probability of being an excessive zero at its mean. Click here to report an error on this page or leave a comment Your Name required. Your Email must be a valid email for us to receive the report! How to cite this page. Version info: Code for this page was tested in Stata Examples of zero-inflated negative binomial regression Example 1.

Example 2. The data set used in this example is from Stata. Percent Cum. However, count data are highly non-normal and are not well estimated by OLS regression. Zero-inflated Poisson Regression — Zero-inflated Poisson regression does better when the data is not overdispersed, i. Ordinary Count Models — Poisson or negative binomial models might be more appropriate if there are not excess zeros. Zero-inflated negative binomial regression A zero-inflated model assumes that zero outcome is due to two different processes.

It begins with the iteration log giving the values of the log likelihoods starting with a model that has no predictors. The last value in the log is the final value of the log likelihood for the full model and is repeated below. Next comes the header information. On the right-hand side the number of observations used is given along with the likelihood ratio chi-squared.

This compares the full model to a model without count predictors, giving a difference of two degrees of freedom. This is followed by the p-value for the chi-square. The model, as a whole, is statistically significant.

Following these are logit coefficients for predicting excess zeros along with their standard errors, z-scores, p-values and confidence intervals. Additionally, there will be an estimate of the natural log of the over dispersion coefficient, alpha, along with the untransformed value. If the alpha coefficient is zero then the model is better estimated using an Poisson regression model.

Below the various coefficients you will find the results of the zip and vuong options. The zip option tests the zero-inflated negative binomial model versus the zero-inflated poisson model. The Vuong test compares the zero-inflated model negative binomial with an ordinary negative binomial regression model. A significant z-test indicates that the zero-inflated model is preferred.

Looking through the results of regression parameters we see the following: The predictors child and camper in the part of the negative binomial regression model predicting number of fish caught count are both significant predictors. The predictor person in the part of the logit model predicting excessive zeros is statistically significant.

Each group was questioned about how many fish they caught count , how many children were in the group child , how many people were in the group persons , and whether or not they brought a camper to the park camper. In addition to predicting the number of fish caught, there is interest in predicting the existence of excess zeros, i. We will use the variables child , persons , and camper in our model.

We can see from the table of descriptive statistics above that the variance of the outcome variable is quite large relative to the means. This might be an indication of over-dispersion. A zero-inflated model assumes that zero outcome is due to two different processes. If not gone fishing, the only outcome possible is zero. The two parts of the a zero-inflated model are a binary model, usually a logit model to model which of the two processes the zero outcome is associated with and a count model, in this case, a negative binomial model, to model the count process.

The expected count is expressed as a combination of the two processes. The SAS commands are shown below. We treat variable camper as a categorical variable by including it in the class statement. This will also make the post estimations easier. We might want to compare the current zero-inflated negative binomial model with the plain negative binomial model, which can be done via, for example, Vuong test.

Currently Vuong test is not a standard part of proc genmod , but a macro progra m is available from SAS that does the Vuong test. You can download this macro program following the link and store it on your hard drive. This macro program takes quite a few arguments shown below. We rerun the models to get produce these required input arguments. We have also used the statement store to store the estimates so we can do post-estimation using the same model via proc plm without having to rerun the model.

With the zero-inflated negative binomial model, there are total of six regression parameters which includes the intercept, the regression coefficients for child and camper and the dispersion parameter for the negative binomial portion of the model as well as the intercept and regression coefficient for persons. The plain negative binomial regression model has a total of four regression parameters.

The scale parameters scale1 and scale2 are the dispersion parameters from each corresponding model. The output above shows the Vuong test followed by the Clarke Sign test. The positive values of the Z statistics for Vuong test indicate that it is the first model, the zero-inflated negative binomial model, which is closer to the true model.

Both of these tests have the same null hypothesis and it happens that the two tests are not consistent with each other leading a weak support for the zero-inflated negative binomial model. First off, we examine the distribution of the predicted probability of being an excessive zero by the number of persons in the group. We can see that the larger the group, the smaller the probability, meaning the more likely that the person went fishing.

To get the predict counts we have used the option ilink for inverse link. Notice by default, SAS fixes the value of the predictor variable persons at its mean value. Next, we can also ask proc plm to plot the fitted values by camper variable. Click here to report an error on this page or leave a comment Your Name required. Your Email must be a valid email for us to receive the report!

How to cite this page. This page was updated using SAS 9. However, count data are highly non-normal and are not well estimated by OLS regression. Zero-inflated Poisson Regression — Zero-inflated Poisson regression does better when the data is not overdispersed, i. Ordinary Count Models — Poisson or negative binomial models might be more appropriate if there are not excess zeros.

SAS zero-inflated negative binomial analysis using proc genmod A zero-inflated model assumes that zero outcome is due to two different processes. Dispersion 1 2. Model Information: General information about the data set, outcome variable, distribution and the number of observations used in the model.

Class Level Information: For each categorical variable, the number of levels and how the levels are coded. The last displayed level will be the reference group in the model. Zero-inflated negative binomial regression is for modeling count variables with excessive zeros and it is usually for overdispersed count outcome variables.

Furthermore, theory suggests that the excess zeros are generated by a separate process from the count values and that the excess zeros can be modeled independently. Please note: The purpose of this page is to show how to use various data analysis commands. It does not cover all aspects of the research process which researchers are expected to do. In particular, it does not cover data cleaning and checking, verification of assumptions, model diagnostics or potential follow-up analyses.

School administrators study the attendance behavior of high school juniors at two schools. Predictors of the number of days of absence include gender of the student and standardized test scores in math and language arts. The state wildlife biologists want to model how many fish are being caught by fishermen at a state park.

Visitors are asked how long they stayed, how many people were in the group, were there children in the group and how many fish were caught. Some visitors do not fish, but there is no data on whether a person fished or not. Some visitors who did fish did not catch any fish so there are excess zeros in the data because of the people that did not fish.

We have data on groups that went to a park. Each group was questioned before leaving the park about how many fish they caught count , how many children were in the group child , how many people were in the group persons , and whether or not they brought a camper to the park camper. The outcome variable of interest will be the number of fish caught. Even though the question about the number of fish caught was asked to everyone, it does not mean that everyone went fishing.

What would be the reason for someone to report a zero count? Otherwise, if a person went to fishing, the count could be zero or non-zero. We will start with reading in the data and the descriptive statistics and plots. This helps us understand the data and give us some hint on how we should model the data. We can see from the table of descriptive statistics above that the variance of the outcome variable is quite large relative to the means.

This might be an indication of over-dispersion. A zero-inflated model assumes that zero outcome is due to two different processes. If not gone fishing, the only outcome possible is zero. The two parts of the a zero-inflated model are a binary model, usually a logit model to model which of the two processes the zero outcome is associated with and a count model, in this case, a negative binomial model, to model the count process.

The expected count is expressed as a combination of the two processes. The Stata command is shown below. This will make the post estimations easier. We cannot include the vuong option when using robust standard errors. Using the robust option has resulted in some change in the model chi-square, which is now a Wald chi-square. This statistic is based on log pseudo-likelihoods instead of log-likelihoods.

The model is still statistically significant. We then look the distribution of the predicted probability by the number of persons in the group. We can see that the larger the group, the smaller the probability, meaning the more likely that the person went fishing. Notice that by default the margins command fixed the expected predicted probability of being an excessive zero at its mean. Click here to report an error on this page or leave a comment Your Name required. Your Email must be a valid email for us to receive the report!

How to cite this page. Version info: Code for this page was tested in Stata Examples of zero-inflated negative binomial regression Example 1. Example 2. The data set used in this example is from Stata. Percent Cum. However, count data are highly non-normal and are not well estimated by OLS regression.