top of page
Search

Understanding Poisson Distribution and Modeling in R

  • Writer: Gamze Bulut
    Gamze Bulut
  • Mar 27
  • 2 min read

This week in my 602 course I learned about the Poisson distribution. The Poisson distribution is a fundamental tool in statistics for modeling count data — situations where the outcome is the number of times an event occurs in a fixed interval of time, space, or another dimension. This distribution assumes that events happen independently and at a constant average rate.


When the mean (which is also the variance) increases, this approaches binomial distribution, and if it increases even more it approaches a normal distribution.


Here is a photo of the French mathematician. 🙏


When to Use Poisson Models?


Poisson models are used when:

  • The outcome variable is a count (e.g., number of ER visits, crashes, phone calls).

  • Events are rare and occur independently.

  • The variance of the outcome is equal to its mean (equidispersion).


Using "person-time" allows for each person's follow up time to contribute meaningfully. Below we see 18 person years total.



Modeling Poisson in R


Here is a basic example using R:

# Simulated data
set.seed(123)
data <- data.frame(
  age = sample(20:60, 100, replace = TRUE),
  gender = sample(c("M", "F"), 100, replace = TRUE),
  exposure = runif(100, 1, 10)
)
data$events <- rpois(100, lambda = 0.3 * data$exposure)

# Fit Poisson regression
model <- glm(events ~ age + gender + offset(log(exposure)), 
             data = data, family = poisson)
summary(model)

This fits a Poisson regression with an offset to account for varying exposure time.


Problem 1: Overdispersion


A core assumption of Poisson regression is that the variance equals the mean. However, in real-world data, variance often exceeds the mean. This is called overdispersion. This could happen if the model is missing an interaction term, has some non linear covariates, or missing an important variable.


Symptoms of Overdispersion:


  • Residual deviance much greater than degrees of freedom

  • Large standard errors

  • Poor model fit


Fixing Overdispersion:


One common way is to use:

  • Negative Binomial regression: Adds a dispersion parameter.

library(MASS)
nb_model <- glm.nb(events ~ age + gender + offset(log(exposure)), 
                   data = data)

Problem 2: Zero Inflation


Some count data have more zeros than expected under a Poisson model. This is common in biological, health, or social data. This leads to zero inflation. Consider the below red graph for the number of hospital visits in a year. There could be patients who never go see a doctor. There could be patients just happened to not go this year. The zeros are inflated.



Solution: Zero-Inflated Models


Zero-Inflated Poisson (ZIP) and Zero-Inflated Negative Binomial (ZINB) models use a two-part process:

  1. A logistic model to predict excess zeros (who are always zero, versus happened to be zero)

  2. A count model for non-zero counts (remove the predicted always zeros and refit poisson)

library(pscl)
zip_model <- zeroinfl(events ~ age + gender | 1, 
                      data = data, dist = "poisson")

Summary


Poisson regression is powerful for modeling rates and counts, incidence rate in epidemiology is a bis area for this, but real-world data could violate its assumptions. Be on the lookout for overdispersion and zero inflation, and use alternative models like Negative Binomial or Zero-Inflated Poisson when needed.


I hope you find this post helpful! Let's keep learning!

Comments


bottom of page