Binary data in state space models

Last updated on Nov 18, 2022 3 min read

Generate some data

Here we specify a bivariate latent process where the 1st process affects the 2nd, and there are stable individual differences in the processes.

library(ctsem)
gm <- ctModel(LAMBDA=diag(2), #diagonal factor loading, 2 latents 2 observables
  Tpoints = 50,
  DRIFT=matrix(c(-1,.5,0,-1),2,2), #temporal dynamics
  TRAITVAR = diag(.5,2), #stable latent intercept variance (cholesky factor)
  DIFFUSION=diag(2)) #within person covariance 

ctModelLatex(gm) #to view latex system equations

#when generating data, free pars are set to 0
d <- data.frame(ctGenerate(ctmodelobj = gm,n.subjects = 100,
  burnin = 20,dtmean = .1))

We didn’t specify any measurement error in the above code, because we’re going to mix the regular Gaussian approach with binary data measurements:

d$Y2binary <-rbinom(n = nrow(d),size = 1, #create binary data based on the latent
  prob = ctsem::inv_logit(d$Y2))
d$Y1 <- d$Y1 + rnorm(nrow(d),0,.2) #gaussian measurement error

Now we setup our model for estimation – we no longer hard code the values of the simulation! We also fix the MANIFESTMEANS (measurement intercept) and free the CINT (continuous / latent process intercept) because that will be more appropriate if there are stable differences in the latent process. In Gaussian cases this choice of how to identify the model very often makes no difference, but it is more crucial with non-linearity. Note that there are many system matrices left unspecified, which will be freely estimated by default, and note also that correlated individual differences are enabled by default for all intercept terms (in this case, inital latent process intercepts, and continuous intercepts). The final step is to specify the measurement model by altering the model object directly – once the binary feature gets more development time I will probably create a nicer approach for this. 0 for the first manifest specifies that it is Gaussian, while the 1 specifies that our second variable (listed in manifestNames) is binary.

m <- ctModel(LAMBDA=diag(2),type='stanct',
  manifestNames = c('Y1','Y2binary'),
  MANIFESTMEANS = 0,
  CINT=c('cint1','cint2'))
##      [,1]
## [1,]  "0"
## [2,]  "0"
##      [,1]   
## [1,] "cint1"
## [2,] "cint2"

m$manifesttype=c(0,1)

Then we fit and summarise the model – note that at time of writing the latex output doesn’t know about the binary measurement model for the 2nd observable, and incorrectly shows the errors as Gaussian.

f <- ctStanFit(datalong = d,ctstanmodel = m,cores=6)
s=summary(f)
print(s$popmeans)
##                    mean     sd    2.5%     50%   97.5%
## T0m_eta1         0.1138 0.0907 -0.0650  0.1168  0.2798
## T0m_eta2        -0.1224 0.1155 -0.3534 -0.1259  0.1065
## drift_eta1      -0.9252 0.1277 -1.1902 -0.9146 -0.7104
## drift_eta1_eta2  0.1157 0.1728 -0.2206  0.1177  0.4593
## drift_eta2_eta1  1.0519 0.2989  0.4570  1.0510  1.6362
## drift_eta2      -1.0935 0.3588 -1.9238 -1.0557 -0.5062
## diff_eta1        0.9664 0.0289  0.9075  0.9655  1.0237
## diff_eta2_eta1  -0.0114 0.0762 -0.1605 -0.0081  0.1318
## diff_eta2        0.8806 0.1876  0.5844  0.8568  1.2874
## mvarY1           0.2016 0.0070  0.1878  0.2017  0.2157
## cint1            0.0238 0.0621 -0.0940  0.0252  0.1442
## cint2           -0.0990 0.0883 -0.2758 -0.0991  0.0782
ctModelLatex(f)

Then add a few plots showing temporal regression coefficients conditional on time interval, measurement expectations conditional on estimated parameters and past data, and finally latent states conditional on estimated parameters and both past and future data.

ctStanDiscretePars(f,plot=T)


ctKalman(f,subjects = c(2,5),plot=T)

ctKalman(f,subjects = c(2,5),plot=T,kalmanvec=c('etasmooth'))

Binary data in state space models

Generate some data

Charles Driver

Research Scientist