Formulating State Space Models in R with Focus on Longitudinal

May 25, 2006 - patterns and time-varying covariates can be used in the formula. ...... Fitting a Poisson generalized linear model with a quadratic trend and an ...
365KB taille 2 téléchargements 276 vues
JSS

Journal of Statistical Software May 2006, Volume 16, Issue 1.

http://www.jstatsoft.org/

Formulating State Space Models in R with Focus on Longitudinal Regression Models Claus Dethlefsen

Søren Lundbye-Christensen

Aalborg Hospital, Aarhus University Hospital

Aalborg University

Abstract We provide a language for formulating a range of state space models with response densities within the exponential family. The described methodology is implemented in the R-package sspir. A state space model is specified similarly to a generalized linear model in R, and then the time-varying terms are marked in the formula. Special functions for specifying polynomial time trends, harmonic seasonal patterns, unstructured seasonal patterns and time-varying covariates can be used in the formula. The model is fitted to data using iterated extended Kalman filtering, but the formulation of models does not depend on the implemented method of inference. The package is demonstrated on three datasets.

Keywords: dynamic models, exponential family, generalized linear models, iterated extended Kalman smoothing, Kalman filtering, seasonality, time series, trend.

1. Introduction Generalized linear models, see McCullagh and Nelder (1989), are used when analyzing data where response-densities are assumed to belong to the exponential family. Time series of counts may adequately be described by such models. However, if serial correlation is present or if the observations are overdispersed, these models may not be adequate, and several approaches can be taken. The book by Diggle, Heagerty, Liang, and Zeger (2002) gives an excellent review of many approaches incorporating serial correlation and overdispersion in generalized linear models. Dynamic generalized linear models (DGLM), often called state space models, also address those problems and are treated in a paper by West, Harrison, and Migon (1985) in a conjugate Bayesian setting. They have been subject to further research by e.g. Zeger (1988) using generalized estimating equations (GEE), Gamerman (1998) using Markov chain Monte Carlo (MCMC) methods and Durbin and Koopman (1997) using iterated extended Kalman filtering and importance sampling.

2

Formulating State Space Models in R

Standard statistical software does not not include procedures for DGLMs and only sparse support for Gaussian state space models. There is a need for a simple, yet flexible way of specifying complicated non-Gaussian state space models. Often, one need to tailor make software for each specific application in mind. A function, StructTS, has been developed for analysis of a subclass of Gaussian state space models, see Ripley (2002). The binary library SsfPack for Ox may be used freely for academic research and provides a tool set for analysis of Gaussian state space models with some support for non-Gaussian models, see Koopman, Shephard, and Doornik (1999). The interface is very flexible, but not as easy to use as a glm call in R. Section 2 describes Gaussian state space models and shows how generalized linear models can naturally be extended to allow the parameters to evolve over time. We define components (e.g. trend and seasonal components) that separate the time series into parts that may be inspected individually after analysis. In Section 3 the syntax for defining objects describing the proposed state space models are described as a simple, yet powerful, extension to the glmcall in R (R Development Core Team 2006). The techniques are illustrated on three examples in Section 4.

2. State space models The Gaussian state space model for univariate observations involves two processes, namely the state process (or latent process), {θ k }, and the observation process, {yk }. The random variation in the state space model is specified through descriptions of the sampling distribution, the evolution of the state vector, and the initialization of the state vector. Let {yk } be measured at timepoints tk for k = 1, . . . , n. The state space model is defined by yk = F> k θ k + νk ,

νk ∼ N (0, Vk )

(1)

θ k = Gk θ k−1 + ω k ,

ω k ∼ Np (0, Wk )

(2)

θ 0 ∼ Np (m0 , C0 ).

(3)

We assume that the disturbances {νk } and {ω k } are both serially independent and also independent of each other. The possible time-dependent quantities Fk , Gk , Vk and Wk may depend on a parameter vector, but this is suppressed in the notation. We now consider the case where the state process is Gaussian and the sampling distribution belongs to the exponential family, p(yk |ηk ) = exp {yk ηk − bk (ηk ) + ck (yk )} .

(4)

The density (4) contains the Gaussian, Poisson, gamma and the binomial distributions as special cases. The natural parameter ηk is related to the linear predictor λk by the equation ηk = v(λk ) or equivalently λk = u(ηk ). The linear predictor in a generalized linear model is of the form λk = Zk β, where Zk is a row vector of explanatory variables and β is the vector of regression parameters. The link function, g, relates the mean, E(yk ) = µk , and the linear predictor, λk , as g(µk ) = λk . The inverse link function, h, is defined as µk = τ (ηk ) = h(λk ), where τ is the mean value mapping. The following relations hold ηk = v(λk ) = τ −1 (h(λk )) and λk = u(ηk ) = g(τ (ηk )), where u is the inverse of v. The link function is said to be canonical if ηk = λk , i.e. if g = τ −1 .

Journal of Statistical Software

3

2.1. Dynamic extension The static generalized linear model is extended by adding a dynamic term, Xk β k , to the linear predictor, where β k is varying randomly over time according to a first order Markov process. Hence, λk = Zk β + Xk β k , (5) where β is the coefficient of the static component and {β k } are the time-varying coefficients of the dynamic component. For notational convenience, we will use the notation λk =

F> k θk ,

 θk =

β βk

 .

(6)

The evolution through time of the state vector, θ k , is modelled by the relation θ k = Gk θ k−1 + ω k ,

(7)

for an evolution matrix Gk , determined by the model. The error terms, {ω k }, are assumed to be independent Gaussian variables with zero mean and variance VAR(ω k ), with non-zero entries corresponding to the entries of the time-varying coefficients, β k , and zero elsewhere. The model is fully specified by the initializing parameters m0 and C0 , the matrices Fk , Gk , and the variance parameters Vk and VAR(ω k ). The variances may be parametrized as e.g. VAR(ω k ) = ψ · diag(1, 0, 0, 1, 1) or VAR(ω k ) = diag(ψ1 , ψ2 , ψ2 ).

2.2. Inferential procedures For a Gaussian state space model, we write θ k |Dk ∼ Np (mk , Ck ), where Dk is all information available at time tk . The Kalman filter recursively yields mk and Ck with the recursion starting in θ 0 ∼ Np (m0 , Co ). Assessment of the state vector, θ k , using all available information, Dn , is called Kalman e k ). Starting with m e n = Cn , the e k, C e n = mn and C smoothing and we write θ n |Dn ∼ Np (m Kalman smoother is a backwards recursion in time, k = n − 1, . . . , 1. For exponential family sampling distributions, the iterated extended Kalman filter yields an approximation to the conditional distribution of the state vector given Dn , see e.g. Durbin and Koopman (2000). By Taylor expansion, the sample distribution (4) is approximated with a Gaussian density, giving an approximating Gaussian state space model. The conditional distribution of the state vector given Dn in the exact model and in the Gaussian approximation have the same mode. The iterated extended Kalman filter is used as filter and smoother method in sspir.

2.3. Decomposition The variation in the linear predictor, random or not, may be decomposed into four components: a time trend (Tk ), harmonic seasonal patterns (Hk ), unstructured seasonal patterns (Sk ), and a regression with possibly time-varying covariates (Rk ). Each component may contain static and/or dynamic components, which is specified by zero and non-zero diagonal elements in VAR(ω k ), respectively, as described in the following.

4

Formulating State Space Models in R

The block-diagonal evolution matrix takes the form  (1) G  k I  Gk =  (3)  Gk

   ,  I

(3)

(1)

where Gk is defined in (9), and Gk in (12). The components are only present if the model includes the corresponding terms. The linear predictor, (1)

(2)

(3)

(4)

λk = Tk θ k + Hk θ k + Sk θ k + Rk θ k = T k + Hk + Sk + R k . will be detailed in the following.

Time trend The long term trend is usually modelled by a sufficiently smooth function. In static regression models, this can be done by e.g. a high degree polynomial, a spline, or a generalized additive model. In the dynamic setting, however, a low degree polynomial with time-varying coefficient may suffice. By stacking a polynomial, q(t) = b0 + b1 t + · · · + bp tp , and the first p derivatives, the transition from tk−1 to tk obeys the relation     q(tk ) q(tk−1 )  q 0 (tk )   q 0 (tk−1 )     (1)  (8)  = Gk  ,  .. ..     . . q (p) (tk )

q (p) (tk−1 )

where ∆tk = tk − tk−1 , and the upper triangular transition matrix  1 ∆tk ··· ∆tpk /p!  1 · · · ∆tp−1  k /(p − 1)!  (1) .. . .. Gk =  .   1

is given by     .  

(9)

(1)

Using θ k for the left hand side of (8), a polynomial growth model with time-varying coeffi(1) (1) (1) (1) (1) cients can be written as θ k = Gk θ k−1 + ω k . The error term has variance VAR(ω k ) = ∆tk W(1) , where W(1) is diagonal in the case with independent random perturbations in each of the derivatives. (1)

The trend component is the first element in θ k , i.e. (1)

Tk = Tk θ k = [ 1 0 · · ·

(1)

0 ] θk .

Alternatively, the time trend may be modelled as a random function, q(t), for which the increments over time are described by a random walk, resulting in a cubic spline, see Kitagawa

Journal of Statistical Software

5

and Gersch (1984). The transition is the same as in (8) with p = 2, but only one variance parameter is necessary as, " (1) VAR(ω k )

=

2 σw

∆t3k /3 ∆t2k /2 ∆t2k /2

# .

(10)

∆tk

Harmonic seasonal pattern Seasonal patterns with a given period, m, can be described by the following dth degree trigonometric polynomial (2)

Hk = Hk θ k     d  X 2π 2π tk + θs,i sin i · tk = θc,i cos i · m m i=1  (2)  c1k · · · cdk s1k · · · sdk θ k , =

(11)

where cik = cos(i · 2πtk /m) and sik = sin(i · 2πtk /m). This component can be used to describe seasonal effects showing cyclic patterns. Further seasonal components may be added for each period of interest. (2)

The random fluctuations in θ k (2) VAR(ω k )

= ∆tk W

(2)

(2)

is modelled by a random walk, θ k

(2)

(2)

= θ k−1 + ω k

with

.

Unstructured seasonal component For equidistant observations, a commonly used parameterization for the seasonal component is to let the effects, γk , for each period sum to zero in the static case, or to a white noise error sequence in the time-varying case, see Kitagawa and Gersch (1984). For an integer period, Pm−1 m, the sum-to-zero constraint can be expressed as i=0 γk−i = 0 in the static case, and in P (3) (3) 2 the dynamic case, m−1 i=0 γk−i = ωk , with ωk ∼ N (0, σw ). This is expressed in matrix form (3) by letting θ k = [γk , γk−1 , . . . , γk−m+2 ]> , and defining the (m − 1) × (m − 1) matrix   −1 −1 · · · −1  1 0 ··· 0    (3) (12) Gk =  . . ..  . . . . .  . . . .  0 ··· (3)

(3) (3)

(3)

1

0

(3)

2 , 0, . . . , 0) defines the evoluThen, θ k = Gk θ k−1 + ω k , with VAR(ω k ) = W(3) = diag(σw tion of the seasonal component. The corresponding term in the linear predictor is extracted by (3) (3) Sk = Sk θ k = [ 1 0 · · · 0 ]θ k .

Regression component Observed time-varying covariates, Rk , enter the model through the usual regression term (4)

Rk = Rk θ k ,

6

Formulating State Space Models in R (4)

(4)

(4)

(4)

with θ k = θ k−1 + ω k and VAR(ω k ) = ∆tk W(4) . The structure of W(4) is specified by the modeller and depends on the context.

3. Specification of state space objects The package sspir can be downloaded and installed from http://CRAN.R-project.org/ and is then activated in R by library("sspir"). Assuming that the data are available either in a dataframe or in the current environment, then a state space model is setup using glm-style formula and family arguments. Terms are considered static unless embraced by the special function tvar(), described further in Section 3.2.

3.1. State space model objects In sspir, a state space model is defined as an object from the class ssm. The object defines the model and contains the slots that are needed for the subsequent statistical analysis. The definition of a state space model object has the following syntax ssm(formula, family=gaussian, data, subset, fit=TRUE, phi, m0, C0, Fmat, Gmat, Vmat, Wmat)

The call is designed to be similar to the glm call. The elements in the call are formula a specification of the linear predictor (5) of the model. The syntax is defined in Section 3.2. family a specification of the observation error distribution and link function to be used in the model, as in a glm-call. This can be a character string naming a family function, a family function or the result of a call to a family function. Currently, only Poisson with loglink, binomial with logit-link, and Gaussian with identity-link have been implemented. It is possible to expand with further combinations within the exponential family. data an optional data frame containing the variables in the model. By default the variables are taken from ’environment(formula)’, typically the environment from which ’ssm’ is called. The response has to be of class ts. subset an optional vector specifying a subset of observations to be used in the fitting process. fit a logical defaulting to TRUE which means that the iterated extended Kalman smoother is used to fit the model. If FALSE, the model is only defined and no inferential calculations are made. phi a vector of hyper parameters that are passed directly to Fmat, Gmat, Vmat, and Wmat. If phi is not provided, it is default set to a vector of ones with the length determined by the number of hyper parameters needed on the basis of the formula provided. m0 a vector with the initial state vector. Defaults to a vector of zeros. C0 a matrix with the variance matrix of the initial state. Defaults to a diagonal matrix with diagonal entries set to 106 .

Journal of Statistical Software

7

Fmat a function giving the regression matrix at a given timepoint. If not supplied, this is constructed from the formula. Gmat a function giving the evolution matrix at a given timepoint. If not supplied, this is constructed from the formula. Wmat a function giving the evolution variance matrix at a given timepoint. If not supplied, this is constructed from the formula. Vmat a function giving the observation variance matrix at a given timepoint. If not supplied, this is constructed from the formula. The call creates an object defining the system matrices Ft , Gt , Wt , and Vt in terms of functions, returning the matrix in question at a given time point. For example, the Wmat function could be defined as Wmat