Dynamic Structural Econometrics - Unobserved heterogeneity and CCP estimation
The Expectation-Maximization (EM) Algorithm
We start with an overview of the Expectation-Maximization algorithm. Formally described by Dempster et al (1977), the EM algorithm provides a computationally convenient way to obtain maximum likelihood estimates for "incomplete data" environments. For our purposes, we can think of settings in which each observation in the data is generated by one of two different distribution types. However, the econometrician does not observe which data point has been generated by each type, and only observes the overall distribution. The goal is to identify the distribution of each type and the distribution of the latent variable that governs the distribution over types.
As an example, suppose we have the following data:
Y=(p>0.6).*poissrnd(2,[nobs,1]) + (p<=0.6).*poissrnd(8,[nobs,1]);
histogram(Y,'Normalization',"probability")
The above figure shows the empirical distribution of 1000 observations of an integer-valued random variable Y. Suppose we would like to fit a parametric model to this data. The main feature of the data that complicates this exercise is that the data is bimodal with a heavy right tail, and few (if any) standard distributions for discrete random variables will accurately capture this feature. To test, this, let's try fitting two common parametric distributions, Poisson and Negative Binomial, and see how they do.
histogram(Y,'Normalization','pdf','FaceColor',[0.4,0.4,0.4]);
plot(x,poisspdf(x,poiss_hat));
plot(x,nbinpdf(x,nbin_hat(1),nbin_hat(2)),'--');
legend('Empirical distribution','Poisson estimates','Neg. Bin. estimates','location','northeast')
As we can see from the figure above, the results are not encouraging. Simple parametric models may not suffice to accurately capture the qualitative features of the empirical distribution. While we could consider nonparametric approaches to estimating the distribution, this will have demanding data requirements. We also might have other reasons to believe that the distribution is generated by a finite number of different "types" of random draws that are mixed together in the overall distribution of Y.
A common way to estimate these distributions is to consider a mixture model. That is, we will assume that each draw of Y is obtained from distribution
with probability
and distribution
with probability
. The mixing parameters
govern the distribution over types, while the distributions
describe the distribution of outcomes within each type. Typically, we will parameterize the distributions
and
with parameters
. Noting the restriction that
, we have three parameters we need to estimate:
,
, and
. The log-likelihood for observation
can be expressed as Practical applications have found that direct maximum likelihood approaches to estimating these parameters perform poorly -- convergence may be slow, there can be many local optima, and the estimates may be extremely sensitive to starting values for the parameters. The alternative approach of Dempster et al (1977) is to update the parameters π and
in separate stages. The main features of the estimator can be described in four steps: - Fix initial values for
, and
and set convergence criteria. - Given the current values for
, compute the probability of each observation type, i.e., calculate Pr
for
. This yields an updated value for π and pointwise estimates for the probability that each observation i is of type j. - Hold fixed the type probabilities for each
computed in step 2 and maximize the likelihood with respect to
given these probabilities. This yields updated values for
. - Repeat steps 2-3 until convergence (per-iteration change in the overall likelihood is "small").
We will implement this algorithm on the data set above using the assumptions that (i) there are two components in the mixture (i.e., observations are each associated with one of two unobserved types) and (ii) the two types
each follow a Poisson distribution with parameter
. % Step 1 (a): initial values for theta and lambda
% Step 1 (b): specify convergence criteria
while normdiff>tolerance && iter<=maxiter
Expectation step.
The expectation step takes the current values of
as fixed and calculates the probability of observation
being of type 1 (
) or type 2 (
). We can write the density function of Y conditional on θ, π, and the (unobserved)
as Similarly, the distribution of
conditional on θ and π is where
. These equations together with Bayes theorem imply that the conditional distribution of the
are the proportional likelihood of each mixture component j weighted by the current estimates of the mixture weights λ:
EZ=zeros(obs,length(pi_0));
temp_pdf=poisspdf(repmat(Y,1,length(theta_0)),repmat(theta_0,obs,1));
EZ(:,j)=(pi_0(j)*temp_pdf(:,j))./(temp_pdf*pi_0');
In the code above, the "temp_pdf" variable creates an
matrix of likelihood values, where the first column is the set of likelihoods associated with
and the second column is the set of likelihoods associated with
: Maximization step.
The maximization step performs a standard maximum likelihood procedure in which the probabilities of being in each group
are treated as known and fixed. Because the likelihood is additively separable in θ and λ, the maximization step can be done in two parts. Updating π is simple: maximizing
with respect to π simply requires taking the average of the probability of being each type: That is, we simply take the mean of each column of the
"EZ" matrix we computed above. Updating θ requires us to numerically optimize the first component of the log-likelihood function above.
Since Matlab has a built-in function for the pmf of a Poisson distribution, this likelihood function can be defined and optimized with two lines of code.
theta_fun= @(theta) (-1)*sum(sum(log(poisspdf(repmat(Y,1,length(theta_0)),repmat(theta,obs,1))).*EZ));
[theta_1,~]=fminsearch(theta_fun,theta_0);
The last step in each iteration is to calculate the overall likelihood
and check for convergence. If
is less than the convergence criteria, we stop. Otherwise, we proceed to the next iteration. lik1=sum(sum(log(poisspdf(repmat(Y,1,length(theta_1)),repmat(theta_1,obs,1))).*EZ)) + sum(sum(log(pi_1).*EZ));
% Update for next iteration
With the estimates obtained, let's see how we did:
Y_fit=PiHat(1)*poisspdf(x,ThetaHat(1))+PiHat(2)*poisspdf(x,ThetaHat(2));
histogram(Y,'Normalization','probability','FaceColor',[0.4,0.4,0.4]);
legend('Estimated mixture distribution','Empirical distribution','location','northeast')
This looks much better: the estimated distribution matches the shape of the empirical distribution much more closely. Since we know the true parameters used to generate the data, we can also compare these with the estimated parameters:
FiniteMixtureEstimatesTable = table([ThetaHat'; PiHat'],[2; 8; 0.4; 0.6],'VariableNames',{'Estimates','True values'},'RowNames',{'Theta_1','Theta_2','Pi_1','Pi_2'})
FiniteMixtureEstimatesTable = 4×2 table
| Estimates | True values |
---|
1 Theta_1 | 1.9176 | 2 |
---|
2 Theta_2 | 7.9876 | 8 |
---|
3 Pi_1 | 0.41922 | 0.4 |
---|
4 Pi_2 | 0.58078 | 0.6 |
---|
The estimator also converges quickly, taking fewer than 30 iterations to converge.
Using the EM algorithm in a dynamic discrete choice model
Now we apply the EM algorithm to a dynamic discrete choice model. The basic features of the algorithm do not change: at each iteration, we compute the probability of each individual (or bus) of being of type s given the most recent estimates for θ, and then use these probabilities to obtain a new estimate for θ.
Model description and setup
We continue to use the model of Harold Zurcher's bus engine replacement problem introduced in Rust (1987) as the framework for estimation. To briefly reiterate the model, we have the following setup, noting that we now assume that the bus type s is unobservable:
- Time is discrete, with
denoting each period, with
. - Each period, Harold Zurcher chooses between two actions. He can either replace the bus engine,
, or not replace
. These are mutually exclusive, so that
. - Zurcher's payoff depends on a single state variable
that represents the mileage of the bus. Mileage advances by a random amount if the bus engine is not replaced and is set to zero if the bus engine is replaced. Denote by
the distribution over next period mileage conditional on current mileage after making choice d. When the bus engine is replaced, mileage resets to zero, so that
for all
. Mileage accumulation when the engine is not replaced follows a discrete analog of the exponential distribution up to some maximum mileage
where mileage increment is
.
- In addition to mileage
, there is a permanent and unobservable bus-specific type
. Assume that the deterministic component of flow payoffs for each action
for a bus of type s are given by
- Payoffs also have an additively separable idiosyncratic component
so that the total flow payoff from choice j in period t is
. We assume that
are i.i.d draws from a Type I EV distribution. - Denoting the optimal decision rule at t by
, we define the ex-ante value function for state
in period t as
- The conditional value function
is the value function for choice j in period t and state
less the utility shock
. In the case of Zurcher's bus engine replacement problem, these functions can be expressed as
- Note that all t subscripts have been removed from value functions and state transitions: this is a stationary, infinite horizon problem.
- In the binary choice setting with Type I EV preference shocks we can express conditional choice probabilities very simply:
- Representation: finally, we can use the results of Hotz and Miller (1993) to express the difference between the value function
and conditional value function
:
where
when
Type I EV, where
is the Euler-Mascheroni constant.
x_min = 0.0 ; x_max = 10.0;
x_len = (x_max-x_min)/delta_x+1;
x = linspace(x_min, x_max, x_len);
x_tday = repmat(x, [x_len,1]);
f = (x_next >= x_tday)' .* exp(- (x_next - x_tday)') .* (1 - exp(-delta_x));
f(:,end) = 1 - sum(f(:,1:(end-1)), 2); % The last column is set so that the sum of probs is 1
emc = double(eulergamma);
% Define parameters and simulate data
theta = [2.00, -0.15, 1.0];
[data_t, data_x, data_s, data_d, data_x_index] = generate_data(N, T, S, f, F, X,beta, pi, theta);
% Data_vars={'data_t', 'data_x', 'data_s', 'data_d', 'data_x_index'};
% Const_vars={'beta','theta','S','X','x_len','pi','T','x','N','f','F','F1','F2','emc'};
% All_vars=[Data_vars, Const_vars];
% save("Bus_engine_unobs_het_data.mat",All_vars{:});
Estimation
We begin by setting initial values for the parameters
and
, and use the initial values for
to initialize the conditional choice probabilities
. The convergence parameter is a tolerance level of 1e-7, and we check for convergence by testing the percentage change in the likelihood for values 25 iterations apart. Finally, we set the maximum number of iterations to 1000 and include a check (and error message) in case we reach this level before converging.
% load("Bus_engine_unobs_het_data.mat");
stored_lik_vals=zeros(max_iter,1);
q_new=[pi_hat*ones(N,1), (1-pi_hat)*ones(N,1)];
ccp_hat = zeros(length(X),length(S));
ccp_hat(j,s) = sum(repmat(q_new(:,s),[T,1]).*(data_x_index(:)==j).*(data_d(:)==1))/sum(repmat(q_new(:,s),[T,1]).*(data_x_index(:)==j));
ccp_hat=max(epsilon, min(1-epsilon,ccp_hat));
To illustrate the method as it applies to the bus engine replacement case, we go through one iteration first before putting each term inside the loop that completes the estimation.
Expectation step.
With initial values for the parameters established, we first need to calculate the likelihood for each observation bus n, time period t, and bus type s. Recall that in the finite Poisson mixture in the example above, we updated the guesses for the probability that observation i was type j by using the π-weighted likelihoods of observing outcome
conditional on being type j and having type-specific parameters
. With the dynamic bus engine replacement problem, we proceed similarly. However, we have to remember that the relevant likelihood for establishing whether bus n is of type s is the sequence of replacement decisions
made for that bus. Therefore, we calculate the individual likelihood components
for each period, bus, and type, and then we take the product across time periods t to compute the overall likelihood associated with each bus type s. Each individual likelihood term can be calculated as
where we exploit the one-period finite dependence property inherent to the renewal of the bus engine to express the differences conditional value functions in terms of the one-period ahead conditional choice probabilities:
u2 = theta_0(1) + theta_0(2)*X' + theta_0(3)*S;
v1_ccp = repmat(beta*(-log(ccp_hat(1,:))) + beta*emc,[length(X),1]);
v2_ccp = u2 + beta*F2(:,:)*(0 - log(ccp_hat(:,:))) + beta*emc;
likelihood_pointwise_1=((data_d(:)==2).*exp((v2_ccp(data_x_index(:),1)-v1_ccp(data_x_index(:),1)))+(1-(data_d(:)==2)))./(1+exp(v2_ccp(data_x_index(:),1)-v1_ccp(data_x_index(:),1)));
likelihood_pointwise_2=((data_d(:)==2).*exp((v2_ccp(data_x_index(:),2)-v1_ccp(data_x_index(:),2)))+(1-(data_d(:)==2)))./(1+exp(v2_ccp(data_x_index(:),2)-v1_ccp(data_x_index(:),2)));
The likelihood associated with each bus n conditional on engine type s is the product of the likelihood of the sequence of decisions
conditional on the states
, engine type s, conditional choice probabilities
, and utility parameters θ:
. This is the term that captures the observation-level likelihood terms
from our simple EM algorithm example above. As before, we weight the engine-type specific likelihoods by the mixing parameter
yielding an overal per-bus estimate of the probability of being engine type s of
ll_pw1=reshape(likelihood_pointwise_1,[N,T]);
ll_pw2=reshape(likelihood_pointwise_2,[N,T]);
likelihood_by_n=[prod(ll_pw1,2), prod(ll_pw2,2)];
q_new(:,1) = pi_hat*likelihood_by_n(:,1)./(pi_hat*likelihood_by_n(:,1) + (1-pi_hat)*likelihood_by_n(:,2));
q_new(:,2) = 1-q_new(:,1);
Maximization step.
Having established the expectation that each bus n is of type s in the previous step, we now perform that maximization step of the algorithm. As before, updating the mixing parameter π is simple, as we simply take the mean of the bus-level probabilities of being of type s:
To update the utility parameters θ, we first need to update the conditional choice probabilities
since we are using CCP methods to calculate the likelihood terms. We update the conditional choice probabilities using the bin estimator after applying weights via the per-bus probability of being in state s: The last maximization step is to update the utility parameters θ. Here we maximize the likelihood, again weighting each likelihood component
by the probability that bus n is of type s using the estimates for
obtained in the previous step: pi_hat = mean(q_new(:,1));
ccp_new = zeros(length(X),length(S));
ccp_new(j,s) = sum(repmat(q_new(:,s),[T,1]).*(data_x_index(:)==j).*(data_d(:)==1))/sum(repmat(q_new(:,s),[T,1]).*(data_x_index(:)==j));
never_reached=isnan(ccp_new(:)); % keep track of states that never occur in the data
ccp_new=max(epsilon, min(1-epsilon,ccp_new)); % Makes sure that all probabilities are in the open interval (0,1), as "1" and "0" observations cause problems.
lik_fun = @(theta) ccp_likelihood_inner(theta,X,S,f,data_x_index,q_new,data_d,ccp_new,beta,T);
[updated_theta,fval] = fminunc(lik_fun,theta_0,optimoptions('fminunc','Display','none'));
Finally, we include everything in a while loop so as to iterate until convergence:
while cond==0 && iter<max_iter
u2 = theta_0(1) + theta_0(2)*X' + theta_0(3)*S;
v1_ccp = repmat(beta*(-log(ccp_hat(1,:))) + beta*emc,[length(X),1]);
v2_ccp = u2 + beta*F2(:,:)*(0 - log(ccp_hat(:,:))) + beta*emc;
likelihood_pointwise_1=((data_d(:)==2).*exp((v2_ccp(data_x_index(:),1)-v1_ccp(data_x_index(:),1)))+(1-(data_d(:)==2)))./(1+exp(v2_ccp(data_x_index(:),1)-v1_ccp(data_x_index(:),1)));
likelihood_pointwise_2=((data_d(:)==2).*exp((v2_ccp(data_x_index(:),2)-v1_ccp(data_x_index(:),2)))+(1-(data_d(:)==2)))./(1+exp(v2_ccp(data_x_index(:),2)-v1_ccp(data_x_index(:),2)));
ll_pw1=reshape(likelihood_pointwise_1,[N,T]);
ll_pw2=reshape(likelihood_pointwise_2,[N,T]);
likelihood_by_n=[prod(ll_pw1,2), prod(ll_pw2,2)];
q_new(:,1) = pi_hat*likelihood_by_n(:,1)./(pi_hat*likelihood_by_n(:,1) + (1-pi_hat)*likelihood_by_n(:,2));
pi_hat = mean(q_new(:,1));
ccp_new = zeros(length(X),length(S));
ccp_new(j,s) = sum(repmat(q_new(:,s),[T,1]).*(data_x_index(:)==j).*(data_d(:)==1))/sum(repmat(q_new(:,s),[T,1]).*(data_x_index(:)==j));
never_reached=isnan(ccp_new(:)); % keep track of states that never occur in the data
ccp_new=max(epsilon, min(1-epsilon,ccp_new)); % Makes sure that all probabilities are in the open interval (0,1), as "1" and "0" observations cause problems.
lik_fun = @(theta) ccp_likelihood_inner(theta,X,S,f,data_x_index,q_new,data_d,ccp_new,beta,T);
[updated_theta,fval] = fminunc(lik_fun,theta_0,optimoptions('fminunc','Display','none'));
% STEP 3: Re-calculate pointwise likelihood with updated thetas, pi
% hat, and ccps, and check convergence
u2 = updated_theta(1) + updated_theta(2)*X' + updated_theta(3)*S;
v1_ccp = repmat(beta*(-log(ccp_new(1,:))) + beta*emc,[length(X),1]);
v2_ccp = u2 + beta*F2(:,:)*(0 - log(ccp_new(:,:))) + beta*emc;
likelihood_pointwise_1=((data_d(:)==2).*exp((v2_ccp(data_x_index(:),1)-v1_ccp(data_x_index(:),1)))+(1-(data_d(:)==2)))./(1+exp(v2_ccp(data_x_index(:),1)-v1_ccp(data_x_index(:),1)));
likelihood_pointwise_2=((data_d(:)==2).*exp((v2_ccp(data_x_index(:),2)-v1_ccp(data_x_index(:),2)))+(1-(data_d(:)==2)))./(1+exp(v2_ccp(data_x_index(:),2)-v1_ccp(data_x_index(:),2)));
temp_val=sum(q_new(:,1).*log(prod(reshape(likelihood_pointwise_1,[N,T]),2)) + q_new(:,2).*log(prod(reshape(likelihood_pointwise_2,[N,T]),2))) + sum(log(pi_hat)*q_new(:,1) + log(1-pi_hat)*q_new(:,2));
stored_lik_vals(iter)=temp_val;
if abs((stored_lik_vals(iter) - stored_lik_vals(iter-25))./stored_lik_vals(iter))<tol
% STEP 4: Update values and continue to next iteration if cond==0
disp('Maximum number of iterations reached')
toc
Elapsed time is 75.517123 seconds.
UnobservedHetCCP_EstimatesTable = table([updated_theta; pi_hat],[theta'; pi],'VariableNames',{'CCP estimates','True values'},'RowNames',{'Theta_0','Theta_1','Theta_2','Pi'})
UnobservedHetCCP_EstimatesTable = 4×2 table
| CCP estimates | True values |
---|
1 Theta_0 | 1.9939 | 2 |
---|
2 Theta_1 | -0.14786 | -0.15 |
---|
3 Theta_2 | 1.0279 | 1 |
---|
4 Pi | 0.41936 | 0.4 |
---|
Important disclaimer: in the data generation function, we have assigned exogenous state variables in the initial period
. This is unlikely to hold in general. Suppose all buses had started with zero mileage in period 0 but we, the econometricians, only observed the buses starting in period 10. Because the mileage accumulation and replacement decisions will be correlated with the unobserved type s, we will have an initial conditions problem: the first state that we observed the bus,
, will be correlated with the unobserved type, and we will need to control for this in order to obtain consistent estimates. There are several ways to account for this; see the supplementary material for Arcidiacono and Miller (2011) for one such implementation.
Functions
function [ccp_lik] = ccp_likelihood_inner(theta,X,S,x_tran,data_x_index,q,data_d,ccp_hat,beta,T)
emc = double(eulergamma);
u2 = theta(1) + theta(2)*X' + theta(3)*S;
v1_ccp = repmat(beta*(-log(ccp_hat(1,:))) + beta*emc,[length(X),1]);
v2_ccp = u2 + beta*x_tran(:,:)*(0 - log(ccp_hat(:,:))) + beta*emc;
q_use=repmat(q(:,1),[1,T]);
ccp_lik_1 = -sum(q_use(:).*((data_d(:)==2).*(v2_ccp(data_x_index,1)-v1_ccp(data_x_index,1)) - log(1+exp(v2_ccp(data_x_index,1)-v1_ccp(data_x_index,1)))));
ccp_lik_2 = -sum((1-q_use(:)).*((data_d(:)==2).*(v2_ccp(data_x_index,2)-v1_ccp(data_x_index,2)) - log(1+exp(v2_ccp(data_x_index,2)-v1_ccp(data_x_index,2)))));
ccp_lik = ccp_lik_1+ccp_lik_2;
The value function iteration and data generation functions below are only used in generating the data, not in estimating the model. We have modified the data generating function slightly compared with previous lecture notes to assign exogenous mileage state variables x to each bus in the first time period observed, rather than letting the model run for a certain number of periods before sampling mileage observations. This allows us to ignore the initial conditions problem in estimating the distribution of s (see disclaimer above).
function [value_function,value_diff,ccps] = value_function_iteration(S,x_tran,X, beta,theta)
x_len = length(unique(X));
emc = double(eulergamma);
value_function_2 = zeros(x_len,length(S));
ccps=zeros(x_len,length(S));
while value_diff>1e-5 && iter < max_iter
value_function_1 = value_function_2;
v1 = repmat(beta*value_function_1(1,s),[x_len,1]);
v2=theta(1) + theta(2)*X' + theta(3)*s + beta*(x_tran(:,:)*value_function_1(:,s));
value_function_2(:,s) = log(exp(v1)+exp(v2)) + emc;
ccps(:,s)=1./(1+exp(v2-v1));
value_diff = max(max((value_function_2 - value_function_1).^2));
value_function = value_function_2;
function [data_t, data_x, data_s, data_d, data_x_index] = generate_data(N, T, S, x_tran, x_tran_cumul, X, beta,pi,theta)
initial_run=0; % We start the state variables at zero and let the model run for ten periods before we start sampling
x_index = ones(N, T+initial_run); x_index(:,1)=randi(length(X),[N,1]);
x_sim = zeros(N,T+initial_run); x_sim(:,1)=X(x_index(:,1));
s_sim = (rand(N,1) > pi) + 1;
d_sim = zeros(N, T+initial_run);
draw_ccp = rand(N, T+initial_run);
draw_x = rand(N, T+initial_run);
value_function = value_function_iteration(S,x_tran,X,beta, theta);
v1 = u1 + beta*value_function(1,s_sim(n)); % Reset to zero mileage deterministically if engine is replaced.
u2 = theta(1) + theta(2) * X(x_index(n,t)) + theta(3) * s_sim(n);
v2 = u2 + beta* x_tran(x_index(n,t),:)*value_function(:,s_sim(n));
% Conditional choice probability
p1 = exp(v1)/(exp(v1)+exp(v2));
d_sim(n,t) = (draw_ccp(n,t) > p1) + 1; % 1: replace, 2: continue
x_index(n,t+1) = (d_sim(n,t)==2) * sum((draw_x(n,t) > x_tran_cumul(x_index(n,t),:))) + 1;
x_sim(n,t+1) = X(x_index(n,t+1));
data_x = x_sim(:, initial_run+1:T+initial_run);
data_d = d_sim(:, initial_run+1:T+initial_run);
data_s = repmat(s_sim, [1,T]);
data_t = repmat(transpose(1:T), [1,N])';
[~, data_x_index] = ismember(data_x,X);
References