Financial Institutions URI I a mu \mt (40mm. March 6, 2015 File N0.: 45004 Ref. N0.: 0877 To: BC Credit Unions Re: Research Report on Residential Mortgage Default Probability This letter announces a new research report by the Financial institutions Commission (FICOM) on models and methods for analyzing the probability of default of residential mortgages. FICOM has conducted a review of academic literature to idenlify some of the theories and models for estimating the probability ofdefnull for credit union personal real estate secured loans Please find the report summarizing the results at 1m vov Ca/illde fid/ ublicallonsr We anticipate that the information presented will he ofn istance to credit unions in developing more effective credit risk models that can potentially be used for stress testing and other risk management activities. For any questions or concern regarding the report, please contact me at lnehrdad.mslan@flcombc.ca. Yours truly Mehrdad Rastan Executive Director, Risk Surveillance Analytics \anc - Superlnlendem of Final-lull] ln~|l|u|lum 2800555 Web! . Superlnlendem of Pensions Vuncuuven EC VfiB 4N6 - Superlnlendem of Real Ebmle Telephone, 604 660 555 Regmmr oi Mortgage Brokers Facsimile. 604 6604365 . Research Paper October 2014 Residential Mortgage Probability of Default Models and Methods by Mingxin Li Risk Surveillance and Analytics Financial Institutions Commission About this report Mingxin Li is a PhD candidate in the Beedie School of Business at Simon Fraser University. The research was completed under the supervision of the Financial Institutions Commission staff. The views expressed in this paper are those of the author. No responsibility for them should be attributed to the Financial Institutions Commission. Acknowledgements I would like to take this opportunity to acknowledge the following individuals: I would like to thank Dr. Evan Gatev and Dr. Christina Atanasova for reviewing the paper and providing many helpful comments. I would like to thank Mr. Mehrdad Rastan, Mr. Gilbert Yuen, Mr. Peter Lee, and Mr. Jack Ni for the intensive discussion and valuable feedback during the development of this project. I would also like to thank Ms. Angel Chen for proofreading and editing this research paper. 1 Table of Contents Executive Summary .......................................................................................................................3 I. Introduction ................................................................................................................................4 II. Evaluating Mortgage Default Risk in the Early Days ...........................................................5 III. Models for Default Risk of an Individual Loan....................................................................6 Model 1: linear regression analysis on default risk ..................................................................6 Model 2: logistic model ............................................................................................................9 Model 3: survival analysis ......................................................................................................13 Model 4: optimization model .................................................................................................17 IV. Models for Default Probability of a Loan Portfolio ...........................................................20 Model 5: linear regression analysis on default rates ..............................................................20 Model 6: linear regression analysis on log odds ....................................................................24 V. Default Determinants Implied from Economic Theories ....................................................25 Explaining default in the early days .......................................................................................26 Competing theories of default behavior .................................................................................27 Option-based theory of default behavior ................................................................................29 Macroeconomic factors ..........................................................................................................32 VI. Issue of Model Stability .........................................................................................................35 VII. Conclusion ............................................................................................................................36 Appendix 1.a: Overview of Models .............................................................................................38 Appendix 1.b: Loan-level Model versus Portfolio-level Model...................................................39 Appendix 2: Determinants of Residential Mortgage Default Risk ...............................................40 References .....................................................................................................................................42 2 Executive Summary Stress testing is the investigation of an entity’s performance under abnormal circumstances. Financial institutions should conduct stress tests to gauge the resilience of their balance sheets to substantial macroeconomic shocks. One way to measure the performance of a financial institution is by assessing the institution’s loan portfolio loss under stressed scenarios. The first step in assessing loan loss is to estimate the probability of default (PD). Understanding PD is necessary for the purpose of stress testing and risk management. Financial institutions may also find it beneficial as insights from default modeling can be incorporated to guide improvements on good underwriting practice and competitive mortgage pricing. This paper serves as a rigorous background research on PD. We draw upon academic literature on residential mortgage default and research papers on stress testing published by other regulatory bodies, and pull together six models (five statistical models and one economic model) that can be used to generate quantitative assessments of PD. We also comb through the development of economic theories aimed at explaining default behaviors. The economic theories provide the basis for selecting default determinants, which in turn are used as inputs in statistical models to predict PD. This paper sheds light on the questions of what drives default and how to model the probability of default for residential mortgages and mortgage portfolios. Our goal is to present available methods for the purpose of modeling PD, rather than to recommend specific models or default determinants for financial institutions to use. FICOM and the credit unions, in choosing a model, should assess the suitability of the model giving consideration to specific business requirements. Further research into the model may be required for seamless execution. 3 I. Introduction Although default rates on residential mortgages in BC have been relatively low in the past, credit unions should still be concerned about mortgage default for several reasons. First, residential mortgages make up a large portion of the asset portfolios of BC credit unions. According to data, almost 68 per cent of BC credit unions’ total loans are personal real estate backed assets. Secondly, home mortgages represent a large bulk of outstanding household debt. As of the second quarter of 2014, mortgages account for 47 per cent of total consumer debt in BC. 1 Default is costly to everyone involved. Costs to the lender and the insuring institution incur when net cash recouped from foreclosure is less than the remaining balance of the defaulted mortgage. In the extreme case, systemic defaults may impair the soundness of lending institutions. Default is also costly to the borrower. Examples include the loss of a home, a lower credit rating, an impaired ability to acquire financing, and even mental distress. In addition, default risk is of particular concern given the continuously climbing housing price in the Greater Vancouver area. When the US last experienced a housing price run-up, what followed was a disastrous crash, the effects of which still persist today. Acknowledging the differences between the real estate and mortgage markets of BC and those of the US, we do not attempt to make predictions of the housing market in BC; rather, we emphasize the importance of understanding the risk of mortgage default, as real estate backed loans play a key role in our financial system. Understanding mortgage default risk will not only provide guidance for designing stress testing scenarios but also help improve underwriting practices and enhance pricing of mortgage products. The goal of this paper is to provide an overview of alternative methods that can be applied to answer the question – How should lending institutions assess the default probability on a pool of mortgage loans? Firstly, section II briefly discusses how default risk was assessed in the early days and why that is insufficient in understanding default risk today. Then section III and IV describe six models that can be used to estimate default probability given certain factors. Appendix 1 offers an overview. The models are introduced in the order as they were first applied in studies of residential mortgage default. Adoptions of later models are often spurred by some inadequacy of earlier ones in answering the question of interest or are inspired by new developments in statistical methods and computer programming capabilities. Model 1, 2, 3 and 4 are for individual loans; Model 5 and 6 are for loan portfolios. Model 1 uses a linear probability 1 Information of BC credit unions asset mix and total household debt distribution are from FICOM DTI Q2 2014 report. 4 function to model default risk; it is simple and robust in discriminating loans based on a predicted default risk index; however, this model does not provide a number for the default probability. Model 2 overcomes this shortfall and uses a logistic function to model default probability. Model 3 applies a time-to-event method to model the length of time before a mortgage terminates. Model 4 departs from these regression-type models; instead, for every possible outcome for house prices and interest rates over a period of time, it simulates a borrower’s decision over three choices: continuing with the current mortgage, defaulting, or prepaying the current mortgage. Model 5 and 6 view a mortgage portfolio as a whole and analyze the default rate of the portfolio. Section III and IV do not discuss (except for Model 4) the factors that one would input into the models. These factors are macroeconomic measures, loan-, and borrower-specific characteristics that potentially drive default behavior. They are sometimes referred to as default determinants. These models have flexibility in terms of the factors they accept as inputs. It is up to the users to choose the factors. Section V discusses these factors as suggested by economic theories. Appendix 2 presents a summary of default determinants. Finally, section VI discusses the issue of model stability, and section VII concludes the paper. Models and methods discussed hereafter draw upon studies done in the past by researchers. A list of references is provided at the end for further investigation. II. Evaluating Mortgage Default Risk in the Early Days Prior to the 1980’s, the evaluation of mortgage default risk was largely established on rules of thumb and risk ratings based on experience ([34]). Mortgage applications were scored or rated on a grid given borrower-, loan-, and property-related criteria. Four ratios were employed back then and are still in use today. They are the loan to value ratio, the monthly mortgage payment to gross income ratio, the total debt obligation to gross income ratio, and the house value to gross income ratio. These ratio analysis and risk ratings specify some indicators of default risk; however, they are insufficient mainly in two ways. Firstly, they look at the likelihood of default during the life time of a mortgage, but do not deal with the timing of default. As shown by researchers, marginal probabilities of default display a rising-then-falling pattern over time. 2 Secondly, the risk ratings do not provide quantitative assessments of the likelihood of default. 2 Von Furstenberg ([34]) is the first to reveal this pattern. For his loan sample, default rates peak around 3 to 4 years after origination and subsequently fall and become negligible after half the term of a mortgage has passed. 5 The shortcoming is twofold. A rating or score of, say, 1 out of 10 may indicate that the mortgage is likely to default, but it does not tell how likely it is to default (i.e., whether there is a 90 per cent or 60 per cent probability of default). Also, these risk ratings do not estimate the degree of impact each criterion has on the likelihood of default. In turn, a differential in rating indicates that one mortgage is more or less likely to default than another, lacking insights on how much more or less the likelihood is. III. Models for default risk of an individual loan This section outlines four default risk models, where one considers individual mortgages as the subject of study. Model 1, 2, and 3 are statistical models that predict default risk by estimating relationships between default risk and default determinants. Model 4 is an economic model based on optimization, which estimates default risk by describing a borrower’s behavior under certain economic forces. A description is provided for each model, followed by the model implementation with data structure examples; the model is then compared to earlier ones to show the advantages and disadvantages. Model 1: Linear regression analysis on default risk Description Regression analysis looks for the relationships between default risk and an array of variables that may have impacts on default behavior. Default risk is treated as a dependent variable, which can be explained by some independent or explanatory variables. The relationship between default risk and its explanatory factors is assumed to be linear. A common formulation 3 is 𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑟𝑟𝑟𝑘 = 𝛼 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ 𝛽𝑘 𝑋𝑘 + 𝜀 (1) where 𝑋1 , 𝑋2 , … 𝑋𝑘 are explanatory variables, factors or predictors that may help determine default risk; 𝛼 is a constant; 𝛽1 , 𝛽2 , … , 𝛽𝑘 are coefficients that capture the impact that each factor may have on default risk; and 𝜀 is an error term, which is assumed to be independent and is sometimes in addition assumed to be normally distributed. Default risk here is not measured by the probability of default, as a loan is either in default or not in default. One does not observe a “probability” for a single loan; rather, the status of the loan is observed. Loan status is used as a proxy for default risk. If a mortgage is in good standing, then the default risk measure takes a 3 See Quercia and Stegman ([29]) for a list of studies. 6 value of zero; if a mortgage is in default (either in delinquency or foreclosure), then the default risk measure takes a value of one. Explanatory variables, 𝑋′𝑠, are any factors that may affect the default risk of a mortgage. These factors can be macroeconomic, loan specific, borrower-, lender-, or property- related. They are derived from economic reasoning as well as empirical observations. In the simplest specification, the default risk is assumed to have a linear relationship with the factors. Factors may be transformed before entering the regression equation. We discuss the selection of explanatory variables later. Implementation One can observe the performance status of a sample of loans and conduct regression analysis. There are two ways to do it: 1) a cross-sectional dataset is obtained if a sample is observed at one point in time; 2) a panel dataset is obtained if a sample is observed at multiple points in time. If data is prepared as a snapshot of a loan profile at one point in time, the regression is crosssectional. Figure1 gives an example of cross-sectional loan data. Figure 1. Cross-section data on individual mortgages: data structure example Loan ID Loan Status X1: loan-to-value X2: term of mortgage X3: borrower occupation 1 0 80% 20 3 2 0 85% 25 4 3 1 90% 25 2 …… ...... …… …… …… Fitting the model with data yields estimates of the coefficients, 𝛽′𝑠 , in equation (1). The coefficients estimate the impact of each factor on default risk, by how much default risk changes when a factor changes by a particular amount. Alternatively, the estimation may suggest that a factor does not have a significant impact on default risk. Using estimated coefficients and given values of explanatory variables, we can compute the predicted default risk for a particular mortgage from equation (1). If data is prepared such that there are multiple mortgages in the sample and each mortgage is observed at multiple points in time, one would have a panel dataset. Estimation of the model then follows panel regression techniques. An example of panel loan data is shown in Figure 2. 7 Figure 2. Panel data on individual mortgages: data structure example Loan ID Date Loan X1: loan-to- X2: term of X3: borrower X4: GDP Status value mortgage occupation growth 1 2005 0 80% 20 3 1.5% 1 2006 1 85% 20 3 1.2% 2 2005 0 85% 25 4 1.5% 2 2006 0 84% 25 4 1.2% 2 2007 0 80% 25 4 1.3% 2 2008 0 83% 25 4 1.0% 3 2005 1 90% 25 2 1.5% …… …… ...... …… …… …… …… Advantage and disadvantage The linear regression model is easy to implement and the interpretation of the output is straightforward. Equation (1) can have good discriminating power and can be used to rank mortgages by estimated default risk; lower output values indicate lower default risk and high output values indicate higher default risk. However, the model has several problems in general. When default risk is measured by loan status, it only assumes a value of either zero or one. From equation (1), one can see that with a dichotomous dependent variable, the error term 𝜀 is dichotomous as well. This is inconsistent with the model assumption on normally distributed errors. Predictions from a linear probability function may be difficult to interpret. In order to have a probability interpretation, the output of the estimated equation should be a number between zero and one, even when particular values are assigned to the explanatory variables. For example, when designing stress scenarios, one may set the house price index at a stressed level to estimate the resulting default probability. If the output of equation (1) is negative or above one for some set of factors, then one cannot interpret the estimated default risk as a probability of default. So the output of the model may be viewed as a default risk index rather than a default probability of a mortgage. The model does not answer the questions of interest – What is the probability of default given values of the explanatory variables? 8 Model 2: Logistic model Description The performance status of a mortgage loan is often described as current, 30-, 60-, 90-day delinquent, foreclosed, refinanced, et cetera. In statistical analysis, this information is qualitative data, and is represented using categorical indicators. 4 A logistic model is particularly suitable for empirical studies with qualitative data. Consider the loan status, a binary variable which takes a value of either zero (for mortgages that are performing) or one (for non-performing mortgages). A logistic model formulates the probability of a loan being non-performing as a logistic function of some combination of explanatory variables 5: 𝑃(𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠𝑠𝑠 = 1) = 1 1+𝑒 −(𝛼+𝛽1 𝑋1 +𝛽2 𝑋2 +⋯ 𝛽𝑘 𝑋𝑘 ) (2) where 𝑃(𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠𝑠𝑠 = 1) is the probability of a mortgage being non-performing. 6 Equation (2) can be seen as a transformation of equation (1), a positive monotone transformation that maps the linear probability predictor into a unit interval. Such a transformation will retain the linear structure of the model while ensuring the estimated output stays between zero and one. Implementation Suppose that the one-year default probability is desired and one draws a loan sample in 2010. All loans that are outstanding at the beginning of 2010 enter the sample, and one observes the loan status at the end of 2010. An example of loan data looks like Figure 1. The model is estimated using likelihood techniques, and goodness-of-fit tests can be conducted to assess whether or not the model fits the data on hand. Logit coefficients, 𝛽′𝑠, estimate the impact of a unit change in factors on the natural logarithm of odds. Odds have the intuitive meaning of 𝜋 1−𝜋 , where 𝜋 is the probability of a mortgage being non-performing. For example, the odds of a loan being in default are the probability of default versus the probability of non- 4 For example, if mortgages in a portfolio are either current or non-current, one may use a value of zero for mortgages that are current and one for mortgages that are non-current. If mortgages in a portfolio are current, delinquent, or foreclosed, one may use a value of one for mortgages that are current, two for delinquency, and three for foreclosure. 5 See Quercia and Stegman ([29]) for a list of studies. 6 McFadden ([25]) shows that the logistic function is an appropriate representation of consumer choice behavior under reasonable assumptions. In this application, it is the borrower’s choice of continuing servicing current mortgage, becoming delinquent, defaulting, or prepaying. 9 default. A simple conversion gives the impact of factors on the default probability 𝜋. Using estimated coefficients and given values of explanatory variables, one can compute the predicted probability of default for a particular mortgage from equation (2). The predicted probability can also be used to classify mortgages. For example, one may choose a cut-off value, say 0.5, and classify mortgages with predicted probability above 0.5 into a group of predicted default loans and mortgages with predicted probability below 0.5 into a group of predicted performing loans. If one has a separate record for each time period in which each mortgage is observed, the data structure is similar to Figure 2, and panel regression techniques apply. Sometimes, one may have a finer categorization of mortgages, more than just “performing” and “non-performing”. Consider a loan sample consisting of three groups of mortgages based on their performance status. Loan status equals to 1 for mortgages that are performing, 2 for mortgages that default, and 3 for mortgages that are prepaid. In this case, one would use a multinomial logistic model. Figure 3 is an example of multinomial data with loan sample observed at one point in time. Figure 3. Cross-section data on individual loans: data structure example Loan ID Loan Status X1: loan-to-value X2: term of mortgage X3: borrower occupation 1 3 90% 20 3 2 2 85% 25 4 3 1 80% 25 2 …… ...... …… …… …… Estimation of a multinomial logistic model takes one group as the base group and identifies coefficients for the rest of the groups. For example, if one uses performing mortgages (group 1) as the base group, then the model estimates two sets of coefficients, one for each of default mortgages (group 2) and prepaid mortgages (group 3). The natural logarithm of odds of a mortgage falling in group 𝑖 versus the base group is ln � 𝑃(𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠𝑠𝑠=𝑖) � = 𝛽0𝑖 + 𝛽1𝑖 𝑋1 + ⋯ 𝛽𝑘𝑖 𝑋𝑘 𝑃(𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠𝑠𝑠=1) (3) where group 𝑖 = 2 𝑜𝑜 3, and 𝛽 𝑖 ′𝑠 are coefficients quantifying impact of factors on a mortgage falling in group 𝑖 versus the base group. For example, 𝛽12 of 0.5 is interpreted such that a one unit increase in 𝑋1 results in a 0.5 increase in the natural logarithm of odds that the loan falls into 10 group 2 versus group 1; or the odds of falling into group 2 versus group 1 increase by 𝑒 0.5 as a result of one unit increase in 𝑋1. Coefficients for the base group, 𝛽1 ′𝑠, are set to zero for the purpose of estimating the model. From equation (3), one can derive the probability of a mortgage falling in group 𝑖 to be 𝑃(𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠𝑠𝑠 = 𝑖) = 𝑖 𝑖 𝑖 𝑒 (𝛽0 +𝛽1 𝑋1 +⋯𝛽𝑘 𝑋𝑘 ) 𝑖 𝑖 (4) 𝑖 ∑𝑖 𝑒 (𝛽0 +𝛽1 𝑋1 +⋯𝛽𝑘 𝑋𝑘 ) where group 𝑖 = 1, 2, 𝑜𝑜 3. For a particular mortgage with a given set of explanatory variables, equation (4) computes the predicted probability of the mortgage falling in group 𝑖. Any of the three groups can be used as the base group. Coefficient estimates are different depending on the choice of the base group; however, predicted probabilities will be the same regardless of the choice. For classification, a mortgage would be assigned to the group with the largest predicted probability. For example, fitting the model with data one can estimate 𝛽 2 and 𝛽 3 ; 𝛽1 for the base group is set to 0. Then for a mortgage with given values of 𝑋′𝑠, from equation (4) we predict 70 per cent probability for it falling in the performing group, 20 per cent probability for it falling in the default group, and 10 per cent for it falling in the prepayment group. And one would classify this mortgage into a group of predicted performing loans. Figure 4 is an example of a panel dataset where a particular mortgage has multiple records from multiple points in time. Panel regression techniques apply. Figure 4. Panel data on individual loans: data structure example Loan ID Date Loan X1: loan-to- X2: term of X3: borrower X4: GDP Status value mortgage occupation growth 1 2005 1 90% 20 3 1.5% 1 2006 3 85% 20 3 1.2% 2 2005 1 85% 25 4 1.5% 2 2006 1 84% 25 4 1.2% 2 2007 2 86% 25 4 1.3% 3 2005 1 80% 25 2 1.5% 3 2006 1 76% 25 2 1.2% 3 2007 1 75% 25 2 1.3% 3 2008 1 77% 25 2 1.0% …… …… ...... …… …… …… …… 11 Advantage and disadvantage If one is only concerned with the significance of the relationship between loan status and explanatory factors, both the linear regression model and the logistic model may yield similar results. If the goal is to estimate the probability of an event, such as mortgage default, then the logistic model is better. It overcomes the problems of a linear regression model in analysing categorical data, and fits the observed loan status better than a linear regression model. Coefficient estimates under the logistic model are efficient and well behaved even when the sample size is relatively small. 7 Equation (2) and (4) can used to predict probabilities of default for mortgages. The outputs from the model fall within a sensible range between zero and one. The output of default probability predictions is on a loan-by-loan basis. 8 One caveat is that the logistic function may not fit a particular dataset. If the probability of default is not monotonic in relation to an explanatory variable, then logistic regression would not fit the data. For example, von Furstenberg ([36]) reveals a single-peaked pattern for the term structure of mortgage default. Term structure is the relationship between the default rate and the mortgage age. On average, default rates increase and peak a few years after origination and subsequently decrease until they become negligible. Here default probability is not a monotonic function of the mortgage age. To accommodate this, one can include both the mortgage age and the squared mortgage age as explanatory variables. Alternatively, one can use multiple age categories and dummy variables to account for the non-linear relationship between age-ofmortgage and default probability. 9 Another consideration before using a multinomial logistic model is that the model relies on an assumption of independence of irrelevant alternatives, which says that odds of one group relative to another are unaffected by the presence or absence of the third group. In our example, this is to assume that 1) the possibility of refinancing is irrelevant to how likely a mortgage would be in default rather than performing; 2) the possibility of default is irrelevant to how likely a mortgage would be refinanced rather than performing; and 3) the possibility of performing is irrelevant to 7 Another advantage of logistic model arises with the consideration of sampling schemes: whether it is prospective sampling or retrospective sampling. Logistic model specifies functional form for odds of one outcome relative to another, instead of for probabilities directly. Odds are identical regardless of the sampling scheme. 8 The prediction resulting from the model is something like: loan #1 has 2% probability of default; loan #2 has 1% probability of default, and so on. It is not like: default rate of the portfolio is 3%. The latter is dealt with in Model 6. 9 An APRA working paper by Coleman et al. ([8]) uses multiple age categories (dummy variables) to account for the non-linear relationship between age-of-mortgage and default probability. 12 how likely a mortgage would be refinanced rather than in default. Whether or not this assumption holds for our application is debatable, and which is a limitation. However, even if the assumption were violated, multinomial logistic model would still be more effective than other models that do not rely on this assumption. Model 3: survival analysis Description Survival analysis is a modelling technique for time-to-event data or duration data. Consider the life course of a mortgage. At each point in time, the mortgage may enter one of a number of mutually exclusive states, such as performing, default, and prepayment. With the passage of time, the mortgage moves between these states (or it remains static). It is likely that the mortgage will start in the performing state and later stay in the performing state or move into either default or prepayment. Survival analysis is a tool to study the length of time the mortgage spends within the performing state, in other words, how long the mortgage survives before it defaults or prepays. We seek the relationship between mortgage status and the passage of time along with other explanatory variables. A common formulation for survival analysis 10 is ℎ(𝑡) = ℎ0 (𝑡)𝑒 (𝛽1 𝑋1+𝛽2 𝑋2 +⋯ 𝛽𝑘𝑋𝑘) (5) where ℎ(𝑡) is the hazard rate, or the conditional probability that a mortgage survives until time 𝑡 but fails during the next time interval; 11 time 𝑡 represents age of mortgage; ℎ0 (𝑡) is the baseline hazard, which captures the shape of the hazard function and summarizes how the probability of mortgage termination (either default or prepay) changes over time; 𝑋1 , 𝑋2 , … , 𝑋𝑘 are explanatory variables that also influence risk of mortgage termination; and 𝛽1 , 𝛽2 , … , 𝛽𝑘 are coefficients that measure the impacts of the explanatory variables on the hazard rate. Implementation Before conducting survival analysis, one first organizes data into a loan-period format. One now needs an event indicator (loan status) and time variables that can be used to imply duration of 10 Refer to Survival Analysis by Stephen P. Jenkins. This interpretation is appropriate for discrete time hazard rate. In continuous time, ℎ(𝑡)∆𝑡 has similar interpretation. In this context, survival of a mortgage means that it stays in the performing state; failure of a mortgage means that it exits the performing state and moves into either default or prepayment. 13 11 time before a mortgage moves out of the performing state. Figure 5 is an example of the data structure where loan sample is observed at one point in time. Figure 5. Duration data on individual loans: data structure example Loan Origination Event Event X1: initial X2: term of X3: borrower ID date date type loan-to-value mortgage occupation 1 2005Q1 2009Q3 2 90% 20 3 2 2003Q3 2009Q4 1 85% 25 4 3 2001Q2 2010Q1 0 80% 25 2 …… …… ...... ...... …… …… …… In this example, the sample period ends in 2010 Q1. At this time loan #3 is still in the performing state; the “Event” indicator for loan #3 is 0, and “Event date” is the same as the end of the sample period. Loan #1 prepays in 2009 Q3 with “Event” indicator equal to 2; loan #2 defaults in 2009 Q4 with “Event” indicator equal to 1. Now suppose that one draws a sample period from 2009 Q2 to 2010 Q1, and observes the loan sample every quarter. Figure 6 gives an example of the data structure with time varying explanatory variables. Figure 6. Duration data on individual loans: data structure example Loan Origination Event Event X1: current X2: term of X3: borrower X4: GDP ID date date type loan-to-value mortgage occupation growth 1 2005Q1 2009Q2 0 80% 20 3 1.5% 1 2005Q1 2009Q3 2 85% 20 3 1.2% 2 2003Q3 2009Q2 0 85% 25 4 1.5% 2 2003Q3 2009Q3 0 83% 25 4 1.2% 2 2003Q3 2009Q4 1 84% 25 4 1.3% 3 2001Q2 2009Q2 0 61% 25 2 1.5% 3 2001Q2 2009Q3 0 63% 25 2 1.2% 3 2001Q2 2009Q4 0 62% 25 2 1.3% 3 2001Q2 2010Q1 0 60% 25 2 1.1% …… …… …… ...... …… …… …… …… The data structure example presented above represents a competing nature of two types of mortgage termination risk – risk of default and risk of prepayment. A lender of an outstanding 14 mortgage faces these two types of termination risk. The mortgage terminates once one of the two risks is realized. The two risks are jointly present; however, their realizations are mutually exclusive. A mortgage that prepays will not default, whereas a mortgage that defaults will not prepay. Thus default and prepayment are “competing” risks. For this reason, a complete model of mortgage termination risk should simultaneously consider both default risk and prepayment risk. 12 The most current method for mortgage termination under this framework is a competing risks hazard model, which is illustrated by Deng, Quigley and Van Order ([12]). The competing risk hazard model is a unified model that analyzes the joint choices of default and prepayment, and estimates the influence of factors on default decision as well as prepayment decision. The estimation of the model uses likelihood techniques, which try to find the values of coefficients, 𝛽′𝑠, that maximize the probability of observing the data on hand. Assumptions on the functional form of the baseline hazard, ℎ0 (𝑡), are not required to estimate coefficients. Coxregression fits the model to data and estimates equation (5) by maximizing the partial likelihood function derived from the equation; the baseline hazard is common to all mortgages and its contributions cancel out in the partial likelihood expression, so when the estimation process maximizes partial likelihood, the baseline hazard does not make a difference and only the coefficients, 𝛽′𝑠, are estimated. 13 Using estimated coefficients and empirical baseline hazard, one can compute conditional default probabilities from equation (5) for a particular mortgage with given values of explanatory variables. The likelihood estimation techniques allow one to control for unobserved variables. Equation (5) implies that two mortgages with same age, 𝑡, and same explanatory characteristics, 𝑋′𝑠, have identical default risks. For simplicity, consider only two factors influencing default, loan-to- value ratio (LTV) and borrower occupation. Suppose two mortgages, both one year old with LTV of 70 per cent, and both borrowers in the same occupational category. Equation (5) would predict an identical probability of default for these two mortgages. The problem is that the two borrowers are most likely to differ in ways that are not captured by occupation; they may differ in consumer behavior, habit, ability to pull external financial resources, et cetera. These 12 Using the option theory, we can consider the prepayment option as a call option and the default option as a put option. The borrower of an outstanding mortgage holds both options. Once one of the two options is exercised, the mortgage is terminated and the other option is foregone. This is to say, when the borrower makes the decision to exercise one option, he/she would bear in mind the value of the other option. Hence a model of default risk should also address the presence of prepayment risk. 13 Refer to Survival Analysis by Stephen P. Jenkins and Buis ([4]) for more explanation. 15 differences are unobserved or unmeasured, but they do play a role in the borrowers’ default decisions. To deal with this issue, one can assume that mortgages in the sample belong to some number of groups; mortgages in each group are similar in terms of the unobserved characteristics. Mortgages are not pre-assigned into groups. The likelihood estimation process will generate coefficient estimates taking into account the presence of unobserved characteristics. One can start with two groups and subsequently increase the number of groups until the model performance no longer improves. Incorporating unobserved characteristics enhances the estimation results. This paper does not discuss details of the estimation process. Buis ([4]) explains how the likelihood estimation process incorporates unobserved characteristics. 14 Advantage and disadvantage The survival analysis method is well accepted in studies of default probability because it matches the life course of a mortgage and its termination process. The estimated output provides forecasts of default probabilities as a function of time (the mortgage age) and other default determinants. It models both probability of default and time dependence of the probability. This is an advantage over the logistic model. In a logistic model, the predicted probability has a fixed time horizon; to have prediction for a different time horizon, one needs to revise the loan sample and repeat the estimation process. Survival analysis can estimate default risk for any time horizon. Also, survival analysis handles censored data, which is an issue not addressed by the logistic model.15 Another advantage of survival analysis is its versatility, because assumptions on the functional form of the baseline hazard are not required to estimate the model. To predict default probability for a particular mortgage over time, the model uses the empirical baseline hazard along with estimated coefficients and given values of explanatory variables. 14 Logistic and multinomial logistic models with panel dataset may also be estimated with fixed effects which control for time-invariant, borrower-specific unobserved variables. 15 A time-to-event (survival time) is censored if we only know the observation either entered or exited within the sample period, and the total length of survival time is not known exactly. For example, a mortgage that is still outstanding at the end of the sample period is censored because we only know that it has not defaulted yet but we do not know whether it will mature without default or not. Another example, a mortgage exits the sample during sample period because lender sold the mortgage, thus performance status of this mortgage is not observed. For the former example, logistic model implicitly ignores the issue; for the latter example, logistic model abandons the observation due to missing data. Survival analysis incorporates censored data in its estimation process. 16 Model 4: optimization model Description An optimization model of default attempts to capture the core structure of economic dynamics surrounding the default process. This type of model assumes that a borrower makes mortgage payment decisions with objectives to maximize wealth and utility or minimize housing-related costs. At one point in time, a vector of choices available to the borrower normally includes: 1) to make the scheduled mortgage payment and continue with the current mortgage, 2) to prepay the current mortgage, or 3) to default on the current mortgage. 16 There are various wealth effects associated with each of the choices. A borrower compares these wealth effects and chooses to default if it meets his/her objective better than the other alternatives. Capozza, Kazarian, and Thomson ([7]) provide an example of utilizing a dynamic optimization model for estimating residential mortgage default behavior. Consider a time line that is divided into monthly intervals. 17 At each time interval, a borrower makes a decision around mortgage payment and chooses the least costly action. The borrower assesses whether it is less costly to default, to refinance, or to continue with the current mortgage. At each interval, a borrower’s choice can be written as: 𝑃𝑡 (𝐻𝑡 , 𝑟𝑡 ) = min[𝑃𝑡𝑑 (𝐻𝑡 , 𝑟𝑡 ), 𝑃𝑡𝑟 (𝐻𝑡 , 𝑟𝑡 ), 𝑃𝑡𝑤 (𝐻𝑡 , 𝑟𝑡 )] (6) where 𝑃𝑡 is the borrower’s housing cost at time 𝑡; 𝑃𝑡𝑑 is the housing cost if the borrower chooses to default; 𝑃𝑡𝑟 is the housing cost if the borrower chooses to refinance; 𝑃𝑡𝑤 is the housing cost if the borrower chooses to continue with the current mortgage; 𝐻𝑡 is the property value at time 𝑡, modeled as a stochastic process; and 𝑟𝑡 is the interest rate at time 𝑡 , modeled as another stochastic process. The stochastic processes are functions that specify how house prices and interest rates will evolve over time. With an initial value, one can use the processes to simulate possible house prices and interest rates over time. 16 The specification of borrower’s choices follows Capozza, Kazarian, and Thomson ([7]). Souissi ([33]) has a similar setup. Others may have finer or coarser differentiation among choices. For example, another model may distinguish prepayment by sale of property from prepayment by refinancing. 17 Each time interval is one time-step; length of the time interval can be one month to represent monthly mortgage payment. 17 Implementation How is the default probability estimated from the model? First, one generates possible outcomes for house prices and interest rates over the term of the mortgage. This is done by assuming stochastic processes for house prices and interest rates, respectively. At each time interval, this provides the distributions of the house price and the interest rate at that time. Then one explicitly expresses the functions of 𝑃𝑡𝑑 , 𝑃𝑡𝑟 , and 𝑃𝑡𝑤 . At a time interval 𝑡, the cost of default, 𝑃𝑡𝑑 , is the property value at that time plus the transaction costs of default 18; the cost of refinancing, 𝑃𝑡𝑟 , is the periodic mortgage payment plus the outstanding mortgage balance along with the deadweight cost of refinancing 19 ; the cost of continuing with the current mortgage, 𝑃𝑡𝑤 , is the periodic mortgage payment plus the expected mortgage cost in the future, 𝐸(𝑃𝑡+1 ). The expected future mortgage cost does not affect the cost of default or refinancing, because the mortgage is terminated after default or refinancing. Equation (6) is recursive – the borrower repeatedly makes such decision at every time interval until the mortgage matures or terminates, whichever comes first; also, the borrower’s choice today is influenced by the borrower’s expectation of housing costs tomorrow – 𝑃𝑡𝑤 is a function of 𝑃𝑡+1 , which is the same as equation (6) except with subscript 𝑡 + 1 instead of 𝑡. In addition, the function of 𝑃𝑡𝑤 can be modified to include a trigger event. A trigger event is a random event, such as divorce or unemployment, which can happen to an average borrower and that “triggers” mortgage termination. The “trigger event” is modeled by Capozza, Kazarian, and Thomson ([7]) as a probability that such a random event happens. With a 10-year mortgage, equation (6) represents a system of 120 equations, one for every monthly time interval. 20 Finally, from distributions of 𝐻𝑡 and 𝑟𝑡 at each time interval the model generates a distribution of mortgage payment choices, from which one calculates probability of default for that time interval. The solution techniques and calculation of default probabilities are described in Capozza, Kazarian, and Thomson ([7]). 21 18 Assume that the mortgagor losses the property if he/she defaults. Transaction costs of default can incorporate a wide range of monetary and non-monetary things, including moving expenses, legal fees, a negative impact on the borrower’s credit quality, mental stigma, and so on. 19 Assume that the new mortgage starts after the current time interval. Deadweight cost is a transaction cost for refinancing; it may be a fixed amount plus a variable amount as a percentage of outstanding mortgage balance. 20 The equation for the final time interval is the boundary condition; it has a slightly different form and is simpler as housing cost during the last month before mortgage matures is cost of default (property value plus transaction costs of default) or periodic mortgage payment, whichever is less. 21 One way to understand the solution technique is to draw reference from binomial option pricing model. 18 The model can be used in at least three ways. First, it can be used to generate an estimation of the default probability for a given loan over a chosen time horizon. Secondly, the model can be used to generate probabilities of default assuming differentiated values for a particular parameter (for example, higher house price volatility versus lower house price volatility) to assess the impact of that parameter on the default probability. Lastly, the model can be used to simulate the performances of hypothetical mortgages; this simulated sample can then be used to test the robustness of a statistical model of default. Advantage and disadvantage An optimization model of default differs from previous models. Model 1, 2, and 3 are statistical models that reduce the economic structure behind the mortgage default process; they use statistical properties inherent in loan data to draw inferences. An optimization model, on the other hand, tries to tell a story on what happens when a borrower chooses to default, and capture the dynamics using equations. The advantage is that the optimization model does not make the assumption that the default probability is a given function of the explanatory variables. Probability estimation in the model is due to different economic forces that drive the borrower’s behavior. However, the model requires extensive programming and is more difficult to implement than previous models. Outcomes of the model rely on assumptions made to construct the model. Here, assumptions are made on how house prices and interest rates evolve over time. Bad assumptions lead to poor predictions of default probability. Also due to its reliance on certain economic structures, the model is less flexible. For example, the specification described above mainly incorporates impacts of house prices and interest rates. If one wants to add in the impact of GDP growth, it is not easy to do. Another advantage of this model is that its ability to estimate default probability relies more on the economic structure and less on the historical data. It can be useful when one has a poor collection of loan data. The model estimates the default probability for a “typical” mortgage; “typical” is characterized by a set of initial values assigned as model inputs, which include parameters in house price and interest rate processes, and parameters in housing cost functions. Depending on the purpose of the estimation, one can vary the set of initial values and generate the default probability for a particular loan, the median loan in a portfolio, or a stressed scenario. 19 The model described by equation (6) may be criticized for not considering the borrower’s ability to continue making periodic mortgage payments. The borrower may be forced to default because of insufficient cash to meet mortgage obligations. One justification is that the ability-to-pay is accounted for by the inclusion of the “trigger event”. Also, the ability-to-pay, perhaps more precisely inability-to-pay, can be implicitly accounted for by the transaction costs of default. For example, for a borrower who is financially distressed, default on mortgage relieves the borrower of an unaffordable financial burden, which can be reflected in the costs of default. One possible modification that directly deals with the ability-to-pay is to introduce another stochastic process for income (net of non-housing expenditures) or non-housing wealth, revising the borrower’s decision by incorporating the liquidity constraints. Another group of optimization models falls under a utility-maximization framework that is often used in economics to model household consumption. This type of model defines household utility as a function of non-durable consumptions over time, housing consumptions over time and/or terminal wealth (financial wealth and housing wealth). At one point in time, a borrower’s housing and mortgage decisions are outcomes resulting from maximizing expected lifetime utility. 22 These models are highly structured; they make assumptions that balance model tractability and its accuracy in describing consumer behaviors as well as housing and mortgage market practices. IV. Models for default probability of a loan portfolio Models in section III treat an individual mortgage as the subject of study. In this section, one would view a portfolio of mortgages as one subject. Model 5: Linear regression analysis of default rate Description Regression analysis looks for the relationship between default risk and an array of explanatory variables. Default risk is treated as a dependent variable, which can be explained by some independent or explanatory variables. For a portfolio of loans, the default rate is calculated as the 22 Two current working papers, Garriga and Schlagenhauf ([18]) and Campbell and Cocco ([5]), are examples of structural models for household mortgage decisions. The former study emphasizes the multiplier effect of leverage in increasing default risk; leverage position of a mortgagor is measured by the loan-to-value ratio. The latter offers a dynamic model incorporating income, house price, inflation, and interest rate risks. Both studies provide ample implications for an empirical model of default risk. 20 number of loans in default over the total number of loans in the portfolio. 23 The default rate in turn serves as a measure of default risk for the loan portfolio. The regression model is formulated as: 𝑑𝑑𝑑𝑑𝑑𝑑𝑑 𝑟𝑟𝑟𝑟 = 𝛼 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ 𝛽𝑘 𝑋𝑘 + 𝜀 (7) where 𝑋1 , 𝑋2 , … 𝑋𝑘 are explanatory variables, factors or predictors that may help determine default risk; 𝛼 is a constant; 𝛽1 , 𝛽2 , … , 𝛽𝑘 are coefficients that capture the impact that each factor may have on default risk; and 𝜀 is an error term. Implementation When viewing a loan portfolio as one subject, the question arises as to how explanatory variables in equation (7) are measured. For example, one may use the loan-to-value (LTV) ratio as one predictor of default risk; each mortgage in the portfolio has a LTV. Then the question is how to measure the LTV for the portfolio. Broadly speaking, there are two ways to construct the sample. 1. A particular lending institution may consider its entire mortgage portfolio as one subject under examination. The average or median measures for the explanatory variables (e.g., LTV) may be used in the analysis. Periodically, one observes the default rate and explanatory variables for the entire portfolio over time. An example of data looks like Figure 7. Figure 7. Time series data on a loan portfolio: data structure example Date Default rate X1: average X2: average term X3: GDP loan-to-value of mortgage growth 2005 1% 52% 21 2% 2006 2% 53% 22 3% 2007 3% 55% 22 1.5% …… ...... …… …… …… When using average portfolio measures, one should bear in mind the variations in the sizes of loans in the portfolio. Instead of using a simple average LTV, one may use 23 Alternatively, one may calculate the default rate as loan value in default over total value of loan portfolio. An IMF working paper by Hardy and Schmieder ([20]) suggests that credit loss rate (dollar loan loss from profit and loss account over total dollar of loan stock) account for both probability of default and loss given default. 21 weighted average by selected weighting factors. For example, one may calculate the average LTV weighted by individual loan sizes relative to portfolio size. Forming a sample this way is simple and straightforward. As a result, the institution gets a time series dataset on its entire mortgage portfolio as a whole. A disadvantage of this approach is that certain explanatory factors (e.g., LTV) are smoothed out over time due to averaging and the impacts of these variables are not well estimated. 2. Another way to construct the sample is to group the entire mortgage portfolio into smaller sub-portfolios and view each sub-portfolio as a subject under examination. Mortgages in one sub-portfolio share some common combination of characteristics. The characteristics are criteria used to group loans; they also enter the regression equation as explanatory variables. For example, one may use two criteria to group mortgages: 1) loan term of 20 or 25 years; and 2) initial LTV of above 90 per cent, between 80 and 90 per cent, or below 80 per cent. As shown in Figure 8, using this grid one sorts all mortgages in the loan portfolio into 6 layered groups, or cohorts, each of which is a sub-portfolio under examination. Figure 8. Layered groups: example LTV > 90% 80% < LTV < 90% LTV < 80% Term: 20 years Loan portfolio 1 Loan portfolio 3 Loan portfolio 5 Term: 25 years Loan portfolio 2 Loan portfolio 4 Loan portfolio 6 The grouping technique transforms loan-by-loan data into a cohort-by-cohort sample. Each cohort (sub-portfolio) is then treated as one subject, and is observed in each period. The more criteria one adds to the grid and the finer one defines the grid, the more subportfolios one has in the sample. Suppose one adds a third criterion, borrower’s income – above median or below median, this increases the number of sub-portfolios from 6 to 12. Figure 9 is an example of data structure using this approach. 22 Figure 9. Panel data on loan portfolios: data structure example Loan Date portfolio ID Default X1: X2: X3: X4: term of X5: GDP rate LTV>90% 80%