Momentum strategies .fr

Dec 31, 2011 - 1.9 Cross-validation procedure for determining optimal value λ⋆ . ..... filter scheme is a quadratic program with some boundary constraints.
7MB taille 69 téléchargements 298 vues
University of Paris 7 - Lyxor Asset Management

Master thesis

Momentum Strategies: From novel Estimation Techniques to Financial Applications

Author: Tung-Lam Dao

Supervisor: Prof. Thierry Roncalli

September 30, 2011

Contents Acknowledgments

ix

Confidential notice

xi

Introduction

xiii

1 Trading Strategies with L1 Filtering 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 1.3 L1 filtering schemes . . . . . . . . . . . . . . . . . . . . 1.3.1 Application to trend-stationary process . . . . 1.3.2 Extension to mean-reverting process . . . . . . 1.3.3 Mixing trend and mean-reverting properties . . 1.3.4 How to calibrate the regularization parameters? 1.4 Application to momentum strategies . . . . . . . . . . 1.4.1 Estimating the optimal filter for a given trading 1.4.2 Backtest of a momentum strategy . . . . . . . . 1.5 Extension to the multivariate case . . . . . . . . . . . 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . . . . . . date . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

2 Volatility Estimation for Trading Strategies 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Range-based estimators of volatility . . . . . . . . . . . . . . . . . 2.2.1 Range based daily data . . . . . . . . . . . . . . . . . . . . 2.2.2 Basic estimator . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 High-low estimators . . . . . . . . . . . . . . . . . . . . . . 2.2.4 How to eliminate both drift and opening effects? . . . . . . 2.2.5 Numerical simulations . . . . . . . . . . . . . . . . . . . . . 2.2.6 Backtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Estimation of realized volatility . . . . . . . . . . . . . . . . . . . . 2.3.1 Moving-average estimator . . . . . . . . . . . . . . . . . . . 2.3.2 IGARCH estimator . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Extension to range-based estimators . . . . . . . . . . . . . 2.3.4 Calibration procedure of the estimators of realized volatility 2.4 High-frequency volatility estimators . . . . . . . . . . . . . . . . . . i

. . . . . . . . . . . .

1 1 2 3 3 4 8 8 13 13 15 16 17

. . . . . . . . . . . . . .

21 21 22 22 23 26 28 29 35 42 42 43 45 45 50

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

3 Support Vector Machine in Finance 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Support vector machine at a glance . . . . . . . . . . . . 3.2.1 Basic ideas of SVM . . . . . . . . . . . . . . . . . 3.2.2 ERM and VRM frameworks . . . . . . . . . . . . 3.3 Numerical implementations . . . . . . . . . . . . . . . . 3.3.1 Dual approach . . . . . . . . . . . . . . . . . . . 3.3.2 Primal approach . . . . . . . . . . . . . . . . . . 3.3.3 Model selection - Cross validation procedure . . . 3.4 Extension to SVM multi-classification . . . . . . . . . . 3.4.1 Basic idea of multi-classification . . . . . . . . . . 3.4.2 Implementations of multiclass SVM . . . . . . . . 3.5 SVM-regression in finance . . . . . . . . . . . . . . . . . 3.5.1 Numerical tests on SVM-regressors . . . . . . . . 3.5.2 SVM-Filtering for forecasting the trend of signal 3.5.3 SVM for multivariate regression . . . . . . . . . . 3.6 SVM-classification in finance . . . . . . . . . . . . . . . 3.6.1 Test of SVM-classifiers . . . . . . . . . . . . . . . 3.6.2 SVM for classification . . . . . . . . . . . . . . . 3.6.3 SVM for score construction and stock selection . 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

59 . 59 . 60 . 60 . 65 . 68 . 68 . 75 . 76 . 77 . 77 . 78 . 83 . 83 . 84 . 87 . 91 . 91 . 95 . 98 . 105

2.5

4

2.4.1 Microstructure effect . . . . . . . . . . . . . 2.4.2 Two time-scale volatility estimator . . . . . 2.4.3 Numerical implementation and backtesting Conclusion . . . . . . . . . . . . . . . . . . . . . . .

. . . .

. . . .

52 52 55 56

Analysis of Trading Impact in the CTA strategy 109 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.2 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

Conclusions

113

A Appendix of chaper 1 A.1 Computational aspects of L1 , L2 filters . . . . A.1.1 The dual problem . . . . . . . . . . . . A.1.2 The interior-point algorithm . . . . . . A.1.3 The scaling of smoothing parameter of A.1.4 Calibration of the L2 filter . . . . . . . A.1.5 Implementation issues . . . . . . . . .

115 115 115 117 118 119 121

. . . . . . . . . . . . . . . L1 filter . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

B Appendix of chapter 2 123 B.1 Estimator of volatility . . . . . . . . . . . . . . . . . . . . . . . . . . 123 B.1.1 Estimation with realized return . . . . . . . . . . . . . . . . . 123 ii

C Appendix of chapter 3 C.1 Dual problem of SVM . . . . . . . C.1.1 Hard-margin SVM classifier C.1.2 Soft-margin SVM classifier . C.1.3 ε-SV regression . . . . . . . C.2 Newton optimization for the primal C.2.1 Quadratic loss function . . C.2.2 Soft-margin SVM . . . . . . Published paper

. . . . . . . . . . . . . . . . . . . . problem . . . . . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

. . . . . . .

125 125 125 126 127 128 128 129 131

iii

List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12

L1 − T filtering versus HP filtering for the model (1.2) . . . L1 -T filtering versus HP filtering for the model (1.3) . . . . L1 − C filtering versus HP filtering for the model (1.5) . . . L1 − C filtering versus HP filtering for the model (1.6) . . . L1 − T C filtering versus HP filtering for the model (1.2) . . L1 − T C filtering versus HP filtering for the model (1.3) . . Influence of the smoothing parameter λ . . . . . . . . . . . Scaling power law of the smoothing parameter λmax . . . . Cross-validation procedure for determining optimal value λ? Calibration procedure with the S&P 500 index . . . . . . . Cross validation procedure for two-trend model . . . . . . . Comparison between different L1 filters on S&P 500 Index .

2.1 2.2 2.3 2.4 2.5

Data set of 1 trading day . . . . . . . . . . . . . . . . . . . . . . . . 23 Volatility estimators without drift and opening effects (M = 50) . . . 30 Volatility estimators without drift and opening effect (M = 500) . . 31 Volatility estimators with µ = 30% and without opening effect (M = 500) 31 Volatility estimators with opening effect f = 0.3 and without drift (M = 500) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Volatility estimators with correction of the opening jump (f = 0.3) . 32 Volatility estimators on stochastic volatility simulation . . . . . . . . 33 Test of voltarget strategy with stochastic volatility simulation . . . . 34 Test of voltarget strategy with stochastic volatility simulation . . . . 35 Comparison between different probability density functions . . . . . 36 Comparison between the different cumulative distribution functions . 36 Volatility estimators on S& P 500 index . . . . . . . . . . . . . . . . 37 Volatility estimators on on BHI UN Equity . . . . . . . . . . . . . . 37 Estimation of the closing interval for S&P 500 index . . . . . . . . . 38 Estimation of the closing interval for BHI UN Equity . . . . . . . . . 38 Likehood function for various estimators on S&P 500 . . . . . . . . . 39 Likehood function for various estimators on BHI UN Equity . . . . . 40 Backtest of voltarget strategy on S&P 500 index . . . . . . . . . . . 41 Backtest of voltarget strategy on BHI UN Equity . . . . . . . . . . . 41 Comparison between IGARCH estimator and CC estimator . . . . . 46

2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20

v

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

5 5 7 7 8 9 10 11 11 13 13 14

2.21 Likehood function of high-low estimators versus filtered parameter β 2.22 Likehood function of high-low estimators versus effective moving window 2.23 IGARCH estimator versus moving-average estimator for close-to-close prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.24 Comparison between different IGARCH estimators for high-low prices 2.25 Daily estimation of the likehood function for various close-to-close estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.26 Daily estimation of the likehood function for various high-low estimators 2.27 Backtest for close-to-close estimator and realized estimators . . . . . 2.28 Backtest for IGARCH high-low estimators comparing to IGARCH close-to-close estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 2.29 Two-time scale estimator of intraday volatility . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25

47 48 48 49 49 50 51 51 56

Geometric interpretation of the margin in a linear SVM. . . . . . . . 61 Binary decision tree strategy for multiclassification problem . . . . . 80 L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16) 84 L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.17) 85 Comparison of different regression kernel for model (3.16) . . . . . . 85 Comparison of different regression kernel for model (3.17) . . . . . . 86 Cross-validation procedure for determining optimal value C ? σ ? . . . 87 SVM-filtering with fixed horizon scheme . . . . . . . . . . . . . . . . 88 SVM-filtering with dynamic horizon scheme . . . . . . . . . . . . . . 88 L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16) 90 Comparison of different kernels for multivariate regression . . . . . . 90 Comparison between Dual algorithm and Primal algorithm . . . . . . 92 Illustration of non-linear classification with Gaussian kernel . . . . . 92 Illustration of multiclassification with SVM-BDT for in-sample data 93 Illustration of multiclassification with SVM-BDT for out-of-sample data 94 Illustration of multiclassification with SVM-BDT for  = 0 . . . . . . 94 Illustration of multiclassification with SVM-BDT for  = 0.2 . . . . . 95 Multiclassification with SVM-BDT on training set . . . . . . . . . . 96 Prediction efficiency with SVM-BDT on the validation set . . . . . . 97 Comparison between simulated score and Probit score for d = 2 . . . 101 Comparison between simulated score CDF and Probit score CDF for d = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Comparison between simulated score PDF and Probit score PDF for d = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Selection curve for long strategy for simulated data and Probit model 103 Probit scores for Eurostoxx data with d = 20 factors . . . . . . . . . 104 SVM scores for Eurostoxx data with d = 20 factors . . . . . . . . . . 105

A.1 Spectral density of moving-average and L2 filters . . . . . . . . . . . 120 A.2 Relationship between the value of λ and the length of the movingaverage filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

vi

List of Tables 1.1

Results for the Backtest . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.1 2.2 2.3

Estimation error for various estimators . . . . . . . . . . . . . . . . . 2 2 Performance of σ ˆHL versus σ ˆCC for different averaging windows . . . 2 2 Performance of σ ˆH L versus σ ˆCC for different filters of f . . . . . . .

34 42 42

vii

Acknowledgments During the six months unforgettable in the R&D team of Lyxor Management, I have experienced and enjoyed every moments. Apart from all the professional experiences that I have learnt from everyones int the department, I did really appreciate the great ambiance in the team which motivated me everyday. I would like first to thank Thierry Roncalli for his supervision during my stay in the team. I did not ever imagine that I could learn so many interesting things within my internship without his direction and his confidence. Thierry has introduced me the financial concepts of the asset management world in a very interactive way. I would say that I have learnt finance in every single discussion with him. He taught me how to combine learning and practice. For the professional experiences, Thierry has help me to fill the lag in my financial knowledges by allowing me to work on various interesting topics. He made me confident to present my understanding on this field. For the daily life, Thierry has shared his own experiences and teach me as well how to adapt to this new world. I would like to thank Nicolas Gaussel for his warming reception in Quantitative management department, for his confidence and for his encouragements during my stay in Lyxor. I have a chance to work with him on a very interesting topic concerning the CTA strategy which plays an important role in asset management. I would like to thank Benjamin Bruder, my nearest neighbor, for his guide and his supervision along my internship. Informally, Benjamin is almost my co-advisor. I must say that I owe him a lot for all of his patience in every daily discussion in order to teach me and to work out many questions coming up to my projects. I am really graceful for his humorist quality which warm up the ambiance. For all members of the R&D team, I would like to express my gratitude to them for their helps, their advices and everything that they shared with me during my stay. I am really happy to be one of them. Thank Jean-Charles for your friendship, for all daily discussions and for your support for all initiatives in my projects. A great thank to Stephane who always cheer up all the breaks with his intelligent humor. I would say that I have learnt from him the most interesting view of the “Binomial world” . Thank Karl for your explanation to your macro-world. Thank Pierre for all your help on data collection and your passion in all explanation such as the story of “Merrill lynch’s investment clock”. Thank Zelia for very stimulated collaboration on my last project and the great time during our internship.

For all persons in the other side of the room, I would like to thank Philippe Balthazard for his comments on my projects and his point of view on financial aspects. Thank Hoang-Phong Nguyen for his help on data base and his support during my stay. There are many other persons that I have chance to be in interaction with but I could not cite here. Thank to my parents, my sister who always believe in me and support me during my deviation to a new direction. In the end, I would like reserve the greatest thank to my wife and my son for their love and daily encouragement. They were always behind me during the most difficult moments of this year.

x

Confidential notice This thesis is sujected to confidential researchs in the R&D team of Lyxor Asset Management. It is divided into two main parts. The first part including three first chapers 1,2 and 3 consists of applications of novel estimation techniques for the trend and the volatility of financial time series. We will present the main results in detail together with a publication in the Lyxor White Paper series. The second part concerning the analysis in the risk-return framework (see The Lyxor White Paper Series, Issue #7, June 2011) of the CTA performance will be skipped due the confidentiality. Only a brief introduction and the final conclusion of this part (chaper 4) will be presented in order to sketch out the main features. This document contains information confidential and proprietary to Lyxor Asset Management. The information may not be used, disclosed or reproduced without the prior written authorization of Lyxor Asset Management and those so authorized may only use the information for the purpose of evaluation consistent with authorization.

Introduction Within the internship in the Research and Development team of Lyxor Asset Management, we studied novel technologies which are applicable on asset management. We focused on the analysis of some special classes of momentum strategies such as the trend-following strategies or the voltarget strategies. These strategies play a crucial role in the quantitative management as they pretend to optimize the benefit basing on exploitable signals of the market inefficiency and to limit the market risk via an efficient control of the volatility. The objectives of this report are two-fold. We first studied some novel techniques in statistic and signal treatment fields such as trend filtering, daily and high frequency volatility estimator or support vector machine. We employed these techniques to extract interesting financial signals. These signals are used to implement the momentum strategies which will be described in detail in every chapters of this report. The second objective concerns the study of the performance of these strategies based on the general risk-return analysis framework (see B. Bruder and N. Gaussel 7th White Paper, Lyxor). This report is organized as following: In the first chapter, we discuss various implementation of L1 filtering in order to detect some properties of noisy signals. This filter consists of using a L1 penalty condition in order to obtain the filtered signal composed by a set of straight trends or steps. This penalty condition, which determines the number of breaks, is implemented in a constrained least square problem and is represented by a regularization parameter λ which is estimated by a cross-validation procedure. Financial time series are usually characterized by a long-term trend (called the global trend) and some short-term trends (which are named local trends). A combination of these two time scales can form a simple model describing the process of a global trend process with some mean-reverting properties. Explicit applications to momentum strategies are also discussed in detail with appropriate uses of the trend configurations. We next review in the second chapter various techniques for estimating the volatility. We start by discussing the estimators based on the range of daily monitoring data then we consider the stochastic volatility model in order to determine the instantaneous volatility. At high trading frequency, the stock prices are fluctuated by an additional noise, so-called the micro-structure noise. This effect comes from the bidask spread and the short time scale. Within a short time interval, the trading price

does not reflect exactly the equilibrium price determined by the “supply-demand” but bounces between the bid and ask prices. In the second part, we discuss the effect of the micro-structure noise on the volatility estimation. It is a very important topic concerning a large field of “high-frequency” trading. Examples of backtesting on index and stocks will illustrate the efficiency of considered techniques. The third chapter is dedicated to the study of general framework of machinelearning technique. We review the well-known machine learning techniques so-called support vector machine (SVM). This technique can be employed in different contexts such as classification, regression or density estimation according to Vapnik [1998]. Within the scope of this report, we would like first to give an overview of this method and its numerical variation implementation, then bridge it to financial applications such as trend forecasting, the stock selection, sector recognition or score construction. We finish in Chapter 4 by the performance analysis of CTA strategy. We review first the trend-following strategies within Kalman filter and study the impact of the trend estimator error. We start the discussion with the case of momentum strategy on the single asset case then generalize the analysis to the multi-asset case. In order to construct the allocation strategy, we employ the observed trend which is filtered by exponential moving average. It can be demonstrated that the cumulated return of the strategy can be splited into two important parts. The first one is called “Option Profile” which involves only the current measured trend. This idea is very similar in concept to the straddle profile suggested by Fung and Hsied (2001). The second part is called “Trading Impact“ which involves an integral of the measured trend over the trading period. We focus on the second quantity by estimating its probability distribution function and associated gain and loss expectations. We illustrate how the number of assets and their correlations influence the performance of a strategy via a “toy model”. This study can reveal important results which can be directly tested on a CTA funds.

xiv

Chapter 1

Trading Strategies with L1 Filtering In this chapter, we discuss various implementation of L1 filtering in order to detect some properties of noisy signals. This filter consists of using a L1 penalty condition in order to obtain the filtered signal composed by a set of straight trends or steps. This penalty condition, which determines the number of breaks, is implemented in a constrained least square problem and is represented by a regularization parameter λ which is estimated by a cross-validation procedure. Financial time series are usually characterized by a long-term trend (called the global trend) and some short-term trends (which are named local trends). A combination of these two time scales can form a simple model describing the process of a global trend process with some mean-reverting properties. Explicit applications to momentum strategies are also discussed in detail with appropriate uses of the trend configurations. Keywords: Momentum strategy, L1 filtering, L2 filtering, trend-following, meanreverting.

1.1

Introduction

Trend detection is a major task of time series analysis from both mathematical and financial point of view. The trend of a time series is considered as the component containing the global change which is in contrast to the local change due to the noise. The procedure of trend filtering concerns not only the problem of denoising but it must take into account also the dynamic of the underlying process. That explains why mathematical approaches to trend extraction have a long history and this subject still gives a great interest in the scientific community 1 . In an investment perspective, trend filtering is the core of most momentum strategies developed in the asset management industry and the hedge funds community in order to improve performance and to limit risk of portfolios. 1

For a general review, see Alexandrov et al. (2008).

1

Trading Strategies with L1 Filtering

The paper is organized as follows. In section 2, we discuss the trend-cycle decomposition of time series and review general properties of L1 and L2 filtering. In section 3, we describe the L1 filter with its various extensions and the calibration procedure. In section 4, we apply L1 filters to some momentum strategies and present the results of some backtests with the S&P 500 index. In section 5, we discuss the possible extension to the multivariate case and we conclude in the last section.

1.2

Motivations

In economics, the trend-cycle decomposition plays an important role to describe a non-stationary time series into permanent and transitory stochastic components. Generally, the permanent component is assimilated to a trend whereas the transitory component may be a noise or a stochastic cycle. Moreover, the literature on business cycle has produced a large number of empirical research on this topic (see for example Cleveland and Tiao (1976), Beveridge and Nelson (1991), Harvey (1991) or Hodrick and Prescott (1997)). These last authors have then introduced a new method to estimate the trend of long-run GDP. The method widely used by economists is based on L2 filtering. Recently, Kim et al. (2009) have developed a similar filter by replacing the L2 penalty function by a L1 penalty function. Let us consider a time series yt which can be decomposed by a slowly varying trend xt and a rapidly varying noise εt process: yt = xt + εt Let us first remind the well-known L2 filter (so-called Hodrick-Prescott filter). This scheme consists to determine the trend xt by minimizing the following objective function: n n−1 X 1X (yt − xt )2 + λ (xt−1 − 2xt + xt+1 )2 2 t=1

t=2

with λ > 0 the regularization parameter which control the competition between the smoothness of xt and the residual yt −xt (or the noise εt ). We remark that the second term is the discrete derivative of the trend xt which characterizes the smoothness of the curve. Minimizing this objective function gives a solution which is the trade-off between the data and the smoothness of its curvature. In finance, this scheme does not give a clear signature of the market tendency. By contrast, if we replace the L2 norm by the L1 norm in the objective function, we can obtain more interesting properties. Therefore, Kim et al. (2009) propose to consider the following objective function: n n−1 X 1X 2 (yt − xt ) + λ |xt−1 − 2xt + xt+1 | 2 t=1

t=2

This problem is closely related to the Lasso regression of Tibshirani (1996) or the L1 regularized least square problem of Daubechies et al. (2004). Here, the fact of taking the L1 norm will impose the condition that the second derivation of the filtered signal 2

Trading Strategies with L1 Filtering

must be zero. Hence, the filtered signal is composed by a set of straight trends and breaks2 . The competition between these two terms in the objective function turns to the competition between the number of straight trends (or number of breaks) and the closeness to the raw data. Therefore, the smoothing parameter λ plays an important role for detecting the number of breaks. In the later, we present briefly how the L1 filter works for the trend detection and its extension to mean-reverting processes. The calibration procedure for λ parameter will be also discussed in detail.

1.3 1.3.1

L1 filtering schemes Application to trend-stationary process

The Hodrick-Prescott scheme discussed in last section can be rewritten in the vectorial space Rn and its L2 norm k·k2 as: 1 ky − xk22 + λ kDxk22 2 where y = (y1 , . . . , yn ), x = (x1 , . . . , xn ) ∈ Rn and the D operator is the (n − 2) × n matrix:   1 −2 1   1 −2 1     . . D= (1.1)  .     1 −2 1 1 2 1 The exact solution of this estimation is given by

 −1 x? = I + 2λD> D y

The explicit expression of x? allows a very simple numerical implementation with sparse matrix. As L2 filter is a linear filter, the regularization parameter λ is calibrated by comparing to the usual moving-average filter. The detail of the calibration procedure is given in Appendix A.1.4. The idea of L2 filter can be generalized to a lager class so-called Lp filter by using Lp penalty condition instead of L2 penalty. This generalization is already discussed in the work of Daubechies et al. (2004) for the linear inverse problem or in the Lasso regression problem by Tibshirani et al. (1996). If we consider a L1 filter, the objective function becomes: n

n−1

t=1

t=2

X 1X (yt − xt )2 + λ |xt−1 − 2xt + xt+1 | 2 2

A break is the position where the trend of signal changes.

3

Trading Strategies with L1 Filtering

which is equivalent to the following vectorial form: 1 ky − xk22 + λ kDxk1 2 It has been demonstrated in Kim et al. (2009) that the dual problem of this L1 filter scheme is a quadratic program with some boundary constraints. The detail of this derivation is shown in Appendix A.1.1. In order to optimize the numerical computation speed, we follow Kim et al. (2009) by using a “primal-dual interior point” method (see Appendix A.1.2). In the following, we check the efficient of this technique on various trend-stationary processes. The first model consists of data simulated by a set of straight trend lines with a white noise perturbation:  yt = xt + εt       εt ∼ N 0, σ 2 xt = xt−1 + vt (1.2)   Pr {v = v } = p  t t−1     Pr vt = b U[0,1] − 12 = 1 − p

We present in Figure 2.19 the comparison between L1 − T and HP filtering schemes3 . The top-left graph is the real trend xt whereas the top-right graph presents the noisy signal yt . The bottom graphs show the results of the L1 − T and HP filters. Here, we have chosen λ = 5 258 for the L1 − T filtering and λ = 1 217 464 for HP filtering. This choice of λ for L1 − T filtering is based on the number of breaks in the trend, which is fixed to 10 in this example4 . The second model model is a random walk generated by the following process:  yt = yt−1 + vt + εt    εt ∼ N 0, σ 2 (1.3) Pr {v  t = vt−1 } = p     Pr vt = b U[0,1] − 12 = 1 − p

We present in Figure 1.2 the comparison between L1 − T filtering and HP filtering on this second model5 .

1.3.2

Extension to mean-reverting process

As shown in the last paragraph, the use of L1 penalty on the second derivative gives the correct description of the signal tendency. Hence, similar idea can be applied for other order of the derivatives. We present here the extension of this L1 filtering technique to the case of mean-reverting processes. If we impose now the L1 penalty 3 We consider n = 2000 observations. The parameters of the simulation are p = 0.99, b = 0.5 and σ = 15. 4 We discuss how to obtain λ in the next section. 5 The parameters of the simulation are p = 0.993, b = 5 and σ = 15.

4

Trading Strategies with L1 Filtering

Figure 1.1: L1 − T filtering versus HP filtering for the model (1.2) Signal

Noisy signal

100

100

50

50

0

0

−50

−50 500

1000

1500

2000

500

t

L1 -T filter

HP filter

100

100

50

50

0

0

−50

−50 500

1000

t

1000

1500

2000

500

1000

t

1500

2000

1500

2000

t

Figure 1.2: L1 -T filtering versus HP filtering for the model (1.3) Signal

Noisy signal

1500

1500

1000

1000

500

500

0

0 500

1000

1500

2000

500

1000

t

t

L1 -T filter

HP filter

1500

1500

1000

1000

500

500

0

1500

2000

1500

2000

0 500

1000

1500

2000

500

t

1000

t

5

Trading Strategies with L1 Filtering

condition to the first derivative, we can expect to get the fitted signal with zero slope. The cost of this penalty will be proportional to the number of jumps. In this case, we would like to minimize the following objective function:

or in the vectorial form:

n

n

t=1

t=2

X 1X (yt − xt )2 + λ |xt − xt−1 | 2

1 ky − xk22 + λ kDxk1 2 Here the D operator is (n − 1) × n matrix which is the discrete version of the first order derivative:   −1 1 0  0 −1 1  0     . .. D= (1.4)     −1 1 0  −1 1

We may apply the same minimization algorithm as previously (see Appendix A.1.1). To illustrate that, we consider the model with step trend lines perturbed by a white noise process:  y t = x t + εt     εt ∼ N 0, σ 2 (1.5) Pr {x  t = xt−1 } = p     Pr xt = b U[0,1] − 12 = 1 − p We employ this model for testing the L1 − C filtering and HP filtering adapted to the first derivative6 , which corresponds to the following optimization program: n

n

t=1

t=2

X 1X min (yt − xt )2 + λ (xt − xt−1 )2 2

In Figure 1.3, we have reported the corresponding results7 . For the second test, we consider a mean-reverting process (Ornstein-Uhlenbeck process) with mean value following a regime switching process:  yt = yt−1 + θ(x    t − yt−1 ) + εt  εt ∼ N 0, σ 2 (1.6) Pr {x    t = xt−1 } = p 1   Pr xt = b U[0,1] − 2 = 1 − p Here, µt is the process which characterizes the mean value and θ is inversely proportional to the return time to the mean value. In Figure 1.4, we show how the L1 − C filter can capture the original signal in comparison to the HP filter8 . 6

We use the term HP filter in order to keep homogeneous notations. However, we notice that this filter is indeed the FLS filter proposed by Kalaba and Tesfatsion (1989) when the exogenous regressors are only a constant. 7 The parameters are p = 0.998, b = 50 and σ = 8. 8 For the simulation of the Ornstein-Uhlenbeck process, we have chosen p = 0.9985, b = 20, θ = 0.1 and σ = 2

6

Trading Strategies with L1 Filtering

Figure 1.3: L1 − C filtering versus HP filtering for the model (1.5) Signal

Noisy signal

80

80

60

60

40

40

20

20

0

0

−20

−20

−40

−40 500

1000

1500

2000

500

1000

t

t

L1 -C filter

HP filter

80

80

60

60

40

40

20

20

0

0

−20

−20

−40

1500

2000

1500

2000

−40 500

1000

1500

2000

500

t

1000

t

Figure 1.4: L1 − C filtering versus HP filtering for the model (1.6) Signal

Noisy signal

40

40

30

30

20

20

10

10

0

0

−10

−10

−20

−20 500

1000

1500

2000

500

t

1000

1500

2000

1500

2000

t

L1 -C filter

HP filter

40

40

30

30

20

20

10

10

0

0

−10

−10

−20

−20 500

1000

1500

2000

500

t

1000

t

7

Trading Strategies with L1 Filtering

1.3.3

Mixing trend and mean-reverting properties

We now combine the two schemes proposed above. In this case, we define two regularPn−1 ization parameters λ and λ corresponding to two penalty conditions 1 2 t=1 |xt − xt−1 | Pn−1 and t=2 |xt−1 − 2xt + xt+1 |. Our objective function for the primal problem becomes now: n

n−1

n−1

t=1

t=1

t=2

X X 1X (yt − xt )2 + λ1 |xt − xt−1 | + λ2 |xt−1 − 2xt + xt+1 | 2

which can be again rewritten in the matrix form:

1 ky − xk22 + λ1 kD1 xk1 + λ2 kD2 xk1 2 where the D1 and D2 operators are respectively the (n − 1) × n and (n − 2) × n matrices defined in equations (1.4) and (1.1). In Figures 1.5 and 1.6, we test the efficiency of the mixing scheme on the straight trend lines model (1.2) and the random walk model (1.3)9 . Figure 1.5: L1 − T C filtering versus HP filtering for the model (1.2) Signal

Noisy signal

100

100

50

50

0

0

−50

−50

−100

−100

500

1000

1500

2000

500

1000

t L1 -TC filter 100

50

50

0

0

−50

−50

−100

−100

1000

1500

2000

500

1000

1500

2000

t

t

1.3.4

2000

HP filter

100

500

1500

t

How to calibrate the regularization parameters?

As shown above, the trend obtained from L1 filtering depends on the parameter λ of the regularization procedure. For large values of λ, we obtain the long-term trend of 9

For both models, the parameters are p = 0.99, b = 0.5 and σ = 5.

8

Trading Strategies with L1 Filtering Figure 1.6: L1 − T C filtering versus HP filtering for the model (1.3) Signal

Noisy signal

1500

1500

1000

1000

500

500

0

0

−500

−500 500

1000

1500

2000

500

t L1 -TC filter

2000

1500

2000

HP filter 1500

1000

1000

500

500

0

0

−500

−500 1000

1500

t

1500

500

1000

1500

2000

500

t

1000

t

the data while for small values of λ, we obtain short-term trends of the data. In this paragraph, we attempt to define a procedure which permits to do the right choice on the smoothing parameter according to our need of trend extraction.

A preliminary remark For small value of λ, we recover the original form of the signal. For large value of λ, we remark that there exists a maximum value λmax above which the trend signal has the affine form: xt = α + βt where α and β are two constants which do not depend on the time t. The value of λmax is given by:



−1

>

λmax = DD Dy



We can use this remark to get an idea about the order of magnitude of λ which should be used to determine the trend over a certain time period T . In order to show this idea, we take the data over the total period T . If we want to have the global trend on this period, we fix λ = λmax . This λ will gives the unique trend for the signal over the whole period. If one need to get more detail on the trend over shorter periods, we can divide the signal into p time intervals and then estimate λ 9

Trading Strategies with L1 Filtering

via the mean value of all the λimax parameter: p

1X i λmax λ= p i=1

In Figure 1.7, we show the results obtained with p = 2 (λ = 1 500) and p = 6 (λ = 75) on the S&P 500 index. Figure 1.7: Influence of the smoothing parameter λ 7.6

S&P 500 λ =999 λ =15

7.4

7.2

7

6.8

6.6

2007

2008

2009

2010

2011

Moreover, the explicit calculation of a Brownian motion process gives us the scaling law of the the smoothing parameter λmax . For the trend filtering scheme, λmax scales as T 5/2 while for the mean-reverting scheme, λmax scales as T 3/2 (see Figure 1.8). Numerical calculation of these powers for 500 simulations of the model (1.3) gives very good agreement with the analytical result for Brownian motion. Indeed, we obtain empirically that the power for L1 − T filter is 2.51 while the one for L1 − C filter is 1.52. Cross validation procedure In this paragraph, we discuss how to employ a cross-validation scheme in order to calibrate the smoothing parameter λ of our model. We define two additional parameters which characterize the trend detection mechanism. The first parameter T1 is the width of the data windows to estimate the optimal λ with respect to our target strategy. This parameter controls the precision of our calibration. The second parameter T2 is used to estimate the prediction error of the trends obtained in the 10

Trading Strategies with L1 Filtering Figure 1.8: Scaling power law of the smoothing parameter λmax

main window. This parameter characterizes the time horizon of the investment strategy. Figure 3.7 shows how the data set is divided into different windows in the Figure 1.9: Cross-validation procedure for determining optimal value λ? Training set |

|

T1

Test set -|

-

T2

Historical data

Forecasting | T2 k Today Prediction

cross validation procedure. In order to get the optimal parameter λ, we compute the total error after scanning the whole data by the window T1 . The algorithm of this calibration process is described as following:

11

Trading Strategies with L1 Filtering Algorithm 1 Cross validation procedure for L1 filtering procedure CV_Filter(T1 , T2 ) Divide the historical data by m rolling test sets T2i (i = 1, . . . , m) For each test window T2i , compute the statistic λimax ¯ and the standard deviation From the array of λimax , compute the average λ σλ ¯ − 2σλ and λ2 = λ ¯ + 2σλ Compute the boundaries λ1 = λ for j = 1 : n do Compute λj = λ1 (λ2 /λ1 )(j/n) Divide the historical data by p rolling training sets T1k (k = 1, . . . , p) for k = 1 : p do For each training window T1k , run the L1 filter Forecast the trend for the adjacent test window T2k Compute the error ek (λj ) on the test window T2k end for P k Compute the total error e (λj ) = m k=1 e (λj ) end for Minimize the total error e (λ) to find the optimal value λ? Run the L1 filter with λ = λ? end procedure Figure 1.10 illustrates the calibration procedure for the S&P 500 index with T1 = 400 and T2 = 50 for the S&P 500 index (the number of observations is equal to 1 008 trading days). With m = p = 12 and n = 15, the estimated optimal value λ? for the L1 − T filter is equal to 7.03. We have observed that this calibration procedure is more favorable for long-term time horizon, that is to estimate a global trend. For short-term time horizon, the prediction of local trends is much more perturbed by the noise. We have computed the probability of having good prediction on the tendency of the market for longterm and short-term time horizons. This probability is about 70% for 3 months time horizon while it is just 50% for one week time horizon. It comes that even if the fit is good for the past, the noise is however large meaning that the prediction of the future tendency is just 1/2 for an increasing market and 1/2 for a decreasing market. In order to obtain better results for smaller time horizons, we improve the last algorithm by proposing a two-trend model. The first trend is the local one which is determined by the first algorithm with the parameter T2 corresponding to the local prediction. The second trend is the global one which gives the tendency of the market over a longer period T3 . The choice of this global trend parameter is very similar to the choice of the moving-average parameter. This model can be considered as a simple version of mean-reverting model for the trend. In Figure 1.11, we describe how the data set is divided for estimating the local trend and the global trend. The procedure for estimating the trend of the signal in the two-trend model is summarized in Algorithm 2. The corrected trend is now determined by studying the relative position of the historical data to the globaltrend. The reference position is characterized by the standard deviation σ yt − xG where xG t t is the filtered global 12

Trading Strategies with L1 Filtering Figure 1.10: Calibration procedure with the S&P 500 index 7.5

7

80

e(λ)

6.5 2007

2008

2009

2010

2011

60

40

−2

0

2

4

6

8

ln λ

trend.

1.4

Application to momentum strategies

In this section, we apply the previous framework to the S&P 500 index. First, we illustrate the calibration procedure for a given trading date. Then, we backtest a momentum strategy by estimating dynamically the optimal filters.

1.4.1

Estimating the optimal filter for a given trading date

We would like to estimate the optimal filter for January 3rd, 2011 by considering the period from January 2007 to December 2010. We use the previous algorithms Figure 1.11: Cross validation procedure for two-trend model Training set | |

Forecasting

Test set -

T1

| |

-

T3

-

T2

Historical data 13

|

| k Today

 - Global trend   T3  Local trend  T2 -

Prediction

Trading Strategies with L1 Filtering Algorithm 2 Prediction procedure for the two-trend model procedure Predict_Filter(Tl , Tg ) Compute the local trend xL t for the time horizon T2 with the CV_FILTER procedure Compute the global trend xG t for the time horizon T3 with the CV_FILTER procedure  Compute the standard deviation σ yt − xG of data with respect to the global t trend  G then if yt − xG t < σ yt − xt Prediction ← xL t else Prediction ← xG t end if end procedure

with T1 = 400 and T2 = 50. The optimal parameters are λ1 = 2.46 (for the L1 − C filter) and λ2 = 15.94 (for the L2 − T filter). Results are reported in Figure 1.12. The trend for the next 50 trading days is estimated to 7.34% for the L1 − T filter and 7.84% for the HP filter whereas it is null for the L1 − C and L1 − T C filters. By comparison, the true performance of the S&P 500 index is 1.90% from January 3rd, 2011 to March 15th, 201110 . Figure 1.12: Comparison between different L1 filters on S&P 500 Index

10

It corresponds exactly to a period of 50 trading days

14

Trading Strategies with L1 Filtering

1.4.2

Backtest of a momentum strategy

Design of the strategy Let us consider a class of self-financed strategies on a risky asset St and a risk-free asset Bt . We assume that the dynamics of these assets is: dBt = rt Bt dt dSt = µt St dt + σt St dWt where rt is the risk-free rate, µt is the trend of the asset price and σt is the volatility. We denote αt the proportion of investment in the risky asset and (1 − αt ) the part invested in the risk-free asset. We start with an initial budget W0 and expect a final wealth WT . The optimal strategy is the one which optimizes the expectation of the utility function U (WT ) which is increasing and concave. It is equivalent to the Markowitz problem which consists of maximizing the wealth of the portfolio under a penalty of risk:   λ 2 α α sup E (WT ) − σ (WT ) 2 α∈R which is equivalent to:   λ 2 2 sup αt µt − W0 αt σt 2 α∈R As the objective function is concave, the maximum corresponds to the zero point of the gradient µt − λW0 αt σt2 . We obtain the optimal solution: αt? =

1 µt λW0 σt2

In order to limit the explosion of αt , we also impose the following constraint αmin ≤ αt ≤ αmax :     1 µt ? αt = max min , αmin , αmax λW0 σt2 The wealth of the portfolio is then given by the following expression:     ? ? St+1 Wt+1 = Wt + Wt αt − 1 + (1 − αt )rt St Results In the following simulations, we use the estimators µ ˆt and σ ˆt in place of µt and σt . For µ ˆt , we consider different models like L1 , HP and moving-average filters11 whereas we use the following estimator for the volatility: Z t 1 X Si 1 T 2 2 σt dt = ln2 σ ˆt = T 0 T Si−1 i=t−T +1

We consider a long/short strategy, that is (αmin , αmax ) = (−1, 1). In the particular 1 case of the µ ˆL t estimator, we consider three different models: 11

1 We note them respectively µ ˆL ˆHP and µ ˆMA . t t t , µ

15

Trading Strategies with L1 Filtering Table 1.1: Results for the Backtest Model S&P 500 µ ˆMA t µ ˆHP t 1 µ ˆL t 1 µ ˆL t L1 µ ˆt

Trend

(LT) (GT) (LGT)

Performance 2.04% 3.13% 6.39% 3.17% 6.95% 6.47%

Volatility 21.83% 18.27% 18.28% 17.55% 19.01% 18.18%

Sharpe −0.06 −0.01 0.17 −0.01 0.19 0.17

IR 0.03 0.13 0.03 0.14 0.13

Drawdown 56.78 33.83 39.60 25.11 31.02 31.99

1. the first one is based on the local trend; 2. the second one is based on the global trend; 3. the combination of both local and global trends corresponds to the third model. For all these strategies, the test set of the local trend T2 is equal to 6 months (or 130 trading days) whereas the length of the test set for global trend is four times the length of the test set – T3 = 4T2 – meaning that T3 is one year (or 520 trading days). This choice of T3 agrees with the habitual choice of the width of the windows in moving average estimator. The length of the training set is also four times the length of the test set T1 . The study period is from January 1998 to December 2010. In the backtest, the trend estimation is updated every day. In Table 2.3, we summarize the results obtained with the different models cited above for the backtest. We remark that the best performances correspond to the case of global trend, HP and two-trend models. Because HP filter is calibrated to the window of the moving-average filter which is equal to T3 , it is not surprising that the performances of these three models are similar. On the considered period of the backtest, the S&P does not have a clear upward or downward trend. Hence, the local trend estimator does not give a good prediction and this strategy gives the worst performance. By contrast, the two-trend model takes into account the trade-off between local trend and global trend and gives a better result

1.5

Extension to the multivariate case

  (1) (m) We now extend the L1 filtering scheme to a multivariate time series yt = yt , . . . , yt . The underlying idea is to estimate the common trend of several univariate time series. In finance, the time series correspond to the prices of several assets. Therefore, we can build long/short strategies between these assets by comparing the individual trends and the common trend. For the sake of simplicity, we assume that all the signals are rescaled to the same 16

Trading Strategies with L1 Filtering

order of magnitude12 . The objective function becomes new: m

2 1 X

(i)

y − x + λ kDxk1 2 2 i=1

In Appendix A.1.1, we show thatPthis problem is equivalent to the L1 univariate (i) as the signal. problem by considering y¯t = m−1 m i=1 y

1.6

Conclusion

Momentum strategies are efficient ways to use the market tendency for building trading strategies. Hence, a good estimator of the trend is essential from this perspective. In this paper, we show that we can use L1 filters to forecast the trend of the market in a very simple way. We also propose a cross-validation procedure to calibrate the optimal regularization parameter λ where the only information to provide is the investment time horizon. More sophisticated models based on a local and global trends is also discussed. We remark that these models can reflect the effect of meanreverting to the global trend of the market. Finally, we consider several backtests on the S&P 500 index and obtain competing results with respect to the traditional moving-average filter.

12

For example, we may center and standardize the time series by subtracting the mean and dividing by the standard deviation.

17

Bibliography [1] Alexandrov T., Bianconcini S., Dagum E.B., Maass P. and McElroy T. (2008), A Review of Some Modern Approaches to the Problem of Trend Extraction , US Census Bureau, RRS #2008/03. [2] Beveridge S. and Nelson C.R. (1981), A New Approach to the Decomposition of Economic Time Series into Permanent and Transitory Components with Particular Attention to Measurement of the Business Cycle, Journal of Monetary Economics, 7(2), pp. 151-174. [3] Boyd S. and Vandenberghe L. (2009), Convex Optimization, Cambridge University Press. [4] Cleveland W.P. and Tiao G.C. (1976), Decomposition of Seasonal Time Series: A Model for the Census X-11 Program, Journal of the American Statistical Association, 71(355), pp. 581-587. [5] Daubechies I., Defrise M. and De Mol C. (2004), An Iterative Thresholding Algorithm for Linear Inverse Problems with a Sparsity Constraint, Communications on Pure and Applied Mathematics, 57(11), pp. 1413-1457. [6] Efron B., Tibshirani R. and Friedman R. (2009), The Elements of Statistical Learning, Second Edition, Springer. [7] Harvey A. (1991), Forecasting, Structural Time Series Models and the Kalman Filter, Cambridge University Press. [8] Hodrick R.J. and Prescott E.C. (1997), Postwar U.S. Business Cycles: An Empirical Investigation, Journal of Money, Credit and Banking, 29(1), pp. 1-16. [9] Kalaba R. and Tesfatsion L. (1989), Time-varying Linear Regression via Flexible Least Squares, Computers & Mathematics with Applications, 17, pp. 1215-1245. [10] Kim S-J., Koh K., Boyd S. and Gorinevsky D. (2009), `1 Trend Filtering, SIAM Review, 51(2), pp. 339-360. [11] Tibshirani R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society B, 58(1), pp. 267-288. 19

Chapter 2

Volatility Estimation for Trading Strategies We review in this chapter various techniques for estimating the volatility. We start by discussing the estimators based on the range of daily monitoring data then we consider the stochastic volatility model in order to determine the instantaneous volatility. At high trading frequency, the stock prices are fluctuated by an additional noise, socalled the micro-structure noise. This effect comes from the bid-ask bounce due to the short time scale. Within a short time interval, the trading price does not converge to the equilibrium price determined by the “supply-demand” equilibrium. In the second part, we discuss the effect of the micro-structure noise on the volatility estimation. It is very important topic concerning an enormous field of “high-frequency” trading. Examples of backtesting on index and stocks will illustrate the efficiency of considered techniques. Keywords: Volatility, voltarget strategy, range-based estimator, high-low estimator, microstructure noise.

2.1

Introduction

Measuring the volatility is one of the most important questions in finance. As stated in its name, volatility is the direct measurement of the risk for a given asset. Under the hypothesis that the realized return follows a Brownian motion, volatility is usually estimated by the standard deviation of daily price movement. As this assumption relates the stock price to the most common object of stochastic calculus, many mathematical work have been carried out on the volatility estimation. With the increasing of the trading data, we can explore more and more useful information in order to improve the precision of the volatility estimator. New class of estimators which are based on the high and low prices was invented. However, in the real world the asset price is just not a simple geometric Brownian process, different effects have been observed including the drift or the opening jump. A general correction 21

Volatility Estimation for Trading Strategies

scheme based on the combination of various estimators have been studied in order to eliminate these effects. As far as the trading frequency increases, we expect that the precision of estimator gets better as well. However, when the trading frequency reaches certain limit1 , new phenomena due to the nonequlibrum of the market emerge and spoil the precision. It is called the micro-structure noise which is characterized by the bid-ask bounce or the transaction effect. Because of this noise, realized variance estimator overestimates the true volatility of the price process. A suggestion based on the use of two different time scales can aim to eliminate this effect. The note is organized as following. In Section II, we review the basic volatility estimator using the variance of realized return (note from B.Bruder article) then we introduce all the variation based on the range estimation. In section III, we discuss how to measure the instantaneous volatility and the effect of the lag by doing the moving-average. In section IV, we discuss the effect of the microstructure on the high frequency volatility.

2.2 2.2.1

Range-based estimators of volatility Range based daily data

In this paragraph, we discuss the general characteristics of the asset price and introduce the basic notations which will be used for the rest of the article. Let us assume that the dynamics of asset price follows the habitual Black-Scholes model. We denote the asset price St which follows a geometric Brownian motion in continuous time: dSt = µt dt + σt dBt St

(2.1)

Here, µt is the return or the drift of the process whereas σt is the volatility. Over the period of T = 1 trading day, the evolution is divided in two time intervals: the first interval with ratio f describes the closing interval (before opening) and the second interval with ratio 1 − f describes the opening interval (trading interval). On the monitoring of the data, the closing interval is unobservable and is characterized by the jumps in the opening of the market. The measure of closing interval is not given by the real closing time but the jumps in the opening of the market. If the logarithm of price follows a standard Brownian motion without drift, then the fraction f / (1 − f ) is given by the square of ratio between the standard deviation of the opening jump and the daily price movement. We will see that this idea can give a first correction due to the close-open effect for all the estimators discussed below. In order to fix the notation, we define here different quantities concerning the statistics of the price evolution: • T is the time interval of 1 trading day 1

This limit defines the optimal frequency for the classical estimator. It is more and less agreed to be one trade every 5 minutes.

22

Volatility Estimation for Trading Strategies Figure 2.1: Data set of 1 trading day

• f is the fraction of closing period • σ ˆt2 is the estimator of the variance σt2 • Oti is the closing price on a given period [ti , ti+1 [ • Cti is the closing price on a given period [ti , ti+1 [ • Hti = maxt∈[ti ,ti+1 [ St is the highest price on a given period [ti , ti+1 [ • Lti = mint∈[ti ,ti+1 [ St is the lowest price on a given period [ti , ti+1 [ • oti = ln Oti − ln Cti−1 is the opening jump • uti = ln Hti − ln Oti is the highest price movement during the trading open • dti = ln Lti − ln Oti is the lowest price movement during the trading open • cti = ln Cti − ln Oti is the daily price movement over the trading open period

2.2.2

Basic estimator

For the sake of simplicity, let us start this paragraph by assuming that there is no opening jump f = 0. The asset price St described by the process (3.17) is observed in a series of discrete dates {t0 , ..., tn }. In general, this series is not necessary regular. Let Rti be the realized return in the period [ti−1 , ti [, then we obtain: Rti = ln Sti − ln Sti−1

  1 2 = σu dBu + µu du − σu du 2 ti−1 Z

ti

In the following, we assume that the couple (µt , σt ) is independent to the Brownian motion Bt of the asset price evolution. 23

Volatility Estimation for Trading Strategies

Estimator over a given period In appendix B.1, we show that the realized return Rti is related to the volatility as:  2  2  1 2 2 2 E Rti |σ, µ = (ti − ti−1 ) σti + (ti − ti−1 ) µti−1 − σti−1 2

This √ quantity can 2not be a good estimator of volatility because its standard deviation is 2 (ti+1 − ti ) σti which is proportional to the estimated quantity. In order to reduce the estimation error, we focus on the estimation of the average volatility over the period tn − t0 . The average volatility is defined as: Z tn 1 2 σ = σ 2 du (2.2) tn − t0 t0 u This quantity can be measured by using the canonical estimator defined as: n 1 X 2 Rti tn − t0

σ ˆ2 =

i=1

 The variance of this estimator is approximated as var σ ˆ 2 ≈ 2σ 4 /n or the standard √ 2 √ is small if deviation is proportional to 2σ / n. It means that the estimation error  √  2 n is large enough. Indeed the variance of the average volatility reads var σ ˆ ≈ √ 2 σ / (2n) and the standard deviation is approximated to σ/ 2n. Effect of the weight distribution In general, we can define an estimator with a weight distribution wi such as: 2

σ ˆ =

n X

wi Rt2i

i=1

then the expectation value of the estimator is given by: 



2

E σ ˆ |σ, µ =

n Z X i=1

ti

ti−1

wi σu2 du

A simple example of the general definition is the estimator with annualized return √ Ri / ti+1 − ti . In this case, our estimator becomes: n

σ ˆ2 =

1 X Rt2i n tn − t1 i=1

for which the expectation value is: n  2  X E σ ˆ |σ, µ = i=1

1 ti − ti−1 24

Z

ti

ti−1

σu2 du

(2.3)

Volatility Estimation for Trading Strategies

We remark that if the time step (time increment) is constant ti − ti−1 = T , then we obtain the same result as the canonical estimator. However, if the time step ti − ti−1 is not constant, the long-term return is underweighted while the short-term return is overweighted. We will see in the next discussion on the realized volatility, the way of choosing the weight distribution can help to improve the quality of the estimator. For example, we will show that the IGARCH estimation can lead to an exponential weight distribution which is more appropriate to estimate the realized volatility. Close to close, open to close estimators As discussed above, the volatility can be obtained by an using moving-average on discrete ensemble data. The standard measurement is to employ the above result of the canonical estimator for the closing prices (so-called “close to close” estimator): 2 σ ˆCC =

n X 1 ((oti + cti ) − (o + c))2 (n − 1) T i=1

Here, T is the time period corresponding to 1 trading day. In the rest of the paper, we user CC to denote the close to close estimator. We remark that in this formula, there are two different points in comparison to the one defined above. Firstly, we have subtracted the mean value of the closing price (o + c) in order to eliminate the drift effect: n n 1 X 1 X o= oti , c = cti nT nT i=1

i=1

Secondly, the prefactor is now 1/ (n − 1) T but not 1/nT . In fact, we have subtracted the mean value then maximum likehood procedure leads to the factor 1/ (n − 1) T . We can define also two other volatility estimators which is “open to close” estimator (OC): n X 1 2 σ ˆC = (cti − c)2 (n − 1) T i=1

and the “close to open” estimator (CO): 2 σ ˆO =

n X 1 (oti − o)2 (n − 1) T i=1

We remind that oti is the opening jump for a given trading period, cti is the daily movement of the asset price such that the close to close return is equal to (o + c). We remark that the “close to close ” estimator does not depend on the drift and the closing interval f . Without presence of the microstructure noise, this estimator is unbiased. Hence, it is usually used as a benchmark to judge the efficiency of other estimators σ ˆ which is defined as:  2  var σ ˆCC 2 eff σ ˆ = var (ˆ σ2)  4 /n. The quality of an estimator is determined by its high value where var σ ˆ 2 = 2σ  of efficiency eff σ ˆ 2 > 1. 25

Volatility Estimation for Trading Strategies

2.2.3

High-low estimators

We have seen that the daily deviation can be used to define the estimator of the volatility. It comes from the fact that one has assumed that the logarithm of price follows a Brownian motion. We all know that the standard deviation in the diffusive √ process over an interval time ∆t is proportional to σ ∆t , hence using the variance to estimate the volatility is quite intuitive. Indeed, within a given time interval, if additional information of the price movement is available such as the highest value or the lowest value, this range must provide as well a good measure of the volatility. This idea is first addressed by W. Feller in 1951. Later, Parkinson (1980) has employed the first result of Feller’s work to provide the first “high-low” estimator (so-called Parkinson estimator). If one uses close prices to estimate the volatility, one can eliminate the effect of the drift by subtracting the mean value of daily variation. By contrast, the use of high and low prices can not eliminate the drift effect in such a simple way. In addition, the high and low prices can be only observed in the opening interval, then it can not eliminate the second effect due to the opening jump. However, as demonstrated in the work of Parkinson (1980), this estimator gives a better confidence but it obviously underestimate the volatility because of the discrete observation of the price. The maximum and minimum value over a time interval ∆t are not the true ones of the Brownian motion. They are underestimated then it is not surprising that the result will depend strongly on the frequency of the price quotation. In the high frequency market, the third effect can be negligible however we will discuss this effect in the later. Because of the limitation of Parkinson’s estimator, an other estimator which is also based on the work of Feller was proposed by Kunitomo (1992). In order to eliminate the drift, he construct a Brownian bridge then the deviation of this motion is again related to the diffusion coefficient. In the same line of thought, Rogers and Satchell (1991) propose an other use of high and low prices in order to obtain a drift-independent volatility estimator. In this section, we review the three techniques which are always constrained by the opening jump.

The Parkinson estimator Let us consider the random variable uti − dti (namely the range of the Brownian motion over the period [ti , ti+1 [), then the Parkinson estimator is defined by using the following result (Feller 1951): h i E (u − d)2 = (4 ln 2) σ 2 T By inversing this formula, we obtain a natural estimator of volatility based on high and low prices. The Parkinson’s volatility estimator is then defined as (Parkinson 1980): σ ˆP2

n 1 X 1 = (uti − dti )2 nT 4 ln 2 i=1

26

Volatility Estimation for Trading Strategies

In order to estimate the error of the estimator, we compute the variance of σ ˆP2 which is given by the following expression:  4   σ 9ζ (3) 2 var σ ˆP = 2 −1 n 16 (ln 2)

Here, ζ (x) is the Riemann function. In comparison to the benchmark estimator “close to close” , we have an efficiency:  eff σ ˆP2 =

32 (ln 2)2 = 4.91 9ζ (3) − 16 (ln 2)2

The Garman-Klass estimator Another idea employing the additional information from the high and low value of the price movement within the trading day in order to increase the estimator efficiency was introduced by Garman and Klass (1980). They construct a best analytic scale estimator by proposing a quadratic form estimator and imposing the well-known invariance condition of Brownian motion on the set of variable (u, d, c). By minimizing its variance, they obtain the optimal variational form of quadratic estimator which is given by the following property: i h E 0.511 (u − d)2 − 0.019 (c (u + d) − 2ud) − 0.383c2 = σ 2 T

Then the Garman-Klass estimator is defined as: 2 σ ˆGK

n i 1 Xh = 0.511 (uti − dti )2 − 0.019 (cti (uti + dti ) − 2uti dti ) − 0.383c2ti nT i=1

 2 The minimal value of the variance corresponding to the quadratic estimator is var σ = GK  2 0.27σ 4 /n and its efficiency is now eff σGK = 7.4.

The Kunitomo estimator

Let Xt the logarithm of price process Xt = ln St , the Ito theorem gives us its evolution:   σt2 dXt = µt − dt + σt dBt 2

If the drift term becomes relevant in the estimation of volatility, one can eliminate it by constructing a Brownian bridge on the period T as following: Wt = Xt −

t XT T

If the initial condition is normalized to X0 = 0, then by definition we always have XT = 0. This construction eliminates automatically the drift term when its daily variation is small µti+1 − µti  µti . We define the range of the Brownian bridge 27

Volatility Estimation for Trading Strategies

Dti = Mti − mti where Mti = maxt∈[ti ,ti+1 [ Wt and mti = mint∈[ti ,ti+1 [ Wt . It has been demonstrated that the variance of the range of Brownian bridge is directly proportional to the volatility (Feller 1951):   E D2 = T π 2 σ 2 /6 (2.4)

Hence, Kunimoto’s estimator is defined as following: 2 σ ˆK =

n 1 X 6 (Mti − mti )2 nT π2 i=1

Higher moment of the Brownian bridge can be also calculated analytically and is given by the formula 2.10 in Kunitomo (1992). In particular, the variance of the  2 Kunitomo’s estimator is equal to var σK = σ 4 /5n which implies the efficiency of 2 = 10. this estimator eff σK The Rogers-Satchell estimator

Another way to eliminate the drift effect is proposed by Rogers and Satchell. They consider the following property of the Brownian motion: E [u (u − c) + d (d − c)] = σ 2 T This expectation value does not depend on the drift of the Brownian motion, hence it does provide a drift-independent estimator which can be defined as: 2 σ ˆRS =

n 1 X [uti (uti − cti ) + dti (dti − cti )] nT i=1

 2 The variance of this estimator is given by var σ ˆRS = 0.331σ 4 /n which gives an  2 efficiency eff σ ˆRS = 6.

Like the other techniques based on the range ”high-low”, this estimator underestimates the volatility due to the fact that the maximum of a discretized Brownian motion is smaller than the true value. Rogers and Satchell have also proposed a correction scheme which can be generalized for other technique. Let M be the number of quoted price, then h = T /M is the step of the discretization, then the corrected estimator taking account of the finite step error is give by the root of the following equation: √ 2 σ ˆh2 = 2bhˆ σh2 + 2 (u − d) a hˆ σh + σ ˆRS √  √   where a = 2π 1/4 − 2 − 1 /6 and b = (1 + 3π/4) /12.

2.2.4

How to eliminate both drift and opening effects?

A common way to eliminate both effects coming from the drift and the opening jump is to combine the various available volatility estimators. The general scheme 28

Volatility Estimation for Trading Strategies

is to form a linear combination of opening estimator σO and close estimator σC or a high-low estimator σHL . The coefficients of this combination are determined by a minimization procedure on the variance of the result estimator. Given the faction of closing interval f , we can improve all high-low estimators discussed above by introducing the combination: σ ˆ2 = α

2 σ ˆO σ ˆ2 + (1 − α) HL f 1−f

Here, the trivial choice is α = f and the estimator becomes independent of the opening jump. However, the optimal value of the coefficient is chosen as α = 0.17 for Parkinson and Kunimoto estimators whereas it value is α = 0.12 for GarmanKlass estimator (Garman and Klass 1980). This technique can eliminate the effect of the opening jump for all estimator but only Kunimoto estimator can avoid both effects. Applying the same idea, Yang and Zhang (2000) have proposed another combination which can also eliminate both effect as Kunimoto estimator. They choose the following combination: σ ˆY2 Z = α

2  σ ˆO 1−α 2 2 + κˆ σC + (1 − κ) σ ˆHL f 1−f

2 as high-low estimator because In the work of Yang ans Zhang, they have used σ ˆRS it is drift independent estimator. The coefficient α will be chosen as α = f and κ is given by optimizing the variance of estimator. The minimization procedure gives the optimal value of the parameter κ:

κo =

β−1 β + n+1 n−1

h i where β = E (u (u − c) + d (d − c))2 /σ 4 (1 − f )2 . As the numerator is proportional

to (1 − f )2 , β is in dependent of f . Indeed, the value of β varies not too much (from 1.331 to 1.5) when the drift is changed. In practice, the value of β is chosen as 1.34.

2.2.5

Numerical simulations

Simulation with constant volatility We test various volatility estimators via a simulation of a geometric Brownian motion with constant annualized drift µ = 30% and constant annualized volatility σ = 15%. We realize the simulation based on N = 1000 trading days with M = 50 or 500 intra-day observations in order to illustrate the effect of the discrete price on the family of high-low estimators. • Effect of the discretization We first test the effect of the discretization on the various estimators. Here, 29

Volatility Estimation for Trading Strategies

we take M = 50 or 500 intraday observations with µ = 0 and f = 0. In Figure 2.2, we present the simulation results for M = 50 price quotation in a trading day. All the high-low estimators are weakly biased due the discretization effect. They all underestimate the volatility as the range of estimator is small than the true range of Brownian motion. We remark that the close-to-close is unbiased however its variance is too large. The correction scheme proposed by Roger and Satchell can eliminate the discretization effect. When the number of observation is larger, the discretization effect is negligible and all estimators are unbiased (see Figure 2.3). Figure 2.2: Volatility estimators without drift and opening effects (M = 50) 20 Simulated σ,

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

19 18 17

σ (%)

16 15 14 13 12 11 10 0

100

200

300

400

500

600

700

800

900

1000

• Effect of the non-zero drift We consider now the case with non-zero annual drift µ = 30%. Here, we take M = 500 intraday observations. In Figure 2.4, we observe that the Parkinson estimator and the Garman-Klass estimator are strongly dependent on the drift of Brownian motion. Kunimoto estimator and Rogers-Satchell estimator are not dependent on the drift. • Effect of the opening jump For the effect of the opening jump, we simulate data with f = 0.3. In Figure 2.4, we take M = 500 intraday observations with zero drift µ = 0. We observe that with the effect of the opening jump, all high-low estimator underestimate the volatility except for the YZ estimator. By using the combination between 2 with the other estimators, the effect of the the open volatility estimator σ ˆO opening can be completely eliminated (see Figure 2.6). 30

Volatility Estimation for Trading Strategies

Figure 2.3: Volatility estimators without drift and opening effect (M = 500) 20 Simulated σ,

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

19 18 17

σ (%)

16 15 14 13 12 11 10 0

100

200

300

400

500

600

700

800

900

1000

Figure 2.4: Volatility estimators with µ = 30% and without opening effect (M = 500)

Simulated σ,

26

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

24

σ (%)

22 20 18 16 14 12 0

100

200

300

400

500

31

600

700

800

900

1000

Volatility Estimation for Trading Strategies

Figure 2.5: Volatility estimators with opening effect f = 0.3 and without drift (M = 500) 20 Simulated σ,

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

19 18 17

σ (%)

16 15 14 13 12 11 10 9 0

100

200

300

400

500

600

700

800

900

1000

Figure 2.6: Volatility estimators with correction of the opening jump (f = 0.3)

32

Volatility Estimation for Trading Strategies

Simulation with stochastic volatility We consider now the simulation with stochastic volatility which is described by the following model:  dSt = µt St dt + σt St dBt (2.5) dσt2 = ξσt2 dBtσ

in which Btσ is a Brownian motion independent to the one of asset process.

We will first estimate the volatility with all the proposed estimators then verify the quality of these estimators via a backtest using the voltarget strategy2 . For the simulation of the volatility, we take the same parameters as above with f = 0, µ = 0, N = 5000, M = 500, ξ = 0.01 and σ0 = 0.4. In Figure 2.7, we present the result corresponding to different estimators. We remark that the group of high-low estimators gives a better result for volatility estimation. We can estimate the error Figure 2.7: Volatility estimators on stochastic volatility simulation 55 Simulated σ,

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

50 45

σ (%)

40 35 30 25 20 15 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

committed for each estimator by the following formula: N X = (ˆ σt − σt )2 t=1

The errors obtained for various estimators are summarized in the below Table 2.1. We now apply the estimation of the volatility to perform the voltarget strategies. The result of the this test is presented in Figure 2.8. In order to control the 2

The detail description of voltarget strategy is presented in Section Backtest

33

Volatility Estimation for Trading Strategies Table 2.1: Estimation error for various estimators Estimator PN σ − σ)2 t=1 (ˆ

2 σ ˆCC

σ ˆP2

2 σ ˆK

2 σ ˆGK

2 σ ˆRS

σ ˆY2 Z

0.135

0.072

0.063

0.08

0.076

0.065

quality of the voltarget strategy, we compute the volatility of the voltarget strategy obtained by each estimator. We remark that the calculation of the volatility on the voltarget strategies is effectuated by the close-to-close estimator with the same averaging window of 3 months (or 65 trading days). The result is reported in Figure 2.9. As shown in the figure, all estimators give more and less the same results. If we compute the error committed by these estimators, we obtain CC = 0.9491, P = 1.0331, K = 0.9491, GK = 1.2344, RS = 1.2703, Y Z = 1.1383. This result may comes form the fact that we have used the close-to-close estimator to calculate the volatility of all voltarget strategies. Hence, we consider another check of the Figure 2.8: Test of voltarget strategy with stochastic volatility simulation 2.6 Benchmark,

CC,

OC,

P,

GK,

RS,

YZ

2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8

0

500

1000

1500

2000

2500

t

3000

3500

4000

4500

5000

estimation quality. We compute the realized return of the voltarget strategies: RV (ti ) = ln Vti − ln Vti−1 where Vti is the wealth of the voltarget portfolio. We expect that this quantity follows a Gaussian probability distribution with volatility σ ? = 15%. Figure 2.10 shows the probability density function (Pdf) of the realized returns corresponding to all considered estimators. In order to have a more visible result, we compute the different between the cumulative distribution function (Cdf) of each estimator and 34

Volatility Estimation for Trading Strategies Figure 2.9: Test of voltarget strategy with stochastic volatility simulation 25 CC,

OC,

P,

K,

GK,

RS,

YZ

σ (%)

20

15

10 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

the expected Cdf (see Figure 2.11). Both results confirm that the Parkinson and the Kunitomo estimators improve the quality of the volatility estimation.

2.2.6

Backtest

Volatility estimations of S&P 500 index We now employ the estimators discussed above for the S&P 500 index. Here, we do not have all tick-by-tick intraday data, hence the Kunimoto’s estimator and the Rogers-Satchell correction can not be applied. We remark that the effect of the drift is almost negligible which is confirmed by Parkinson and Garman-Klass estimators. The spontaneous opening jump is estimated simply by:  2 ! σ ˆC ft = 1 + σ ˆO We then employ the exponential-average technique to obtain a filter of this quantity. We obtain the average value of closing interval over the considered data for S&P 500 f¯ = 0.015 and for BBVA SQ Equity f¯ = 0.21. In the following, we use different estimators in order to extract the signal ft . The trivial one is using ft as the prediction of the opening jump, we denote fˆt , then we contruct the habitual ones like the moving-average fˆma , the exponential moving-average fˆexp and the cumulated average fˆc . In Figure 2.15, we show result corresponding to different filtered f on the 35

Volatility Estimation for Trading Strategies

Figure 2.10: Comparison between different probability density functions 45

Expected Pdf,

CC,

OC,

P,

K,

GK,

RS,

YZ

40 35 30

Pdf

25 20 15 10 5 0 −0.05

−0.04

−0.03

−0.02

−0.01

0

RV

0.01

0.02

0.03

0.04

0.05

Figure 2.11: Comparison between the different cumulative distribution functions

CC,

OC,

P,

K,

GK,

RS,

YZ

0.07 0.06 0.05

∆Cdf

0.04 0.03 0.02 0.01 0 −0.01 −0.02 −0.06

−0.04

−0.02

0

RV

36

0.02

0.04

0.06

Volatility Estimation for Trading Strategies

Figure 2.12: Volatility estimators on S& P 500 index

CO,

100

CC,

OC,

P,

GK,

RS,

YZ

90 80 70

σ (%)

60 50 40 30 20 10 01/2001

01/2003

01/2005

01/2007

01/2009

01/20011

Figure 2.13: Volatility estimators on on BHI UN Equity

80

CO,

CC,

OC,

P,

GK,

RS,

YZ

70 60

σ (%)

50 40 30 20 10

01/2001

01/2003

01/2005

01/2007

37

01/2009

01/2011

Volatility Estimation for Trading Strategies

Figure 2.14: Estimation of the closing interval for S&P 500 index 0.15

Realized closing ratio Moving average Exponential average Cummulated average Average

f

0.1

0.05

0 01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

Figure 2.15: Estimation of the closing interval for BHI UN Equity

Realized closing ratio Moving average Exponential average Cummulated average Average

0.8

f

0.6

0.4

0.2

01/2001

01/2003

01/2005

01/2007

38

01/2009

01/2011

Volatility Estimation for Trading Strategies

BHI UN Equity data. Figure 2.13 shows that the family of high-low estimator give a better result than the calissical close-to-close estimator. In order to check the quality of these estimators on the prediction of the volatility, we checke the value of the “Likehood” function corresponding to each estimator. Assuming that the observable signal follows the Gaussian distribution, the likehood function is defined as:  n n  n 1X 1 X Ri+1 2 2 l(σ) = − ln 2π − ln σi − 2 2 2 σi i=1

i=1

where R is the future realized return. In Figure 2.17, we present the result of the likehood function for different estimators. This function reaches its maximal value for the ‘Roger-Satchell’ estimator. Figure 2.16: Likehood function for various estimators on S&P 500 4

1.98

x 10

1.97

1.96

1.95

1.94

CC

OC

P

GK

RS

YZ

Backtest on voltarget strategy We now backtest the efficiency of various volatility estimators with vol-target strategy on S&P 500 index and an individual stock. Within the vol-target strategy, the exposition to the risky asset is determined by the following expression: αt =

σ? σ ˆt

where σ ? is the expected volatility of the strategy and σ ˆt is the prediction of the volatility given by the estimators above. In the backtest, we take the annualized volatility σ ? = 15% with historical data since 01/01/2001 to 31/12/2011. We present the results for two cases: 39

Volatility Estimation for Trading Strategies Figure 2.17: Likehood function for various estimators on BHI UN Equity 4

1.806

x 10

1.804

1.802

1.8

1.798

1.796

1.794

CC

OC

P

GK

RS

YZ

• Backtest on S&P 500 index with moving-average equal to 1 month (n = 21) of historical data. We remark in this case that the volatility of the index is small then the error on the volatility estimation causes less effect. However, the high-low estimators suffer the effect of discretization then they underestimate the volatility. For the index, this effect is more important therefore the closeto-close estimator gives the best performance.

• Backtest on single asset with moving-average equal to 1 month (n = 21) of historical data. In the case with a particular asset such as the BBVA SQ Equity, the volatility is important hence the error due the efficiency of volatility estimators are important. High-low estimators now give better results than the classical one.

In order to illustrate the efficiency of the range-based estimators, we realize a ranking between high-low estimator and the benchmark estimator close-to-close. We 2 apply the voltarget strategy for close-to-close estimator σ ˆCC and a high-low esti2 mator σ ˆHL . Then we compare the Sharpe ratio obtained by these two estimators and compute the number of times where the high-low estimator gives better performance over the ensemble of stocks. The result over S&P 500 index and its first 100 compositions is summarized in Table 2.3. 40

Volatility Estimation for Trading Strategies

Figure 2.18: Backtest of voltarget strategy on S&P 500 index 1.3

S&P 500,

CC,

OC,

P,

GK,

RS,

YZ

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

Figure 2.19: Backtest of voltarget strategy on BHI UN Equity 3

Benchmark,

CC,

OC,

P,

GK,

RS,

YZ

2.5

2

1.5

1

01/2001

01/2003

01/2005

01/2007

41

01/2009

01/2011

Volatility Estimation for Trading Strategies 2 2 Table 2.2: Performance of σ ˆHL versus σ ˆCC for different averaging windows

Estimator 6 month 3 month 2 month 1 month

σ ˆP2 56.2% 52.8% 60.7% 65.2%

2 σ ˆGK 52.8% 49.4% 60.7% 64.0%

2 σ ˆRS 52.8% 51.7% 60.7% 64.0%

σ ˆY2 Z 57.3% 53.9% 56.2% 64.0%

2 L versus σ 2 Table 2.3: Performance of σ ˆH ˆCC for different filters of f

Estimator fˆc fˆma fˆexp fˆt

2.3

σ ˆP2

2 σ ˆGK

2 σ ˆRS

σ ˆY2 Z

65.2% 64.0% 64.0% 64.0%

64.0% 61.8% 61.8% 61.8%

64.0% 61.8% 60.7% 60.7%

64.0% 64.0% 64.0% 64.0%

Estimation of realized volatility

The common way to estimate the realized volatility is to estimate the expectation value of the variance over an observed windows. Then we compute the corresponding volatility. However, to do so we encounter a great dilemma: taking a long historical window can help to decrease the estimation error as discussed in the last paragraph or taking a short historical data allows an estimation of volatility closer to the present volatility. In order to overcome this dilemma, we need to have an idea about the dynamics of the variance σt2 that we would like to measure. Combining this knowledge on the dynamics of σt2 with the committed error on the long historical window, we can find out an optimal windows for the volatility estimator. We assume that the variance follows a simplified dynamics which has been used in the last numerical simulation: 

dSt = µt St dt + σt St dBt dσt2 = ξσt2 dBtσ

in which Btσ is a Brownian motion independent to the one of asset process.

2.3.1

Moving-average estimator

In this section, we show how the optimal window of the moving-average estimator is obtained via a simple example. Let us consider the canonical estimator: n 1 X 2 σ ˆ = Rti nT 2

i=1

42

Volatility Estimation for Trading Strategies

Here, the time increment is chosen to be constant ti − ti−1 = T , then the variance of this estimator at instant tn is: var σ ˆ

2





2σt4n T tn − t0

=

2σt4n n

On another hand, σt2 is now itself a stochastic process, hence its conditional variance to σt2n gives us the error due to the use of historical observations. We rewrite: Z tn Z tn 1 1 2 2 σ dt = σtn − (t − t0 ) σt2 ξ dBtσ tn − t0 t0 t tn − t0 t0 then the error due to the stochastic volatility is given by:   Z tn 2 tn − t0 4 2 nT σt4n ξ 2 1 2 ≈ σt dt σtn var σtn ξ = tn − t0 t0 3 3

The total error of the canonical estimator is simply the sum of these errors due to the fact that the two considered Brownian motions are supposed to be independent. We define the function of total estimation errors as following:  2σ 4 nT σt4n ξ 2 e σ ˆ 2 = tn + n 3

In order to obtain the optimal window for volatility estimation, we minimize the error function e σ ˆ 2 with respect to nT which leads to the following equation: σt4n ξ 2 2σt4n − 2 3 n T

= 0

√ This equation provides a very simple solution nT = 6T /ξ with the optimal error p 2 ≈ 2 2T /3 σt4n ξ. The major difficulty of this estimator is to calibrate is now e σ ˆopt the parameter ξ which is not trivial because ξt2 is an unobservable process. Different techniques can be considered such as the maximum likehood which will be discussed later.

2.3.2

IGARCH estimator

We discuss now another approach for estimating the realized volatility based on the IGARCH model. The detail theoretical derivation of the method is given in Drost F.C. et al. (1993, 1999) It consists of a volatility estimator of the form: 2 σ ˆt2 = β σ ˆt−T +

1−β 2 Rt T

where T is a constant increment of estimation . In integrating the recurrence relation above, we obtain the estimator of the variance IGARCH in function of the return observed in the past: σ ˆt2

n 1−β X i 2 2 = β Rt−iT + β n σ ˆt−nT T i=1

43

(2.6)

Volatility Estimation for Trading Strategies

We remark that the contribution of the last term tends to 0 when n tends to infinity. This estimator again has the form of a weighted average then similar approach as in the canonical estimator is applicable. Assuming that the volatility follows the lognormal dynamics described by Equation 2.3, therefore the optimal value of β is given by: p ξ 8T − ξ 2 T 2 − 4 β? = (2.7) ξ2T − 4 We encounter here again the same question as the canonical case that is how to calibrate the parameter ξ of the lognormal dynamics. In practice, we proceed in the inverse way. We seek first the optimal value β ? of the IGARCH estimator then use the inverse relation of equation 2.7 to determine the value of ξ:

ξ=

s

4 (1 − β ? )2 T 1 + β ?2

Remark 1 Finally, as insisted at the beginning of this discussion, we would like to point out that IGARCH estimator can be considered as an exponential weighted average. We begin first with a IGARCH estimator with constant time increment. The expectation value of this estimator is: # " +∞ X  1 − β 2 E σ ˆt2 σ = E β i Rt−iT σ T i=1 Z +∞ 1 − β X i t−iT +T 2 β σu du = T t−iT i=1 Z t−iT +T +∞ X 1 i = P+∞ β σu2 du i T β t−iT i=1 i=1 Z t−iT +T +∞ X 1 iT λ = P+∞ e σu2 du iT λ t−iT i=1 T e i=1 

with λ = − ln β/T . In this present form, we conclude that the IGARCH estimator is a weighted-average of the variance σt2 with an exponential weight distribution. The annualized estimator of the volatility can be written as: E





σ ˆt2 σ

=

P+∞ i=1

R t−iT +T 2 e−iT λ t−iT σu du P+∞ −iT λ i=1 T e

This expression admits a continuous limit when T → 0 . 44

Volatility Estimation for Trading Strategies

2.3.3

Extension to range-based estimators

The estimation of the optimal window in the last discussion can be also generalized to the case of range-based estimators. The main idea is to obtain the trade-off between the estimator error (variance of the estimator) and the dynamic volatility described by the model (2.3). The equation that determines the total error of the estimator is given by:  nT 4 2 e(ˆ σ 2 ) = var σ ˆ2 + σ ξ 3 tn Here, we remind that the first term in this expression is the estimator error coming from the discrete sum whereas the second term is the error of the stochastic volatility. In fact, the first term is already given by the study of various estimators in the last section. The second term is typically dependent on the choice of volatility dynamics. Using the notation of the estimator efficiency, we rewrite the above expression as: e(ˆ σ2) =

1 2σt4n nT 4 2 + σ ξ eff (ˆ σ2) n 3 tn

The minimization procedure of the total error is exactly the same as the last example on the canonical estimator, then we obtain the following result of the optimal averaging window: s 6T nT = (2.8) eff (ˆ σ2) ξ2 The IGARCH estimator can also be applied for various type of high-low estimator, the extension consists of performing an exponential moving average in stead of the simple average. The parameter of the exponential moving average β will be determined again by the maximum likehood method as shown in the discussion below.

2.3.4

Calibration procedure of the estimators of realized volatility

As discussed above, the estimators of realized volatility depend on the choice of the underlying dynamics of the volatility. In order to obtain the best estimation of the realized volatility, we must estimate the parameter which characterizes this dynamics. Two possible approaches to obtain the optimal value of the these estimators are: • using the least square problem which consists to minimize the following objective function: n X 2 Rt2i +T − T σ ˆt2i i=1

• or using the maximum likehood problem which consists to maximize the loglikehood objective function: n

n

i=0

i=0

X1  X Rt2i +T n − ln 2π − ln T σ ˆt2 − 2 2 2T σ ˆt2i 45

Volatility Estimation for Trading Strategies

We remark here that the moving-average estimator depends only on the averaging window whereas the IGARCH estimator depends only on the parameter β. In general, there is no way to compare these two estimators if we do not use a specific dynamics. By this way, the optimal values of both parameters are obtained by the optimal value of ξ and that offers a direct comparison between the quality of these two estimators. Example of realized volatility We illustrate here how the realized volatility is computed by the two methods discussed above. In order to illustrate how the optimal value of the averaging window nT or β ? are calibrated, we plot the likehood functions of these two estimator for one value of volatility at a given date. In Figure 2.20, we present the logarithm of likehood functions for different value of ξ. The maximal value of the function l(ξ) gives us the optimal value ξ ? which will be used to evaluate the volatility for the two methods. We remark that the IGARCH estimator is better to estimate the global maximum because its logarithm likehood is a concave function. For the the moving-average method, its logarithm likehood function is not smooth and presents complicated structure with local maximums which is less efficient for the optimization procedure. Figure 2.20: Comparison between IGARCH estimator and CC estimator 1720

CC optimal IGARCH

1715 1710 1705

l(ξ)

1700 1695 1690 1685 1680 1675 1670 0

0.05

0.1

0.15

0.2

ξ

0.25

0.3

0.35

0.4

We now test the implementation of IGARCH estimators for various high-low estimators. As we have demonstrated that the IGARCH estimator is equivalent to 46

Volatility Estimation for Trading Strategies

exponential moving-average, then the implementation for high-low estimators can be set up in the same way as the case of close-to-close estimator. In order to determine the optimal parameter β ? , we perform an optimization scheme on the logarithm likehood function. In Figure 2.21, we present the comparison of the logarithm likehood function between different estimators in function of the parameter β. The optimal parameter β ? of each estimator corresponds to the maximum of the logarithm likehood function. In order to have a clear idea about the corresponding size of the Figure 2.21: Likehood function of high-low estimators versus filtered parameter β

CC,

1490

OC,

P,

GK,

RS,

YZ

1485 1480 1475

l(β)

1470 1465 1460 1455 1450 1445 1440

0.7

0.75

0.8

0.85

0.9

0.95

1

β

moving-average window to the optimal parameter β ? , we use the formula (2.7) to effectuate the conversion. The result is reported in the Figure 2.22 below. Backtest on the voltarget strategy We take historical data of S&P 500 index over the period since 01/2001 to 12/2011 and the averaging window of the close-to-close estimator is chosen as n = 25. In Figure2.23, we show the different estimations of realized volatility. In order to test the efficiency of these realized estimators (moving-average and IGARCH), we first evaluate the likehood function for the close-to-close estimator and realized estimators then apply these estimators for the voltarget strategy as performed in the last section. In Figure 2.25, we present the value of likehood function over the period from 01/2001 to 12/2010 for three estimators: CC, CC optimal (moving-average) and IGARCH. The estimator corresponding to the highest value of the likehood function is the one that gives the best prediction of the volatility.

47

Volatility Estimation for Trading Strategies

Figure 2.22: Likehood function of high-low estimators versus effective moving window 1485 CC,

OC,

P,

GK,

RS,

YZ

1480 1475 1470

l(n)

1465 1460 1455 1450 1445 1440 0

10

20

30

40

n

50

60

70

80

90

Figure 2.23: IGARCH estimator versus moving-average estimator for close-to-close prices 100

CC CC optimal IGARCH

80

σ (%)

60

40

20

01/2001

01/2003

01/2005

01/2007

48

01/2009

01/2011

Volatility Estimation for Trading Strategies

Figure 2.24: Comparison between different IGARCH estimators for high-low prices

CC,

90

CO,

P,

GK,

RS,

YZ

80 70

σ (%)

60 50 40 30 20 10 01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

Figure 2.25: Daily estimation of the likehood function for various close-to-close estimators 1900

CC CC optimal CC IGARCH

1800

1700

l(ˆ σ)

1600

1500

1400

1300

01/2001

01/2003

01/2005

01/2007

49

01/2009

01/20011

Volatility Estimation for Trading Strategies Figure 2.26: Daily estimation of the likehood function for various high-low estimators 1900

CC,

OC,

P,

GK,

RS,

YZ

1800

1700

l(ˆ σ)

1600

1500

1400

1300

1200 01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

In Figure 2.27, the result of the backtest on voltarget strategy is performed for the three considered estimators. The estimators which dynamical choice of averaging parameters always give better result than a simple close-to-close estimator with fixed averaging window n = 25. We next backtest on the IGARCH estimator applied on the high-low price data, the comparison with IGARCH applied on close-to-close data is shown in Figure 2.28. We observe that the IGARCH estimator for close-to-close price is one of the estimators which produce the best backtest.

2.4

High-frequency volatility estimators

We have discussed in the previous sections how to measure the daily volatility based on the range of the observed prices. If more information is available in the trading data like having all the real-time quotation, can one estimate more accurately the volatility? As far as the trading frequency increases, we expect that the precision of estimator get better as well. However, when the trading frequency reaches certain limit, new phenomenon coming from the non-equilibrium of the market emerges and spoils the precision. This limit defines the optimal frequency for the classical estimator. In the literature, it is more and less agree to be at the frequency of one trade every 5 minutes. This phenomenon is called the micro-structure noise which are characterized by the bid-ask spread or the transaction effect. In this section, we will summarize and test some recent proposals which attempt to eliminate the micro-structure noise. 50

Volatility Estimation for Trading Strategies

Figure 2.27: Backtest for close-to-close estimator and realized estimators

S&P 500 CC CC optimal CC IGARCH

1.4

1.2

1

0.8

0.6

01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

Figure 2.28: Backtest for IGARCH high-low estimators comparing to IGARCH closeto-close estimator 1.4

S&P 500,

CC,

OC,

P,

GK,

RS,

YZ

1.2

1

0.8

0.6

01/2001

01/2003

01/2005

01/2007

51

01/2009

01/2011

Volatility Estimation for Trading Strategies

2.4.1

Microstructure effect

It has been demonstrated in the financial literature that the realized return estimator is not robust when the sampling frequency is too high. Two possible explanations of this effect the following. In the probabilistic point of view, this phenomenon comes from the fact that the cumulated return (or the logarithm of price) is not a semimartingal as we assumed in the last section. However, it emerges only in the short time scale when the trading frequency is high enough. In the financial point of view, this effect is explained by the existence of the so-called market microstructure noises. These noises come from the existence of the bid-ask spread. We now discuss the simplest model which includes the mircrostruture noise as an independent noise to the underlying Brownian motion. We assume that the true cumulated return is an unobservable process and follows a Brownian motion: dXt =



σ2 µt − t 2



dt + σt dBt

The observed signal Yt is the cumulated return which is perturbed by the microstructure noise t : Yt = Xt + t For the sake of symplicity, we use the following assumptions:     (i) ti is iid with E [ti ] = 0 and E 2ti = E 2 (ii) t ⊥ ⊥ Bt

From these assumptions, we see immediately that the volatility estimator based on historical data Yti is biased:   var(Y ) = var(X) + E 2

  The first term var(X) is scaled as t (estimation horizon) and E 2 is constant, this estimator can be considered as unbiased if the time horizon is large enough   (t > E 2 /σ 2 ). At high frequency, the second term is not negligible and better estimator must be able to eliminate this term.

2.4.2

Two time-scale volatility estimator

Using different time scales to extract the true volatility of the hidden price process (without noise) is both independently proposed by Zhang et al. (2005) and Bandi et al. (2004). In this paragraph, we employ the approach in the first reference to define the intra-day volatility estimator. We prefer here discussing the main idea of this method and its practical implementation rather than all the detail of stochastic calculus concerning the expectation value and the variance of the realized return3 . 3

Detail of the derivation of this technique can be found in Zhang et al. (2005).

52

Volatility Estimation for Trading Strategies

Definitions and notations In order to fix the notations, let us consider a time-period [0, T ] which is divided in to M − 1 intervals (M can be understood as the frequency). The quadratic variation of the Bronian motion over this period is denoted: Z T hX, XiT = σt2 dt 0

For the discretized version of the quadratic variation, we employ the [., .] notation: X 2 [X, X]T = Xti+1 − Xti ti ,ti+1 ∈[0,T ]

Then the habitual estimator of realized return over the interval [0, T ] is given by: X 2 [Y, Y ]T = Yti+1 − Yti ti ,ti+1 ∈[0,T ]

We remark that the number of points in the interval [0, T ] can be changed. In fact, the expectation value of the quadratic variation should not depend on the distribution of points in this interval. Let us define the ensemble of points in one period as a grid G: G = {t0 , . . . , tM } Then a subgrid H is defined as:

H = {tk1 , . . . , tkm } where (tkj ) with j = 1, . . . m is a subsequence of (ti ) with i = 1, . . . M . The number of increments is denoted as: |H| = card (H) − 1

With these notations, the quadratic variation over a subgrid H reads:  2 X [Y, Y ]H = Y − Y t t k k T i+1 i tki ,tki+1 ∈H

The realized volatility estimator over the full grid If we compute the quadratic variation over the full grid G which means that at highest frequency. As discussed above, it is not surprising that it will suffer the most effect of the microstructure noise: [Y, Y ]GT = [X, X]GT + 2 [X, ]GT + 2 [, ]GT Under the hypothesis of the microstructure noise, the conditional expectation value of this estimator is equal to: i h   E [Y, Y ]GT X = [X, X]GT + 2M E 2 53

Volatility Estimation for Trading Strategies

and the variation of the estimator:         G var [Y, Y ]T X = 4M E 4 + 8 [X, X]GT E 2 − 2var 2 + O(n−1/2 )

In these two expressions above, the sums are arranged order by order. In the limit M → ∞, we obtain the habitual result of central limit theorem:   1/2   L → 2 E 4 N (0, 1) M −1/2 [Y, Y ]GT − 2M E 2 −

Hence, as M increases, [Y, Y ]GT becomes a good estimator of the microstructure noise and we denote: 1 E[ [2 ] = [Y, Y ]GT 2M The central limit theorem for this estimator states:    L  1/2 M 1/2 E[ [2 ] − E 2 − → E 4 N (0, 1) as M → ∞ The realized volatility estimator over subgrid As we mentioned in the last discussion, increasing the frequency will spoil the estimation of the volatility due to the presence of the microstructure noise. The naive solution is to reduce the number of point in the grid or to consider only a subgrid, then one can take the average over a number choice of subgrids. Let us consider a subgrid H with |H| = m − 1, then the same result as for the full grid can be obtained in replacing M by m: i h  2 H E [Y, Y ]T X = [X, X]H T + 2mE  Let us consider a sequence of subgrids H(k) with k = 1 . . . K which satisfies SKnow (k) G = k=1 H and H(k) ∩ H(l) = ∅ with k 6= l. By averaging over these K subgrid, we obtain the result: K i h (k) 1 X avg E [Y, Y ]T X = [Y, Y ]H T K k=1

P We define the average length of the subgrid m = (1/K) K k=1 mk , then the final expression is: i h  2 avg E [Y, Y ]avg T X = [X, X]T + 2mE  This estimator of volatility is still biased and the precision depends strongly on the choice of the length of subgrid and the number of subgrids. In the paper of Zhang et al., the authors have demonstrated that there exists an optimal value K ? for which we can reach the best performance of estimator. 54

Volatility Estimation for Trading Strategies

Two time-scale estimator As the full-grid averaging estimator and the subgrid averaging estimator both contain the same component coming from the microstructure noise to a factor, we can employ both estimators to have a new one where the microstructure noise can be completely eliminated. Let us consider the following estimator:     m −1 m avg G 2 [Y, Y ]T − σ ˆts = 1 − [Y, Y ]T M M This estimator now is an unbiased estimator with its precision determined by the choice of K and m. In the theoretical framework, this optimal value is given as a function of the noise variance and the forth moment of the volatility. In practice, we employ a scan over the number of the subgrid of size m ∝ M/K in order to look for the optimal estimator.

2.4.3

Numerical implementation and backtesting

We now backtest the proposed technique on the S&P 500 index with the choice of the sub grid as following. The full grid is defined by the ensemble of data every minute from the opening to the close of trading days (9h to 17h30). Data is taken since the 1st February 2011 to the 6th June 2011. We denote the full grid for each trading day period: G = {t0 , . . . , tM } and the subgrid is chosen as following:

H(k) = {tk−1 , tk−1+K . . . , tk−1+nk K } where the indice k = 1, . . . , K and nk is the integer making tk−1+nk K the last element in H(k) . As we can not compute exactly the value of the optimal value K ? for each trading period, we employ an iterative scheme which tends to converge to the optimal value. Analytical expression of K ? is given by Zhang et al.: K? =

 2 !1/3 12 E 2 M 2/3 T Eη 2

where η is given by the expression: 2

η =

Z

T

0

σt4 dt

In the first approximation, we consider the case where the intraday volatility is constant then the expression of η cans be simplified to η 2 = T σ 4 . In Figure 2.29, we present the result of the intraday volatility which takes into account only the trading day for the S&P 500 index under the assumption of constant volatility. The twotime scale estimator reduces the effect of microstructure noise effect on the realized volatility computed over the full grid. 55

Volatility Estimation for Trading Strategies Figure 2.29: Two-time scale estimator of intraday volatility 35

Volatility with full grid Volatility with subgrid Volatility with two scales

30

σ (%)

25

20

15

10

5

0 02/11

2.5

03/11

04/11

05/11

06/11

Conclusion

Voltarget strategies are efficient ways to control the risk for building trading strategies. Hence, a good estimator of the volatility is essential from this perspective. In this paper, we show that we can use the data rang to improve the forecasting of the volatility of the market. The use of high and low prices is less important for the index as it gives more and less the same result with traditional close-to-close estimator. However, for independent stock with higher volatility level, the high-low estimators improves the prediction of volatility. We consider several backtests on the S&P 500 index and obtain competing results with respect to the traditional moving-average estimator of volatility. Indeed, we consider a simple stochastic volatility model which permit to integrate the dynamics of the volatility in the estimator. An optimization scheme via the maximum likehood algorithm allows us to obtain dynamically the optimal averaging window. We also compare these results for rang-based estimator with the wellknown IGARCH model. The comparison between the optimal value of the likehood functions for various estimators gives us also a ranking of estimation error. Finally, we studied the high frequency volatility estimator which is a very active topic of financial mathematics. Using simple model proposed by Zhang et al, (2005), we show that the microstructure noise can be eliminated by the two time scale estimator.

56

Bibliography [1] Bandi F. M. and Russell J. R. (2006), Saperating Microstructure Noise from Volatility Journal of Financial Economics, 79, pp. 655-692. [2] Drost F. C. and Nijman T. E. (1993), Temporal Aggregation of GARCH Processes Econometrica, 61, pp. 909-927. [3] Drost F. C. and Werker J. M. (1999), Closing the GARCH gap: Continuous time GARCH modeling Journal of Econometrics, 74, pp. 31-57 . [4] Feller W. (1951), The Asymptotic Distribution of the Range of Sums of Independent Random Variables, Annals of Mathematical Statistics, 22, pp. 427-432. [5] Garman M. B. and Klass M. J. (1980), On the estimation of security price from historical data, Journal of Business, 53, pp. 67-78. [6] Kunimoto N. (1992), Improving the Parkinson method of estimating security price volatilities, Journal of Business, 65, pp. 295-302. [7] Parkinson M. (1980), The extreme value method for estimating the variance of the rate of return, Journal of Business, 53, pp. 61-65. [8] Rogers L. C. G. and Satchell S. E. (1991), Estimating variance form high, low and closing prices, Annals of Applied Probability 1, pp. 504-512. [9] Yang D. and Zhang Q. (2000), Drift-Independent Volatility Estimation Based on High, Low, Open and Close Prices, Journal of Business, 73, pp. 477-491. [10] Zhang L., Mykland P. A. and Ait-Sahalia Y. (2005), A Tale of Two Time Scales: Determining Integrated Volatility With Noisy High-Frequency Data Journal of the American Statistical Association, 100(472), pp. 1394-1411.

57

Chapter 3

Support Vector Machine in Finance In this chapter, we review in the well-known machine learning technique so-called support vector machine (SVM). This technique can be employed in different contexts such as classification, regression or density estimation according to Vapnik [1998]. Within this paper, we would like first to give an overview on this method and its numerical variation implementation, then bridge it to financial applications such as the stock selection. Keywords:Machine learning, Statistical learning, Support vector machine, regression, classification, stock selection.

3.1

Introduction

Support vector machine is an important part of the Statistical Learning Theory. It was first introduced in the mid-90 by Boser et al., (1992) and contributes important applications for various domains such as pattern recognition (for example: handwritten, digit, image), bioinformatic e.t.c. This technique can be employed in different contexts such as classification, regression or density estimation according to Vapnik [1998]. Recently, different applications in the financial field have been developed via two main directions. The first one employs SVM as non-linear estimator in order to forecast the market tendency or volatility. In this context, SVM is used as a regression technique with feasible possibility for extension to non-linear case thank to the kernel approach. The second direction consists of using SVM as a classification technique which aims to elaborate the stock selection in the trading strategy (for example long/short strategy). In this paper, we review the support vector machine and its application in finance in both points of view. The literature of this recent field is quite diversified and divergent with many approaches and different techniques. We would like first to give an overview on the SVM from its basic construction to all extensions including the multi classification problem. We next present different numerical implementations, then bridge them to financial applications. 59

Support Vector Machine in Finance

This paper is organized as following. In Section 2, we remind the framework of the support vector machine theory based on the approach proposed in O.Chapelle (2002). We next work out various implementations of this technique from both both primal and dual problems in Section 3. The extension of SVM to the case of multi classification is discussed in Section 4. We finish with the introduction of SVM in the financial domain via an example of stock selection in Sections 5 and 6.

3.2

Support vector machine at a glance

We attempt to give an overview on the support vector machine method in this section. In order to introduce the basic idea of SVM, we start with the first discussion on the classification method via the concept of hard margin an soft margin classification. As the work pioneered by Vapnik and Chervonenkis (1971) has established a framework for Statistical Learning Theory, so-called “VC theory ”, we would like to give a brief introduction with basic notation and the important Vapnik-Chervonenkis theorem for Empirical Risk Minimization principle (ERM). Extension of ERM to Vicinal Risk Minimization (VRM) will be also discussed.

3.2.1

Basic ideas of SVM

We illustrate here the basic ideas of SVM as a classification method. The main advantage of SVM is that it can be not only described very intuitively in the context of linear classification but also extended in an intelligent way to the non-linear case. Let us define the training data set consisting of pairs of “input/output” points (xi , yi ), with 1 ≤ i ≤ n. Here the input vector xi belongs to some space X whereas the output yi belongs to {−1, 1} in the case of bi-classification. The output yi is used to identify the two possible classes. Hard margin classification The most simple idea of linear classification is to look at the whole set of input {xi ⊂ X } and search the possible hyperplane which can separate the data in two classes based on its label yi = ±1. Its consists of constructing a linear discriminant function of the form: h(x) = wT x + b where the vector w is the weight vector and b is called the bias. The hyperplane is defined by the following equation: H = {x : h(x) = wT x + b = 0} This hyperplane divides the space X into two regions: the region where the discriminant function has positive value and the region with negative value. The hyperplane is the also called the decision boundary. The linear classification comes from the fact that this boundary depends on the data in the linear way.

60

Support Vector Machine in Finance Figure 3.1: Geometric interpretation of the margin in a linear SVM.

We now define the notion of a margin. In Figure 3.1 (reprinted from Ben-Hur A. et al., 2010), we give a geometric interpretation of the margin in a linear SVM. Let x+ and x− be the closest points to the hyperplane from the positive side and negative side. The cycle data points are defined as the support vectors which are the closest points to the decision boundary (see Figure 3.1). The vector w is the √ normal vector to the hyperplane and we denote its norm kwk = wT w and its ˆ = w/kwk. We assume that x+ and x− are equidistant from the decision direction w boundary. They determine the margin from which the two classes of points of data set D are separated: 1 T ˆ (x+ − x− ) mD (h) = w 2 In the geometric consideration, this margin is just half of the distant between two ˆ closest points from both sides of the hyperplane H projected in the direction w. Using the equations that define the relative positions of these points to the hyperplane H: h(x+ ) = wT x+ + b = a h(x− ) = wT x− + b = −a where a > 0 is some constant. As the normal vector w and the bias b are undetermined quantity, we can simply divide them by a and renormalized all these equations. This is equivalent to set a = 1 in the above expression and we finally get 1 T 1 ˆ (x+ − x− ) = mD (h) = w 2 kwk 61

Support Vector Machine in Finance

The basic idea of maximum margin classifier is to determine the hyperplane which maximizes the margin. For a separable dataset, we can define the hard margin SVM as the following optimization problem: min w,b

1 kwk2 2

(3.1)

 u.c. yi wT xi + b > 1 i = 1...n

 Here, yi wT xi + b > 1 is just a compact way to express the relative position of two classes of data points to the hyperplane H. In fact, we have wT xi + b > 1 for the class yi = 1 and wT xi + b < −1 for the class yi = −1. The historical approach to solve this quadratic program is to map the primal problem to dual problem. We give here the main result while the detailed derivation can be found in the Appendix C.1. Via KKT theorem, this approach gives us the following optimized solution (w? , b? ): w? =

n X

αi? yi xi

i=1

where α? = (α1? , . . . , αn? ) is the solution of the dual optimization problem with dual variable α = (α1 , . . . , αn ) of dimension n: max α u.c.

n X i=1

αi −

n 1 X αi αj yi yj xTi xj 2 i,j=1

αi ≥ 0 i = 1...n

We remark that the above optimization problem is a quadratic program in the vectorial space Rd with n linear inequality constraints. It may become meaningless if it has no solution (the dataset is inseparable) or too many solutions (stability of boundary decision on data). The questions on the existence of a solution in Problem 3.5 or on the sensibility of solution on dataset are very difficult. A quantitative characterization can be found in the next discussion on the framework of VapnikChervonenskis theory. We will present here an intuitive view of this problem which depends on two main factors. The first one is the dimension of the space of function h(x) which determines the decision boundary. In the linear case, it is simply determined by the dimension of the couple (w, b). If the dimension of this function space is two small as in the linear case, it is possible that there exists no linear solution or the dataset can not be separated by a simple linear classifier. The second factor is the number of data points which involves in the optimization program via n inequality constraints. If the number of constraints is too large, the solution may not exist neither. In order to overcome this problem we must increase the dimension of the optimization problem. There exists two possible ways to do this. The first one consists of relaxing the inequality constrains by introducing additional variables which aim to tolerate the strict separation. We will allow the separation with certain error (some data points in the wrong side). This technique is introduced first by 62

Support Vector Machine in Finance

Cortes C. and Vapnik V. (1995) under the name “Soft margin SVM”. The second one consists of using the non-linear classifier which directly extend the function space to higher dimension. The use of non-linear classifier can increase rapidly the dimension of the optimization problem which invokes a computation problem. An intelligent way to get over is employing the notion of kernel. In the next discussions, we will try to clarify these two approaches then finish this section by introducing two general frameworks of this learning theory. Soft margin classification  In fact, the inequality constrains described above yi wT xi + b > 1 ensure that all data points will be well classified with respect to the optimal hyperplane. As the data may be inseparable, an intuitive way to overcome is relaxing the strict constrains by introducing additional variables ξi with i = 1, . . . , n so-called slack variables. They allow to commit certain error in the classification via new constrains:  yi w T xi + b > 1 − ξi

i = 1...n

(3.2)

For ξi > 1, the data point xi is completely misclassified whereas P 0 ≤ ξi ≤ 1 can be interpreted as margin error. By this definition of slack variables, ni=1 ξi is directly related to the number of misclassified points. In order to fix P our expected error in the classification problem, we introduce an additional term C ni=1 ξip in the objective function and rewrite the optimization problem as following: n

min w,b,ξ u.c.

X 1 kwk2 + C ξi 2 i=1  yi wT xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n

(3.3)

Here, C is the parameter used to fix our desired level of error and p ≥ 1 is an usual way to fix the convexity on the additional term 1 . The soft-margin solution for the SVM problem can be interpreted as a regularization technique that one can find different optimization problem such as regression, filtering or matrix inversion. The same result can be found with regularization technique later when we discuss the possible use of kernel. Before switching to next discussion on the non-linear classification with kernel approach, we remark that the soft margin SVM problem is now at higher dimension d + 1 + n. However, the computation cost will be not increased. Thank to the KKT theorem, we can turn this primal problem to a dual problem with more simple constrains. We can also work directly with the primal problem by effectuating a trivial optimization on ξ. The primal problem is now no longer the a quadratic program, however it can be solved by Newton optimization or conjugate gradient as demonstrated in Chapelle O. (2007). 1

It is equivalent to define a Lp norm on the slack vector ξ ∈ Rn

63

Support Vector Machine in Finance

Non-linear classification, Kernel approach The second approach to improve the classification is to employ the non-linear SVM. In the context of SVM, we would like to insist that the construction of non-linear discriminant function h(x) consists of two steps. We first extend the data space X of dimension d to a feature space F with higher dimension N via a non-linear transformation φ : X → F, then a hyperplane will be constructed in the feature space F as presented before: h (x) = wT φ (x) + b Here, the result vector z = (z1 , . . . , zN ) = φ (x) is N -component vector in F space, hence w is also a vector of size N . The hyperplane H = {z : wT z + b = 0} defined in F is no longer a linear decision boundary in the initial space X : B = {x : wT φ (x) + b = 0} At this stage, the generalization to non-linear case helps us to avoid the problem of overfitting or underfitting. However, a computation problem emerges due to the high dimension of the feature space. For example, if we consider an quadratic transformation, it can lead to a feature space of dimension N = d(d + 3)/2. The main question is how to construct the separating hyperplane in the feature space? The answer to this question is to employ the mapping to the dual problem. By this way, our N -dimension problem turn again to the following n-dimension optimization problem with dual variable α: max α u.c.

n X i=1

αi −

n 1 X αi αj yi yj φ (xi )T φ (xj ) 2 i,j=1

αi ≥ 0 i = 1...n

Indeed, the expansion of the optimal solution w? has the following form: ?

w =

n X

αi? yi φ (xi )

i=1

In order to solve the quadratic program, we do not need the explicit form of the non-linear application but only the kernel of the form K (xi , xj ) = φ (xi )T φ (xj ) which is usually supposed to be symmetric. If we provide only the kernel K (xi , xj ) for the optimization problem, it is enough to construct later the hyperplane H in the feature space F or the boundary decision in the data space X . The discriminant function can be computed as following thank to the expansion of the optimal w? on the initial data xi i = 1, . . . , n: h (x) =

n X

αi yi K (x, xi ) + b

i=1

From this expression, we can construct the decision function which can be used to classified a given input x as f (x) = sign (h (x)). 64

Support Vector Machine in Finance

For a given non-linear function φ (x), we can compute the kernel K (xi , xj ) via the scalar product of two vector in F space. However, the reciprocal result does not stay unless the kernel satisfies the condition of the Mercer’s theorem (1909). Here, we study some standard kernel which are already widely used in the pattern recognition domain: p i. Polynomial kernel: K (x, y) = xT y + 1  ii. Radial Basis kernel: K (x, y) = exp −kx − yk2 /2σ 2  iii. Neural Network kernel: K (x, y) = tanh axT y − b

3.2.2

ERM and VRM frameworks

We finish the review on SVM by discussing briefly on the general framework of Statistical Learning Theory including the SVM. Without enter into the detail like the important theorem of Vapnik-Chervonenkis (1998), we would like to give a more general view on the SVM by answering some questions like how to approach SVM as a regression, how to interpret the soft-margin SVM as a regularization technique... Empirical Risk Minimization framework The Empirical Risk Minimization framework was studied by Vapnik and Chervonenkis in the 70s. In order to show the main idea, we first fix some notations. Let (xi , yi ), 1 ≤ i ≤ n be the training dataset of pairs input/output. The dataset is supposed to be generated i.i.d from unknown distribution P (x, y). The dependency between the input x and the output y is characterized in this distribution. For example, if the input x has a distribution P (x, y) and the out put  is related to x via function y = f (x) which is altered by a Gaussian noise N 0, σ 2 , then P (x, y) reads  P (x, y) = P (x) N f (x − y) , σ 2  We remark in this example that if σ → 0 then N 0, σ 2 tends to a Dirac distribution which means that the relation between input and output can be exactly determined by the maximum position of the distribution P (x, y). Estimating the function f (x) is fundamental. In order to give measurement of the estimation quality, we compute the expectation value of the loss function with respect to the distribution P (x, y). We define here the loss function in two different contexts: 1. Classification: l (f (x) , y) = If (x)6=y where I is the indicator function. 2. Regression: l (f (x) , y) = (f (x) − y)2 The objective of statistical learning is to determine the function f in the a certain function space F which minimizes the expected loss or the risk objective function: Z R (x) = l (f (x) , y) dP (x, y) 65

Support Vector Machine in Finance

As the distribution P (x, y) is unknown then the expected loss can not be evaluated. However, with available training dataset {xi , yi }, one could compute the empirical risk as following: n 1X Remp = l (f (xi ) , y) n i=1

In the limit of large dataset n → ∞, we expect the convergence: Remp (f ) → R (f ) for all tested function f thank to the law of large number. However, does the learning function f which minimizes Remp (f ) is the one minimizing the true risk R (f )? The answer to this question is NO. In general, there is infinite number of function f which can learn perfectly the training dataset f (x) = yi ∀i. In fact, we have to restraint the function space F in order to ensure the uniform convergence of the empirical risk to the true risk. The characterization of the complexity of a space of function F was first studied in the VC theory via the concept of VC dimension (1971) and the important VC theorem which gives an upper bound of the convergence probability P {sup f ∈ F |R (f ) − Remp (f )| > ε} → 0. A common way to restrict the function space is to impose a regularization condition. We denote Ω (f ) as a measurement of regularity, then the regularized problem consists of minimizing the regularized risk: Rreg (f ) = Remp (f ) + λΩ (f ) Here λ is the regularization parameter and Ω (f ) can be for example the Lp norm on some deviation of f . Vapnik and Chervonenkis theory We are not going to discuss in detail the VC theory on the statistical learning machine but only recall the most important result concerning the characterization of the complexity of function class. In order to well quantify the trade-off between the overfit problem and the inseparable data problem, Vapnik and Chervonenkis have introduced a very important concept which is the VC dimension and the important theorem which characterize the convergence of empirical risk function. First, the VC dimension is introduced to measure the complexity of the class of functions F Definition 3.2.1 The VC dimension of a class of functions F is defined as the maximum number of point that can be exactly learned by a function of F: n o h = max |X|, X ⊂ X , such that ∀b ∈ {−1, 1}|X| , ∃f ∈ F ∀ xi ∈ X, f (xi ) = bi (3.4) With the definition of the VC dimension, we now present the VC theorems which is a very powerful tool with control the upper limit of the convergence for the empirical risk to the true risk function. These theorems allows us to have a clear idea about the superior boundary on the available information and the number of observation in the training set n. By satisfying this theorem, we can control the trade-off between overfit and underfit. The relation between factors or coordinates of vector x and VC dimension is given in the following theorem: 66

Support Vector Machine in Finance

Theorem 3.2.2 (VC theorem of hyperplanes) Let F be the set of hyperplanes in Rd : n o  F = x 7→ sign wT x + b , w ∈ Rd , b ∈ R then VC dimension is d + 1

This theorem gives the explicit relation between the VC dimension and the number of factors or the number of coordinates in the input vector of the training set. It can be used in the next theorem in order to evaluate the necessary information for having a good classification or regression. Theorem 3.2.3 (Vapnik and Chervonenskis) let F be a class of function of VC dimension h, then for any distribution P r and for any sample {(xi , yi )}i=1n˙ drawn from this distribution, the following inequality holds true: ( ) (     ) 2n 1 2 P r sup |R (f ) − Remp (f )| > ε < 4 exp h 1 + ln − ε− n h n f ∈F An important corollary of the VC theorem is the upper bound for the convergence of the empirical risk function to the risk function: Corollary 3.2.4 Under the same hypothesis of the VC theorem, the following inequality is hold with the probability 1 − η: s  η h ln 2n 1 h + 1 − ln 4 + ∀f ∈ F, R (f ) − Remp (f ) ≤ n n We will skip all the proofs of these theorems and postpone the discussion on the importance of VC theorems important for practical use later in Section 6 as the overfit and underfit problems are very present in any financial applications. Vicinal Risk Minimization framework Vicinal Risk Minimization framework (VRM) was formally developed in the work of Chapelle O. (2000s). In EVM framework, the risk is evaluated by using empirical probability distribution: n

1X dPemp (x, y) = δxi (x)δyi (y) n i=1

where δxi (x), δyi (y) are Dirac distributions located at xi and yi respectively. In the VRM framework, instead of dPemp , the Dirac distribution is replaced by an estimate density in the vicinity of xi : n

1X dPvic (x, y) = dPxi (x)δyi (y) n i=1

67

Support Vector Machine in Finance

Hence, the vicinal risk is then defined as following: Rvic (f ) =

Z

n

1X l (f (x) , y) dPvic (x, y) = n i=1

Z

l (f (x) , yi ) dPxi (x)

In order to illustrate the different between the ERM framework and VRM framework, let us consider the following example of the linear regression. In this case, our loss function l (f (x) , y) = (f (x) − y)2 where the learning function is of the form f (x) = wT x + b. Assuming that the vicinal density probability dPxi (x) is approximated by a white noise of variance σ 2 . The vicinal risk is calculated as following: n

1X Rvic (f ) = n i=1

Z

(f (x) − yi )2 dPxi (x)

n Z  1X = (f (xi + ε) − yi )2 dN 0, σ 2 n

=

1 n

i=1 n X i=1

(f (xi ) − yi )2 + σ 2 kwk2

It is equivalent to the regularized risk minimization problem: Rvic (f ) = Remp (f ) + σ 2 kwk2 of parameter σ 2 with L2 penalty constraint.

3.3

Numerical implementations

In this section, we discuss explicitly the two possible ways to implement the SVM algorithm. As discussed above, the kernel approach can be applied directly in the dual problem and it leads to a simple form of an quadratic program. We discuss first the dual approach for the historical reason. Direct implementation for the primal problem is little bit more delicate that why it was much more later implemented by Chapelle O. (2007) by Newton optimization method and conjugate gradient method. According to Chapelle O., in term of complexity both approaches propose more and less the same efficiency while in some context the later gives some advantage on the solution precision.

3.3.1

Dual approach

We discuss here in more detail the two main applications of SVM which are the classification problem and the regression problem within the dual approach. The reason for the historical choice of this approach is simply it offers a possibility to obtain a standard quadratic program whose numerical implementation is well-established. Here, we summarize the result presented in Cortes C. and Vapnik V. (1995) where they introduced the notion of soft-margin SVM. We next discuss the extension for the regression. 68

Support Vector Machine in Finance

Classification problem As introduced in the last section, the classification encounters two main problems: the overfitted problem and the underfitted problem. If the dimension of the function space is two large, the result will be very sensible to the input then a small change in the data can cause an instability in the final result. The second one consists of nonseparable data in the sense that the function space is too small then we can not obtain a solution which minimizes the risk function. In both case, regularization scheme is necessary to make the problem well-posed. In the first case, on should restrict the function space by imposing some condition and working with some specific function class (linear case for example). In the later case, on needs to extend the function space by introducing some tolerable error (soft-margin approach) or working with non-linear transformation. a) Linear SVM with soft-margin approach In the work of Cortes C. and Vapnik V. (1995), they have first introduced the notion of soft-margin by accepting that there will be some error in the classification. They characterize this error by additional variables ξi associated to each data points xi . These parameters intervene in the classification via the  constraints. For a given hyperplane, the constrain yi wT xi + b ≥ 1 which means that the point xi is well-classified and is out of the margin. When we  T change this condition to yi w xi + b ≥ 1 − ξi with ξi ≥ 0 i = 1...n, it allow first to point xi to be well-classified but in the margin for 0 ≤ ξi < 1. For the value ξi > 1, there is a possibility that the input xi is misclassified. As written above, the primal problem becomes an optimization with respect to the margin and and the total committed error. ! n X 1 p 2 min kwk + C.F ξi w,b,ξ 2 i=1  T u.c. yi w xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n

Here, p is the degree of regularization. We remark that only for the choice of p ≥ 1 the a soft-margin can have an unique solution. The function F (u) is usually chosen as a convex function with F (0) = 0, for example F (u) = uk . In the following we consider two specific cases: (i) Hard-margin limit with C = 0; (ii) L1 penalty with F (u) = u, p = 1. We define the dual vector Λ = (α1 , . . . , αn ) and the output vector y = (y1 , . . . , yn ). In order to write the optimization problem in vectorial form, we define as well the operator D = (Dij )n×n with Dij = yi yj xTi xj . i. Hard-margin limit with C = 0. As shown in Appendix C.1.1, this problem can be mapped to the following dual problem: 1 max ΛT 1 − ΛT DΛ 2 Λ T u.c. Λ y = 0, Λ ≥ 0 69

(3.5)

Support Vector Machine in Finance

ii. L1 penalty with F (u) = u, p = 1. In this case the associated dual problem is given by: 1 max ΛT 1 − ΛT DΛ 2 Λ T u.c. Λ y = 0, 0 ≤ Λ ≤ C1

(3.6)

The full derivation is given in Appendix C.1.2. Remark 2 For the case with L2 penalty (F (u) = u, p = 2), we will demonstrate in the next discussion that it is a special case of kernel approach for hard-margin case. Hence, the dual problem is written exactly the same as hardmargin case with an additional regularization term 1/2C added to the matrix D:   1 T 1 T max Λ 1 − Λ I Λ D+ (3.7) 2 2C Λ u.c.

ΛT y = 0, Λ ≥ 0

b) Non-linear SVM with Kernel approach The second possibility to extend the function space is to employ a non-linear transformation φ (x) from the initial space X to the feature space F then construct the hard margin problem. This approach conducts to the following dual problem with the use of an explicit kernel K (xi , xj ) = φ (xi )T φ (xj ) in stead of xTi xj . In this case, the D operator is a matrix D = (Dij )n×n with element: Dij = yi yj K (xi , xj ) With this convention, the two first quadratic programs above can be rewritten in the context of non-linear classification by replacing D operator by this new definition with the kernel. We finally remark that, the case of soft-margin SVM with quadratic penalty (F (u) = u, p = 2) can be also seen as the case of hard-margin a  SVM with √  ˜ modified Kernel. We introduce a new transformation φ (xi ) = φ (xi ) 0 . . . yi / 2C . . . 0 √ ˜ = where the element y / C is at i + dim(φ(xi )) position, and new vector w i  √ √  w ξ1 2C . . . ξn C . In the new representation, the objective function kwk2 /2+   P ˜ 2 /2 whereas the inequality constrain yi φ (w)T xi + b ≥ C ni=1 ξi2 becomes simply kwk   ˜ T φ˜ (xi ) + b ≥ 1. Hence, we obtain the hard-margin SVM 1 − ξi becomes yi w with a modified kernel which can be computed simply: ˜ i , xj ) = φ˜ (xi )T φ˜ (xj ) = K(xi , xj ) + δij K(x 2C This kernel is consistent with QP program in the last remark. 70

Support Vector Machine in Finance

In summary, the linear SVM is nothing else a special case of the non-linear SVM within kernel approach. In the later, we study the SVM problem only for the two case with hard and soft margin within the kernel approach. After obtaining the optimal vector Λ? by solving the associated QP program described above, we can compute b by condition then derive the decision function f (x). We remind Pnthe KKT ? ? that w = i=1 αi yi φ (x). i. For the hard-margin case, KKT condition given in Appendix C.1.1:    αi? yi w?T φ (xi ) + b? − 1 = 0

We notice that for the value αi > 0, the inequality constraint becomes equality. As the inequality constraint becomes equality constrain, these points are the closest points to the optimal frontier and they are called support-vectors. Hence, b can be computed easily for a given support vector (xi , yi ) as following: b? = yi − w?T φ (xi )

In order to enhance the precision of b? , we evaluate this value as the average all over the set SV of support vectors : b? =

=

1 nSV 1 nSV

X

i∈SV

X

i∈SV

yi − 

X

αj? yj φ (xj )T φ (xi )

i,j∈SV

yi 1 − αi?

X

j∈SV



K (xi , xj )

ii. For the soft-margin case, KKT condition given in Appendix C.1.2 is slightly different:    αi? yi w?T φ (xi ) + b? − 1 + ξi = 0

However, if αi satisfies the condition 0 ≤ αi ≤ C then we can show that ξi = 0. The condition 0 ≤ αi ≤ C defines the subset of training points (support vectors) which are closest to the frontier of separation. Hence, b can be computed by exactly the same expression as the hard-margin case.

From the optimal value of the triple (Λ? , w? , b? ), we can construct the decision function which can be used to classified a given input x as following: ! n X ? f (x) = sign αi yi K (x, xi ) + b (3.8) i=1

Regression problem In the last sections, we have discussed the SVM problem only in the classification context. In this section, we show how the regression problem can be interpreted as a SVM problem. As discussed in the general frameworks of statistical learning (ERM 71

Support Vector Machine in Finance

or VRM), the SVM problem consists of minimizing the risk function Remp or Rvic . The risk function can be computed via the loss function l (f (x) , y) which defines our objective (classification or regression). Explicitly, the risk function is calculated as: Z R (f ) = l (f (x) , y) dP (x, y) where the distribution dP (x, y) can be computed in the ERM framework or in the VRM framework. For the classification problem, the loss function is defined as l (f (x) , y) = If (x)6=y which means that we count as an error whenever the given point is misclassified. The minimization of the risk function for the classification can be mapped then to the minimization of the margin 1/ kwk. For the regression problem, the loss function is l (f (x) , y) = (f (x) − y)2 which means that we count the loss as the error of regression. Remark 3 We have chosen here the loss as the least-square error just for illustration. In general, it can be replaced by any positive function F of f (x) − y. Hence, we have the loss function in general form l (f (x) , y) = F (f (x) − y). We remark that the least-square case corresponds to L2 norm, then the most simple generalization is to have the loss function as Lp norm l (f (x) , y) = |f (x) − y|p . We show later that the special case with L1 can bring the regression problem to a similar form of soft-margin classification. In the last discussion on the classification, we have concluded that the linear-SVM problem is just a special case of non-linear-SVM within kernel approach. Hence, we will work here directly with non-linear case where the training vector x is already transformed by a non-linear application φ (x). Therefore, the approximate function of the regression reads f (x) = wT φ (x)+b. In the ERM framework, the risk function is estimated simply as the empirical summation over the dataset: n

Remp =

1X (f (xi ) − yi )2 n i=1

whereas in the VRM framework, if we assume that dP (x, y) is a Gaussian noise of variance σ 2 then the risk function reads: n

Rvic =

1X (f (xi ) − yi )2 + σ 2 kwk2 n i=1

The risk function in the VRM framework can be interpreted as a regulated form of risk function in the ERM framework. We rewrite the risk function after renormalizing it by the factor 2σ 2 : n X 1 2 Rvic = kwk + C ξi2 2 i=1

with C = Here, we have introduced new variables ξ = (ξi )i=1...n which satisfy yi = f (xi ) + ξi = wT φ (xi ) + b + ξi . The regression problem can be now 1/2σ 2 n.

72

Support Vector Machine in Finance

written as a QP program with equality constrain as following: n

min w,b,ξ u.c.

X 1 kwk2 + C ξi2 2 i=1

yi = wT φ (xi ) + b + ξi

i = 1...n

In the present form, the regression looks very similar to the SVM problem for the classification. We notice that the regression problem in the context of SVM can be easily generalized by two possible ways: • The first way is to introduce more general loss function F (f (xi ) − yi ) instead of the least-square loss function. This generalization can lead to other type of regression such as ε-SV regression proposed by Vapnik (1998). • The second way is to introduce a weight ωi distribution for the empirical distribution instead of the uniform distribution: dPemp (x, y) =

n X

ωi δxi (x)δyi (y)

i=1

As financial quantities depend more on the recent pass, hence an asymmetric weight distribution in the favor of recent data would improve the estimator. The idea of this generalization is quite similar to exponential moving-average. By doing this, we recover the results obtained in Gestel T.V. et al., (2001) and in Tay F.E.H. and Cao L.J. (2002) for the LS-SVM formalism. For examples, we can choose the weight distribution as proposed in Tay F.E.H. and Cao L.J. (2002): ωi = 2i/n (n + 1) (linear distribution) or ωi = (1 + exp (a − 2ai/n)) (exponential weight distribution). Our least-square regression problem can be mapped again to a dual problem after introducing the Lagrangian. Detail calculations are given in Appendix C.1. We give here the principle result which invokes again the kernel Kij = K (xi , xj ) = φ (xi )T φ (xj ) for treating the non-linearity. Like the classification case, we consider only two problems which are similar to the hard-margin and the soft-margin in the context of regression. i. Least-square SVM regression: In fact, the regression problem discussed above similar to the hard-margin problem. Here, we have to keep the regularization parameter C as it define a tolerance error for the regression. However, this problem with the L2 constrain is equivalent to hard-margin with a modified kernel. The quadratic optimization program is given as following:   1 T 1 T max Λ y − Λ K+ I Λ (3.9) 2 2C Λ u.c. ΛT 1 = 0 73

Support Vector Machine in Finance

ii. ε-SVM regression The ε-SVM regression problem was introduced by Vapnik (1998) in order to have a similar formalism with the soft-margin SVM. He proposed to employ the loss function in the following form: l (f (x) , y) = (|y − f (x)| − ε) I{|y−f (x)|≥ε} The ε-SVM loss function is just a generalization of L1 error. Here, ε is an additional tolerance parameter which allows us not to count the regression error small than ε. Insert this loss function into the expression of risk function then we obtain the objective of the optimization problem: n

Rvic =

X 1 kwk2 + C (|f (xi ) − yi | − ε) I{|yi −f (xi )|≥ε} 2 i=1

Because the two ensembles {yi −f (xi ) ≥ ε} and {yi −f (xi ) ≤ −ε} are disjoint. We now break the function I{|yi −f (xi )|≥ε} into two terms: I{|yi −f (xi )|≥ε} = I{yi −f (xi )−ε≥0} + I{f (xi )−yi −ε≥} By introducing the slack variables ξ and ξ 0 as the last case which satisfy the condition ξi ≥ yi − f (xi ) − ε and ξi0 ≥ f (xi ) − yi − ε. Hence, we obtain the following optimization problem: n

min 0 w,b,ξ ,ξ u.c.

X  1 kwk2 + C ξi + ξi0 2 i=1

T

w φ (xi ) + b − yi ≤ ε + ξi , T

yi − w φ (xi ) − b ≤ ε +

ξi0 ,

ξi ≥ 0 i = 1...n

ξi0 ≥ 0 i = 1...n

Remark 4 We remark that our approach gives exactly the same result as the traditional approach discussed in the work of Vapnik (1998) in which the objective function is constructed by minimizing the margin with additional terms defining the regression error. These terms are controlled by the couple of slack variables. The dual problem in this case can be obtained by performing the same calculation as the soft-margin SVM: max Λ,Λ0

Λ − Λ0

u.c.

Λ − Λ0

T

T

y − ε Λ + Λ0 1 = 0,

T

1−

T  1 Λ − Λ0 K Λ − Λ0 (3.10) 2

0 ≤ Λ, Λ0 ≤ C1

For the particular case with ε = 0, we obtain: 1 max ΛT y − ΛT KΛ 2 Λ T u.c. Λ 1 = 0, |Λ| ≤ C1 74

Support Vector Machine in Finance

After the optimization procedure using QP program, we obtain the optimal vector Λ? then compute b? by the KKT condition: wT φ (xi ) + b − yi = 0 for support vectors (xi , yi ) (see Appendix C.1.3 for more detail). In order to have good accuracy for the estimation of b, we average over the set of support vectors SV and obtain:   n n SV X X 1 yi − αi? b? = K (xi , xj ) nSV i=1

j=1

The SVM regressor is then given by the following formula: f (x) =

n X

αi? K (x, xi ) + b?

i=1

3.3.2

Primal approach

We discuss now the possible of an direct implementation for the primal problem. This problem has been proposed and studied by Chapelle O. (2007). In this work, the author argued  that both primal and dual  implementations give the same complexity 2 of the order O max (n, d) min (n, d) . Indeed, according to the author, the primal problem might give a more accurate solution as it treats directly the quantity that one is interested in. It is can be easily understood via the special case of a LS-SVM linear estimator where both primal and dual problems can be solved analytically. The main idea of primal implementation is to rewrite the optimization problem under constraint as a unconstrained problem by performing a trivial minimization on the slack variables ξ. We then obtain: n X  1 2 L yi , wT φ (xi ) + b min kwk + C w,b 2

(3.11)

i=1

Here, we have L (y, t) = (y − t)p for the regression problem whereas L (y, t) = max (0, 1 − yt)p for the classification problem. In the case with quadratic loss or L2 penalty, the function L (y, t) is differentiable with respect to the second variable hence one can obtain the zero gradient equation. In the case where L (y, t) is not differentiable such as L (y, t) = max (0, 1 − yt), we have to approximate it by a regular function. Assuming that L (y, t) is differentiable with respect to t then we obtain: w+C

n X ∂L i=1

∂t

 yi , wT φ (xi ) + b φ (xi ) = 0

which leads to the following representation of the solution w: w=

n X

βi φ (xi )

i=1

75

Support Vector Machine in Finance

By introducing the kernel Kij = K (xi , xj ) = φ (xi )T φ (xj ) we rewrite the primal problem as following: n X  1 min β T Kβ + C L yi , KiT β + b β ,b 2 i=1

(3.12)

where Ki is the ith column of the matrix K. We note that it is now an unconstrained optimization problem which can be solved by gradient descent whenever L (y, t) is differentiable. In Appendix C.1, we present detail derivation of the primal implementation in for the case quadratic loss and soft-margin classification.

3.3.3

Model selection - Cross validation procedure

The possibility to enlarge or restrict the function space let us the possibility to obtain the solution for SVM problem. However, the choice of the additional parameter such as the error tolerance C in the soft-margin SVM or the kernel parameter in the extension to non-linear case is fundamental. How can we choose these parameters for a given data set? In this section, we discuss the calibration procedure so-called “model selection ” which aims to determine the ensemble of parameters for SVM. This discussion bases essentially on the result presented the O. Chapelle’s thesis (2002). In order to define the calibration procedure, let us first define the test function which is used to estimate the SVM problem. In the case where we have a lot of data, we can follow the traditional cross validation procedure by dividing the total data in two independent sets: the training set and the validation set. The training set {xi , yi }1≤i≤n is used for the optimization problem whereas the validation set {x0i , yi0 }1≤i≤m is used to evaluate the error via the following test function: m

 1 X T = ψ −yi0 f x0i m i=1

where ψ (x) = I{x>0} with IA the standard notation of the indicator function. In the case where we do not have enough data for SVM problem, we can employ directly the training set to evaluate the error via the “Leave-one-out error” . Let f 0 be the classifier obtained by the full training set and f p be the one with the point (xp , yp ) left out. The error is defined by the test of the decision rule f p on the missing point (xp , yp ) as following: n 1X T = ψ (−yp f p (xp )) n i=1

We focus more here the first test error function with available validation data set. However, the error function requires the step function ψ which is discontinuous can cause some difficulty if we expect to determine the best selection parameter via the optimal test error. In order to perform the search for minimal test error by gradient 76

Support Vector Machine in Finance

descent for example, we should smooth the test error by regulate the step function by: 1 ψ˜ (x) = 1 + exp (−Ax + B) The choice of the parameter A, B are important. If A is too small the approximation error is too much whereas if A is large the test error is not smooth enough for the minimization procedure.

3.4

Extension to SVM multi-classification

The single SVM classification (binary classification) discussed in the last section was very well-established and becomes a very standard method for various applications. However, the extension to the multi classification problem is not straight forward. This problem still remains a very active research topic in the recognition domain. In this section, we give a very quick overview on this progressing field and some practical implementations.

3.4.1

Basic idea of multi-classification

The multiclass SVM can be formulated as following. Let (xi , yi )i=1...n be the training set of data with characteristic x ∈ Rd under classification criterion y. For example, the training data belong to m different classes labeled from 1 to m which means that y ∈ {1, . . . , m}. Our task is to determine a classification rule F : Rd → {1, . . . , m} based on training set data which aims to predict to which class belongs the test data xt by evaluating the decision rule f (xt ). Recently, many important contributions have progressed the field both in the accuracy and complexity (i.e. reduction of time computation). The extensions have been developed via two main directions. The first one consists of dividing the multiclassification problem into many binary classification problem by using “one-againstall” strategy or “one-against-one”. The next step is to construct the decision function in the recognition phase. The implementation of the decision for “one-against-all” strategy is based on the maximum output among all binary SVMs. The outputs are usually mapped into an estimation probability which are proposed by different authors such as Platt (1999). For “one-against-one”strategy, in order to take the right decision, the Max Wins algorithm is adopted. The resultant class is given by the one voted by the majority of binary classifiers. Both techniques encounter the limitation of complexity and high cost of computation time. Other improvement in the same direction such as the binary decision tree (SVM-BDT) was recently proposed by Madzaro G. et al., (2009). This technique proved to be able to speed up the computation time. The second direction consist of generalizing the kernel concept in the SVM algorithm into a more general form. This method treats directly the multiclassification problem by writing a general form of the large margin problem. It will be again mapped into the dual problem by incorporating the kernel concept. 77

Support Vector Machine in Finance

Crammer K. and Singer Y. (2001) introduced an efficient algorithm which decomposes the dual problem into multiple optimization problems which can be solved later by fixed-point algorithm.

3.4.2

Implementations of multiclass SVM

We describe here the two principal implementations of SVM for multiclassification problem. The first one concerns a direct application of binary SVM classifier, however the recognition phase requires a careful choice of decision strategy. We next describe and implement the multiclass kernel-based SVM algorithm which is a more elegant approach. Remark 5 Before discussing details of the two implementations, we remark that there exists other implementations of SVM such as the application of Nonnegative Matrix Factorization (Poluru V. K. et al., 2009) in the binary case by rewriting the SVM problem in NMF framework. Extension of this application to multiclassification case must be an interesting topic for future work. Decomposition into multiple binary SVM The most two popular extensions of single SVM classifier to multiclass SVM classifier are using the one-against-all strategy and one-against-all strategy. Recently, another technique utilizing the binary decision tree provided less effort in training the data and it is much faster for recognition phase with a complexity of the order O [log2 N ]. All these techniques employ directly the above SVM implementation. a) One-against-all strategy: In this case, we construct m single SVM classifiers in order separate the training data from every class to the rest of classes. Let us consider the construction of classifier separating class k from the rest. We start by attributing the response zi = 1 if yi = k and zi = −1 for all yi ∈ {1, . . . m} / {k}. Applying this construction for all classes, we obtain finally the m classifiers f1 (x) , . . . , fm (x). For a testing data x the decision rule is obtained by the maximum of the outputs given by these m classifiers: y = argmaxk∈{1...m} fk (x) In order to avoid the error coming from the fact that we compare the output corresponding to different classifiers, we can map the output of each SVM into the same form of probability proposed by Platt (1999): Pˆ r ( ωk | fk (x)) =

1 1 + exp (Ak fk (x) + Bk )

where ωk is the label of the k th class. This quantity can be interpreted as a measure of the accepting probability of the classifier ωk for a given point x with 78

Support Vector Machine in Finance

output fk (x). However, nothing guarantees that hence we have to renormalize this probability:

Pm

ˆ

k=1 P r ( ωk | fk

(x)) = 1,

Pˆ r ( ωk | fk (x)) Pˆ r ( ωk | x) = Pm ˆ j=1 P r ( ωj | fj (x))

In order to obtain these probability, we have to calibrate the parameters (Ak , Bk ). It can be realized by performing the maximum likehood on the training set (Platt (1999)). b) One-against-one strategy: Other way to employ the binary SVM classifier is to construct Nc = m(m − 1)/2 binary classifiers which separate all couples of classes (ωi , ωj ). We denote the ensemble of classifiers C = {f1 , . . . , fNc }. In the recognition phase, we evaluate all possible outputs f1 (x) , . . . , fNc (x) over C for a given point x. These outputs can be mapped to the response function of each classifier signfk (x) which determines to which class the point x belongs with respect to the classifier fk . We denote N1 , . . . , Nm numbers of times that the point x is classified in the classes ω1 , . . . , ωm respectively. Using the responses we can construct a probability distribution Pˆ r ( ωk | x) over the set of classes {ωk }. This probability again is used to decide the recognition of x. c) Binary decision tree: Both methods above are quite easy for implementation as they employ directly the binary solver. However, they are all suffer a high cost of computation time. We discuss now the last technique proposed recently by Madazarov G. et al., (2009)which uses the binary decision tree strategy. With advantage of the binary tree, the technique gains both complexity and computation time consumption. It needs only m − 1 classifiers which do not always run on the whole training set for constructing the classifiers. By construction, for recognizing a testing point x, it requires only O (log2 N ) evaluation by descending the tree. Figure 3.2 illustrates how this algorithm works for classifying 7 classes. Multiclass Kernel-based Vector Machines A more general and elegant formalism can be obtained for multiclassification by generating the concept kernel. Within this discussion, we follow the approach given in the work of Crammer G. et al., (2001) but with more geometrical explanation. We demonstrate that this approach can be interpreted as a simultaneous combination of “one-against-all” and “one-against-one” strategies. As in the linear case, we have to define a decision function. For the binary case, f (x) = sign (h (x)) where h (x) is the boundary (i.e. f (x) = +1 if x ∈ class 1 whereas f (x) = −1 if x ∈ class 2). For the decision function must as-well indicate the class index. In the work of Crammer K. et al., (2001), they proposed to construct the decision rule F : Rd → {1, . . . , m} as following:  F (x) = argmaxk∈{1,...,m} WkT x 79

Support Vector Machine in Finance Figure 3.2: Binary decision tree strategy for multiclassification problem

Here, W is the d × m weight matrix in which each column Wk corresponds to a d × 1 weight vector. Therefore, we can rewrite the weight matrix as W = (W1 W2 . . . Wm ). We remind that the vector x is of dimension d. In fact, the vectors Wk corresponding to k th class can be interpreted as the normal vector of the hyperplan in the binary SVM. It characterizes the sensibility of a given point x to the k th class. The quantity WkT x is similar to a “score ” that we attribute to the class ωk . Remark 6 This construction looks quite similar to the “one-against-all” strategy. The main difference is that for the “one-against-all” strategy, all vectors W1 . . . Wm are constructed independently one by one with binary SVM whereas within this formalism, they are constructed spontaneously all together. We will show in the following that the selection rule of this approach is more similar to “one-against-one” strategy. Remark 7 In order to have an intuitive geometric interpretation, we treat here the case of linear-classifier. However, the generation to non-linear case will be straight  forward when we replace xTi xj by φ xTi f (xj ). This step introduces the notion of kernel K (xi , xj ) = φ (xi )T φ (xj ). By definition Wk is the vector defining the boundary which distinguishes the class ωk from the rest. It is a normal vector to the boundary and point to the region occupied by class ωk . Assuming that we are able to separate correctly all data by classifier W. For any point (x, y) when we compute the position of x with respect to two classes ωy and ωk for all k 6= y, we must find that x belongs to class ωy . As Wk defines the vector pointing to the class ωk , hence when we compare a class ωy to a class ωk , it is natural to define the vector Wy − Wk to define the vector point to class ωy but not ωk . As consequence, Wk − Wy is the vector  point to class ωk but not ωy . T T When x is well classified, we must have Wy − Wk x > 0 (i. e. the class ωy has 80

Support Vector Machine in Finance

the best score). In order to have a margin like the binary case, we impose strictly that WyT − WkT x ≥ 1 ∀k 6= y. This condition can be translated for all k = 1 . . . m by adding δy,k (the Kronecker symbol) as following:  WyT − WkT x + δy,k ≥ 1 Therefore, solving the multi-classification problem for training set (xi , yi )i=1...n is equivalent to finding W satisfying:  WyTi − WkT xi + δyi ,k ≥ 1 ∀i, k

We notice here that w = WiT − WjT is normal vector to the separation boundary  Hw = z|wT z + bij = 0 between two classes ωi and ωj . Hence the width of the margin between two classes is as in the binary case: M (Hw ) =

1 kwk

Maximizing the margin is equivalent to minimizing the norm kwk. Indeed, we have  2 2 2 2 kwk = kWi − Wj k ≤ 2 kWi k + kWj k . In order to maximize all the margin at the same time, it turns out that we have to minimize the L2 -norm of the matrix W: kWk22 =

m X i=1

kWi k2 =

m X d X

Wij2

i=1 j=1

Finally, we obtain the following optimization problem: min W

u.c.

1 kWk2 2  WyTi − WkT xi + δyi ,k ≥ 1 ∀i = 1 . . . n, k = 1 . . . m

The extension the similar case with“soft-margin” can be formulated easily bu introducing the slack variables ξi corresponding for each training data. As before, this slack variable allow the point to be classified in the margin. The minimization problem now becomes: ! n X 1 min kWk2 + C.F ξip W,ξ 2 i=1  T T u.c. Wyi − Wk xi + δyi ,k ≥ 1 − ξi , ξi ≥ 0 ∀i, k Remark 8 Within the ERM or V RM frameworks, we can construct the risk function via the loss function l (x) = I{F (x)6=y} for the couple of data (x, y). For example, in the ERM framework, we have: n

1X I{F (xi )6=yi } Remp (W) = n i=1

81

Support Vector Machine in Finance

The classification problem is now equivalent to find the optimal matrix W? which minimizes the empirical risk function. In the binary case, we have seen that the optimization of risk function is equivalent to maximizing the margin kwk2 under linear constraint. We remark that in VRM framework, this problem can be tackled exactly as the binary case. In order to prove the equivalence of minimizing the risk function with the large margin principle, we look for a linear superior boundary the indicator function I{F (x)6=y} . As shown in Crammer K. et al., (2001), we consider the following function:  g (x, y; k) = WkT − WyT x + 1 − δy,k In fact, we can prove that

I{F (x)6=y} ≤ g (x, y) = max g (x, y; k) k

∀ (x, y)

 We first remark that g (x, y; y) = WyT − WyT x + 1 − δy,y = 0, hence g (x, y) ≥ g (x, y; y) = 0. If the point (xi , yi ) satisfies F (xi ) = yi then WyTi x = maxk WkT xi and I{F (x)6=y} (xi ) = 0. For this case, it is obvious that I{F (x)6=y} (xi ) ≤ g (xi , yi ). If we have now F (xi ) 6= yi then WyTi x < maxk WkT xi and I{F (x)6=y} (xi ) = 1. In this  case, g (x, y) = maxk WkT x − WyT + 1 ≥ 1. Hence, we obtain again I{F (x)6=y} (xi ) ≤ g (xi , yi ). Finally, we obtain the upper boundary of the risk function by the following expression: Remp (W) ≤

n   1X max WkT − WyTi xi + 1 − δyi ,k k n i=1

If the the data is separable, then the optimal value of the risk function is zero. If one require that the superior boundary of the risk function is zero, then the W? which optimizes this boundary must be the one optimize Remp (W). The minimization can be expressed as:   max WkT − WyTi xi + 1 − δyi ,k = 0 ∀i k

or in the same form of the large margin problem:  WyTi − WkT xi + 1 + δyi ,k ≥ 1

∀i, k

Follow the traditional routine for solving this problem, we map it into the dual problem as in the case with binary classification. The detail of this mapping is given in K. Crammer and Y. Singer (2001). We summarize here their important result in the dual form with the dual variable η i of dimension m with i = 1 . . . n. Define τ i = 1yi − η i where 1yi is zero column vector except for ith element, then in the case with soft margin p = 1 and F (u) = u we have the dual problem: ! n 1X T  T  1 X T max Q (τ ) = − xi x j τ i τ j + τ i 1 yi τi 2 C i,j

u.c.

τ i ≤ 1yi

and

i=1

τ Ti 1

= 0 ∀i

82

Support Vector Machine in Finance

We remark here again that we obtain a quadratic program which involves only the interior product between all couples of vector xi , xj . Hence the generation to nonlinear is straight forward with the introduction of the kernel  concept. The general problem is finally written by replacing the the factor xTi xj by the kernel K (xi , xj ):  1 1X max Q (τ ) = − K (xi , xj ) τ Ti τ j + τi 2 C i,j

u.c.

τ i ≤ 1yi

and

τ Ti 1

= 0 ∀i

n X i=1

τ Ti 1yi

!

(3.13) (3.14)

The optimal solution of this problem allows to evaluate the classification rule: H(x) = arg max

r=1...m

( n X

τ i,r K (x, xi )

i=1

)

(3.15)

For small value of class number m, we can implement the above optimization by the traditional QP program with matrix of size mn × mn. However, for important number of class, we must employ efficient algorithm as stocking a mn×mn is already a complicate problem. Crammer and Singer have introduced an interesting algorithm which optimize this optimization problem both in stockade and computation speed.

3.5

SVM-regression in finance

Recently, different applications in the financial field have been developed through two main directions. The first one employs SVM as non-linear estimator in order to forecast the market tendency or volatility. In this context, SVM is used as a regression technique with feasible possibility for extension to non-linear case thank to the kernel approach. The second direction consists of using SVM as a classification technique which aims to elaborate the stock selection in the trading strategy (for example long/short strategy). The SVM-regression can be considered as a non-linear filter for times series or a regression for evaluating the score. We discuss first here how to employ the SVM-regression as as an estimators of the trend for a given asset. The observed trend can be used later for momentum strategies such as trend-following strategy. We next use SVM as a method for constructing the score of the stock for long/short strategy.

3.5.1

Numerical tests on SVM-regressors

We test here the efficiency of different regressors discussed above. They can be distinguished by the form of loss function into L1 -type or L2 type or by the form of non-linear kernel. We do not focus yet on the calibration of SVM parameter and reserve it for the next discussion on the trend extraction of financial time series with a full description of cross validation procedure. For a given times series yt we would like to regress the data with the training vector x = t = (ti )i=1...n . Let us consider 83

Support Vector Machine in Finance

two model of time series. The first model is simply an determined trend perturbed by a white noise: yt = (t − a)3 + σN (0, 1) (3.16) The second model for our tests is the Black-Scholes model of the stock price: dSt = µt dt + σt dBt St

(3.17)

We notice here that the studied signal yt = ln St . The parameters of the model are the annualized return µ = 5% and the annulized volatility σ = 20%. We consider the regression on a period of one year corresponding to N = 260 trading days. The first test consists of comparing the L1 -regressor and L2 -regressor for Gaussian kernel (see Figures 3.3-3.4). As shown in Figure 3.3 and Figure 3.4, the L2 -regressor seems to be more favor for the regression. Indeed, we observe that the L2 -regressor is more stable than the L1 -regressor (i.e L1 is more sensible to the training data set) via many test on simulated data of Model 3.17. In the second test, we compare different L2 regressions corresponding to four typical kernel: 1. Linear, 2. Polynomial, 3. Gaussian, 4. Sigmoid. Figure 3.3: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16) 20

15

10

yt

5

0

−5

−10

−15

−20 0

Real signal L1 regression L2 regression 0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

t

3.5.2

SVM-Filtering for forecasting the trend of signal

Here, we employ SVM as a non-linear filtering technique for extracting the hidden trend of a time series signal. The regression principle was explained in the last 84

Support Vector Machine in Finance

Figure 3.4: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.17) 0.1

0.05

ln(St /S0 )

0

−0.05

−0.1

−0.15

−0.2

Real signal L1 regression L2 regression

−0.25 0

50

100

150

200

250

300

t

Figure 3.5: Comparison of different regression kernel for model (3.16) 20

15

10

yt

5

0

−5

−10

Real signal Linear Polynomial Gaussian Sigmoid

−15

−20 0

0.5

1

1.5

2

2.5

t

85

3

3.5

4

4.5

5

Support Vector Machine in Finance Figure 3.6: Comparison of different regression kernel for model (3.17) 0.15

0.1

0.05

yt

0

−0.05

−0.1

Real signal Linear Polynomial Gaussian Sigmoid

−0.15

−0.2 0

50

100

150

200

250

300

t

discussion. We apply now this technique for estimating the derivative of the trend µ ¯t , then plug it into a trend-following strategy. Description of trend-following strategy We choose here the most simple trend-following strategy whose exposure is given by: et = m

µ ˆt σ ˆt2

with m the risk tolerance and σ ˆt the estimator of volatility given by: σ ˆt2

1 = T

Z

0

T

σt2 dt

1 = T

t X

i=t−T +1

ln2

Si Si−1

In order to limit the risk of explosion of the exposure et , we capture it by a superior and inferior boundaries emax and emin :     µ ˆt et = max min m 2 , emin , emax σ ˆt The wealth of the portfolio is then given by the following expression:     ? St+1 ? Wt+1 = Wt + Wt et − 1 + (1 − et )rt St 86

Support Vector Machine in Finance

SVM-Filtering We discuss now how to build a cross-validation procedure which can help to learn the trend of a given signal. We employ the moving-average as a benchmark to compare with this new filter. An important parameter in moving-average filtering is the estimation horizon T then we use this horizon as a reference to calibrate our SVM-filtering. For the sake of simplicity, we studied here only the SVM-filter with Gaussian kernel and L2 penalty. The two typical parameters of SVM-filter are C and σ. C is the parameter which allows some certain level of error in the regression curve while σ characterizes the horizon of estimation and it is directly proportional to T . We propose to scheme of the validation procedure which base on the following structure of data division: training set, validation set and testing set. In the first scheme, we fix the kernel parameter σ = T and optimize the error tolerance parameter C on the validation set. This scheme is comparable to our benchmark moving-average. The second scheme consists of optimizing both couple of parameter C, σ on the validation set. In this case, we let our validation data decides its estimation horizon. This scheme is more complicate to interpret as σ is now a dynamic parameter. However, by affecting σ to the local horizon, we can have an additional understanding on the change in the price of the underlying asset. For example, we can determine in the historical data if the underlying asset undergoes a period with long or short trend. It can help to recognize some additional signature such as the cycle of between long and short trends. We report the two schemes in the following algorithm. Figure 3.7: Cross-validation procedure for determining optimal value C ? σ ? Training |

|

T1

Validation -|

-

T2

Historical data

Forecasting | T2 k Today Prediction

Backtesting We first check the SVM-filter with simulated data given by the Black-Scholes model of the price. We consider a stock price with annualized return µ = 10% and annualized volatility σ = 20%. The regression is based on 1 trading year data (n = 260 days) with a fixed horizon of 1 month T = 20 days. In Figure 3.8, we present the result of the SVM trend prediction with fixed horizon T = 20 whereas Figure 3.9 presents the SVM trend prediction for the second scheme.

3.5.3

SVM for multivariate regression

As a regression method, we can employ SVM for the use of multivariate regression.  (i) Assuming that we consider an universal of d stocks X = X i=1...d during the 87

Support Vector Machine in Finance

Figure 3.8: SVM-filtering with fixed horizon scheme 0.15

0.1

0.05

yt

0

−0.05

−0.1

Real signal Training Validation Prediction

−0.15

−0.2 0

50

100

150

200

250

300

t

Figure 3.9: SVM-filtering with dynamic horizon scheme 0.15

0.1

0.05

yt

0

−0.05

−0.1

Real signal Training Validation Prediction

−0.15

−0.2 0

50

100

150

t

88

200

250

300

Support Vector Machine in Finance Algorithm 3 SVM score construction procedure SVM_Filter(X, y, T ) Divide data into training set Dtrain , validation set Dvalid and testing set Dtest Regression on the training data Dtrain Construct the SVM prediction on validation set Dvalid if Fixed horizon then σ=T Compute Error(C) prediction error on Dvalid Minimize Error(C) and obtain the optimal parameters (C ? ) else Compute Error(σ, C) prediction error on Dvalid Minimize Error(σ, C) and obtain the optimal parameters (σ ? , C ? ) end if Use optimal parameters to predict the trend on testing set Dtest end procedure

period of n dates. The performance of the index or an individual stock that we are interested in is given by y. We are looking for the prediction of the value of yn+1 by using the regression of the historical data of (Xt , yt )t=1...n . In this case, the different stocks play the role of the factors of vector in the training set. We can as well apply other regression like the prediction of the performance of the stock based on available information of all the factors. Multivariate regression We first test here the efficiency of the multivariate regression on a simulated model. Assuming that all the factors at a given date j follow a Brownian motion. (i)

(i)

dXt = µt dt + σt dBt

∀i = 1...d

Let (yt )1n˙ be the vector to be regressed which is related to the input X by a function: yt = f (Xt ) = WtT Xt We would like to regress the vector y = (yt )t=2...n by the historical data (Xt )t=1...n−1 by SVM-regression. This regression is give by the function yt = F (Xt−1 ). Hence, the prediction of the future performance of yn+1 is given by: E [yn+1 |Xn ] = F (Xn ) In Figure 3.10, we present the results obtained by Gaussian kernel with L1 and L2 penalty condition whereas in Figure 3.11, we compare the result obtained with different types of kernel. Here, we consider just a simple scheme with the lag of one trading day for the regression. In all Figures, we remark this lack on the prediction of the value of y. 89

Support Vector Machine in Finance

Figure 3.10: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16) 6 5 4 3

yt

2 1 0 −1

Real signal L1 regression L2 regression

−2 −3 0

50

100

150

200

250

300

350

400

450

500

t

Figure 3.11: Comparison of different kernels for multivariate regression 5 4 3

yt

2 1 0 −1

Real signal Linear Polynomial Gaussian Sigmoid

−2 −3 0

50

100

150

t

90

200

250

Support Vector Machine in Finance

Backtesting

3.6

SVM-classification in finance

We now discuss the second applications of SVM in the finance as a stock classifier within this section. We will first test our implementations of the binary classifier and multiclassifier. We next employ the SVM technique to study two different problems: (i) recognition of sectors and (ii) construction of SVM score for stock picking strategy.

3.6.1

Test of SVM-classifiers

For the binary classification problem, we consider the both approaches (dual/primal) to determine the boundary between two given classes based on the available information of each data point. For the multiclassification problem, we first extend the binary classifier to the multi-class case by using the binary decision tree (SVMBDT). This algorithm was demonstrated to be more efficient than the traditional approaches such as “one-against-all” or “one-against-one” both in computation time and in precision. The general approach of multi-SVM will be then compared to SVM-BDT. Binary-SVM classifier Let us compare here the two proposed approaches (dual/primal) for solving numerically SVM-classification problem. In order to realize the test, we consider a random training data set of n vector xi with classification criterion yi = sign (xi ). We present here the comparison of two classification approaches with linear kernel. Here, the result of primal approach is directly obtained by the software of O. Chapelle 2 . This software was implemented with L2 penalty condition. Our dual solver is implemented for both L1 and L2 penalty conditions by employing simply the QP program. In Figure 3.12, we show the results of classification obtained by both methods within L2 penalty condition. We test next the non-linear classification by using the Gaussian kernel (RBF kernel) for the binary dual-solver. We generate the simulated data by the same way as the last example with x ∈ R2 . The result of the classification is illustrated in Figure 3.13 for RBF kernel with parameter C = 0.5 and σ = 2 3 . Multi-SVM classifier We first test the implementation of SVM-BDT via simulated data (xi )i=1...n which are generated randomly. We suppose that these data are distributed in Nc classes. In order to test efficiently our multi-SVM implementation, the response vector y = 2 The free software of O. Chapelle can be found in the following website http://olivier.chapelle.cc/primal/ 3 We used here the “plotlssvm ” function of the LS-SVM toolbox for graphical illustration. Similar result was aso obtained by using “trainlssvm” function in the same toolbox.

91

Support Vector Machine in Finance

Figure 3.12: Comparison between Dual algorithm and Primal algorithm

Primal,

Dual,

Margins

Boundary,

6

h(x, y)

4

2

0

−2

−4 0

10

20

30

40

50

60

70

80

90

100

Training data

Figure 3.13: Illustration of non-linear classification with Gaussian kernel

1

2.5

class 1 class 2

2 1.5 1

1 0.5

X2

1

0 −0.5 1

−1 −1.5 −2

1

−2.5 −3

−2.5

−2

−1.5

−1

−0.5

X1

92

0

0.5

1

1.5

2

Support Vector Machine in Finance

(y1 . . . yn ) is supposed to be dependent only on the first coordinate of the data vector: z = U (0, 1)

x1 = Nc z

y = [x1 ] + N (0, 1) xi = U (0, 1)

∀i > 1

Here [a] denote the part of a. We can generate our simulated data in much more general way but it will be very hard to visualize the result of the classification. Within the above choice of simulated data, we can see that in the case  = 0 the data a separable in the axis x1 . In the geometric view, the space Rd is divided in to Nc zones along the axis x1 : Rd−1 × [0, 1[, . . . , Rd−1 × [Nc , Nc + 1[. The boundaries are simply the Nc hyperplane Rd−1 crossing x1 = 1 . . . Nc . When we introduce some noise on the coordinate x1 ( > 0), then the training set is now is not separable by these ensemble of linear hyperplanes. There will be some misclassified points and some deformation of the boundaries thank to non-linear kernel. For the sake of simplicity, we assume that the data (x, y) are already gathered by group. In Figures ?? and 3.15, we present the classification results for in-sample data and out-of-simple data in the case  = 0 (i.e. separable data). We are now introduce the noise in the Figure 3.14: Illustration of multiclassification with SVM-BDT for in-sample data C10 C09 C08

Classes

C07 C06 C05 C04 C03 C02 C01

Real sector distribution Multiclass SVM S10

S20

S30

S40

S50

Stocks

data coordinate x1 with  = 0.2. 93

S60

S70

S80

S90

S99

Support Vector Machine in Finance

Figure 3.15: Illustration of multiclassification with SVM-BDT for out-of-sample data C10 C09 C08

Classes

C07 C06 C05 C04 C03 C02 C01

Real sector distribution Multiclass SVM S05

S10

S15

S20

S25

S30

S35

S40

S45

S50

Stocks

Figure 3.16: Illustration of multiclassification with SVM-BDT for  = 0 1.2

C1,

C2,

1

2

C3,

C4,

3

4

C5,

C6,

C7,

C8,

C9,

C10

1

x2

0.8

0.6

0.4

0.2

0

5

x1

94

6

7

8

9

Support Vector Machine in Finance Figure 3.17: Illustration of multiclassification with SVM-BDT for  = 0.2 1.2

C1,

C2,

C3,

C4,

C5,

C6,

C7,

C8,

C9,

3

4

5

6

7

8

9

C10

1

x2

0.8

0.6

0.4

0.2

0

3.6.2

1

2

x1

10

SVM for classification

We employ here multi-SVM algorithm for all the compositions of the Eurostoxx 300 index. Our goal is to determine the boundaries between various sector to which belong the compositions of the index. As the algorithm contains two main parts, classification and prediction, we then can classify our stocks via their common properties resulted from the available factors. The number of misclassified stocks or the error of classification can give us an estimation on sector definition. We next study the recognition phase on the ensemble of tested data. Classification of stocks by sectors In order to well classify the stocks composing the Eurostoxx 300 index, we consider the Ntrain = 100 most representative stocks in term of value. In order to establish the multiclass-svm classification using the binary decision tree, we sort the Ntrain = 100 assets by sectors. We next employing the SVM-BDT for computing the Ntrain − 1 binary separators. In Figure 3.18, we present the classification result with Gaussian kernel and L2 penalty condition. For σ = 2 and C = 20, we are able to well classify the 100 assets over ten main sectors: Oil & Gas, Industrials, Financial, Telecommunications, Health Care, Basic Materials, Consumer Goods, Technology, Utilities, Consumer Services. In order to check the efficiency of the classification, we test the prediction quality on a test set which is composed by Ntest = 50 assets. In Figure 3.19, we compare the SVM-BDT result with the true sector distribution of 50 assets. We obtain in 95

Support Vector Machine in Finance Figure 3.18: Multiclassification with SVM-BDT on training set Consumer Services Utilities Technology Consumer Goods Basic Materials Health Care Telecommunications Financials Industrials Oil & Gas S1

Real sector distribution Multiclass SVM S10

S20

S30

S40

S50

S60

S70

S80

S90

S100

this case the rate of correct prediction is about 58%. Calibration procedure As discussed above in the implementation part of the SVM-solver, there are two kinds of parameter which play important role in the classification process. The first parameter C concerns the tolerance error of the margin and the second parameters concern the choice of kernel (σ for Gaussian kernel for example). In last example, we have optimized the couple of parameters C, σ in order to have the best classifiers which do not commit any error on the traing set. However, this result is true only in the case if the sectors are correctly defined. Here, nothing guaranties that the given notion of sectors is the most appropriate one. Hence, the classification process should consist of two steps: (i) determine of binary SVM classifiers on training data set and (ii) calibrate parameters on the validation set. In fact, we decide to optimize this couple of parameters C, σ by minimizing the realized error on the validation set because the committed error on the training set (learning set) must be always smaller than the one on validation set (unknown set). In the second phase, we can redefine the sectors in the sens that if any asset is misclassified, we change its sector label and repeat the optimization on the validation set until convergence. In the end of the calibration procedure, we expect to obtain first a new recognition of sectors and second a multi-classifiers for new assets. As SVM uses the training set to lean about the classification, it must commits less error on this set than on the validation set. We propose here to optimize the 96

Support Vector Machine in Finance Figure 3.19: Prediction efficiency with SVM-BDT on the validation set Consumer Services Utilities

Real sector distribution Multiclass SVM

Technology Consumer Goods Basic Materials Health Care Telecommunications Financials Industrials Oil & Gas

S101

S110

S120

S130

S140

S150

SVM parameters by minimizing the error on the validation set. We use the same error function defined in Section 3 but apply it on the validation data set V: Error =

X  1 ψ −yi0 f x0i card (V) i∈V

where ψ (x) = I{x>0} with IA the standard notation of the indicator function. However, the error function requires the step function ψ which is discontinuous can cause some difficulty if we expect to determine the best selection parameter via the optimal test error. In order to perform the search for minimal test error by gradient descent for example, we should smooth the test error by regulate the step function by: ψ˜ (x) =

1 1 + exp (−Ax + B)

The choice of the parameter A, B are important. If A is too small the approximation error is too much whereas if A is large the test error is not smooth enough for the minimization procedure. Recognition of sectors By construction, SVM-classifier is a very efficient method for recognize and classify a new element with respect to a given number of classes. However, it is not able to recognize the sectors or introduces a new correct definition of available sectors over an universal of available data (stocks). In finance, the classification by sector is more 97

Support Vector Machine in Finance

related to the origin of stock than the intrinsic property of the stock in the market. It may introduce some problem on the trading strategy if a stock is misclassified, for example, the case of pair-trading strategy. Here, we try to overcome this weakness point of SVM in order to introduce a method which modifies the initial definition of sectors. The main idea of sector recognition procedure is the following. We divide the available data into two sets: training set and validation set. We employ the training set to learn about the classification and the validation set to optimize the SVM parameters. We start with the initial definition of the given sectors. Within each iteration, we learn the training set in order to determine the classifiers then we test the validation error. An optimization procedure on the validation error helps us to determine the optimal parameters of SVM. For each ensemble of optimal parameters, we encounter some error on the training set. If the validation is smaller on certain threshold with no error on the training set, we reach the optimal configuration of sector definition. In the case, there are errors on the training set, we relabel the misclassified data point and define new sectors with this correction. All the sector labels will be changed by this rule for both training and validation sets. The iteration procedure will be repeat until no error on the training set is committed for a given expected threshold of error on the validation set. The algorithm of this sectorrecognition procedure is summarized in the following table: Algorithm 4 Sector recognition by SVM classification procedure SVM_SectorRecognition(X, y, ε) Divide the historical data by training set T and validation set V Initiate the sectors label by the physical sector names: Sec01 , . . . , Sec0m while E T >  do while E V >  do Compute the SVM separators for labels Sec1 , . . . , Secm on T for given (C, σ) Construct the SVM predictor from the separator Sec1 , . . . , Secm Compute error E V on validation set Update parameter (C, σ) until convergence of E V >  end while Compute error E T on training set Verify misclassified points of training set Relabel the misclassified points then update definition of sectors end while end procedure

3.6.3

SVM for score construction and stock selection

Traditionally, in order to improve the stock picking we rank the stocks by constructing a “score” based on all characterizations (so-called factor) of the considered stock. We require that the construction of this global quantity (combination of factors) 98

Support Vector Machine in Finance

must satisfy some classification criterion, for example the performance. We denote here the (xi )i=1...n with xi the ensemble of factors for the ith stock. The classification criterion such as the performance is denoted by the vector y = (yi )i=1...n . The aim of SVM-classifier in this problem is to recognize which stocks (scores) belong to the high/low performance class (overperformed/underperformed). More precisely, we have to identify the a boundary of separation as a function of score and performance f (x, y). Hence, the SVM stock peaking consists of two steps: (i) construction of factors ensemble (i.e. harmonize all characterizations of a given stock such as the price, the risk, marco-properties e.t.c into comparable quantities); (ii) application of SVM-classifier algorithm with adaptive choice of parameters. In the following, we are going to first give a brief description of score constructions and then establish the backtest on stock-picking strategy. Probit model for score construction We summary here briefly the main idea of the score construction by the Probit model. Assuming that the set of training data (xi , yi )i=1...n is available. Here x is the vector of factors whereas y is the binary response. We look for constructing a conditional probability distribution of the random variable Y for a given point X. This probability distribution can be used later to predict the response of a new data point xnew . The probit model suppose to estimate this conditional probability in the form:  P r (Y = 1 |X ) = Φ XT β + α with Φ (x) the cumulative distribution function (CDF) of the standard normal distribution. The couple of parameters (α, β) can be obtained by using estimators of maximum likehood. The choice of the function Φ (x) is quite natural as we work with a binary random variable because it allows to have a symmetric probability distribution. Remark 9 We remark that this model can be written in another form with the introduction of a hidden random variable: Y ? = XT β + α +  where  ∼ N (0, 1). Hence, Y can be interpreted as an indicator for whether Y ? is positive.  1 if Y ? > 0 Y = I{Y ? >0} = 0 otherwise

In finance, we can employ this model for the score construction. If we define the binary variable Y is the relative return of a given asset with respect to the benchmark: Y = 1 if the return of is higher than the one of the benchmark and Y = 0 otherwise. Hence, P r (Y = 1|X) is the probability for the give asset with the vector of factors X to be super-performed. Naturally, we can define this quantity as a score measuring the probability of gain over the benchmark: S = P r (Y = 1|X) 99

Support Vector Machine in Finance

In order to estimate the regression parameters α, β, we maximize the log-likehood function: L (α, β) =

n X i=1

  yi ln Φ xTi β + α + (1 − yi ) ln 1 − Φ xTi β + α

Using the estimated parameters by maximum likehood, we can predict the score of the a given asset with its factor vector X as following:   Sˆ = Φ XT βˆ + α ˆ

The probability distribution of the score Sˆ can be computed by the empirical formula n   1X P r Sˆ < s = I{Si 0}

Here, the parameters of the model α0 and β0 are chosen as α0 = 0.1 and β0 = 1. We employ the Probit regression in order to determine the score of n = 500 data in the cases d = 2 and d = 5. The comparisons between the Probit score and the simulated score are presented in Figures 3.20-3.22 SVM score construction We discuss now how to employ SVM to construct the score for a given ensemble of the assets. In the work of G. Simon (2005), the SVM score is constructed by using SVM-regression algorithm. In fact, with SVM-regression algorithm, we are able to forecast the future performance E [µt+1 |Xt ] = µ ˆt based on the present ensemble of factor then this value can be employed directly as the prediction in a trend-following strategy without need of score construction. We propose here another utilization 100

Support Vector Machine in Finance

Figure 3.20: Comparison between simulated score and Probit score for d = 2 0.7

0.65

0.6

Score

0.55

0.5

0.45

0.4

0.35

Simulated score Probit score 0

50

100

150

200

250

300

350

400

450

500

Assets

Figure 3.21: Comparison between simulated score CDF and Probit score CDF for d=2 1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

Simulated CDF Probit CDF 0.1

0.2

0.3

0.4

0.5

Score

101

0.6

0.7

0.8

0.9

1

Support Vector Machine in Finance Figure 3.22: Comparison between simulated score PDF and Probit score PDF for d=2 6

Simulated PDF Probit PDF

5

PDF

4

3

2

1

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Score

of SVM algorithm based on SVM-classification for building the scores which allow later to implement long/short strategies by using the selection curves. Our main idea of SVM-score construction is very similar to Probit model. We first define a binary variable Yi = ±1 associated to each asset xi . This variable characterizes the performance of the asset with respect to the benchmark. If Yi = −1, the stock is underperformed whereas Yi = 1 the stock is overperformed. We next employ the binary SVM-classification to separate the universal of stocks into two classes: high performance and low performance. Finally, we define the score of each stock the its distance to the boundary decision.

Selection curve In order to construct a simple strategy of type long/short for example, we must be able to establish a selection rule based on the score obtained by Probit model and SVM regression. Depending on the strategy long, short or long/short, we expect to build a selection curve which determine the portion of assets which have a certain level of error. For a long strategy, we prefer to buy a certain portion of high performance with the knowledge on the possible committed error. To do so, we define a 102

Support Vector Machine in Finance

selection curve for which the score plays the role of the parameter: Q (s) = P r (S ≥ s)

E (s) = P r (S ≥ s |Y = 0 ) ∀ s ∈ [0, 1]

This parametric curve can be traced in the the square [0, 1] × [0, 1] as shown in Figure 3.23. On the x-axis, Q (s) defines the quantile corresponding to the stock selection among the considered universal of stocks. On the y-axis, E (s) defines the committed error corresponding to the stock selection. Precisely, for a certain quantile, it measures the chance that we pick the bad performance stock. Two trivial limits are the points (0, 0) and (1, 1). The first point corresponds to the limit with no selection whereas the second point corresponds to the limit with all selection. A good score construction method should allow a selection curve as much convex as possible because it guaranties a selection with less error. Figure 3.23: Selection curve for long strategy for simulated data and Probit model 1 0.9

Simulated data Probit model

0.8

P r(S > s|Y = 0)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P r(S > s)

Reciprocally, for a short strategy, the selection curve can be obtained by tracing the following parametric curve: Q (s) = P r (S ≤ s)

E (s) = P r (S ≤ s |Y = 1 ) ∀ s ∈ [0, 1]

Here, Q (s) aims us to determine the quantile of low-performance stocks to be shorted while E (s) helps us to avoid selling the high-performance one. As the selection 103

Support Vector Machine in Finance Figure 3.24: Probit scores for Eurostoxx data with d = 20 factors 1 0.9

Probit on Training Probit on Validation

0.8

P r(S > s|Y = 0)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P r(S > s)

curve is independent of the score definition, it is an appropriate quantity to compare different scoring techniques. In the following, we employ the selection curve for comparing the score constructions of the Probit model and of the SVM-regression. Figure 3.24 shows the comparison of the selection curves constructed by SVM score and Probit score on the training set. Here, we did not effectuate any calibration on the SVM parameters. Backtesting and comparison As presented in the last discussion on the regression, we have to build a cross validation procedure to optimize the SVM parameters. We follow the traditional routine by dividing the data in three independent sets: (i)training set, (ii)validation set and (iii)testing set. The classifier is obtained by the training set whereas its optimal parameters (C, σ) will be obtained by minimizing the fitting error on the validation set. The efficiency of the SVM algorithm will be finally checked on the testing set. We summarize the cross-validation procedure in the below algorithm. In order to make the training set close to both validation data and testing data, we decide to divide the data in the the following time order: validation set, training set and testing set. Using this way, the prediction score on the testing set contains more information in the recent past. We now employ this procedure to compute the SVM score on the universal of stocks of Eurostoxx index. Figure 3.25 present the construction of the score basing on the the training set and validation set. The SVM parameters are optimized on 104

Support Vector Machine in Finance Algorithm 5 SVM score construction procedure SVM_Score(X, y) Divide data into training set Dtrain , validation set Dvalid and testing set Dtest Classify the training data by using high/low performance criteria Compute the decision boundary on Dtrain Construct the SVM score on Dvalid by using the distance to the decision boundary Compute Error(σ, C) prediction error and classification error on Dvalid Minimize Error(σ, C) and obtain the optimal parameters (σ ? , C ? ) Use optimal parameters to compute the final SVM-score on testing set Dtest end procedure

the validation set while the final score construction uses both training and validation set in order to have largest data ensemble. Figure 3.25: SVM scores for Eurostoxx data with d = 20 factors 1

SVM Training SVM Validation SVM Testing

0.9 0.8

P r(S > s|Y = 0)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P r(S > s)

3.7

Conclusion

Support vector machine is a well-established method with a very wide use in various domain. In the financial point of view, this method can be used to recognize and to predict the high performance stocks. Hence, SVM is a good indicator to build efficients trading strategy over an universal of stocks. Within this paper, we first have revisited the basic idea of SVM in both classification and regression contexts. 105

Support Vector Machine in Finance

The extension to the case of multi-classification is also discussed in detail. Various applications of this technique were introduced and discussed in detail. The first class of applications is to employ SVM as forecasting method for time-series. We proposed two applications: the first one consists of using SVM as a signal filter. The advantage of the method is that we can calibrate the model parameter by using only the available data. The second application is to employ SVM as a multi-factor regression technique. It allows to refine the prediction with additional inputs such as economic factors. For the second class of applications, we deal with SVM classification. Two main applications that we discussed in the scope of this paper are the score construction and the sector recognition. Both resulting information are important to build momentum strategies which are the core of the modern asset management.

106

Bibliography [1] Allwein E. L. et al., (2000) , Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers, Journal of Machine Learning Research, 1, pp. 113-141. [2] At A. (2005), Optimisation d’un Score de Stock Screening, Rapport de stageENSAE, Société Générale Asset Management. [3] Basak D., Pal S. and Patranabis D.J. (2007), Support Vector Regression, Neural Information Processing, 11, pp. 203-224. [4] Ben-Hur A. and Weston J. (2010), A User’s Guide to Support Vector Machines, Methods In Molecular Biology Clifton Nj, 609, pp. 223-239. [5] Burges C. J. C. (1998), A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 2, pp. 121-167. [6] Chapelle O. (2002), Support Vector Machine: Induction Principles, Adaptive Tuning and Prior Knowledge PhD thesis, Paris 6. [7] Chapelle O. et al., (2002), Choosing Multiple Parameters for Support Vector Machine, Machine Learning, 46, pp. 131-159. [8] Chapelle O. (2007), Training a Support Vector Machine in the Primal, Journal Neural Computation, 19, pp. 1155-1178. [9] Cortes C. and Vapnik V. (1995), Support-Vector Networks, Machine Learning, 20, pp. 273-297. [10] Crammer K. and Singer Y. (2001), On the Algprithmic Implementation of Multiclass Kernel-based Vector Machines, Journal of Machine Learning Research, 2, pp. 265-292. [11] Gestel T. V. et al., (2001), Financial Time Series Prediction Using Least Squares Support Vector Machines Within the Evidence Framework, IEEE Transactions on neural Networks, 12, pp. 809-820. [12] Madzarov G. et al., (2009), A multi-class SVM Classifier Utilizing Binary Decision Tree ,Informatica, 33, pp. 233-241. 107

Support Vector Machine in Finance

[13] Milgram J. et al., (2009), “One Against One” or “One Against All”: Which One is Better for Handwriting Recognition with SVMs? (2006) ,Tenth International Workshop on Frontiers in Handwriting Recognition. [14] Potluru V. K. et al., (2009), Efficient Multiplicative updates for Support Vector Machines ,Proceedings of the 2009 SIAM Conference on Data Mining. [15] Simon G. (2005), L’Econométrie Non Linéaire en Gestion Alternative, Rapport de stage-ENSAE, Société Générale Asset Management. [16] Tay F.E.H. and Cao L.J. (2002), Modified Support Vector Machines in Financial Times Series forecasting,Neurocomputing,48, pp. 847-861 [17] Tsochantaridis I. et al., (2004), Support Vector Machine Learning for Interdependent and Structured Output Spaces,Proceedings of the 21 st International Confer- ence on Machine Learning,Banff, Canada. [18] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons,New York.

108

Chapter 4

Analysis of Trading Impact in the CTA strategy We review in this chapter the trend-following strategies within Kalman filter and study the impact of the trend estimator error. We first study the case of momentum strategy on uni-asset case then generalize the analysis to the multi-asset case. In order to construct the allocation strategy, we employ the observed trend which is filtered by exponential moving average. It can be demonstrated that the cumulated return of the strategy can be broken down into two important parts: the option profile which is similar in concept to the straddle profile suggested by Fung and Hsied (2001) and the trading impact which involves directly the estimator error on the efficiency of strategy. We focus in this paper on the second quantity by estimating its probability distribution function and associated gain and loss expectations. We illustrate how the number of assets and their correlations influence on the performance of a strategy via a “toy model”. This study can reveal important results which can be directly tested on CTA fund such as the “Epsilon ”fund. Keywords: CTA, Momentum strategy, Trend following, Kalman filter, Trading impact, Chi-square distribution.

4.1

Introduction

Trend-following strategies are specific example of an investment style that emerged as an industry recently. They are so-called Commodity Trading Advisors (CTA) and play an important role in the Hedge Fund industry (15% of total Hedge Fund AUMs). Recently, this investment style has been carefully reviewed and analyzed the 7th White Paper of Lyxor edition. We present here a complementary result of this nice paper and give more specific analysis on a typical CTA. We will focus here on the trading impact by estimating its probability distribution function and associated gain and loss expectations. We illustrate how the number of assets and their correlations influence on the performance of a strategy via a “Toy model”. This 109

Analysis of Trading Impact in the CTA strategy

study can reveal important results which can be directly tested on CTA fund such as the “Epsilon ”fund. This chapter is organized as following. In the first part, we remind the main result of trend-following strategy in the univariate case which has been demonstrated in the 7th White Paper of Lyxor. We next generalize this result into the multivariate case which establishes a framework for studying the impacts of the correlation and the number of assets in a CTA fund. Finally, we finish with the study of a toy model which permits to understand the efficiency of trend-following strategy.

4.2

Conclusion

Momentum strategies are efficient ways to use the market tendency for building trading strategies. Hence, a good estimator of the trend is essential from this perspective. In this paper, we study the impact of estimator error on a trend-following strategy both in the single asset case and multi-asset case. The objective of this paper is twofold. First, we have establish the general framework for analyzing a CTA fund. Second, we illustrate important results of the trading impact on CTA strategy via a simple “Toy Model ”. We have shown that the gain probability and gain expectation depend strongly on the correlation and the number of assets. Increasing the number of asset can help to improve the performance and reduce the risk (volatility) within a momentum strategy. However, when the number of asset reaches certain limit, we observe a saturation of performance. It implies that above this limit, putting more assets does not improve too much the performance but it does make the strategy more complicate and increase the management cost as the portfolio is rebalanced frequently. The correlation of between assets play an important role as well. As usual, the higher correlation level is, the less efficient strategies are. Interestingly, we remark that when the correlation increases, we approach the limit of single asset in which the gain probability is small than the loss probability but the conditional expectation of gain is much higher than the conditional expectation of loss.

110

Bibliography [1] Al-Naffouri T. Y. Babak H. (2009), On the Distribution of Indefinite Quadratic Forms in Gaussian Random Variables, Information Theory, pp. 1744 - 1748 . [2] Davies R. B.(1973), Numerical Inversion of Characteristic Function, Biometrika, 60, pp. 415-417. [3] Davies R. B. (1980), The Distribution of a Linear Combination of χ2 Random Variables, Applied Statistics, 29, pp. 323-333. [4] Imhoff J. P.(1961), Computing the Distribution of Quadratic Form in Normal variables, Biometrika, 48, pp. 419-426. [5] Khatri C. G.(1978), A remark on the necessary and sufficient conditions for a quadratic form to be distributed as a chi-square, Biometrika, 65, pp. 239-240. [6] Kotz S., Johnson N.L. and Boyd D.W. (1967), Series Representations of Distributions of Quadratic Forms in Normal Variables II. Non-Central Case, The Annals of Mathematical Statistics, 38, pp. 838-848. [7] Murison R. (2005), Distribution theory and inference, School of Science and Technology , ch. 6, pp. 86-88. [8] Ruben H.(1962), Probability Content of Regions Under Spherical Normal Distributions, IV: The Distribution of Homogeneous and Non-Homogeneous Quadratic Functions of Normal Variables, The Annals of Mathematical Statistics, 33, pp. 542-570. [9] Ruben H.(1962), A New Result on the Distribution of Quadratic Forms, The Annals of Mathematical Statistics, 34, pp. 1582-1584. [10] Shah B.K. (1963) Distribution of Definite and of Indefinite Quadratic Forms from a Non- Central Normal Distribution, The Annals of Mathematical Statistics, 34, pp. 186-190. [11] Shah B.K. and Khatri C.G. (1961) Distribution of a Definite Quadratic Form for Non-Central Normal Variates, The Annals of Mathematical Statistics, 32, pp. 883-887. 111

Analysis of Trading Impact in the CTA strategy

[12] Tziritas G. G.(1987), On the Distribution of Positive-definite Gaussian Quadratic Forms, IEEE Transtractions on Information Theory, 33, pp. 895906.

112

Conclusions Within the internship in the R&D team of Lyxor Asset Management, I had chance to work on many interesting topics concerning the quantitative asset management. Beyond of this report, the resutls obtained during the stay have been employed for the 8th edition of the Lyxor White Paper communication. The main results of this intership can be divided into three grand lines. The first results consists of improving the trend and volatility estimations which are important quantities for implementing dynamical strategies. The second main results concern the application of the machine learning technology in finance. We expect to employ the “Support vector machine” for forcasting the expected return of financial assets and for having a criterial for stock selection. The third main result is devoted for the analysis of the performance of trend-following strategy (CTA) in the general case. It consists of studying the efficiency of CTA within the changes in the market such as the correlation between the assets, or their performance. In the first part, we focused on improving the trend and volatility estimations in order to implement two crucial momentum-strategies: trend-following and voltarget. We show that we can use L1 filters to forecast the trend of the market in a very simple way. We also propose a cross-validation procedure to calibrate the optimal regularization parameter λ where the only information to provide is the investment time horizon. More sophisticated models based on a local and global trends is also discussed. We remark that these models can reflect the effect of mean-reverting to the global trend of the market. Finally, we consider several backtests on the S&P 500 index and obtain competing results with respect to the traditional moving-average filter. On another hand, voltarget strategies are efficient ways to control the risk for building trading strategies. Hence, a good estimator of the volatility is essential from this perspective. In this report, we present the improvement on the forecasting of volatility by using some novel technologies. The use of high and low prices is less important for the index as it gives more and less the same result with traditional close-to-close estimator. However, for independent stock with higher volatility level, the high-low estimators improves the prediction of volatility. We consider several backtests on the S&P 500 index and obtain competing results with respect to the traditional moving-average estimator of volatility. Indeed, we consider a simple stochastic volatility model which permit to integrate the dynamics of the volatility in the estimator. An optimization scheme via the maximum likehood algorithm allows us to obtain dynamically the optimal averaging window. We also compare these

Analysis of Trading Impact in the CTA strategy

results for range-based estimator with the well-known IGARCH model. The comparison between the optimal value of the likehood functions for various estimators gives us also a ranking of estimation error. Finally, we studied the high frequency volatility estimator which is a very active topic of financial mathematics. Using simple model proposed by Zhang et al, (2005), we show that the microstructure noise can be eliminated by the two time scale estimator. Support vector machine is a well-established method with a very wide use in various domain. In the financial point of view, this method can be used to recognize and to predict the high performance stocks. SVM is a good indicator to build efficient trading strategies over a stocks universe. Within the second part of this report, we first have revisited the basic idea of SVM in both classification and regression contexts. The extension to the case of multi-classification is also discussed in detail. Various applications of this technique were introduced and discussed in detail. The first class of applications is to employ SVM as forecasting method for time-series. We proposed two applications: the first one consists of using SVM as a signal filter. The advantage of the method is that we can calibrate the model parameter by using only the available data. The second application is to employ SVM as a multi-factor regression technique. It allows to refine the prediction with additional inputs such as economic factors. For the second class of applications, we deal with SVM classification. Two main applications that we discussed in the scope of this paper are the score construction and the sector recognition. Both resulting information are important to build momentum strategies which play an important role in Lyxor quantitative management. Finally, we have realized a detailled analysis on the performance of trend-following strategy in order to understand its important role in the risk diversification and in optimizing the absolute return. In the third part, we studied the impact of estimator error and market parameters such as the correlation and the average performance of individual stocks on a trend-following strategy both in the single asset and multiasset cases. The objective of this chapter is two-fold. First, we have establish the general framework for analyzing a CTA fund. Second, we illustrate important results of the trading impact on CTA strategy via a simple “toy model ”. We have shown that the gain probability and gain expectation depend strongly on the correlation and the number of assets. Increasing the number of asset can help to improve the performance and reduce the risk (volatility) within a momentum strategy. However, when the number of asset reaches certain limit, we observe a saturation of performance. It implies that above this limit, putting more assets does not improve the performance very much but it does make the strategy more complicate and increase the management cost as the portfolio is rebalanced frequently. The correlation between assets play an important role as well. As usual, the higher correlation level is, the less efficient strategies are. Interestingly, we remark that when the correlation increases, we approach the limit of single asset in which the gain probability is smaller than the loss probability but the conditional expectation of gains is much higher than the conditional expectation of losses. 114

Appendix A

Appendix of chaper 1 A.1 A.1.1

Computational aspects of L1 , L2 filters The dual problem

The L1 − T filter This problem can be solved by considering the dual problem which is a QP program. We first rewrite the primal problem with new variable z = Dx: 1 ky − xk22 + λ kzk1 2 u.c. z = Dx

min

We construct now the Lagrangian function with the dual variable ν ∈ Rn−2 : L (x, z, ν) =

1 ky − xk22 + λ kzk1 + ν > (Dx − z) 2

The dual objective function is obtained in the following way: 1 inf x,z L (x, z, ν) = − ν > DD> ν + y > D> ν 2 for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent to the dual problem: 1 > ν DD> ν − y > D> ν 2 u.c. −λ1 ≤ ν ≤ λ1

min

This QP program can be solved by traditional Newton algorithm or by interior-point methods, and the final solution of the trend reads x? = y − D > ν 115

Analysis of Trading Impact in the CTA strategy

The L1 − C filter

The optimization procedure for L1 − C filter follows the same strategy as the L1 − T filter. We obtain the same quadratic program with the D operator replaced by (n − 1) × n matrix which is the discrete version of the first order derivative:   −1 1 0  0 −1 1  0     . . D=  .    −1 1 0  −1 1 The L1 − T C filter

In order to follow the same strategy presented above, we introduce two additional variables z1 = D1 x and z2 = D2 x. The initial problem becomes: 1 ky − xk22 + λ1 kz1 k1 + λ2 kz2 k1 2  z 1 = D1 x u.c. z 2 = D2 x

min

The Lagrangian function with the dual variables ν1 ∈ Rn−1 and ν2 ∈ Rn−2 is:

1 ky − xk22 +λ1 kz1 k1 +λ2 kz2 k1 +ν1> (D1 x − z1 )+ν2> (D2 x − z2 ) 2 whereas the dual objective function is:

2   1

inf x,z1 ,z2 L (x, z1 , z2 , ν1 , ν2 ) = − D1> ν1 + D2> ν2 + y > D1> ν1 + D2> ν2 2 2

L (x, z1 , z2 , ν1 , ν2 ) =

for −λi 1 ≤ νi ≤ λi 1 (i = 1, 2). Introducing the variable z = (z1 , z2 ) and ν = (ν1 , ν2 ), the initial problem is equivalent to the dual problem: 1 > ν Qν − R> ν 2 u.c. −ν + ≤ ν ≤ ν +     D1 λ1 > + with D = , Q = DD , R = Dy and ν = 1. The solution of the D2 λ2 primal problem is then given by x? = y − D> ν. min

The L1 − T multivariate filter

As in the univariate case, this problem can be solved by considering the dual problem which is a QP program. The primal problem is: min

m

2 1 X

(i)

y − x + λ kzk1 2 2 i=1

u.c. z = Dx

116

Analysis of Trading Impact in the CTA strategy

Let us define y¯ = (¯ yt ) with y¯t = m−1

Pm

i=1 y

(i) .

The dual objective function becomes:

m >   1 X  (i) 1 y − y¯ y (i) − y¯ inf x,z L (x, z, ν) = − ν > DD> ν + y¯> D> ν + 2 2 i=1

for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent to the dual problem: 1 > ν DD> ν − y¯> D> ν 2 u.c. −λ1 ≤ ν ≤ λ1

min

This QP program can be solved by traditional Newton algorithm or by interior-point methods and the solution is: x? = y¯ − D> ν

A.1.2

The interior-point algorithm

We present briefly the interior-point algorithm of Boyd and Vandenberghe (2009) in the case of the following optimization problem: min f0 (x)  Ax = b u.c. fi (x) < 0 for i = 1, . . . , m where f0 , . . . , fm : Rn → R are convex and twice continuously differentiable and rank (A) = p < n. The inequality constraints will become implicit if one rewrite the problem as: min f0 (x) +

m X i=1

u.c. Ax = b

I− (fi (x))

where I− (u) : R → R is the non-positive indicator function1 . This indicator function is discontinuous, hence the Newton method can not be applied. In order to overcome this problem, we approximate I− (u) by the logarithmic barrier function ? (u) = −τ −1 ln (−u) with τ → ∞. Finally the Kuhn-Tucker condition for this I− approximation problem gives rt (x, λ, ν) = 0 with: 

1

We have:

 ∇f0 (x) + ∇f (x)> λ + A> ν rτ (x, λ, ν) =  − diag (λ) f (x) − τ −1 1  Ax − b  I− (u) =

0 ∞

117

u≤0 u>0

Analysis of Trading Impact in the CTA strategy

The solution of rτ (x, λ, ν) = 0 can be obtained by Newton’s iteration for the triple y = (x, λ, ν): rτ (y + ∆y) ' rτ (y) + ∇rτ (y) ∆y = 0

This equation gives the Newton’s step ∆y = −∇rτ (y)−1 rτ (y) which defines the search direction.

A.1.3

The scaling of smoothing parameter of L1 filter

We can try to estimate the order of magnitude of the parameter λmax by considering the continuous case. Assuming that the signal is a process Wt . The value of λmax in the discrete case defined by:



−1

>

λmax = DD Dy

RT



can be considered as the first primitive I1 (T ) = 0 Wt dt of the process Wt if D = D1 RT Rt (L1 − C filtering) or the second primitive I2 (T ) = 0 0 Ws ds dt of Wt if D = D2 (L1 − T filtering). We have: Z T Wt dt I1 (T ) = 0 Z T = WT T − t dWt 0 Z T = (T − t) dWt 0

The process I1 (T ) is a Wiener integral (or a Gaussian process) with variance: Z T   T3 (T − t)2 dt = E I12 (T ) = 3 0

In this case, we expect that λmax ∼ T 3/2 . The second order primitive can be calculated in the following way: Z T I2 (T ) = I1 (t) dt 0 Z T = I1 (T ) T − t dI1 (T ) 0 Z T = I1 (T ) T − tWt dt 0 Z T 2 T2 t WT + dWt = I1 (T ) T − 2 0 2  Z T T2 t2 = − WT + T2 − Tt + dWt 2 2 0 Z 1 T = (T − t)2 dWT 2 0 118

Analysis of Trading Impact in the CTA strategy

This quantity is again a Gaussian process with variance: E[I22 (T )]

1 = 4

Z

T

(T − t)4 dt =

0

T5 20

In this case, we expect that λmax ∼ T 5/2 .

A.1.4

Calibration of the L2 filter

We discuss here how to calibrate the L2 filter in order to extract the trend with respect to the investment time horizon T . Though the L2 filter admits an explicit solution which is a great advantage for numerical implementation, the calibration of the smoothing parameter λ is not trivial. We propose to calibrate the L2 filter by comparing the spectral density of this filter with the one obtained with the movingaverage filter. For this last filter, we have: x ˆMA t

t−1 1 X = yi T i=t−T

It comes that the spectral density is: 1 f (ω) = 2 T

2 −1 TX e−iωt t=0

For the L2 filter, we k now that the solution is x ˆHP = 1 + 2λDT D the spectral density is: 

1 1 + 4λ (3 − 4 cos ω + cos 2ω) 2  1 ' 1 + 2λω 4

f HP (ω) =

2

−1

y. Therefore,

The width of the spectral density for the L2 filter is then (2λ)−1/4 whereas it is 2πT −1 for the moving-average filter. Calibrate the L2 filter could be done by matching this two quantities. Finally, we obtain the following relationship: 1 λ ∝ λ? = 2



T 2π

4

In Figure A.1, we represent the spectral density of the moving-average filter for different windows T . We report also the spectral density of the corresponding L2 filters. For that, we have calibrated the optimal parameter λ? by least square minimization. In Figure A.2, we compare the optimal estimator λ? with the one corresponding to 10.27 × λ? . We notice that the approximation is very good. 119

Analysis of Trading Impact in the CTA strategy

Figure A.1: Spectral density of moving-average and L2 filters

Figure A.2: Relationship between the value of λ and the length of the moving-average filter

120

Analysis of Trading Impact in the CTA strategy

A.1.5

Implementation issues

The computational time may be large when working with dense matrices even if we consider interior-point algorithms. It could be reduced by using sparse matrices. But the efficient way to optimize the implementation is to consider band matrices. Moreover, we may also notice that we have to solve a large linear system at each iteration. Depending on the filtering problem (L1 − T , L1 − C and L1 − T C filters), the system is 6-bands or 3-bands but always symmetric. For computing λmax , one may remark that it is equivalent to solve a band system which is positive definite. We suggest to adapt the algorithms in order to take into account all these properties.

121

Appendix B

Appendix of chapter 2 B.1 B.1.1

Estimator of volatility Estimation with realized return

We consider only one return, then the estimator of volatility can be obtained as following: !2 Z ti Z ti 2 1 2 2 Rti = ln Sti − ln Sti−1 = µu du − σu du σu dWu + 2 ti−1 ti−1 The conditional expectation with respect to the couple σu and µu which are supposed to be independent to dWu is given by: !2 Z ti Z ti  2  1 2 2 µu du − σu du σu du + E Rti |σ, µ = 2 ti−1 ti−1

which is approximatively equal to: (ti −

ti−1 ) σt2i−1

 2 1 2 + (ti − ti−1 ) µti−1 − σti−1 2 2

The variance of this estimator characterizes the error and reads:   !2 Z ti  1 var Rt2i |σ, µ = var  σu dWu + µu du − σu2 du σ, µ 2 ti−1

 1 2 σ dW + µ du − σ du with respect to u u u u 2 ti−1  R ti 1 2 σ et µ is a Gaussian variable of mean value ti−1 µu du − 2 σu du and variance R ti 2 ti−1 σu du. Therefore, we obtain the variance of the estimator:

As the conditional expectation of



var Rt2i |σ, µ = 2

Z

ti

ti−1

!2

σu2 du

R ti

+4

Z

ti

ti−1

123

! Z σu2 du

!2 1 2 µu du − σu du (B.1) 2 ti−1 ti

Analysis of Trading Impact in the CTA strategy

which is approximatively equal to: 2 (ti − ti−1 )

2

σt4i−1

+ 4 (ti − ti−1 )

3

σt2i−1

2  1 2 µu du − σu du 2

We remark that when the time step (ti√− ti−1 ) becomes small, the estimator becomes unbiased with its standard deviation 2 (ti − ti−1 ) σt2i−1 . This error is directly proportional to the quantity to be estimated. In order to estimate the average variance between t0 and tn or the approached volatility at tn , we can employ the canonical estimator n X

Rt2i =

i=1

n X i=1

ln Sti − ln Sti−1

2

The expectation value of this estimator reads !2 " n # Z Z ti n tn X X 1 2 2 2 E µu du − σu du Rti σ, µ = σu du + 2 ti−1 t0 i=1

i=1

We observe that his estimator is weakly biased, however this effect is totally negligible. If we consider a volatility of 20% with a trend of 10%, the estimation of volatility is 20.006% instead of 20%. The variance of the canonical estimator (estimation error) reads: n X i=1

2

Z

ti

ti−1

!2

σu2 du

+4

Z

ti

ti−1

! Z σu2 du

!2 1 2 µu du − σu du 2 ti−1 ti

which can be roughly estimated by: n X i=1

2

Z

ti

ti−1

!2

σu2 du

≈ 2σ 4

n X i=1

(ti − ti−1 )2

If the recorded time ti are regularly distributed with time-spacing ∆t, then we have: ! n X ≈ 2σ 4 (tn − t0 ) ∆ var Rt2i σ, µ i=1

124

Appendix C

Appendix of chapter 3 C.1

Dual problem of SVM

In the traditional approach, the SVM problem is first mapped to the dual problem then is solved by a QP program. We present here the detail derivation of the dual problem in both hard-margin SVM and soft-margin SVM case.

C.1.1

Hard-margin SVM classifier

Let us start first with the hard-margin SVM problem for the classification: min w,b

1 kwk2 2

 u.c. yi wT xi + b ≥ 1 i = 1...n

In order to get the dual problem, we construct the Lagrangian for inequality constrains by introducing positive Lagrange multipliers Λ = (α1 , . . . , αi ) ≥ 0: L (w, b, Λ) =

n n X  X 1 kwk2 − α i yi w T x i + b + αi 2 i=1

i=1

In minimizing the Lagrangian with respect to (w, b), we obtain the following equations: n

X ∂L = w − α i yi xi = 0 ∂wT i=1

∂L =− ∂b

n X

α i yi = 0

i=1

Insert these results into the Lagrangian, we obtain the dual objective LD function with respect to the variable w: 1 LD (Λ) = ΛT 1 − ΛT DΛ 2 125

Analysis of Trading Impact in the CTA strategy

with Dij = yi yj xTi xj and the constrains ΛT y = 0 and Λ ≥ 0. Thank to the KKT theorem, the initial optimization problem is equivalent to maximizing the dual objective function LD (Λ) 1 max ΛT 1 − ΛT DΛ 2 Λ u.c. ΛT y = 0, Λ ≥ 0

C.1.2

Soft-margin SVM classifier

We turn now to the soft-margin SVM classifier with L1 constrain case F (u) = u, p = 1. We first write down the primal problem: ! n X 1 min kwk2 + C.F ξip w,b,ξ 2 i=1  T u.c. yi w xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n

For both case, we construct Lagrangian by introducing the couple of Lagrange multiplier (Λ, µ) for 2n constraints. 1 L (w, b, Λ, µ) = kwk2 + C.F 2

n X i=1

ξi

!



n X i=1

n   X αi yi wT xi + b − 1 + ξi − µi ξi i=1

with the following constraints on the Lagrange multipliers Λ ≥ 0 and µ ≥ 0. Minimizing the Lagrangian with respect to (w, b, ξ) gives us: n

X ∂L = w − α i yi xi = 0 ∂wT i=1

∂L =− ∂b

n X

α i yi = 0

i=1

∂L =C −Λ−µ=0 ∂ξ with inequality constraints Λ ≥ 0 and µ ≥ 0. Insert these results into the Lagrangian leads to the dual problem: 1 max ΛT 1 − ΛT DΛ 2 Λ T u.c. Λ y = 0, 0 ≤ Λ ≤ C1 126

(C.1)

Analysis of Trading Impact in the CTA strategy

C.1.3

ε-SV regression

We study here the ε-SV regression. We first write down the primal problem with all constrains: ! n X 1 kwk2 + C min ξi w,b,ξ 2 i=1

u.c.

T

w xi + b − yi ≤ ε + ξi yi − wT xi − b ≤ ε + ξi0

ξi ≥ 0 ξi0 ≥ 0 i = 1...n In this case, we have 4n inequality constrain. Hence, we construct Lagrangian by introducing the positive Lagrange multipliers (Λ, Λ0 , µ, µ0 ). The Lagrangian of this primal problem reads: ! n n n X X X  1 2 0 µi ξi − µ0i ξi0 ξi − L w, b, Λ, Λ , µ = kwk + C.F 2 i=1

i=1



n X i=1



αi wT φ (xi ) + b − yi + ε + ξi −

n X i=1

i=1

βi −wT φ (xi ) − b + yi + ε + ξi0



with Λ = (αi )i=1...n , Λ0 = (βi )i=1...n and the following constraints on the Lagrange multipliers Λ, Λ0 , µ, µ0 ≥ 0. Minimizing the Lagrangian with respect to (w, b, ξ) gives us: n

X ∂L = w − (αi − βi ) yi xi = 0 ∂wT i=1

∂L = ∂b

n X i=1

(βi − αi ) yi = 0

∂L = CI − Λ − µ = 0 ∂ξ ∂L = CI − Λ0 − µ0 = 0 ∂ξ 0 Insert these results into the Lagrangian leads to the dual problem: max Λ,Λ0

Λ − Λ0

u.c.

Λ − Λ0

T

T

y − ε Λ + Λ0 1 = 0,

T

1−

T  1 Λ − Λ0 K Λ − Λ0 2

(C.2)

0 ≤ Λ, Λ0 ≤ C1

T

When ε = 0, the term ε (Λ + Λ0 ) 1 in the objective function disappears, then we can reduce the optimization problem by changing variable (Λ − Λ0 ) → Λ. The inequality constrain for new variable reads |Λ| < CI. 127

Analysis of Trading Impact in the CTA strategy

The dual problem can be solved by the QP program which gives the optimal solution Λ? . In order to compute b, we use the KKT condition:  αi wT φ (xi ) + b − yi + ε + ξi = 0  βi yi − wT φ (xi ) − b + ε + ξi = 0 (C − αi ) ξi = 0 (C − βi ) ξi0 = 0

We remark that the two last conditions give us: ξi = 0 for 0 < αi < C and ξi0 = 0 for 0 < βi < C. This result implies direclty the following condition for all support vectors of training set (xi , yi ): wT φ (xi ) + b − yi = 0 We denote here SV the set of support vectors. Using the condition w = and averaging over the training set, we obtain finally: b=

nSV 1 X

nSV

i

Pn

i=1 (αi

− βi ) φ (xi )

(yi − (z)i ) = 0

with z = K (Λ − Λ0 ).

C.2

Newton optimization for the primal problem

We consider here the Newton optimization scheme for solving the unconstrainted primal problem: n X  1 L yi , KiT β + b min LP (β, b) = min β T Kβ + C β ,b β ,b 2 i=1

The required condition of this scheme is that the function L (y, t) is differentiable. We study first the case of quadratic loss where L (y, t) is differentiable then the case with soft-margin where we have to regularize L (y, t).

C.2.1

Quadratic loss function

For the quadratic loss case, the penalty function has a suitable form: L (yi , f (xi )) = max (0, 1 − yi f (xi ))2 This function is differentiable everywhere and its derivative reads: ∂L (y, t) = 2y (yt − 1) I{yt≤1} ∂t However, the second derivative is not defined at the point yt = 1. In order to avoid this problem, we consider directly the function L as a function of the vector β and 128

Analysis of Trading Impact in the CTA strategy

perform a quasi-Newton optimization. The second derivative now is replaced by an approximation of the Hessian matrix. The gradient of the objective function with respect to the vector (bβ)T is given as following:     T 0  2C1T I0 1 2C1T I0 K b 1 I y ∇LP = − 2C 2CK T I0 1 K + CKI0 K β KI0 y and the pseudo-Hessian matrix is given by:   2C1T I0 1 2C1T I0 K H= 2CKI0 1 K + 2CKI0 K Then the Newton iteration consists of updating the vector (bβ)T until convergence as following:     b b ← − γH −1 ∇LP β β

C.2.2

Soft-margin SVM

For the soft-margin case, the penalty function has the following form L (yi , f (xi )) = max (0, 1 − yi f (xi )) which requires a regularization. A differentiable approximation is to use the following penalty function:   0 if yt > 1 + h  2 (1+h−yt) L (y, t) = if |1 − yt| ≤ h 4h   1− yt if yt < 1 − h

129

Published paper in the Lyxor White Paper Series:

Trend Filtering Methods For Momentum Strategies Lyxor White Paper Series, Issue # 8, December 2011 http://www.lyxor.com/fr/publications/white-papers/wp/52/

December 2 0 11 Issue #8

W H I T E PA PE R

T R E N D F I LT E R I N G METHODS FOR M O M E N T U M S T R AT E G I E S

Benjamin Bruder Research & Development Lyxor Asset Management, Paris [email protected]

Tung-Lam Dao Research & Development Lyxor Asset Management, Paris [email protected]

Jean-Charles Richard Research & Development Lyxor Asset Management, Paris [email protected]

Thierry Roncalli Research & Development Lyxor Asset Management, Paris [email protected]

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Foreword The widespread endeavor to “identify” trends in market prices has given rise to a significant amount of literature. Elliott Wave Principles, Dow Theory, Business cycles, among many others, are common examples of attempts to better understand the nature of market prices trends. Unfortunately this literature often proves frustrating. In their attempt to discover new rules, many authors eventually lack precision and forget to apply basic research methodology. Results are indeed often presented without any reference neither to necessary hypotheses nor to confidence intervals. As a result, it is difficult for investors to find there firm guidance and to differentiate phonies from the real McCoy. This said, attempts to differentiate meaningful information from exogenous noise lie at the core of modern Statistics and Time Series Analysis. Time Series Analysis follows similar goals as the above mentioned approaches but in a manner which can be tested. Today more than ever, modern computing capacities can allow anybody to implement quite powerful tools and to independently tackle trend estimation issues. The primary aim of this 8th White Paper is to act as a comprehensive and simple handbook to the most widespread trend measurement techniques. Even equipped with refined measurement tools, investors have still to remain wary about their representation of trends. Trends are sometimes thought about as some hidden force pushing markets up or down. In this deterministic view, trends should persist. However, random walks also generate trends! Five reds drawn in a row from a non biased roulette wheel do not give any clue about the next drawn color. It is just a past trend with nothing to do with any underlying structure but a mere succession of independent events. And the bottom line is that none of those two hypotheses can be confirmed or dismissed with certainty. As a consequence, overfitting issues constitute one of the most serious pitfalls in applying trend filtering techniques in finance. Designing effective calibration procedures reveals to be as important as the theoretical knowledge of trend measurement theories. The practical use of trend extraction techniques for investment purposes constitutes the other topic addressed in this 8th White Paper. Nicolas Gaussel Global Head of Quantitative Asset Management

Q U A N T R E S E A R C H B Y LY X O R

1

2

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Executive Summary

Introduction The efficient market hypothesis implies that all available information is reflected in current prices, and thus that future returns are unpredictable. Nevertheless, this assumption has been rejected in a large number of academic studies. It is commonly accepted that financial assets may exhibit trends or cycles. Some studies cite slow-moving economic variables related to the business cycle as an explanation for these trends. Other research argues that investors are not fully rational, meaning that prices may underreact in the short run and overreact at long horizons. Momentum strategies try to benefit from these trends. There are two opposing types: trend following and contrarian. Trend following strategies are momentum strategies in which an asset is purchased if the price is rising, while in the contrarian strategy assets are sold if the price is falling. The first step in both strategies is trend estimation, which is the focus of this paper. After a review of trend filtering techniques, we address practical issues, depending on whether trend detection is designed to explain the past or forecast the future.

The principles of trend filtering In time series analysis, the trend is considered to be the component containing the global change, which contrasts with local changes due to noise. The separation between trend and noise has a long mathematical history, and continues to be of great interest to the scientific community. There is no precise definition of the trend, but it is generally accepted that it is a smooth function representing long-term movement. Thus, trends should exhibit slow change, while noise is assumed to be highly volatile. The simplest trend filtering method is the moving average filter. On average, the noisy parts of observations tend to cancel each other out, while the trend has a cumulative nature. But observations can be averaged using many different types of weightings. More generally, the different averages obtained are referred to as linear filtering. Several examples representing trend filtering for various linear filters are shown in Figure 1. In this example, the averaging horizon (65 business days or one year) has much more influence than the type of averaging. Other trend following methods, which are classified as nonlinear, use more complex calculations to obtain more specific results (such as filters based on wavelet analysis, support vector machines or singular spectrum analysis). For instance, the L1 filter is designed to obtain piecewise constant trends, which can be interpreted more easily.

Q U A N T R E S E A R C H B Y LY X O R

3

Figure 1: Trend estimate of the S&P 500 index

Variations around a benchmark estimator Trend filtering can be performed either to explain past behaviour of asset prices, or to forecast future returns. The choice of the estimator and its calibration primarily depend on that objective. If the goal is to explain past price behaviour, there are two possible approaches. The first is to select the model and parameters that minimise past prediction error. This can be performed using a cross-validation procedure, for example. The second option is to consider a benchmark estimator, such as the six-month moving average, and to calibrate another model to be as close to the benchmark as possible. For instance, the L1 filter of Figure 2 is calibrated to deliver a constant trend over an average six-month period. This type of filter is more easily interpreted than the original six-month moving average, with clearly delimited trend periods. This procedure can be performed on any time series.

From trend filtering to forecasting Trend filtering may also be a predictive tool. This is a much more ambitious objective. It supposes that the last observed trend has an influence on future asset returns. More precisely, trend following predictions suppose that positive (or negative) trends are more likely to be followed by positive (or negative) returns. Any trend following method would be useless if this assumption did not hold. Figure 3 illustrates that the distributions of the one-month GSCI index returns after a very positive three-month trend (i.e. above a threshold) clearly dominate the return distribution after a very negative trend (i.e. below the threshold).

4

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Figure 2: L1 versus moving average filtering

Figure 3: Distribution of the conditional standardised monthly return

Q U A N T R E S E A R C H B Y LY X O R

5

Furthermore, this persistence effect is also tested in Table 1 for a number of major financial indices. This table compares the average one-month return following a positive three-month trend period to the average one-month return following a negative three month trend period. Table 1: Average one-month conditional return based on past trends Trend Eurostoxx 50 S&P 500 MSCI WORLD MSCI EM TOPIX EUR/USD USD/JPY GSCI

Positive 1.1% 0.9% 0.6% 1.9% 0.4% 0.2% 0.2% 1.3%

Negative 0.2% 0.5% −0.3% −0.3% −0.4% −0.2% −0.2% −0.4%

Difference 0.9% 0.4% 1.0% 2.2% 0.9% 0.4% 0.4% 1.6%

On average, for all indices under consideration, returns are higher after a positive trend than after a negative one. Thus, the trends are persistent, and seem to have a predictive value. This makes the case for the study of trend following strategies, and highlights the appeal of trend filtering methods.

Conclusion The ultimate goal of trend filtering in finance is to design portfolio strategies that may benefit from the identified trends. Such strategies must rely on appropriate trend estimators and time horizons. This paper highlights the variety of estimators available in the academic literature. But the choice of trend estimator is just one of the many questions that arises in the definition of those strategies. In particular, diversification and risk budgeting are key aspects of success.

6

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Table of Contents 1 Introduction 2 A review of econometric estimators for 2.1 The trend-cycle model . . . . . . . . . 2.2 Linear filtering . . . . . . . . . . . . . 2.3 Nonlinear filtering . . . . . . . . . . . . 2.4 Multivariate filtering . . . . . . . . . .

9 . . . .

10 10 11 21 27

3 Trend filtering in practice 3.1 The calibration problem . . . . . . . . . . . . . . . . . . . . . . 3.2 What about the variance of the estimator? . . . . . . . . . . . . 3.3 From trend filtering to trend forecasting . . . . . . . . . . . . .

30 30 33 38

4 Conclusion

40

A Statistical complements A.1 State space model and Kalman filtering A.2 L1 filtering . . . . . . . . . . . . . . . . A.3 Wavelet analysis . . . . . . . . . . . . . A.4 Support vector machine . . . . . . . . A.5 Singular spectrum analysis . . . . . . .

Q U A N T R E S E A R C H B Y LY X O R

trend . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

filtering . . . . . . . . . . . . . . . . . . . . . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . .

. . . . .

. . . . .

41 41 42 44 47 50

7

8

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Trend Filtering Methods for Momentum Strategies∗ Benjamin Bruder Research & Development Lyxor Asset Management, Paris [email protected]

Tung-Lam Dao Research & Development Lyxor Asset Management, Paris [email protected]

Jean-Charles Richard Research & Development Lyxor Asset Management, Paris [email protected]

Thierry Roncalli Research & Development Lyxor Asset Management, Paris [email protected]

December 2011

Abstract This paper studies trend filtering methods. These methods are widely used in momentum strategies, which correspond to an investment style based only on the history of past prices. For example, the CTA strategy used by hedge funds is one of the best-known momentum strategies. In this paper, we review the different econometric estimators to extract a trend of a time series. We distinguish between linear and nonlinear models as well as univariate and multivariate filtering. For each approach, we provide a comprehensive presentation, an overview of its advantages and disadvantages and an application to the S&P 500 index. We also consider the calibration problem of these filters. We illustrate the two main solutions, the first based on prediction error, and the second using a benchmark estimator. We conclude the paper by listing some issues to consider when implementing a momentum strategy.

Keywords: Momentum strategy, trend following, moving average, filtering, trend extraction. JEL classification: G11, G17, C63.

1

Introduction

The efficient market hypothesis tells us that financial asset prices fully reflect all available information (Fama, 1970). One consequence of this theory is that future returns are not predictable. Nevertheless, since the beginning of the nineties, a large body of academic research has rejected this assumption. One of the arguments is that risk premiums are time varying and depend on the business cycle (Cochrane, 2001). In this framework, returns on financial assets are related to some slow-moving economic variables that exhibit cyclical patterns in accordance with the business cycle. Another argument is that some agents are ∗ We

are grateful to Guillaume Jamet and Hoang-Phong Nguyen for their helpful comments.

Q U A N T R E S E A R C H B Y LY X O R

9

not fully rational, meaning that prices may underreact in the short run but overreact at long horizons (Hong and Stein, 1997). This phenomenon may be easily explained by the theory of behavioural finance (Barberis and Thaler, 2002). Based on these two arguments, it is now commonly accepted that prices may exhibit trends or cycles. In some sense, these arguments chime with the Dow theory (Brown et al., 1998), which is one of the first momentum strategies. A momentum strategy is an investment style based only on the history of past prices (Chan et al., 1996). We generally distinguish between two types of momentum strategy: 1. the trend following strategy, which consists of buying (or selling) an asset if the estimated price trend is positive (or negative); 2. the contrarian (or mean-reverting) strategy, which consists of selling (or buying) an asset if the estimated price trend is positive (or negative). Contrarian strategies are clearly the opposite of trend following strategies. One of the tasks involved in these strategies is to estimate the trend, excepted when based on mean-reverting processes (see D’Aspremont, 2011). In this paper, we provide a survey of the different trend filtering methods. However, trend filtering is just one of the difficulties in building a momentum strategy. The complete process of constructing a momentum strategy is highly complex, especially as regards transforming past trends into exposures – an important factor that is beyond the scope of this paper. The paper is organized as follows. Section two presents a survey of the different econometric trend estimators. In particular, we distinguish between methods based on linear filtering and nonlinear filtering. In section three, we consider some issues that arise when trend filtering is applied in practice. We also propose some methods for calibrating trend filtering models and highlight the problem of estimator variance. Section four offers some concluding remarks.

2

A review of econometric estimators for trend filtering

Trend filtering (or trend detection) is a major task of time series analysis from both a mathematical and financial viewpoint. The trend of a time series is considered to be the component containing the global change, which contrasts with local changes due to noise. The trend filtering procedure concerns not only the problem of denoising; it must also take into account the dynamics of the underlying process. This explains why mathematical approaches to trend extraction have a long history, and why this subject is still of great interest to the scientific community1 . From an investment perspective, trend filtering is fundamental to most momentum strategies developed in asset management and hedge funds sectors in order to improve performance and limit portfolio risks.

2.1

The trend-cycle model

In economics, trend-cycle decomposition plays an important role by identifying the permanent and transitory stochastic components in a non-stationary time series. Generally, the permanent component can be interpreted as a trend, whereas the transitory component may 1 See

10

Alexandrov et al. (2008).

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

be a noise or a stochastic cycle. Let yt be a stochastic process. We assume that yt is the sum of two different unobservable parts: yt = xt + εt where xt represents the trend and εt is a stochastic (or noise) process. There is no precise definition for trend, but it is generally accepted to be a smooth function representing longterm movements: “[...] the essential idea of trend is that it shall be smooth.” (Kendall, 1973). It means that changes in the trend xt must be smaller than those of the process yt . From a statistical standpoint, it implies that the volatility of yt − yt−1 is higher than the volatility of xt − xt−1 : σ (yt − yt−1 )  σ (xt − xt−1 ) One of the major problems in financial econometrics is the estimation of xt . This is the subject of signal extraction and filtering (Pollock, 2009).

Finite moving average filtering for trend estimation has a long history. It has been used in actuarial science since the beginning of the twentieth century2 . But the modern theory of signal filtering has its origins in the Second World War and was formulated independently by Norbert Wiener (1941) and Andrei Kolmogorov (1941) in two different ways. Wiener worked principally in the frequency domain whereas Kolmogorov considered a time-domain approach. This theory was extensively developed in the fifties and sixties by mathematicians and statisticians such as Hermann Wold, Peter Whittle, Rudolf Kalman, Maurice Priestley, George Box, etc. In economics, the problem of trend filtering is not a recent one, and may date back to the seminal article of Muth (1960). It was extensively studied in the eighties and nineties in the literature on business cycles, which led to a vast body of empirical research being carried out in this area3 . However, it is in climatology that trend filtering is most extensively studied nowadays. Another important point is that the development of filtering techniques has evolved according to the development of computational power and the IT industry. The Savitzky-Golay smoothing procedure may appear very basic today though it was revolutionary4 when it was published in 1964. In what follows, we review the class of filtering techniques that is generally used to estimate a trend. Moving average filters play an important role in finance. As they are very intuitive and easy to implement, they undoubtedly represent the model most commonly used in trading strategies. The moving average technique belongs to the class of linear filters, which share a lot of common properties. After studying this class of filters, we consider some nonlinear filtering techniques, which may be well suited to solving financial problems.

2.2 2.2.1

Linear filtering The convolution representation

We denote by y = {. . . , y−2 , y−1 , y0 , y1 , y2 , . . .} the ordered sequence of observations of the ˆt be the estimator of the underlying trend xt which is by definition an process yt . Let x 2 See,

in particular, the works of Henderson (1916), Whittaker (1923) and Macaulay (1931). for example Cleveland and Tiao (1976), Beveridge and Nelson (1981), Harvey (1991) or Hodrick and Prescott (1997). 4 The paper of Savitzky and Golay (1964) is still considered by the Analytical Chemistry journal to be one of its 10 seminal papers. 3 See

Q U A N T R E S E A R C H B Y LY X O R

11

unobservable process. A filtering procedure consists of applying a filter L to the data y: x ˆ = L (y) ˆ−1 , x ˆ0 , x ˆ1 , x ˆ2 , . . .}. When the filter is linear, we have x ˆ = Ly with the with x ˆ = {. . . , x ˆ−2 , x normalisation condition 1 = L1. If we assume that the signal yt is observed at regular dates5 , we obtain: ∞  Lt,t−i yt−i (1) x ˆt = i=−∞

We deduce that linear filtering may be viewed as a convolution. The previous filter may not be of much use, however, because it uses future values of yt . As a result, we generally impose some restriction on the coefficients Lt,t−i in order to use only past and present values of the signal. In this case, we say that the filter is causal. Moreover, if we restrict our study to time invariant filters, the equation (1) becomes a simple convolution of the observed signal yt with a window function Li : n−1  x ˆt = Li yt−i (2) i=0

With this notation, a linear filter is characterised by a window kernel Li and its support. The kernel defines the type of filtering, whereas the support defines the range of the filter. For instance, if we take a square window on a compact support [0, T ] with T = nΔ the width of the averaging window, we obtain the well-known moving average filter: Li =

1 1 {i < n} n

We finish this description by considering the lag representation: x ˆt =

n−1  i=0

Li Li yt

with the lag operator L satisfying Lyt = yt−1 . 2.2.2

Measuring the trend and its derivative

We discuss here how to use linear filtering to measure the trend of an asset price and its derivative. Let St be the asset price which follows the dynamics of the Black-Scholes model: dSt = μt dt + σt dWt St where μt is the drift, σt is the volatility and Wt is a standard Brownian motion. The asset price St is observed in a series of discrete dates {t0 , . . . , tn }. Within this model, the appropriate signal to be filtered is the logarithm of the price yt = ln St but not the price itself. Let Rt = ln St − ln St−1 represent the realised return at time t over a unit period. If μt and σt are known, we have:   √ 1 Rt = μt − σt2 Δ + σt Δηt 2 5 We

12

have ti+1 − ti = Δ.

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

where ηt is a standard Gaussian white noise. The filtered trend can be extracted using the following equation: n−1  Li yt−i x ˆt = i=0

6

and the estimator of μt is :

μ ˆt 

n−1 1  Li Rt−i Δ i=0

We can also obtain the same result by applying the filter directly to the signal and defining the derivative of the window function as i = L˙ i : n

μ ˆt 

1  i yt−i Δ i=0

We obtain the following correspondence: ⎧ ⎨ L0 Li − Li−1 i = ⎩ −Ln−1

if i = 0 if i = 1, . . . , n − 1 if i = n

(3)

ˆt are related by the following expression: Remark 1 In some senses, μ ˆt and x μ ˆt =

d x ˆt dt

Econometric methods principally involve x ˆt , whereas μ ˆt is more important for trading strategies. Remark 2 μ ˆt is a biased estimator of μt and the bias increases with the volatility of the process σt . The expression of the unbiased estimator is then: μ ˆt =

n−1 1 2 1  Li Rt−i σt + 2 Δ i=0

Remark 3 In the previous analysis, x ˆt and μ ˆt are two estimators. We may also represent them by their corresponding probability density functions. It is therefore easy to derive estimates, but we should not forget that these estimators present some variance. In finance, and in particular in trading strategies, the question of statistical inference is generally not addressed. However, it is a crucial factor in designing a successful momentum strategy. 2.2.3

Moving average filters

Average return over a given period Here, we consider the simplest case corresponding to the moving average filter where the form of the window is: Li =

1 1 {i < n} n

In this case, the only calibration parameter is the window support, i.e. T = nΔ. It characterises the smoothness of the filtered signal. For the limit T → 0, the window becomes a Dirac distribution δt and the filtered signal is exactly the same as the observed signal: 6 If

we neglect the contribution from the term σt2 . Moreover, we consider Δ = 1 to simplify the calculation.

Q U A N T R E S E A R C H B Y LY X O R

13

x ˆt = yt . For T > 0, if we assume that the noise εt is independent from xt and is a centered process, the first contribution of the filtered signal is the average trend: x ˆt =

n−1 1 xt−i n i=0

If the trend is homogeneous, this average value is located at t − (n − 1) /2 by construction. It means that the filtered signal lags the observed signal by a time period which is half the window. To extract the derivative of the trend, we compute the derivative kernel i which is given by the following formula: i =

1 (δi,0 − δi,n ) nΔ

where δi,j is the Kronecker delta7 . The main advantage of using a moving average filter is the reduction of noise due to the central limit theorem. For the limit case n → ∞, the signal is completely denoised but it corresponds to the average value of the trend. The estimator is also biased. In trend filtering, we also face a trade-off between denoising maximisation and bias minimisation. The problem is the calibration procedure for the lag window T . Another way to determine the optimal parameter T  is to take into account the dynamics of the trend. The above moving average filter can be applied directly to the signal. However, μ ˆt is simply the cumulative return over the window period. It needs only the first and last dates of the period under consideration. Moving average crossovers Many practitioners, and even individual investors, use the moving average of the price itself as a trend indication, instead of the moving average of returns. These moving averages are generally uniform moving averages of the price. Here we will consider an average of the logarithm of the price, in order to be consistent with the previous examples: n−1 1 yt−i yˆtn = n i=0

Of course, an average price does not estimate the trend μt . This trend is estimated from the difference between two moving averages over two different time horizons n1 and n2 . Supposing that n1 > n2 , the trend μ may be estimated from: μ ˆt 

2 (ˆ y n2 − yˆtn1 ) (n1 − n2 ) Δ t

(4)

In particular, the estimated trend is positive if the short-term moving average is higher than the long-term moving average. Thus, the sign of the trend changes when the shortterm moving average crosses the long-term moving average. Of course, when the short-term horizon n1 is one, then the short-term moving average is just the current asset price. The −1 scaling term 2 (n1 − n2 ) is explained below. It is derived from the interpretation of this estimator as a weighted moving average of asset returns. Indeed, this estimator can be interpreted in terms of asset returns by inverting the formula (3) with Li being interpreted as the primitive of i : ⎧ if i = 0 ⎨ 0 i + Li−1 if i = 1, . . . , n − 1 Li = ⎩ if i = n −n+1 7δ

14

i,j

is equal to 1 if i = j and 0 otherwise.

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

The weighting of each return in the estimator (4) is represented in Figure 1. It forms a triangle, and the biggest weighting is given at the horizon of the smallest moving average. Therefore, depending on the horizon n2 of the shortest moving average, the indicator can be focused toward the current trend (if n2 is small) or toward past trends (if n2 is as large as n1 /2 for instance). From these weightings, in the case of a constant trend μ, we can compute the expectation of the difference between the two moving averages:   n1 − n2 1 2 n2 n1 μ − σt Δ E [ˆ yt − yˆt ] = 2 2 Therefore, the scaling factor in formula (4) appears naturally. Figure 1: Window function Li of moving average crossovers (n1 = 100)

Enhanced filters To improve the uniform moving average estimator, we may take the following kernel function: n 4 i = 2 sgn −i n 2 We notice that the estimator μ ˆt now takes into account all the dates of the window period. By taking the primitive of the function i , the trend filter is given as follows: n

4  n

Li = 2 − i −

n 2 2

We now move to the second type of moving average filter which is characterised by an asymmetric form of the convolution kernel. One possibility is to take an asymmetric window function with a triangular form: Li =

Q U A N T R E S E A R C H B Y LY X O R

2 (n − i) 1 {i < n} n2

15

By computing the derivative of this window function, we obtain the following kernel: i =

2 (δi − 1 {i < n}) n

The filtering equation of μt then becomes: n−1 2 1 μ ˆt = xt−i xt − n n i=0 Remark 4 Another way to define μ ˆt is to consider the Lanczos generalised derivative (Groetsch, 1998). Let f (x) be a function. We define the Lanczos derivative of f (x) in terms of the following relationship:

ε 3 dL tf (x + t) dt f (x) = lim 3 ε→0 2ε dx −ε In the discrete case, we have: dL f (x) = lim h→0 dx

n

kf (x + kh) n 2 k=1 k h

k=−n

2

We first notice that the Lanczos derivative is more general than the traditional derivative. Although Lanczos’ formula is a more onerous method for finding the derivative, it offers some advantages. This technique allows us to compute a “pseudo-derivative” at points where the function is not differentiable. For the observable signal yt , the traditional derivative does not exist because of the noise εt , but does in the case of the Lanczos derivative. Let us apply the Lanczos’ formula to estimate the derivative of the trend at the point t − T /2. We obtain: n dL 12   n x ˆt = 3 − i yt−i dt n i=0 2

We deduce that the kernel is: i =

12  n − i 1 {0 ≤ i ≤ n} 3 n 2

By computing an integration by parts, we obtain the trend filter: Li =

6 i (n − i) 1 {0 ≤ i ≤ n} n3

In Figure 2, we have represented the different functions Li given in this paragraph. We may extend these filters by computing the convolution of two or more filters. For exemple, the mixed filter in Figure 2 is the convolution of the asymmetric filter with the Lanczos filter. Let us apply these filters to the S&P 500 index. The results are given in Figure 3 for two values of the window length (n = 65 days and n = 260 days). We notice that the choice of n has a big impact on the filtered series. The choice of the window function seems to be less important at first sight. However, we should mention that traders are principally interested in the derivative of the trend, and not the absolute value of the trend itself. In this case, the window function may have a significant impact. Figure 4 is the scatterplot of the μ ˆt statistic in the case of the S&P 500 index from January 2000 to July 2011 (we have considered the uniform and Lanczos filters using n = 260). We may also show that this impact increases when we reduce the length of the window as illustrated in Table 1.

16

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Figure 2: Window function Li of moving average filters (n = 100)

Figure 3: Trend estimate for the S&P 500 index

Q U A N T R E S E A R C H B Y LY X O R

17

Table 1: Correlation between the uniform and Lanczos derivatives n Pearson ρ Kendall τ Spearman

5 84.67 65.69 83.15

10 87.86 68.92 86.09

22 90.14 70.94 88.17

65 90.52 71.63 88.92

130 92.57 73.63 90.18

260 94.03 76.17 92.19

Figure 4: Comparison of the derivative of the trend

2.2.4

Least squares filters

L2 filtering The previous Lanczos filter may be viewed as a local linear regression (Burch et al., 2005). More generally, least squares methods are often used to define trend estimators: n

ˆn } = arg min {ˆ x1 , . . . , x

1 2 (yt − x ˆt ) 2 t=1

However, this problem is not well-defined. We also need to impose some restrictions on the ˆt to obtain a solution. For example, we may underlying process yt or on the filtered trend x consider a deterministic constant trend: xt = xt−1 + μ In this case, we have: yt = μt + εt Estimating the filtered trend x ˆt is also equivalent to estimating the coefficient μ: n tyt μ ˆ = t=1 n 2 t=1 t

18

(5)

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

If we consider a trend that is not constant, we may define the following objective function: n n−1  1 2 2 (yt − x ˆt ) + λ (ˆ xt−1 − 2ˆ xt + x ˆt+1 ) 2 t=1 t=2

In this function, λ is the regularisation parameter which controls the competition between ˆt and the noise yt − x ˆt . We may rewrite the objective function in the the smoothness8 of x vectorial form: 1 2 2 y − x ˆ 2 + λ Dˆ x 2 2 where y = (y1 , . . . , yn ), x ˆ = (ˆ x1 , . . . , x ˆn ) and the D operator is the (n − 2) × n matrix: ⎤ ⎡ 1 −2 1 ⎥ ⎢ 1 −2 1 ⎥ ⎢ ⎥ ⎢ .. D=⎢ ⎥ . ⎥ ⎢ ⎦ ⎣ 1 −2 1 1 2 1

The estimator is then given by the following solution:  −1 x ˆ = I + 2λD D y

It is known as the Hodrick-Prescott filter (or L2 filter). This filter plays an important role in calibrating the business cycle.

Kalman filtering Another important trend estimation technique is the Kalman filter, which is described in Appendix A.1. In this case, the trend μt is a hidden process which follows a given dynamic. For example, we may assume that the model is9 :  Rt = μt + σζ ζt (6) μt = μt−1 + ση ηt Here, the equation of Rt is the measurement equation and Rt is the observable signal of to follow a random walk. We define realised returns. The hidden process μt is supposed  2  ˆt|t−1 − μt . Using the results given in Appendix μ ˆt|t−1 = Et−1 [μt ] and Pt|t−1 = Et−1 μ A.1, we have: ˆt|t−1 + Kt Rt μ ˆt+1|t = (1 − Kt ) μ  where Kt = Pt|t−1 / Pt|t−1 + σζ2 is the Kalman gain. The estimation error is determined by Riccati’s equation: Pt+1|t = Pt|t−1 + ση2 − Pt|t−1 Kt Riccati’s equation gives us the stationary solution:  ση  P∗ = ση + ση2 + 4σζ2 2 The filter equation becomes:

μ ˆt+1|t = (1 − κ) μ ˆt|t−1 + κRt 8 We notice that the second term is the discrete derivative of the trend x ˆt which characterises the smoothness of the curve. 9 Equation (5) is a special case of this model if σ = 0. η

Q U A N T R E S E A R C H B Y LY X O R

19

with: κ=

2σ  η ση + ση2 + 4σζ2

This Kalman filter can be considered as an exponential moving average filter with parameter10 λ = − ln (1 − κ): ∞   e−λi Rt−i μ ˆt = 1 − e−λ i=0

with11 μ ˆt = Et [μt ]. The filter of the trend x ˆt is therefore determined by the following equation: ∞   e−λi yt−i x ˆt = 1 − e−λ i=0

while the derivative of the trend may be directly related to the observed signal yt as follows: ∞      e−λi yt−i μ ˆt = 1 − e−λ yt − 1 − e−λ eλ − 1 i=1

In Figure 5, we reported the window function of the Kalman filter for several values of λ. We notice that the cumulative weightings increase strongly with λ. The half-life of this filter is approximatively equal to λ−1 − 2−1 ln 2 . For example, the half-life for λ = 5% is 14 days.

Figure 5: Window function Li of the Kalman filter

10 We 11 We

20

have 0 < κ < 1 and lambda > 0. notice that μ ˆt+1|t = μ ˆt .

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

We may wonder what the link is between the regression model (5) and the Markov model (6). Equation (5) is equivalent to the following state space model12 : 

yt = xt + σε εt xt = xt−1 + μ

If we now consider that the trend is stochastic, the model becomes: 

yt = xt + σε εt xt = xt−1 + μ + σζ ζt

This model is called the local level model. We may also assume that the slope of the trend is stochastic, in which case we obtain the local linear trend model: ⎧ ⎨ yt = xt + σε εt xt = xt−1 + μt−1 + σζ ζt ⎩ μt = μt−1 + ση ηt

These three models are special cases of structural models (Harvey, 1989) and may be easily solved by Kalman filtering. We also deduce that the Markov model (6) is a special case of the latter when σε = 0. Remark 5 We have shown that Kalman filtering may be viewed as an exponential moving average filter when we consider the Markov model (6). Nevertheless, we cannot regard the Kalman filter simply as a moving average filter. First, the Kalman filter is the optimal filter in the case of the linear Gaussian model described in Appendix A.1. Second, it could be regarded as “an efficient computational solution of the least squares method” (Sorensen, 1970). Third, we could use it to solve more sophisticated processes than the Markov model (6). However, some nonlinear or non Gaussian models may be too complex for Kalman filtering. These nonlinear models can be solved by particle filters or sequential Monte Carlo methods (see Doucet et al., 1998). Another important feature of the Kalman approach is the derivation of an optimal smoother (see Appendix A.1). At time t, we are interested by the numerical value of xt , but also by the past values of xt−i because we would like to measure the slope of the trend. The Kalman smoother improves the estimate of x ˆt−i by using all the information between t − i and t. Let us consider the previous example in relation to the S&P 500 index, using the local level model. Figure 6 gives the filtered and smoothed components xt and μt for two sets of parameters13 . We verify that the Kalman smoother reduces the noise by incorporating more information. We also notice that the restriction σε = 0 increases the variance of the trend and slope estimators.

2.3

Nonlinear filtering

In this section, we review other filtering approaches. They are generally classed as nonlinear filters, because it is not possible to express the trend as a linear convolution of the signal and a window function. 12 In

what follows, the noise processes are white noise: εt ∼ N (0, 1), ζt ∼ N (0, 1) and ηt ∼ N (0, 1). the first set of parameters, we assume that σε = 100σζ and ση = 1/100σζ . For the second set of parameters, we impose the restriction σε = 0. 13 For

Q U A N T R E S E A R C H B Y LY X O R

21

Figure 6: Kalman filtered and smoothed components

2.3.1

Nonparametric regression

In the regression model (5), we assume that xt = f (t) while f (t) = μt. The model is said to be parametric because the estimation of the trend consists of estimating the parameter μ. ˆt. With nonparametric regression, we directly estimate the function f , We then have x ˆt = μ obtaining x ˆt = fˆ (t). Some examples of nonparametric regression are kernel regression, loess regression and spline regression. A popular method for trend filtering is local polynomial regression: yt

= f (t) + εt p  j = β0 (τ ) + βj (τ ) (τ − t) + εt j=1

For a given value of τ , we estimate the parameters βˆj (τ ) using weighted least squares with the following weightings:   τ −t wt = K h where K is the kernel function with a bandwidth h. We deduce that: x ˆt = E [ yt | τ = t] = βˆ0 (t) Cleveland (1979) proposed an improvement to the kernel regression through a two-stage procedure (loess regression). First, we  fit a polynomial regression to estimate the residuals ε|)) and run a εˆt . Then, we compute δt = 1 − u2t · 1 {|ut | ≤ 1} with ut = εˆt / (6 median (|ˆ second kernel regression14 with weightings δt wt . 14 Cleveland

22

(1979) suggests using the tricube kernel function to define K.

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

A spline function is a C 2 function S (τ ) which corresponds to a cubic polynomial function on each interval [t, t + 1[. Let SP be the set of spline functions. We then have to solve the following optimisation programme:

T n  2 2 wt (yt − S (t)) + h wτ S  (τ ) dτ min (1 − h) S∈SP

t=0

0

where h is the smoothing parameter – h = 0 corresponds to the interpolation case15 and h = 1 corresponds to the linear regression16 .

Figure 7: Illustration of the kernel, loess and spline filters

We illustrate these three nonparametric methods in Figure 7. The calibration of these filters is more complicated than for moving average filters, where the only parameter is the length n of the window. With these methods, we have to decide the polynomial degree17 p, the kernel function18 K and the smoothing parameter19 h. 2.3.2

L1 filtering

The idea of the Hodrick-Prescott filter can be generalised to a larger class of filters by using the Lp penalty condition instead of the L2 penalty. This generalisation was previously 15 We

have x ˆt = S (t) = yt . have x ˆt = S (t) = cˆ + μ ˆt with (ˆ c, μ ˆ) the OLS estimate of yt on a constant and time t because the optimum is reached for S  (τ ) = 0. 17 For the kernel regression, we use a Gaussian kernel with a bandwidth h = 0.10. We notice the impact of the degree of polynomial. The higher the degree, the smoother the trend (and the slope of the trend). 18 For the loess regression, the degree of polynomial is set to 1 and the bandwidth h is 0.02. We show the impact of the second step which modifies the kernel function. 19 For the spline regression, we consider a uniform kernel function. We notice that the parameter h has an impact on the smoothness of the trend. 16 We

Q U A N T R E S E A R C H B Y LY X O R

23

discussed in the work of Daubechies et al. (2004) in relation to the linear inverse problem, while Tibshirani (1996) considers the Lasso regression problem. If we consider an L1 filter, the objective function becomes: n n−1  1 2 (yt − x ˆt ) + λ |ˆ xt−1 − 2ˆ xt + x ˆt+1 | 2 t=1 t=2

which is equivalent to the following vectorial form:

1 2 x 1 y − x ˆ 2 + λ Dˆ 2 Kim et al. (2009) shows that the dual problem of this L1 filter scheme is a quadratic ˆ, we may also use the quadratic programme with some boundary constraints20 . To find x programming algorithm, but Kim et al. (2009) suggest using the primal-dual interior point method in order to optimise the numerical computation speed. We have illustrated the L1 filter in Figure 8. Contrary to all other previous methods, the filtered signal comprises a set of straight trends and breaks21 , because the L1 norm imposes the condition that the second derivative of the filtered signal must be zero. The competition between the two terms in the objective function turns to the competition between the number of straight trends (or the number of breaks) and the closeness to the data. Thus, the smoothing parameter λ plays an important role for detecting the number of breaks. This explains why L1 filtering is radically different to L2 (or Hodrick-Prescott) filtering. Moreover, it is easy to compute the slope of the trend μ ˆt for the L1 filter. It is a step function, indicating clearly if the trend is up or down, and when it changes (see Figure 8). 2.3.3

Wavelet filtering

Another way to estimate the trend xt is to denoise the signal yt by using spectral analysis. The Fourier transform is an alternative representation of the original signal yt , which becomes a frequency function: n  y (ω) = yt e−iωt t=1

We note y (ω) = F (y). By construction, we have y = F −1 (y) with F −1 the inverse Fourier transform. A simple idea for denoising in spectral analysis is to set some coefficients y (ω) to zero before reconstructing the signal. Figure 9 is an illustration of denoising using the thresholding rule. Selected parts of the frequency spectrum can easily be manipulated by filtering tools. For example, some can be attenuated, and others may be completely removed. Applying the inverse Fourier transform to this filtered spectrum leads to a filtered time series. Therefore, a smoothing signal can be easily performed by applying a low-pass filter, that is, by removing the higher frequencies. For example, we have represented two denoised signals of the S&P 500 index in Figure 9. For the first one, we use a 95% thresholding procedure whereas 99% of the Fourier coefficients are set to zero in the second case. One difficulty with this approach is the bad time location for low frequency signals and the bad frequency location for the high frequency signals. It is then difficult to localise when the trend (which is located in low frequencies) reverses. But the main drawback of spectral analysis is that it is not well suited to nonstationary processes (Martin and Flandrin, 1985, Fuentes, 2002, Oppenheim and Schafer, 2009). 20 The 21 A

24

detail of this derivation is shown in Appendix A.2. break is the position where the signal trend changes.

Issue # 8

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Figure 8: L1 versus L2 filtering

Figure 9: Spectral filtering

Q U A N T R E S E A R C H B Y LY X O R

25

A solution consists of adopting a double dimension analysis, both in time and frequency. This approach corresponds to the wavelet analysis. The method of denoising is the same as described previously and the estimation of xt is done in three steps: 1. we compute the wavelet transform W of the original signal yt to obtain the wavelet coefficients ω = W (y); 2. we modify the wavelet coefficients according to a denoising rule D: ω  = D (ω) 3. We convert the modified wavelet coefficients into a new signal using the inverse wavelet transform W −1 : x = W −1 (ω  ) There are two principal choices in this approach. First, we have to specify which mother wavelet to use. Second, we have to define the denoising rule. Let ω − and ω + be two scalars with 0 < ω − < ω + . Donoho and Johnstone (1995) define several shrinkage methods22 : • Hard shrinkage • Soft shrinkage

  ωi = ωi · 1 |ωi | > ω +

  ωi = sgn (ωi ) · |ωi | − ω + +

• Semi-soft shrinkage ⎧ si |ωi | ≤ ω − ⎨ 0 −1 +  + − − ωi = sgn (ωi ) (ω − ω ) ω (|ωi | − ω ) si ω − < |ωi | ≤ ω + ⎩ ωi si |ωi | > ω +

• Quantile shrinkage is a hard shrinkage method where w+ is the q th quantile of the coefficients |ωi |. Wavelet filtering is illustrated in Figure 10. We have computed the wavelet coefficients using the cascade algorithm of Mallat (1989) and the low-pass and high-pass filters of order 6 proposed by Daubechies (1992). The filtered trend is obtained using quantile shrinkage. In the first case, the noisy signal remains because we consider all the coefficients (q = 0). In the second and third cases, 95% and 99% of the wavelet coefficients are set to zero23 . 2.3.4

Other methods

Many other methods can be used to perform trend filtering. The most recent include, for example, singular spectrum analysis24 (Vautard et al., 1992), support vector machines25 and empirical mode decomposition (Flandrin et al., 2004). Moreover, we notice that traders sometimes use their own techniques (see, inter alia, Ehlers, 2001). 22 In

practice, the coefficients ωi are standardised before being computed. is interesting to note that the denoising procedure retains some wavelet coefficients corresponding to high and medium frequencies and located around the 2008 crisis. 24 See Appendix A.5 for an illustration. 25 A brief presentation is given in Appendix A.4. 23 It

26

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Figure 10: Wavelet filtering

2.4

Multivariate filtering

Until now, we have assumed that the trend is specific to a financial asset. However, we may be interested in estimating the common trend of several financial assets. For example, if we wanted to estimate the trend of emerging markets equities, we could use a global index like the MSCI EM or extract the trend by considering several indices, e.g. the Bovespa index (Brazil), the RTS index (Russia), the Nifty index (India), the HSCEI index (China), etc. In this case, the trend-cycle model becomes: ⎞ ⎛ (1) (1) ε yt ⎜ t. ⎜ . ⎟ ⎜ . ⎟ = xt + ⎜ . ⎝ . ⎝ . ⎠ (m) (m) yt εt ⎛

(j)

⎞ ⎟ ⎟ ⎠

(j)

where yt and εt are respectively the signal and the noise of the financial asset j and xt is the common trend. One idea for estimating the common trend is to obtain the mean of the specific trends: m

x ˆt =

Q U A N T R E S E A R C H B Y LY X O R

1  (j) x ˆ m j=1 t

27

If we consider moving average filtering, it is equivalent to applying the filter to the average m (j) 1 filter26 y¯t = m j=1 yt . This rule is also valid for some nonlinear filters such as L1 filtering (see Appendix A.2). In what follows, we consider the two main alternative approaches developed in econometrics to estimate a (stochastic) common trend. 2.4.1

Error-correction model, common factors and the P-T decomposition

The econometrics of nonstationary time series may also help us to estimate a common trend. (j) (j) (j) yt is said to be integrated of order 1 if the change yt − yt−1 is stationary. We will note (j)

yt

(j)

∼ I (1) and (1 − L) yt

(1)

(m)

∼ I (0). Let us now define yt = yt , . . . , yt

. The vector yt

is cointegrated of rank r if there exists a matrix β of rank r such that zt = β  yt ∼ I (0). In this case, we show that yt may be specified by an error-correction model (Engle and Granger, 1987): ∞  Φi Δyt−i + ζt (7) Δyt = γzt−1 + i=1

where ζt is a I (0) vector process. Stock and Watson (1988) propose another interesting representation of cointegration systems. Let ft be a vector of r common factors which are I (1). Therefore, we have: yt = Aft + ηt

(8)

where ηt is a I (0) vector process and ft is a I (1) vector process. One of the difficulties with this type of model is the identification step (Peña and Box, 1987). Gonzalo and Granger (1995) suggest defining a permanent-transitory (P-T) decomposition: y t = P t + Tt such that the permanent component Pt is difference stationary, the transitory component Tt is covariance stationary and (ΔPt , Tt ) satisfies a constrained autoregressive representation. Using this framework and some other conditions, Gonzalo and Granger show that we may obtain the representation (8) by estimating the relationship (7): ft = γ˘  yt

(9)

where γ˘  γ = 0. They then follow the works of Johansen (1988, 1991) to derive the maximum likelihood estimator of γ˘ . Once we have estimated the relationship (9), it is also easy to ˆt . identify the common trend27 x 26 We

have: x ˆt

=

=

=

m n−1 1 XX (j) Li yt−i m j=1 i=0 0 1 n−1 m X 1 X (j) A @ Li y m j=1 t−i i=0 n−1 X i=0

27 If

28

Li y¯t−i

a common trend exists, it is necessarily one of the common factors.

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

2.4.2

Common stochastic trend model

Another idea is to consider an extension of the local linear trend model: ⎧ ⎨ yt = αxt + εt xt = xt−1 + μt−1 + σζ ζt ⎩ μt = μt−1 + ση ηt

  (1) (m) (1) (m) , εt = εt , . . . , εt ∼ N (0, Ω), ζt ∼ N (0, 1) and ηt ∼ N (0, 1). with yt = yt , . . . , yt Moreover, we assume that εt , ζt and ηt are independent of each other. Given the parameters (α, Ω, σζ , ση ), we may run the Kalman filter to estimate the trend xt and the slope μt whereas the Kalman smoother allows us to estimate xt−i and μt−i at time t. Remark 6 The case ση = 0 has been extensively studied by Chang et al. (2009). In particular, they show that yt is cointegrated with β = Ω−1 Γ and Γ a m × (m − 1) matrix such that Γ Ω−1 α = 0 and Γ Ω−1 Γ = Im−1 . Using the P-T decomposition, they also found that the common stochastic trend is given by α Ω−1 yt , implying that the above averaging rule is not optimal. We come back to the example given in Figure 6 page 22. Using the second set of parameters, we now consider three stock indices: the S&P 500 index, the Stoxx 600 index and the MSCI EM index. For each index, we estimate the filtered trend. Moreover, using the previous common stochastic trend model28 , we estimate the common trend for the bivariate signal (S&P 500, Stoxx 600) and the trivariate signal (S&P 500, Stoxx 600, MSCI EM). Figure 11: Multivariate Kalman filtering

28 We

assume that αj takes the value 1 for the three signals.

Q U A N T R E S E A R C H B Y LY X O R

29

3 3.1

Trend filtering in practice The calibration problem

For the practical use of the trend extraction techniques discussed above, the calibration of filtering parameters is crucial. These calibrated parameters must incorporate our prediction requirement or they can be mapped to a commonly-known benchmark estimator. These constraints offer us some criteria for determining the optimal parameters for our expected prediction horizon. Below, we consider two possible calibration schemes based on these criteria. 3.1.1

Calibration based on prediction error

One idea for estimating the parameters of a model is to use statistical inference tools. Let us consider the local linear trend model. We may estimate the set of parameters (σε , σζ , ση ) by maximising the log-likelihood function29 : n

=

v2 1 ln 2π + ln Ft + t 2 t=1 Ft

# $ where vt = yt − Et−1 [yt ] is the innovation process and Ft = Et−1 vt2 is the variance of vt . In Figure 12, we have reported the filtered and smoothed trend and slope estimated by the maximum likelihood method. We notice that the estimated components are more noisy than those obtained in Figure 6. We can explain this easily because maximum likelihood is based on the one-day innovation process. If we want to look at a longer trend, we have to consider the innovation process vt = yt − Et−h [yt ] where h is the horizon time. We have reported the slope for h = 50 days in Figure 12. It is very different from the slope corresponding to h = 1 day. The problem is that the computation of the log-likelihood for the innovation process vt = yt − Et−h [yt ] is trickier because there is generally no analytic expression. This is why we do not recommend this technology for trend filtering problems, because the trends estimated are generally very short-term. A better solution is to employ a cross-validation procedure to calibrate the parameters θ of the filters discussed above. Let us consider the calibration scheme presented in Figure 13. We divide our historical data into a training set and a validation set, which are characterised by two time parameters T1 and T2 . The size of training set T1 controls the precision of our calibration, for a fixed parameter θ. For this training set, the value of the expectation of Et−h [yt ] is computed. The second parameter 29 Another way of estimating the parameters is to consider the log-likelihood function in the frequency domain analysis (Roncalli, 2010). In the case of the local linear trend model, the stationary form of yt is S (yt ) = (1 − L)2 yt . We deduce that the associated log-likelihood function is:

=−

n−1 n−1 n 1 X I (λj ) 1 X ln f (λj ) − ln 2π − 2 2 j=0 2 j=0 f (λj )

where I (λj ) is the periodogram of S (yt ) and f (λ) is the spectral density: f (λ) = because we have:

30

ση2 + 2 (1 − cos λ) σζ2 + 4 (1 − cos λ)2 σε2 2π

S (yt ) = ση ηt−1 + σζ (1 − L) ζt + σε (1 − L)2 εt

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Figure 12: Maximum likelihood of the trend and slope components

T2 determines the size of the validation set, which is used to estimate the prediction error: e (θ; h) =

n−h  t=1

2

(yt − Et−h [yt ])

This quantity is directly related to the prediction horizon h = T2 for a given investment strategy. The minimisation of the prediction error leads to the optimal value θ of the filter parameters which will be used to predict the trend for the test set. For example, we apply this calibration scheme for L1 filtering for h equal to 50 days. Figure 14 illustrates the calibration procedure for the S&P 500 index with T1 = 400 and T2 = 50. Minimising the cumulative prediction error over the validation set gives the optimal value λ = 7.03. Figure 13: Cross-validation procedure for determining optimal parameters θ Training set |

3.1.2

|

T1

Forecasting

Test set |

Historical data

 T2

| Today

 T2

 Prediction

Calibration based on benchmark estimator

The trend filtering algorithm can be calibrated with a benchmark estimator. In order to illustrate this idea, we present in this discussion the calibration procedure for L2 filters by

Q U A N T R E S E A R C H B Y LY X O R

31

Figure 14: Calibration procedure with the S&P 500 index for the L1 filter

using spectral analysis. Though the L2 filter provides an explicit solution which is a great advantage for numerical implementation, the calibration of the smoothing parameter λ is not straightforward. We propose to calibrate the L2 filter by comparing the spectral density of this filter with that obtained using the uniform moving average filter with horizon n for which the spectral density is: f

MA

1 (ω) = 2 n



2

n−1

 −iωt

e





t=0

 −1 For the L2 filter, the solution has the analytical form x ˆ = 1 + 2λD D y. Therefore, the spectral density can also be computed explicitly: f HP (ω) =



1 1 + 4λ (3 − 4 cos ω + cos 2ω)

2

2  This spectral density can then be approximated by 1/ 1 + 2λω 4 . Hence, the spectral −1/4 for the L2 filter whereas it is 2πn−1 for the uniform moving average filter. width is (2λ) The calibration of the L2 filter could be achieved by matching these two quantities. Finally, we obtain the following relationship: λ ∝ λ =

1  n 4 2 2π

In Figure 15, we represent the spectral density of the uniform moving average filter for different window sizes n. We also report the spectral density of the corresponding L2 filters. To obtain this, we calibrated the optimal parameter λ by least square minimisation. In

32

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Figure 16, we compare the optimal estimator λ with that corresponding to 10.27 × λ . We notice that the approximation is very good30 . Figure 15: Spectral density of moving average and L2 filters

3.2

What about the variance of the estimator?

Let μ ˆt be the estimator of the slope of the trend. There may be a confusion between the estimator of the slope and the estimated value of the slope (or the estimate). The estimator is a random variable and is defined by a probability distribution function. Based on the sample data, the estimator takes a value which is the estimate of the slope. Suppose that we obtain an estimate of 10%. It means that 10% is the most likely value of the slope given the data. But it does not mean that 10% is the true value of the slope. 3.2.1

Measuring the efficiency of trend filters

Let μ0t be the true value of the slope. In statistical inference, the quality of an estimator is defined by the mean squared error (or MSE):  2  ˆt − μ0t MSE (ˆ μt ) = E μ (1)

It indicates how far the estimates are from the true value. We say that the estimator μ ˆt (2) is more efficient than the estimator μ ˆt if its MSE is lower:   (1) (2) (1) (2) μ ˆt  μ ≤ MSE μ ˆt ˆt ⇔ MSE μ ˆt 30 We

estimated the figure 10.27 using least squares.

Q U A N T R E S E A R C H B Y LY X O R

33

Figure 16: Relationship between the value of λ and the length of the moving average filter

We may decompose the MSE statistic into two components:   # $2 2 μt ] − μ0t μt − E [ˆ μt ]) + E E [ˆ MSE (ˆ μt ) = E (ˆ

The first component is the variance of the estimator var (ˆ μt ) whereas the second component is the square of the bias B (ˆ μt ). Generally, we are interested by estimators that are unbiased (B (ˆ μt ) = 0). If this is the case, comparing two estimators is equivalent to comparing their variances. Let us assume that the price process is a geometric Brownian motion: dSt = μ0 St dt + σ 0 St dWt In this case, the slope of the trend is constant and is equal to μ0 . In Figure 17, we have reported the probability density function of the estimator μ ˆt when the true slope μ0 is 10%. We consider the estimator based on a uniform moving average filter of length n. First, we notice that using filters is better than using the noisy signal. We also observe that the variance of the estimators increases with the parameter σ 0 and decreases with the length n. 3.2.2

Trend detection versus trend filtering

In the previous paragraph, we saw that an estimate of the trend may not be significant if the variance of the estimator is too large. Before computing an estimate of the trend, we then have to decide if there is a trend or not. This process is called trend detection. Mann (1945) considers the following statistic: (n) St

=

n−2  n−1 

i=0 j=i+1

34

sgn (yt−i − yt−j )

Issue # 8

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Figure 17: Density of the estimator μ ˆt

Figure 18: Impact of μ0 on the estimator μ ˆt

Q U A N T R E S E A R C H B Y LY X O R

35

with sgn (yt−i − yt−j ) = 1 if yt−i > yt−j and sgn (yt−i − yt−j ) = −1 if yt−i < yt−j . We have31 : n (n − 1) (2n + 5)  (n) = var St 18 We can show that: n (n + 1) n (n + 1) (n) ≤ St ≤ − 2 2 The bounds are reached if yt < yt−i (negative trend) or yt > yt−i (positive trend) for i ∈ N∗ . We can then normalise the score: (n)

St

(n)

=

2St n (n + 1)

(n)

St takes the value +1 (or −1) if we have a perfect positive (or negative) trend. If there is (n) no trend, it is obvious that St  0. Under this null hypothesis, we have: (n)

Zt

−→ N (0, 1)

n→∞

with: (n)

Zt

(n)

=%

St  (n) var St (n)

In Figure 19, we reported the normalised score St for the S&P 500 index and different values of n. Statistics relating to the null hypothesis are given in Table 2 for the study period. We notice that we generally reject the hypothesis that there is no trend when we consider a period of one year. The number of cases when we observe a trend increases if we consider a shorter period. For example, if n is equal to 10 days, we accept the hypothesis that there is no trend in 42% of cases when the confidence level α is set to 90%. Table 2: Frequencies of rejecting the null hypothesis with confidence level α α n = 10 days n = 3 months n = 1 year

90% 58.06% 85.77% 97.17%

95% 49.47% 82.87% 96.78%

99% 29.37% 76.68% 95.33%

(10)

Remark 7 We have reported the statistic St against the trend estimate32 μ ˆt for the S&P (10) is negative. 500 index since January 2000. We notice that μ ˆt may be positive whereas St This illustrates that a trend measurement is just an estimate. It does not mean that a trend exists. 31 If

there are some tied sequences (yt−i = yt−i−1 ), the formula becomes: “ ” 1 (n) var St = 18

n (n − 1) (2n + 5) −

g X k=1

!

nk (nk − 1) (2nk + 5)

with g the number of tied sequences and nk the number of data points in the kth tied sequence. 32 It is computed with a uniform moving average of 10 days.

36

Issue # 8

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Figure 19: Trend detection for the S&P 500 index

Figure 20: Trend detection versus trend filtering

Q U A N T R E S E A R C H B Y LY X O R

37

3.3

From trend filtering to trend forecasting

There are two possible applications for the trend following problem. First, trend filtering can analyse the past. A noisy signal can be transformed into a smoother signal, which can be interpreted more easily. An ex-post analysis of this kind can, for instance, clearly separate increasing price periods from decreasing price periods. This analysis can be performed on any time series, or even on a random walk. For example, we have reported four simulations of a geometric Brownian motion without drift and annual volatility of 20% in Figure 21. In this context, trend filtering could help us to estimate the different trends in the past.

Figure 21: Four simulations of a geometric Brownian motion without drift

On the other hand, trend analysis may be used as a predictive tool. Prediction is a much more ambitious objective than analysing the past. It cannot be performed on any time series. For instance, trend following predictions suppose that the last observed trend influences future returns. More precisely, these predictors suppose that positive (or negative) trends are more likely to be followed by positive (or negative) returns. Such an assumption has to be tested empirically. For example, it is obvious that the time series in Figure 21 exhibit certain trends, whereas we know that there is no trend in a geometric Brownian motion without drift. Thus, we may still observe some trends in an ex-post analysis. It does not mean, however, that trends will persist in the future. The persistence of trends is tested here in a simple framework for major financial indices33 . For each of these indices the average one-month returns are separated into two sets. The first set includes one-month returns that immediately follow a positive three-month return (this is negative for the second set). The average one-month return is computed for each of these two sets, and the results are given in Table 3. These results clearly show 33 The

38

study period begins in January 1995 (January 1999 for the MSCI EM) and finish in October 2011.

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Figure 22: Distribution of the conditional standardised monthly return

that, on average, higher returns can be expected after a positive three-month return than after a negative three-month period. Therefore, observation of the current trend may have a predictive value for the indices under consideration. Moreover, we consider the distribution of the one-month returns, based on past three-month returns. Figure 22 illustrates the case of the GSCI index. In the first quadrant, the one-month returns are divided into two sets, depending on whether the previous three-month return is positive or negative. The cumulative distributions of these two sets are shown. In the second quadrant, we consider, on the one hand, the distribution of one-month returns following a three-month return below −5% and, on the other hand, the distribution of returns following a three-month return exceeding +5%. The same procedure is repeated in the other quadrants, for a 10% and a 15% threshold. This simple test illustrates the usefulness of trend following strategies. Here, trends seem persistent enough to study such strategies. Of course, on other time scales or for other assets, one may obtain opposite results that would support contrarian strategies. Table 3: Average one-month conditional return based on past trends Trend Eurostoxx 50 S&P 500 MSCI WORLD MSCI EM TOPIX EUR/USD USD/JPY GSCI

Q U A N T R E S E A R C H B Y LY X O R

Positive 1.1% 0.9% 0.6% 1.9% 0.4% 0.2% 0.2% 1.3%

Negative 0.2% 0.5% −0.3% −0.3% −0.4% −0.2% −0.2% −0.4%

Difference 0.9% 0.4% 1.0% 2.2% 0.9% 0.4% 0.4% 1.6%

39

4

Conclusion

The ultimate goal of trend filtering in finance is to design portfolio strategies that may benefit from these trends. But the path between trend measurement and portfolio allocation is not straightforward. It involves studies and explanations that would not fit in this paper. Nevertheless, let us point out some major issues. Of course, the first problem is the selection of the trend filtering method. This selection may lead to a single procedure or to a pool of methods. The selection of several methods raises the question of an aggregation procedure. This can be done through averaging or dynamic model selection, for instance. The resulting trend indicator is meant to forecast future asset returns at a given horizon. Intuitively, an investor should buy assets with positive return forecasts and sell assets with negative forecasts. But the size of each long or short position is a quantitative problem that requires a clear investment process. This process should take into account the risk entailed by each position, compared with the expected return. Traditionally, individual risks can be calculated in relation to asset volatility. A correlation matrix can aggregate those individual risks into a global portfolio risk. But in the case of a multi-asset trend following strategy, should we consider the correlation of assets or the correlation of each individual strategy? These may be quite different, as the correlations between strategies are usually smaller than the correlations between assets in absolute terms. Even when the portfolio risks can be calculated, the distribution of those risks between assets or strategies remains an open problem. Clearly, this distribution should take into account the individual risks, their correlations and the expected return of each asset. But there are many competing allocation procedures, such as Markowitz portfolio theory or risk budgeting methods. In addition, the total amount of risk in the portfolio must be decided. The average target volatility of the portfolio is closely related to the risk aversion of the final investor. But this total amount of risk may not be constant over time, as some periods could bring higher expected returns than others. For example, some funds do not change the average size of their positions during period of high market volatility. This increases their risks, but they consider that their return opportunities, even when risk-adjusted, are greater during those periods. On the contrary, some investors reduce their exposure to markets during volatility peaks, in order to limit their potential drawdowns. Anyway, any consistent investment process should measure and control the global risk of the portfolio. These are just a few questions relating to trend following strategies. Many more arise in practical cases, such as execution policies and transaction cost management. Each of these issues must be studied in depth, and re-examined on a regular basis. This is the essence of quantitative management processes.

40

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

A A.1

Statistical complements State space model and Kalman filtering

A state space model is defined by a transition equation and a measurement equation. In the measurement equation, we postulate the relationship between an observable vector and a state vector, while the transition equation describes the generating process of the state variables. The state vector αt is generated by a first-order Markov process of the form: αt = Tt αt−1 + ct + Rt ηt where αt is the vector of the m state variables, Tt is a m × m matrix, ct is a m × 1 vector and Rt is a m × p matrix. The measurement equation of the state-space representation is: yt = Zt αt + dt + εt where yt is a n-dimension time series, Zt is a n × m matrix, dt is a n × 1 vector. ηt and εt are assumed to be white noise processes of dimensions p and n respectively. These two last uncorrelated processes are Gaussian with zero mean and respective covariance matrices Qt and Ht . α0 ∼ N (a0 , P0 ) describes the initial position of the state vector. We define at and a t|t−1 as the optimal estimators of αt based on all the information available respectively at time t and t − 1. Let Pt and P t|t−1 be the associated covariance matrices34 . The Kalman filter consists of the following set of recursive equations (Harvey, 1990): ⎧ a t|t−1 = Tt at−1 + ct ⎪ ⎪ ⎪   ⎪ P ⎪ t|t−1 = Tt Pt−1 Tt + Rt Qt Rt ⎪ ⎪ ⎪ y t|t−1 = Zt a t|t−1 + dt ⎨ vt = yt − y t|t−1 ⎪ ⎪ Ft = Zt P t|t−1 Zt + Ht ⎪ ⎪ ⎪ ⎪ at = a t|t−1 + P t|t−1 Zt Ft−1 vt ⎪ ⎪ ⎩ Pt = Im − P t|t−1 Zt Ft−1 Zt P t|t−1

where vt is the innovation process with covariance matrix Ft and y t|t−1 = Et−1 [yt ]. Harvey (1989) shows that we can obtain a t+1|t directly from a t|t−1 : a t+1|t = (Tt+1 − Kt Zt ) a t|t−1 + Kt yt + (ct+1 − Kt dt ) where Kt = Tt+1 P t|t−1 Zt Ft−1 is the matrix of gain. We also have:   a t+1|t = Tt+1 a t|t−1 + ct+1 + Kt yt − Zt a t|t−1 − dt

Finally, we obtain:



yt a t+1|t

= Zt a t|t−1 + dt + vt = Tt+1 a t|t−1 + ct+1 + Kt vt

This system is called the innovation representation.     Let t be a fixed given date. We define a t|t = Et [αt ] and P t|t = Et a t|t − αt a t|t − αt with t ≤ t . We have a t |t = at and P t |t = Pt . The Kalman smoother is then defined by the following set of recursive equations: Pt∗ a t|t P t|t

−1  = Pt Tt+1 P t+1|t   ∗ = at + Pt a t+1|t − a t+1|t  = Pt + Pt∗ P t+1|t − P t+1|t Pt∗

h i = Et [αt ], a t|t−1 = Et−1 [αt ], Pt = Et (at − αt ) (at − αt ) and P t|t−1 h` ´` ´ i where Et indicates the conditional expectation operator. Et−1 a t|t−1 − αt a t|t−1 − αt 34 We

have at

Q U A N T R E S E A R C H B Y LY X O R

=

41

A.2 A.2.1

L1 filtering The dual problem

The L1 filtering problem can be solved by considering the dual problem which is a QP programme. We first rewrite the primal problem with a new variable z = Dˆ x: min u.c.

1 2 y − x ˆ 2 + λ z 1 2 z = Dˆ x

We now construct the Lagrangian function with the dual variable ν ∈ Rn−2 : L (ˆ x, z, v) =

1 2 y − x ˆ 2 + λ z 1 + ν  (Dˆ x − z) 2

The dual objective function is obtained in the following way: 1 x, z, ν) = − ν  DD ν + y  D ν inf xˆ,z L (ˆ 2 for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent to the dual problem: min u.c.

1  ν DD ν − y  D ν 2 −λ1 ≤ ν ≤ λ1

This QP programme can be solved by a traditional Newton algorithm or by interior-point methods, and finally, the solution of the trend is: x ˆ = y − D ν A.2.2

Solving using interior-point algorithms

We briefly present the interior-point algorithm of Boyd and Vandenberghe (2009) in the case of the following optimisation problem: min f0 (θ)  Aθ = b u.c. fi (θ) < 0

for i = 1, . . . , m

where f0 , . . . , fm : Rn → R are convex and twice continuously differentiable and rank (A) = p < n. The inequality constraints will become implicit if the problem is rewritten as: min f0 (θ) +

m  i=1

u.c.

Aθ = b

I− (fi (θ))

where I− (u) : R → R is the non-positive indicator function35 . This indicator function is discontinuous, so the Newton method can not be applied. In order to overcome this prob (u) = −τ −1 ln (−u) lem, we approximate I− (u) using the logarithmic barrier function I− 35 We

have:

j I− (u) =

42

0 ∞

u≤0 u>0

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

with τ → ∞. Finally the Kuhn-Tucker condition for this approximation problem gives rt (θ, λ, ν) = 0 with: ⎛ ⎞  ∇f0 (θ) + ∇f (θ) λ + A ν rτ (θ, λ, ν) = ⎝ − diag (λ) f (θ) − τ −1 1 ⎠ Aθ − b

The solution of rτ (θ, λ, ν) = 0 can be obtained using Newton’s iteration for the triple π = (θ, λ, ν): rτ (π + Δπ)  rτ (π) + ∇rτ (π) Δπ = 0 This equation gives the Newton step Δπ = −∇rτ (π) direction. A.2.3

−1

rτ (π), which defines the search

The multivariate case

In the multivariate case, the primal problem is: min u.c.

m '2 1 ' ' ' (j) ˆ' + λ z 1 'y − x 2 j=1 2

z = Dˆ x

The dual objective function becomes: m   1 1   (j) y − y¯ y (j) − y¯ x, z, ν) = − ν  DD ν + y¯ D ν + inf xˆ,z L (ˆ 2 2 j=1

for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent to the dual problem: min u.c.

1  ν DD ν − y¯ D ν 2 −λ1 ≤ ν ≤ λ1

The solution is then x ˆ = y¯ − D ν. A.2.4

The scaling of the smoothing parameter

We can attempt to estimate the order of magnitude of the parameter λmax by considering the continuous case. We assume that the signal is a process Wt . The value of λmax in the discrete case is defined by: ' ' −1 ' ' Dy ' λmax = ' DD ∞ (T can be considered as the first primitive I1 (T ) = 0 Wt dt of the process Wt if D = D1 (T (t (L1 − C filtering) or the second primitive I2 (T ) = 0 0 Ws ds dt of Wt if D = D2 (L1 − T filtering). We have:

T Wt dt I1 (T ) = 0

= WT T − =

Q U A N T R E S E A R C H B Y LY X O R

T 0

T

t dWt

0

(T − t) dWt

43

The process I1 (T ) is a Wiener integral (or a Gaussian process) with variance:

T # $ T3 2 E I12 (T ) = (T − t) dt = 3 0

In this case, we expect that λmax ∼ T 3/2 . The second order primitive can be calculated in the following way:

T I1 (t) dt I2 (T ) = 0

= I1 (T ) T − = I1 (T ) T −

T

t dI1 (T )

0

T

tWt dt

0 2

T 2 t T WT + dWt 2 2 0 

T T2 t2 dWt = − WT + T2 − Tt + 2 2 0

1 T 2 = (T − t) dWT 2 0 = I1 (T ) T −

This quantity is again a Gaussian process with variance:

1 T T5 4 E[I22 (T )] = (T − t) dt = 4 0 20 In this case, we expect that λmax ∼ T 5/2 .

A.3

Wavelet analysis

The time analysis can detect anomalies in time series, such as a market crash on a specific date. The frequency analysis detects repeated sequences in a signal. The double dimension analysis makes it possible to coordinate time and frequency detection, as we use a larger time window than a smaller frequency interval (see Figure 23). In this area, the uncertainty of localisation is 1/dt, with dt the sampling step and f = 1/dt the sampling frequency. The wavelet transform can be a solution to analysing time series in terms of the time-frequency dimension. The first wavelet approach appeared in the early eighties in seismic data analysis. The term wavelet was introduced in the scientific community by Grossmann and Morlet (1984). Since 1986, a great deal of theoretical research, including wavelets, has been developed. The wavelet transform uses a basic function, called the mother wavelet, then dilates and translates it to capture features that are local in time and frequency. The distribution of the time-frequency domain with respect to the wavelet transform is long in time when capturing low frequency events and long in frequency when capturing high frequency events. As an example, we represent some mother wavelets in Figure 24. The aim of wavelet analysis is to separate signal trends and details. These different components can be distinguished by different levels of resolution or different sizes/scales of detail. In this sense, it generates a phase space decomposition which is defined by two

44

Issue # 8

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Figure 23: Time-frequency dimension

Figure 24: Some mother wavelets

Q U A N T R E S E A R C H B Y LY X O R

45

parameters (scale and location) in opposition to a Fourier decomposition. A wavelet ψ (t) is a function of time t such that:

+∞ ψ (t) dt = 0

−∞ +∞

−∞

2

|ψ (t)| dt

=

1

The continuous wavelet transform is a function of two variables W (u, s) and is given by projecting the time series x (t) onto a particular wavelet ψ by:

+∞ W (u, s) = x (t) ψu,s (t) dt −∞

with:

  t−u 1 ψu,s (t) = √ ψ s s which corresponds to the mother wavelet translated by u (location parameter) and dilated by s (scale parameter). If the wavelet satisfies the previous properties, the inverse operation may be performed to produce the original signal from its wavelet coefficients:

+∞ +∞ W (u, s) ψ (u, s) du ds x (t) = −∞

−∞

The continuous wavelet transform of a time series signal x (t) gives an infinite number of coefficients W (u, s) where u ∈ R and s ∈ R+ , but many coefficients are close or equal to zero. The discrete wavelet transform can be used to decompose a signal into a finite number of coefficients where we use s = 2−j as the scale parameter and u = k2−j as the location parameter with j ∈ Z and k ∈ Z. Therefore ψu,s (t) becomes:   j ψj,k (t) = 2 2 ψ 2j t − k where j = 1, 2, ..., J in a J-level decomposition. The wavelet representation of a discrete signal x (t) is given by: x (t) = s(0) φ (t) +

j−1 J−1  2

d(j),k ψj,k (t)

j=0 k=0

where φ (t) = 1 if t ∈ [0, 1] and J is the number of multi-resolution levels. Therefore, computing the wavelet transform of the discrete signal is equivalent to compute the smooth coefficient s(0) and the detail coefficients d(j),k . Introduced by Mallat (1989), the multi-scale analysis corresponds to the following iterative scheme: x s ss sss ssss

46



 

 

sssd

 

ssd

 sd

d

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

where the high-pass filter defines the details of the data and the low-pass filter defines the smoothing signal. In this example, we obtain these wavelet coefficients: ⎡ ⎤ ssss ⎢ sssd ⎥ ⎢ ⎥ ⎥ W =⎢ ⎢ ssd ⎥ ⎣ sd ⎦ d

Applying this pyramidal algorithm to the time series signal up to the J resolution level gives us the wavelet coefficients: ⎡ ⎤ s(0) ⎢ d(0) ⎥ ⎢ ⎥ ⎢ d(1) ⎥ ⎢ ⎥ ⎥ . W =⎢ ⎢ ⎥ ⎢ ⎥ . ⎢ ⎥ ⎣ ⎦ . d(J−1)

A.4

Support vector machine

The support vector machine is an important part of statistical learning theory (Hastie et al., 2009). It was first introduced by Boser et al. (1992) and has been used in various domains such as pattern recognition, biometrics, etc. This technique can be employed in different contexts such as classification, regression or density estimation (see Vapnik, 1998). Recently, applications in finance have been developed in two main directions. The first employs the SVM as a nonlinear estimator in order to forecast the trend or volatility of financial assets. In this context, the SVM is used as a regression technique with the possibility for extension to nonlinear cases thank to the kernel approach. The second direction consists of using the SVM as a classification technique which aims to define the stock selection in trading strategies. A.4.1

SVM in a nutshell

We illustrate here the basic idea of the SVM as a classification method. Let us define the training data set consisting of n pairs of “input/output” points (xi , yi ) where xi ∈ X and yi ∈ {−1, 1}. The idea of linear classification is to look for a possible hyperplane that can separate {xi ⊂ X } into two classes corresponding to the labels yi = ±1. It consists of constructing a linear discriminant function h (x) = w x + b where w is the vector of weights and b is called the bias. The hyperplane is then defined by the following equation: H = {x : h (x) = w x + b = 0} The vector w is interpreted as the normal vector to the hyperplane. We denote its norm w and its direction w ˆ = w/ w . In Figure 25, we give a geometric interpretation of the margin in the linear case. Let x+ and x− be the closest points to the hyperplane from the positive side and negative side. These points determine the margin to the boundary from which the two classes of points D are separated: mD (h) =

Q U A N T R E S E A R C H B Y LY X O R

1  1 w ˆ (x+ − x− ) = 2 w

47

Figure 25: Geometric interpretation of the margin in a linear SVM

The main idea of a maximum margin classifier is to determine the hyperplane that maximises the margin. For a separable dataset, the margin SVM is defined by the following optimisation problem: min w,b

u.c.

1 2 w 2   yi w xi + b > 1 for i = 1, . . . , n

The historical approach to solving this quadratic problem with nonlinear constraints is to map the primal problem to the dual problem: max α

u.c.

n  i=1

n

αi −

αi ≥ 0

n

1  αi αj yi yj x i xj 2 i=1 j=1

for i = 1, . . . , n

Because of the Kuhn-Tucker conditions, the optimised solution (w , b ) of the primal problem n   is given by w = i=1 αi yi xi where α = (α1 , . . . , αn ) is the solution of the dual problem. We notice that linear SVM depends on input data via the inner product. An intelligent way to extend SVM formalism to the nonlinear case is then to replace the inner product with a nonlinear kernel. Hence, the nonlinear SVM dual problem can be obtained by systematically replacing the inner product x i xj by a general kernel K (xi , xj ). Some standard kernels are widely used in pattern recognition, for example polynomial, radial basis or neural

48

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

network kernels36 . Finally, the decision/prediction function is then given by: n  αi yi K (x, xi ) + b f (x) = sgn h (x) = sgn i=1

A.4.2

SVM regression

In the last discussion, we presented the basic idea of the SVM in the classification context. We now show how the regression problem can be interpreted as a SVM problem. In the general framework of statistical learning, the SVM problem consists of minimising the risk function R (f ) depending on the form of the prediction function f (x). The risk function is calculated via the loss function L (f (x) , y) which clearly defines our objective (classification or regression):

R (f ) =

L (f (x) , y) dP (x, y)

where the distribution P (x, y) can be computed by empirical distribution37 or an approximated distribution38 . For the regression problem, the loss function is simply defined as 2 p L (f (x) , y) = (f (x) − y) or L (f (x) , y) = |f (x) − y| in the case of Lp norm. We have seen that the linear SVM is a special case of nonlinear SVM within the kernel approach. We therefore consider the nonlinear case directly where the approximate function of the regression has the following form f (x) = w φ (x) + b. In the VRM framework, we assume that P (x, y) is a Gaussian noise with variance σ 2 : n

1 p 2 R (f ) = |f (xi ) − yi | + σ 2 w n i=1 We introduce the variable ξ = (ξ1 , . . . , ξn ) which satisfies yi = f (xi ) + ξi . The optimisation problem of the risk function can now be written as a QP programme with nonlinear constraints: min

w,b,ξ

u.c.

n −1   1 2 p w + 2nσ 2 |ξi | 2 i=1

yi = w φ (xi ) + b + ξi

for i = 1, . . . , n

In the present form, the regression looks very similar to the SVM classification problem and can be solved in the same way by mapping to the dual problem. We notice that the SVM regression can be easily generalised in two possible ways: 1. by introducing a more general loss function such as the ε-SV regression proposed by Vapnik (1998); 2. by using a weighting distribution ω for the empirical distribution: dP (x, y) =

n 

ωi δxi (x) δyi (y)

i=1

“ ` ´p ´” 2 ` 2 have, respectively, K (xi , xj ) = x or i xj + 1 , K (xi , xj ) = exp − xi − xj  / 2σ ` ´ K (xi , xj ) = tanh ax x − b . i j 37 This framework called ERM was first introduced by Vapnik and Chervonenskis (1991). 38 This framework is called VRM (Chapelle, 2002). 36 We

Q U A N T R E S E A R C H B Y LY X O R

49

As financial series have short memory and depend more on the recent past, an asymmetric weight distribution focusing on recent data would improve the prediction39 . The dual problem in the case p = 1 is given by: 1 max α y − α Kα α 2   α 1=0 −1  u.c. 1 |α| ≤ 2nσ 2

As previously, the  optimal vector α is obtained by solving the QP programme. We then n deduce that w = i=1 αi φ (xi ) and b is computed using the Kuhn-Tucker condition: w φ (xi ) + b − yi = 0

for support vectors (xi , yi ). In order to achieve a good level of accuracy for the estimation of b, we average out the set of support vectors and obtain b . The SVM regressor is then given by the following formula: f (x) =

n 

αi K (x, xi ) + b

i=1

with K (x, xi ) = φ (x) φ (xi ). In Figure 26, we apply SVM regression with the Gaussian kernel to the S&P 500 index. The kernel parameter σ characterises the estimation horizon which is equivalent to period n in the moving average regression.

A.5

Singular spectrum analysis

In recent years the singular spectrum analysis (SSA) technique has been developed as a time-frequency domain method40 . It consists of decomposing a time series into a trend, oscillatory components and a noise. The method is based on the principal component analysis of the auto-covariance matrix of the time series y = (y1 , . . . , yt ). Let n be the window length such that n = t − m + 1 with m < t/2. We define the n × m Hankel matrix H as the matrix of the m concatenated lag vector of y: ⎞ ⎛ y1 y2 y3 ··· ym ⎜ y2 y3 y4 · · · ym+1 ⎟ ⎟ ⎜ ⎜ .. ⎟ ⎜ y4 y5 ··· . ⎟ H = ⎜ y3 ⎟ ⎟ ⎜. . . . . . . . ⎝. . yt−1 ⎠ . . yn yn+1 yn+2 · · · yt

We recover the time series y by diagonal averaging: yp = 39 See

Gestel et al. (2001) and Tay and Cao 2002. by Broomhead and King (1986).

40 Introduced

50

m 1  (i,j) H αp j=1

(10)

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Figure 26: SVM filtering

where i = p − j + 1, 0 < i < n + 1 and: ⎧ ⎪ if p < m ⎨p αp = t − p + 1 if p > t − m + 1 ⎪ ⎩ m otherwise

This relationship seems trivial because each H(i,j) is equal to yp with respect to the conditions for i and j. But this equality no longer holds if we apply factor analysis. Let C = H H be the covariance matrix of H. By performing the eigenvalue decomposition C = V ΛV  , we can deduce the corresponding principal components: Pk = HVk where Vk is the matrix of the first k th eigenvectors of C. ˆ as follows: Let us now define the n × m matrix H ˆ = Pk V  H k ˆ = H if all the components are selected. If k < m, we have removed the noise and We have H the trend x ˆ is estimated by applying the diagonal averaging procedure (10) to the matrix ˆ H. We have applied the singular spectrum decomposition to the S&P 500 index with different ˆ using lags m. For each lag, we compute the Hankel matrix H, then deduce the matrix H only the first eigenvector (k = 1) and estimate the corresponding trend. Results are given in Figure 27. As for other methods, such as nonlinear filters, the calibration depends on the parameter m, which controls the window length.

Q U A N T R E S E A R C H B Y LY X O R

51

Figure 27: SSA filtering

52

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

References [1] Alexandrov T., Bianconcini S., Dagum E.B., Maass P. and McElroy T. (2008), A Review of Some Modern Approaches to the Problem of Trend Extraction , US Census Bureau, RRS #2008/03. [2] Antoniadis A., Gregoire G. and McKeague I.W. (1994), Wavelet Methods for Curve Estimation, Journal of the American Statistical Association, 89(428), pp. 13401353. [3] Barberis N. and Thaler T. (2002), A Survey of Behavioral Finance, NBER Working Paper, 9222. [4] Beveridge S. and Nelson C.R. (1981), A New Approach to the Decomposition of Economic Time Series into Permanent and Transitory Components with Particular Attention to Measurement of the Business Cycle, Journal of Monetary Economics, 7(2), pp. 151-174. [5] Boser B.E., Guyon I.M. and Vapnik V. (1992), A Training Algorithm for Optimal Margin Classifier, Proceedings of the Fifth Annual Workshop on Computational Learning Theory, pp. 114-152. [6] Boyd S. and Vandenberghe L. (2009), Convex Optimization, Cambridge University Press. [7] Brockwell P.J. and Davis R.A. (2003), Introduction to Time Series and Forecasting, Springer. [8] Broomhead D.S. and King G.P. (1986), On the Qualitative Analysis of Experimental Dynamical Systems, in Sarkar S. (ed.), Nonlinear Phenomena and Chaos, Adam Hilger, pp. 113-144. [9] Brown S.J., Goetzmann W.N. and Kumar A. (1998), The Dow Theory: William Peter Hamilton’s Track Record Reconsidered, Journal of Finance, 53(4), pp. 1311-1333. [10] Burch N., Fishback P.E. and Gordon R. (2005), The Least-Squares Property of the Lanczos Derivative, Mathematics Magazine, 78(5), pp. 368-378. [11] Carhart M.M. (1997), On Persistence in Mutual Fund Performance, Journal of Finance, 52(1), pp. 57-82. [12] Chan L.K.C., Jegadeesh N. and Lakonishok J. (1996), Momentum Strategies, Journal of Finance, 51(5), pp. 1681-1713. [13] Chang Y., Miller J.I. and Park J.Y. (2009), Extracting a Common Stochastic Trend: Theory with Some Applications, Journal of Econometrics, 150(2), pp. 231-247. [14] Chapelle O. (2002), Support Vector Machine: Induction Principles, Adaptive Tuning and Prior Knowledge, PhD thesis, University of Paris 6. [15] Cleveland W.P. and Tiao G.C. (1976), Decomposition of Seasonal Time Series: A Model for the Census X-11 Program, Journal of the American Statistical Association, 71(355), pp. 581-587. [16] Cleveland W.S. (1979), Robust Locally Regression and Smoothing Scatterplots, Journal of the American Statistical Association, 74(368), pp. 829-836.

Q U A N T R E S E A R C H B Y LY X O R

53

[17] Cleveland W.S. and Devlin S.J. (1988), Locally Weighted Regression: An Approach to Regression Analysis by Local Fitting, Journal of the American Statistical Association, 83(403), pp. 596-610. [18] Cochrane J. (2001), Asset Pricing, Princeton University Press. [19] Cortes C. and Vapnik V. (1995), Support-Vector Networks, Machine Learning, 20(3), pp. 273-297. [20] D’Aspremont A. (2011), Identifying Small Mean Reverting Portfolios, Quantitative Finance, 11(3), pp. 351-364. [21] Daubechies I. (1992), Ten Lectures on Wavelets, SIAM. [22] Daubechies I., Defrise M. and De Mol C. (2004), An Iterative Thresholding Algorithm for Linear Inverse Problems with a Sparsity Constraint, Communications on Pure and Applied Mathematics, 57(11), pp. 1413-1457. [23] Donoho D.L. (1995), De-Noising by Soft-Thresholding, IEEE Transactions on Information Theory, 41(3), pp. 613-627. [24] Donoho D.L. and Johnstone I.M. (1994), Ideal Spatial Adaptation via Wavelet Shrinkage, Biometrika, 81(3), pp. 425-455. [25] Donoho D.L. and Johnstone I.M. (1995), Adapting to Unknown Smoothness via Wavelet Shrinkage, Journal of the American Statistical Association, 90(432), pp. 12001224. [26] Doucet A., De Freitas N. and Gordon N. (2001), Sequential Monte Carlo in Practice, Springer. [27] Ehlers J.F. (2001), Rocket Science for Traders: Digital Signal Processing Applications, John Wiley & Sons. [28] Elton E.J. and Gruber M.J. (1972), Earnings Estimates and the Accuracy of Expectational Data, Management Science, 18(8), pp. 409-424. [29] Engle R.F. and Granger C.W.J. (1987), Co-Integration and Error Correction: Representation, Estimation, and Testing, Econometrica, 55(2), pp. 251-276. [30] Fama E. (1970), Efficient Capital Markets: A Review of Theory and Empirical Work, Journal of Finance, 25(2), pp. 383-417. [31] Flandrin P., Rilling G. and Goncalves P. (2004), Empirical Mode Decomposition as a Filter Bank, Signal Processing Letters, 11(2), pp. 112-114. [32] Fliess M. and Join C. (2009), A Mathematical Proof of the Existence of Trends in Financial Time Series, in El Jai A., Afifi L. and Zerrik E. (eds), Systems Theory: Modeling, Analysis and Control, Presses Universitaires de Perpignan, pp. 43-62. [33] Fuentes M. (2002), Spectral Methods for Nonstationary Spatial Processes, Biometrika, 89(1), pp. 197-210. [34] Gençay R., Selçuk F. and Whitcher B. (2002), An Introduction to Wavelets and Other Filtering Methods in Finance and Economics, Academic Press.

54

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

[35] Gestel T.V., Suykens J.A.K., Baestaens D., Lambrechts A., Lanckriet G., Vandaele B., De Moor B. and Vandewalle J. (2001), Financial Time Series Prediction Using Least Squares Support Vector Machines Within the Evidence Framework, IEEE Transactions on Neural Networks, 12(4), pp. 809-821. [36] Golyandina N., Nekrutkin V.V. and Zhigljavsky A.A. (2001), Analysis of Time Series Structure: SSA and Related Techniques, Chapman & Hall, CRC. [37] Gonzalo J. and Granger C.W.J. (1995), Estimation of Common Long-Memory Components in Cointegrated Systems, Journal of Business & Economic Statistics, 13(1), pp. 27-35. [38] Grinblatt M., Titman S. and Wermers R. (1995), Momentum Investment Strategies, Portfolio Performance, and Herding: A Study of Mutual Fund Behavior, American Economic Review, 85(5), pp. 1088-1105. [39] Groetsch C.W. (1998), Lanczo’s Generalized Derivative, American Mathematical Monthly, 105(4), pp. 320-326. [40] Grossmann A. and Morlet J. (1984), Decomposition of Hardy Functions into Square Integrable Wavelets of Constant Shape, SIAM Journal of Mathematical Analysis, 15, pp. 723-736. [41] Härdle W. (1992), Applied Nonparametric Regression, Cambridge University Press. [42] Harvey A.C. (1989), Forecasting, Structural Time Series Models and the Kalman Filter, Cambridge University Press. [43] Harvey A.C. and Trimbur T.M. (2003), General Model-Based Filters for Extracting Cycles and Trends in Economic Time Series, Review of Economics and Statistics, 85(2), pp. 244-255. [44] Hastie T., Tibshirani R. and Friedman R. (2009), The Elements of Statistical Learning, second edition, Springer. [45] Henderson R. (1916), Note on Graduation by Adjusted Average, Transactions of the Actuarial Society of America, 17, pp. 43-48. [46] Hodrick R.J. and Prescott E.C. (1997), Postwar U.S. Business Cycles: An Empirical Investigation, Journal of Money, Credit and Banking, 29(1), pp. 1-16. [47] Holt C.C. (1959), Forecasting Seasonals and Trends by Exponentially Weighted Moving Averages, ONR Research Memorandum, 52, reprinted in International Journal of Forecasting, 2004, 20(1), pp. 5-10. [48] Hong H. and Stein J.C. (1977), A Unified Theory of Underreaction, Momentum Trading and Overreaction in Asset Markets, NBER Working Paper, 6324. [49] Johansen S. (1988), Statistical Analysis of Cointegration Vectors, Journal of Economic Dynamics and Control, 12(2-3), pp. 231-254. [50] Johansen S. (1991), Estimation and Hypothesis Testing of Cointegration Vectors in Gaussian Vector Autoregressive Models, Econometrica, 52(6), pp. 1551-1580. [51] Kalaba R. and Tesfatsion L. (1989), Time-varying Linear Regression via Flexible Least Squares, Computers & Mathematics with Applications, 17, pp. 1215-1245.

Q U A N T R E S E A R C H B Y LY X O R

55

[52] Kalman R.E. (1960), A New Approach to Linear Filtering and Prediction Problems, Transactions of the ASME – Journal of Basic Engineering, 82(D), pp. 35-45. [53] Kendall M.G. (1973), Time Series, Charles Griffin. [54] Kim S-J., Koh K., Boyd S. and Gorinevsky D. (2009), 1 Trend Filtering, SIAM Review, 51(2), pp. 339-360. [55] Kolmogorov A.N. (1941), Interpolation and Extrapolation of Random Sequences, Izvestiya Akademii Nauk SSSR, Seriya Matematicheskaya, 5(1), pp. 3-14. [56] Macaulay F. (1931), The Smoothing of Time Series, National Bureau of Economic Research. [57] Mallat S.G. (1989), A Theory for Multiresolution Signal Decomposition: The Wavelet Representation, IEEE Transactions on Pattern Analysis and Machine Intelligence, 11(7), pp. 674-693. [58] Mann H.B. (1945), Nonparametric Tests against Trend, Econometrica, 13(3), pp. 245259. [59] Martin W. and Flandrin P. (1985), Wigner-Ville Spectral Analysis of Nonstationary Processes, IEEE Transactions on Acoustics, Speech and Signal Processing, 33(6), pp. 1461-1470. [60] Muth J.F. (1960), Optimal Properties of Exponentially Weighted Forecasts, Journal of the American Statistical Association, 55(290), pp. 299-306. [61] Oppenheim A.V. and Schafer R.W. (2009), Discrete-Time Signal Processing, third edition, Prentice-Hall. [62] Peña D. and Box, G.E.P. (1987), Identifying a Simplifying Structure in Time Series, Journal of the American Statistical Association, 82(399), pp. 836-843. [63] Pollock, D.S.G. (2006), Wiener-Kolmogorov Filtering Frequency-Selective Filtering and Polynomial Regression, Econometric Theory, 23, pp. 71-83. [64] Pollock D.S.G. (2009), Statistical Signal Extraction: A Partial Survey, in Kontoghiorges E. and Belsley D.E. (eds.), Handbook of Empirical Econometrics, John Wiley and Sons. [65] Rao S.T. and Zurbenko I.G. (1994), Detecting and Tracking Changes in Ozone air Quality, Journal of Air and Waste Management Association, 44(9), pp. 1089-1092. [66] Roncalli T. (2010), La Gestion d’Actifs Quantitative, Economica. [67] Savitzky A. and Golay M.J.E. (1964), Smoothing and Differentiation of Data by Simplified Least Squares Procedures, Analytical Chemistry, 36(8), pp. 1627-1639. [68] Silverman B.W. (1985), Some Aspects of the Spline Smoothing Approach to NonParametric Regression Curve Fitting, Journal of the Royal Statistical Society, B47(1), pp. 1-52. [69] Sorenson H.W. (1970), Least-Squares Estimation: From Gauss to Kalman, IEEE Spectrum, 7, pp. 63-68.

56

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

[70] Stock J.H. and Watson M.W. (1988), Variable Trends in Economic Time Series, Journal of Economic Perspectives, 2(3), pp. 147-174. [71] Tay F.E.H. and Cao L.J. (2002), Modified Support Vector Machines in Financial Times Series Forecasting, Neurocomputing, 48(1-4), pp. 847-861. [72] Tibshirani R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society, B58(1), pp. 267-288. [73] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons, New York. [74] Vapnik V. and Chervonenskis A. (1991), On the Uniform Convergence of Relative Frequency of Events to their Probabilities, Theory of Probability and its Applications, 16(2), pp. 264-280. [75] Vautard R., Yiou P., and Ghil M. (1992), Singular Spectrum Analysis: A Toolkit for Short, Noisy Chaotic Signals, Physica D, 58(1-4), pp. 95-126. [76] Wahba G. (1990), Spline Models for Observational Data, CBMS-NSF Regional Conference Series in Applied Mathematics, 59, SIAM. [77] Wang Y. (1998), Change Curve Estimation via Wavelets, Journal of the American Statistical Association, 93(441), pp. 163-172. [78] Wiener N. (1949), Extrapolation, Interpolation and Smoothing of Stationary Time Series with Engineering Applications, MIT Technology Press and John Wiley & Sons (originally published in 1941 as a Report on the Services Research Project, DIC-6037). [79] Whittaker E.T. (1923), On a New Method of Graduation, Proceedings of the Edinburgh Mathematical Society, 41, pp. 63-75. [80] Winters P.R. (1960), Forecasting Sales by Exponentially Weighted Moving Averages, Management Science, 6(3), 324-342. [81] Yue S. and Pilon P. (2004), A Comparison of the Power of the t-test, Mann-Kendall and Bootstrap Tests for Trend Detection, Hydrological Sciences Journal, 49(1), 21-37. [82] Zurbenko I., Porter P.S., Rao S.T., Ku J.K., Gui R. and Eskridge R.E. (1996), Detecting Discontinuities in Time Series of Upper-Air Data: Demonstration of an Adaptive Filter Technique, Journal of Climate, 9(12), pp. 3548-3560.

Q U A N T R E S E A R C H B Y LY X O R

57

58

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Lyxor White Paper Series List of Issues • Issue #1 – Risk-Based Indexation.

Paul Demey, Sébastien Maillard and Thierry Roncalli, March 2010.

• Issue #2 – Beyond Liability-Driven Investment: New Perspectives on Defined Benefit Pension Fund Management. Benjamin Bruder, Guillaume Jamet and Guillaume Lasserre, March 2010. • Issue #3 – Mutual Fund Ratings and Performance Persistence.

Pierre Hereil, Philippe Mitaine, Nicolas Moussavi and Thierry Roncalli, June 2010.

• Issue #4 – Time Varying Risk Premiums & Business Cycles: A Survey. Serge Darolles, Karl Eychenne and Stéphane Martinetti, September 2010.

• Issue #5 – Portfolio Allocation of Hedge Funds.

Benjamin Bruder, Serge Darolles, Abdul Koudiraty and Thierry Roncalli, January 2011.

• Issue #6 – Strategic Asset Allocation.

Karl Eychenne, Stéphane Martinetti and Thierry Roncalli, March 2011.

• Issue #7 – Risk-Return Analysis of Dynamic Investment Strategies. Benjamin Bruder and Nicolas Gaussel, June 2011.

Q U A N T R E S E A R C H B Y LY X O R

59

60

T R E N D F I LT E R I N G M E T H O D S F O R M O M E N T U M S T R AT E G I E S

Issue # 8

Disclaimer Each of this material and its content is confidential and may not be reproduced or provided to others without the express written permission of Lyxor Asset Management (“Lyxor AM”). This material has been prepared solely for informational purposes only and it is not intended to be and should not be considered as an offer, or a solicitation of an offer, or an invitation or a personal recommendation to buy or sell participating shares in any Lyxor Fund, or any security or financial instrument, or to participate in any investment strategy, directly or indirectly. It is intended for use only by those recipients to whom it is made directly available by Lyxor AM. Lyxor AM will not treat recipients of this material as its clients by virtue of their receiving this material. This material reflects the views and opinions of the individual authors at this date and in no way the official position or advices of any kind of these authors or of Lyxor AM and thus does not engage the responsibility of Lyxor AM nor of any of its officers or employees. All performance information set forth herein is based on historical data and, in some cases, hypothetical data, and may reflect certain assumptions with respect to fees, expenses, taxes, capital charges, allocations and other factors that affect the computation of the returns. Past performance is not necessarily a guide to future performance. While the information (including any historical or hypothetical returns) in this material has been obtained from external sources deemed reliable, neither Société Générale (“SG”), Lyxor AM, nor their affiliates, officers employees guarantee its accuracy, timeliness or completeness. Any opinions expressed herein are statements of our judgment on this date and are subject to change without notice. SG, Lyxor AM and their affiliates assume no fiduciary responsibility or liability for any consequences, financial or otherwise, arising from, an investment in any security or financial instrument described herein or in any other security, or from the implementation of any investment strategy. Lyxor AM and its affiliates may from time to time deal in, profit from the trading of, hold, have positions in, or act as market makers, advisers, brokers or otherwise in relation to the securities and financial instruments described herein. Service marks appearing herein are the exclusive property of SG and its affiliates, as the case may be. This material is communicated by Lyxor Asset Management, which is authorized and regulated in France by the “Autorité des Marchés Financiers” (French Financial Markets Authority). c 2011 LYXOR ASSET MANAGEMENT ALL RIGHTS RESERVED

Q U A N T R E S E A R C H B Y LY X O R

61

The Lyxor White Paper Series is a quarterly publication providing our clients access to intellectual capital, risk analytics and quantitative research developed within Lyxor Asset Management. The Series covers in depth studies of investment strategies, asset allocation methodologies and risk management techniques. We hope you will find the Lyxor White Paper Series stimulating and interesting.

PUBLISHING DIRECTORS Alain Dubois, Chairman of the Board Laurent Seyer, Chief Executive Officer

EDITORIAL BOARD Nicolas Gaussel, PhD, Managing Editor. Thierry Roncalli, PhD, Associate Editor

Lyxor Asset Management Tour Société Générale – 17 cours Valmy 92987 Paris – La Défense Cedex – France [email protected] – www.lyxor.com

Réf. 712100 – Studio Société Générale +33 (0)1 42 14 27 05 – 12/2011

Benjamin Bruder, PhD, Associate Editor