Momentum strategies: From Novel Estimation Techniques to

Sep 30, 2011 - University of Paris 7 - Lyxor Asset Management ... 1.3.3 Mixing trend and mean-reverting properties . .... 2.19 Backtest of voltarget strategy on BHI UN Equity . .... We next review in the second chapter various techniques for ... technique published in the 8th issue of the White Paper in the Lyxor White Paper.
5MB taille 132 téléchargements 338 vues
University of Paris 7 - Lyxor Asset Management

Master thesis

Momentum Strategies: From novel Estimation Techniques to Financial Applications

Author: Tung-Lam Dao

Supervisor: Dr. Thierry Roncalli

September 30, 2011

Contents Acknowledgments

ix

Introduction

xi

1 Trading Strategies with L1 Filtering 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Motivations . . . . . . . . . . . . . . . . . . . . . . . . 1.3 L1 filtering schemes . . . . . . . . . . . . . . . . . . . . 1.3.1 Application to trend-stationary process . . . . 1.3.2 Extension to mean-reverting process . . . . . . 1.3.3 Mixing trend and mean-reverting properties . . 1.3.4 How to calibrate the regularization parameters? 1.4 Application to momentum strategies . . . . . . . . . . 1.4.1 Estimating the optimal filter for a given trading 1.4.2 Backtest of a momentum strategy . . . . . . . . 1.5 Extension to the multivariate case . . . . . . . . . . . 1.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

1 1 2 3 3 4 8 8 13 13 15 16 17

2 Volatility Estimation for Trading Strategies 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Range-based estimators of volatility . . . . . . . . . . . . . . . . . 2.2.1 Range based daily data . . . . . . . . . . . . . . . . . . . . 2.2.2 Basic estimator . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.3 High-low estimators . . . . . . . . . . . . . . . . . . . . . . 2.2.4 How to eliminate both drift and opening effects? . . . . . . 2.2.5 Numerical simulations . . . . . . . . . . . . . . . . . . . . . 2.2.6 Backtest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Estimation of realized volatility . . . . . . . . . . . . . . . . . . . . 2.3.1 Moving-average estimator . . . . . . . . . . . . . . . . . . . 2.3.2 IGARCH estimator . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Extension to range-based estimators . . . . . . . . . . . . . 2.3.4 Calibration procedure of the estimators of realized volatility 2.4 High-frequency volatility estimators . . . . . . . . . . . . . . . . . . 2.4.1 Microstructure effect . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . .

21 21 22 22 23 26 28 29 35 42 42 43 45 45 50 52

i

. . . . . . . . . . . . . . . . . . . . . . . . date . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

2.5

2.4.2 Two time-scale volatility estimator . . . . . . . . . . . . . . . 2.4.3 Numerical implementation and backtesting . . . . . . . . . . Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Support Vector Machine in Finance 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Support vector machine at a glance . . . . . . . . . . . . 3.2.1 Basic ideas of SVM . . . . . . . . . . . . . . . . . 3.2.2 ERM and VRM frameworks . . . . . . . . . . . . 3.3 Numerical implementations . . . . . . . . . . . . . . . . 3.3.1 Dual approach . . . . . . . . . . . . . . . . . . . 3.3.2 Primal approach . . . . . . . . . . . . . . . . . . 3.3.3 Model selection - Cross validation procedure . . . 3.4 Extension to SVM multi-classification . . . . . . . . . . 3.4.1 Basic idea of multi-classification . . . . . . . . . . 3.4.2 Implementations of multiclass SVM . . . . . . . . 3.5 SVM-regression in finance . . . . . . . . . . . . . . . . . 3.5.1 Numerical tests on SVM-regressors . . . . . . . . 3.5.2 SVM-Filtering for forecasting the trend of signal 3.5.3 SVM for multivariate regression . . . . . . . . . . 3.6 SVM-classification in finance . . . . . . . . . . . . . . . 3.6.1 Test of SVM-classifiers . . . . . . . . . . . . . . . 3.6.2 SVM for classification . . . . . . . . . . . . . . . 3.6.3 SVM for score construction and stock selection . 3.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

. . . . . . . . . . . . . . . . . . . .

52 55 56

59 . 59 . 60 . 60 . 65 . 68 . 68 . 75 . 76 . 77 . 77 . 78 . 83 . 83 . 84 . 87 . 91 . 91 . 95 . 98 . 105

Conclusions

109

A Appendix of chaper 1 A.1 Computational aspects of L1 , L2 filters . . . . A.1.1 The dual problem . . . . . . . . . . . . A.1.2 The interior-point algorithm . . . . . . A.1.3 The scaling of smoothing parameter of A.1.4 Calibration of the L2 filter . . . . . . . A.1.5 Implementation issues . . . . . . . . .

. . . . . . . . . . . . . . . L1 filter . . . . . . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

. . . . . .

111 111 111 113 114 115 117

B Appendix of chapter 2 119 B.1 Estimator of volatility . . . . . . . . . . . . . . . . . . . . . . . . . . 119 B.1.1 Estimation with realized return . . . . . . . . . . . . . . . . . 119 C Appendix of chapter 3 C.1 Dual problem of SVM . . . . . . . C.1.1 Hard-margin SVM classifier C.1.2 Soft-margin SVM classifier . C.1.3 ε-SV regression . . . . . . . ii

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

. . . .

121 121 121 122 123

C.2 Newton optimization for the primal problem . . . . . . . . . . . . . . 124 C.2.1 Quadratic loss function . . . . . . . . . . . . . . . . . . . . . 124 C.2.2 Soft-margin SVM . . . . . . . . . . . . . . . . . . . . . . . . . 125 Published paper

127

iii

List of Figures 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.10 1.11 1.12

L1 − T filtering versus HP filtering for the model (1.2) . . . L1 -T filtering versus HP filtering for the model (1.3) . . . . L1 − C filtering versus HP filtering for the model (1.5) . . . L1 − C filtering versus HP filtering for the model (1.6) . . . L1 − T C filtering versus HP filtering for the model (1.2) . . L1 − T C filtering versus HP filtering for the model (1.3) . . Influence of the smoothing parameter λ . . . . . . . . . . . Scaling power law of the smoothing parameter λmax . . . . Cross-validation procedure for determining optimal value λ? Calibration procedure with the S&P 500 index . . . . . . . Cross validation procedure for two-trend model . . . . . . . Comparison between different L1 filters on S&P 500 Index .

2.1 2.2 2.3 2.4 2.5

Data set of 1 trading day . . . . . . . . . . . . . . . . . . . . . . . . 23 Volatility estimators without drift and opening effects (M = 50) . . . 30 Volatility estimators without drift and opening effect (M = 500) . . 31 Volatility estimators with µ = 30% and without opening effect (M = 500) 31 Volatility estimators with opening effect f = 0.3 and without drift (M = 500) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32 Volatility estimators with correction of the opening jump (f = 0.3) . 32 Volatility estimators on stochastic volatility simulation . . . . . . . . 33 Test of voltarget strategy with stochastic volatility simulation . . . . 34 Test of voltarget strategy with stochastic volatility simulation . . . . 35 Comparison between different probability density functions . . . . . 36 Comparison between the different cumulative distribution functions . 36 Volatility estimators on S& P 500 index . . . . . . . . . . . . . . . . 37 Volatility estimators on on BHI UN Equity . . . . . . . . . . . . . . 37 Estimation of the closing interval for S&P 500 index . . . . . . . . . 38 Estimation of the closing interval for BHI UN Equity . . . . . . . . . 38 Likehood function for various estimators on S&P 500 . . . . . . . . . 39 Likehood function for various estimators on BHI UN Equity . . . . . 40 Backtest of voltarget strategy on S&P 500 index . . . . . . . . . . . 41 Backtest of voltarget strategy on BHI UN Equity . . . . . . . . . . . 41 Comparison between IGARCH estimator and CC estimator . . . . . 46

2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 2.17 2.18 2.19 2.20

v

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

. . . . . . . . . . . .

5 5 7 7 8 9 10 11 11 13 13 14

2.21 Likehood function of high-low estimators versus filtered parameter β 2.22 Likehood function of high-low estimators versus effective moving window 2.23 IGARCH estimator versus moving-average estimator for close-to-close prices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.24 Comparison between different IGARCH estimators for high-low prices 2.25 Daily estimation of the likehood function for various close-to-close estimators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.26 Daily estimation of the likehood function for various high-low estimators 2.27 Backtest for close-to-close estimator and realized estimators . . . . . 2.28 Backtest for IGARCH high-low estimators comparing to IGARCH close-to-close estimator . . . . . . . . . . . . . . . . . . . . . . . . . . 2.29 Two-time scale estimator of intraday volatility . . . . . . . . . . . . . 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 3.17 3.18 3.19 3.20 3.21 3.22 3.23 3.24 3.25

47 48 48 49 49 50 51 51 56

Geometric interpretation of the margin in a linear SVM. . . . . . . . 61 Binary decision tree strategy for multiclassification problem . . . . . 80 L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16) 84 L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.17) 85 Comparison of different regression kernel for model (3.16) . . . . . . 85 Comparison of different regression kernel for model (3.17) . . . . . . 86 Cross-validation procedure for determining optimal value C ? σ ? . . . 87 SVM-filtering with fixed horizon scheme . . . . . . . . . . . . . . . . 88 SVM-filtering with dynamic horizon scheme . . . . . . . . . . . . . . 88 L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16) 90 Comparison of different kernels for multivariate regression . . . . . . 90 Comparison between Dual algorithm and Primal algorithm . . . . . . 92 Illustration of non-linear classification with Gaussian kernel . . . . . 92 Illustration of multiclassification with SVM-BDT for in-sample data 93 Illustration of multiclassification with SVM-BDT for out-of-sample data 94 Illustration of multiclassification with SVM-BDT for  = 0 . . . . . . 94 Illustration of multiclassification with SVM-BDT for  = 0.2 . . . . . 95 Multiclassification with SVM-BDT on training set . . . . . . . . . . 96 Prediction efficiency with SVM-BDT on the validation set . . . . . . 97 Comparison between simulated score and Probit score for d = 2 . . . 101 Comparison between simulated score CDF and Probit score CDF for d = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Comparison between simulated score PDF and Probit score PDF for d = 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 Selection curve for long strategy for simulated data and Probit model 103 Probit scores for Eurostoxx data with d = 20 factors . . . . . . . . . 104 SVM scores for Eurostoxx data with d = 20 factors . . . . . . . . . . 105

A.1 Spectral density of moving-average and L2 filters . . . . . . . . . . . 116 A.2 Relationship between the value of λ and the length of the movingaverage filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

vi

List of Tables 1.1

Results for the Backtest . . . . . . . . . . . . . . . . . . . . . . . . .

16

2.1 2.2 2.3

Estimation error for various estimators . . . . . . . . . . . . . . . . . 2 2 Performance of σ ˆHL versus σ ˆCC for different averaging windows . . . 2 2 Performance of σ ˆH L versus σ ˆCC for different filters of f . . . . . . .

34 42 42

vii

Acknowledgments During the six months unforgettable in the R&D team of Lyxor Management, I have experienced and enjoyed every moments. Apart from all the professional experiences that I have learnt from everyones int the department, I did really appreciate the great ambiance in the team which motivated me everyday. I would like first to thank Thierry Roncalli for his supervision during my stay in the team. I did not ever imagine that I could learn so many interesting things within my internship without his direction and his confidence. Thierry has introduced me the financial concepts of the asset management world in a very interactive way. I would say that I have learnt finance in every single discussion with him. He taught me how to combine learning and practice. For the professional experiences, Thierry has help me to fill the lag in my financial knowledges by allowing me to work on various interesting topics. He made me confident to present my understanding on this field. For the daily life, Thierry has shared his own experiences and teach me as well how to adapt to this new world. I would like to thank Nicolas Gaussel for his warming reception in Quantitative management department, for his confidence and for his encouragements during my stay in Lyxor. I have a chance to work with him on a very interesting topic concerning the CTA strategy which plays an important role in asset management. I would like to thank Benjamin Bruder, my nearest neighbor, for his guide and his supervision along my internship. Informally, Benjamin is almost my co-advisor. I must say that I owe him a lot for all of his patience in every daily discussion in order to teach me and to work out many questions coming up to my projects. I am really graceful for his humorist quality which warm up the ambiance. For all members of the R&D team, I would like to express my gratitude to them for their helps, their advices and everything that they shared with me during my stay. I am really happy to be one of them. Thank Jean-Charles for your friendship, for all daily discussions and for your support for all initiatives in my projects. A great thank to Stephane who always cheer up all the breaks with his intelligent humor. I would say that I have learnt from him the most interesting view of the “Binomial world” . Thank Karl for your explanation to your macro-world. Thank Pierre for all your help on data collection and your passion in all explanation such as the story of “Merrill lynch’s investment clock”. Thank Zelia for very stimulated collaboration on my last project and the great time during our internship.

For all persons in the other side of the room, I would like to thank Philippe Balthazard for his comments on my projects and his point of view on financial aspects. Thank Hoang-Phong Nguyen for his help on data base and his support during my stay. There are many other persons that I have chance to be in interaction with but I could not cite here. Thank to my parents, my sister who always believe in me and support me during my deviation to a new direction. In the end, I would like reserve the greatest thank to my wife and my son for their love and daily encouragement. They were always behind me during the most difficult moments of this year.

x

Introduction Within the internship in the Research and Development team of Lyxor Asset Management, we studied novel technologies which are applicable on asset management. We focused on the analysis of some special classes of momentum strategies such as the trend-following strategies or the voltarget strategies. These strategies play a crucial role in the quantitative management as they pretend to optimize the benefit basing on exploitable signals of the market inefficiency and to limit the market risk via an efficient control of the volatility. The objectives of this report are the studies of some novel techniques in statistic and signal treatment fields such as trend filtering, daily and high frequency volatility estimator or support vector machine. We employed these techniques to extract interesting financial signals. These signals are used to implement the momentum strategies which will be described in detail in every chapters of this report. In the first chapter, we discuss various implementation of L1 filtering in order to detect some properties of noisy signals. This filter consists of using a L1 penalty condition in order to obtain the filtered signal composed by a set of straight trends or steps. This penalty condition, which determines the number of breaks, is implemented in a constrained least square problem and is represented by a regularization parameter λ which is estimated by a cross-validation procedure. Financial time series are usually characterized by a long-term trend (called the global trend) and some short-term trends (which are named local trends). A combination of these two time scales can form a simple model describing the process of a global trend process with some mean-reverting properties. Explicit applications to momentum strategies are also discussed in detail with appropriate uses of the trend configurations. We next review in the second chapter various techniques for estimating the volatility. We start by discussing the estimators based on the range of daily monitoring data then we consider the stochastic volatility model in order to determine the instantaneous volatility. At high trading frequency, the stock prices are fluctuated by an additional noise, so-called the micro-structure noise. This effect comes from the bidask spread and the short time scale. Within a short time interval, the trading price does not reflect exactly the equilibrium price determined by the “supply-demand” but bounces between the bid and ask prices. In the second part, we discuss the effect of the micro-structure noise on the volatility estimation. It is a very important

topic concerning a large field of “high-frequency” trading. Examples of backtesting on index and stocks will illustrate the efficiency of considered techniques. The third chapter is dedicated to the study of general framework of machinelearning technique. We review the well-known machine learning techniques so-called support vector machine (SVM). This technique can be employed in different contexts such as classification, regression or density estimation according to Vapnik [1998]. Within the scope of this report, we would like first to give an overview of this method and its numerical variation implementation, then bridge it to financial applications such as trend forecasting, the stock selection, sector recognition or score construction. The matters of this thesis have been used for the review on the trend filtering technique published in the 8th issue of the White Paper in the Lyxor White Paper Series http://www.lyxor.com/fr/publications/white-papers/wp/52/.

xii

Chapter 1

Trading Strategies with L1 Filtering In this chapter, we discuss various implementation of L1 filtering in order to detect some properties of noisy signals. This filter consists of using a L1 penalty condition in order to obtain the filtered signal composed by a set of straight trends or steps. This penalty condition, which determines the number of breaks, is implemented in a constrained least square problem and is represented by a regularization parameter λ which is estimated by a cross-validation procedure. Financial time series are usually characterized by a long-term trend (called the global trend) and some short-term trends (which are named local trends). A combination of these two time scales can form a simple model describing the process of a global trend process with some mean-reverting properties. Explicit applications to momentum strategies are also discussed in detail with appropriate uses of the trend configurations. Keywords: Momentum strategy, L1 filtering, L2 filtering, trend-following, meanreverting.

1.1

Introduction

Trend detection is a major task of time series analysis from both mathematical and financial point of view. The trend of a time series is considered as the component containing the global change which is in contrast to the local change due to the noise. The procedure of trend filtering concerns not only the problem of denoising but it must take into account also the dynamic of the underlying process. That explains why mathematical approaches to trend extraction have a long history and this subject still gives a great interest in the scientific community 1 . In an investment perspective, trend filtering is the core of most momentum strategies developed in the asset management industry and the hedge funds community in order to improve performance and to limit risk of portfolios. 1

For a general review, see Alexandrov et al. (2008).

1

Trading Strategies with L1 Filtering

The paper is organized as follows. In section 2, we discuss the trend-cycle decomposition of time series and review general properties of L1 and L2 filtering. In section 3, we describe the L1 filter with its various extensions and the calibration procedure. In section 4, we apply L1 filters to some momentum strategies and present the results of some backtests with the S&P 500 index. In section 5, we discuss the possible extension to the multivariate case and we conclude in the last section.

1.2

Motivations

In economics, the trend-cycle decomposition plays an important role to describe a non-stationary time series into permanent and transitory stochastic components. Generally, the permanent component is assimilated to a trend whereas the transitory component may be a noise or a stochastic cycle. Moreover, the literature on business cycle has produced a large number of empirical research on this topic (see for example Cleveland and Tiao (1976), Beveridge and Nelson (1991), Harvey (1991) or Hodrick and Prescott (1997)). These last authors have then introduced a new method to estimate the trend of long-run GDP. The method widely used by economists is based on L2 filtering. Recently, Kim et al. (2009) have developed a similar filter by replacing the L2 penalty function by a L1 penalty function. Let us consider a time series yt which can be decomposed by a slowly varying trend xt and a rapidly varying noise εt process: yt = xt + εt Let us first remind the well-known L2 filter (so-called Hodrick-Prescott filter). This scheme consists to determine the trend xt by minimizing the following objective function: n n−1 X 1X (yt − xt )2 + λ (xt−1 − 2xt + xt+1 )2 2 t=1

t=2

with λ > 0 the regularization parameter which control the competition between the smoothness of xt and the residual yt −xt (or the noise εt ). We remark that the second term is the discrete derivative of the trend xt which characterizes the smoothness of the curve. Minimizing this objective function gives a solution which is the trade-off between the data and the smoothness of its curvature. In finance, this scheme does not give a clear signature of the market tendency. By contrast, if we replace the L2 norm by the L1 norm in the objective function, we can obtain more interesting properties. Therefore, Kim et al. (2009) propose to consider the following objective function: n n−1 X 1X 2 (yt − xt ) + λ |xt−1 − 2xt + xt+1 | 2 t=1

t=2

This problem is closely related to the Lasso regression of Tibshirani (1996) or the L1 regularized least square problem of Daubechies et al. (2004). Here, the fact of taking the L1 norm will impose the condition that the second derivation of the filtered signal 2

Trading Strategies with L1 Filtering

must be zero. Hence, the filtered signal is composed by a set of straight trends and breaks2 . The competition between these two terms in the objective function turns to the competition between the number of straight trends (or number of breaks) and the closeness to the raw data. Therefore, the smoothing parameter λ plays an important role for detecting the number of breaks. In the later, we present briefly how the L1 filter works for the trend detection and its extension to mean-reverting processes. The calibration procedure for λ parameter will be also discussed in detail.

1.3 1.3.1

L1 filtering schemes Application to trend-stationary process

The Hodrick-Prescott scheme discussed in last section can be rewritten in the vectorial space Rn and its L2 norm k·k2 as: 1 ky − xk22 + λ kDxk22 2 where y = (y1 , . . . , yn ), x = (x1 , . . . , xn ) ∈ Rn and the D operator is the (n − 2) × n matrix:   1 −2 1   1 −2 1     . . D= (1.1)  .     1 −2 1 1 2 1 The exact solution of this estimation is given by  −1 x? = I + 2λD> D y The explicit expression of x? allows a very simple numerical implementation with sparse matrix. As L2 filter is a linear filter, the regularization parameter λ is calibrated by comparing to the usual moving-average filter. The detail of the calibration procedure is given in Appendix A.1.4. The idea of L2 filter can be generalized to a lager class so-called Lp filter by using Lp penalty condition instead of L2 penalty. This generalization is already discussed in the work of Daubechies et al. (2004) for the linear inverse problem or in the Lasso regression problem by Tibshirani et al. (1996). If we consider a L1 filter, the objective function becomes: n

n−1

t=1

t=2

X 1X (yt − xt )2 + λ |xt−1 − 2xt + xt+1 | 2 2

A break is the position where the trend of signal changes.

3

Trading Strategies with L1 Filtering

which is equivalent to the following vectorial form: 1 ky − xk22 + λ kDxk1 2 It has been demonstrated in Kim et al. (2009) that the dual problem of this L1 filter scheme is a quadratic program with some boundary constraints. The detail of this derivation is shown in Appendix A.1.1. In order to optimize the numerical computation speed, we follow Kim et al. (2009) by using a “primal-dual interior point” method (see Appendix A.1.2). In the following, we check the efficient of this technique on various trend-stationary processes. The first model consists of data simulated by a set of straight trend lines with a white noise perturbation:  yt = xt + εt       εt ∼ N 0, σ 2 xt = xt−1 + vt (1.2)   Pr {v = v } = p  t t−1     Pr vt = b U[0,1] − 12 = 1 − p We present in Figure 2.19 the comparison between L1 − T and HP filtering schemes3 . The top-left graph is the real trend xt whereas the top-right graph presents the noisy signal yt . The bottom graphs show the results of the L1 − T and HP filters. Here, we have chosen λ = 5 258 for the L1 − T filtering and λ = 1 217 464 for HP filtering. This choice of λ for L1 − T filtering is based on the number of breaks in the trend, which is fixed to 10 in this example4 . The second model model is a random walk generated by the following process:  yt = yt−1 + vt + εt    εt ∼ N 0, σ 2 (1.3) Pr {v  t = vt−1 } = p     Pr vt = b U[0,1] − 12 = 1 − p We present in Figure 1.2 the comparison between L1 − T filtering and HP filtering on this second model5 .

1.3.2

Extension to mean-reverting process

As shown in the last paragraph, the use of L1 penalty on the second derivative gives the correct description of the signal tendency. Hence, similar idea can be applied for other order of the derivatives. We present here the extension of this L1 filtering technique to the case of mean-reverting processes. If we impose now the L1 penalty 3 We consider n = 2000 observations. The parameters of the simulation are p = 0.99, b = 0.5 and σ = 15. 4 We discuss how to obtain λ in the next section. 5 The parameters of the simulation are p = 0.993, b = 5 and σ = 15.

4

Trading Strategies with L1 Filtering

Figure 1.1: L1 − T filtering versus HP filtering for the model (1.2) Signal

Noisy signal

100

100

50

50

0

0

−50

−50 500

1000

1500

2000

500

t

L1 -T filter

HP filter

100

100

50

50

0

0

−50

−50 500

1000

t

1000

1500

2000

500

1000

t

1500

2000

1500

2000

t

Figure 1.2: L1 -T filtering versus HP filtering for the model (1.3) Signal

Noisy signal

1500

1500

1000

1000

500

500

0

0 500

1000

1500

2000

500

1000

t

t

L1 -T filter

HP filter

1500

1500

1000

1000

500

500

0

1500

2000

1500

2000

0 500

1000

1500

2000

500

t

1000

t

5

Trading Strategies with L1 Filtering

condition to the first derivative, we can expect to get the fitted signal with zero slope. The cost of this penalty will be proportional to the number of jumps. In this case, we would like to minimize the following objective function: n

n

t=1

t=2

X 1X (yt − xt )2 + λ |xt − xt−1 | 2 or in the vectorial form:

1 ky − xk22 + λ kDxk1 2 Here the D operator is (n − 1) × n matrix which is the discrete version of the first order derivative:   −1 1 0  0 −1 1  0     . .. D= (1.4)     −1 1 0  −1 1

We may apply the same minimization algorithm as previously (see Appendix A.1.1). To illustrate that, we consider the model with step trend lines perturbed by a white noise process:  y t = x t + εt     εt ∼ N 0, σ 2 (1.5) Pr {x  t = xt−1 } = p     Pr xt = b U[0,1] − 12 = 1 − p We employ this model for testing the L1 − C filtering and HP filtering adapted to the first derivative6 , which corresponds to the following optimization program: n

n

t=1

t=2

X 1X min (yt − xt )2 + λ (xt − xt−1 )2 2 In Figure 1.3, we have reported the corresponding results7 . For the second test, we consider a mean-reverting process (Ornstein-Uhlenbeck process) with mean value following a regime switching process:  yt = yt−1 + θ(x    t − yt−1 ) + εt  εt ∼ N 0, σ 2 (1.6) Pr {x    t = xt−1 } = p 1   Pr xt = b U[0,1] − 2 = 1 − p Here, µt is the process which characterizes the mean value and θ is inversely proportional to the return time to the mean value. In Figure 1.4, we show how the L1 − C filter can capture the original signal in comparison to the HP filter8 . 6

We use the term HP filter in order to keep homogeneous notations. However, we notice that this filter is indeed the FLS filter proposed by Kalaba and Tesfatsion (1989) when the exogenous regressors are only a constant. 7 The parameters are p = 0.998, b = 50 and σ = 8. 8 For the simulation of the Ornstein-Uhlenbeck process, we have chosen p = 0.9985, b = 20, θ = 0.1 and σ = 2

6

Trading Strategies with L1 Filtering

Figure 1.3: L1 − C filtering versus HP filtering for the model (1.5) Signal

Noisy signal

80

80

60

60

40

40

20

20

0

0

−20

−20

−40

−40 500

1000

1500

2000

500

1000

t

t

L1 -C filter

HP filter

80

80

60

60

40

40

20

20

0

0

−20

−20

−40

1500

2000

1500

2000

−40 500

1000

1500

2000

500

t

1000

t

Figure 1.4: L1 − C filtering versus HP filtering for the model (1.6) Signal

Noisy signal

40

40

30

30

20

20

10

10

0

0

−10

−10

−20

−20 500

1000

1500

2000

500

t

1000

1500

2000

1500

2000

t

L1 -C filter

HP filter

40

40

30

30

20

20

10

10

0

0

−10

−10

−20

−20 500

1000

1500

2000

500

t

1000

t

7

Trading Strategies with L1 Filtering

1.3.3

Mixing trend and mean-reverting properties

We now combine the two schemes proposed above. In this case, we define two regularPn−1 ization parameters λ and λ corresponding to two penalty conditions 1 2 t=1 |xt − xt−1 | Pn−1 and t=2 |xt−1 − 2xt + xt+1 |. Our objective function for the primal problem becomes now: n

n−1

n−1

t=1

t=1

t=2

X X 1X (yt − xt )2 + λ1 |xt − xt−1 | + λ2 |xt−1 − 2xt + xt+1 | 2 which can be again rewritten in the matrix form: 1 ky − xk22 + λ1 kD1 xk1 + λ2 kD2 xk1 2 where the D1 and D2 operators are respectively the (n − 1) × n and (n − 2) × n matrices defined in equations (1.4) and (1.1). In Figures 1.5 and 1.6, we test the efficiency of the mixing scheme on the straight trend lines model (1.2) and the random walk model (1.3)9 . Figure 1.5: L1 − T C filtering versus HP filtering for the model (1.2) Signal

Noisy signal

100

100

50

50

0

0

−50

−50

−100

−100

500

1000

1500

2000

500

1000

t L1 -TC filter 100

50

50

0

0

−50

−50

−100

−100

1000

1500

2000

500

1000

1500

2000

t

t

1.3.4

2000

HP filter

100

500

1500

t

How to calibrate the regularization parameters?

As shown above, the trend obtained from L1 filtering depends on the parameter λ of the regularization procedure. For large values of λ, we obtain the long-term trend of 9

For both models, the parameters are p = 0.99, b = 0.5 and σ = 5.

8

Trading Strategies with L1 Filtering Figure 1.6: L1 − T C filtering versus HP filtering for the model (1.3) Signal

Noisy signal

1500

1500

1000

1000

500

500

0

0

−500

−500 500

1000

1500

2000

500

t L1 -TC filter

2000

1500

2000

HP filter 1500

1000

1000

500

500

0

0

−500

−500 1000

1500

t

1500

500

1000

1500

2000

500

t

1000

t

the data while for small values of λ, we obtain short-term trends of the data. In this paragraph, we attempt to define a procedure which permits to do the right choice on the smoothing parameter according to our need of trend extraction.

A preliminary remark For small value of λ, we recover the original form of the signal. For large value of λ, we remark that there exists a maximum value λmax above which the trend signal has the affine form: xt = α + βt where α and β are two constants which do not depend on the time t. The value of λmax is given by:



−1

>

λmax = DD Dy



We can use this remark to get an idea about the order of magnitude of λ which should be used to determine the trend over a certain time period T . In order to show this idea, we take the data over the total period T . If we want to have the global trend on this period, we fix λ = λmax . This λ will gives the unique trend for the signal over the whole period. If one need to get more detail on the trend over shorter periods, we can divide the signal into p time intervals and then estimate λ 9

Trading Strategies with L1 Filtering

via the mean value of all the λimax parameter: p

1X i λmax λ= p i=1

In Figure 1.7, we show the results obtained with p = 2 (λ = 1 500) and p = 6 (λ = 75) on the S&P 500 index. Figure 1.7: Influence of the smoothing parameter λ 7.6

S&P 500 λ =999 λ =15

7.4

7.2

7

6.8

6.6

2007

2008

2009

2010

2011

Moreover, the explicit calculation of a Brownian motion process gives us the scaling law of the the smoothing parameter λmax . For the trend filtering scheme, λmax scales as T 5/2 while for the mean-reverting scheme, λmax scales as T 3/2 (see Figure 1.8). Numerical calculation of these powers for 500 simulations of the model (1.3) gives very good agreement with the analytical result for Brownian motion. Indeed, we obtain empirically that the power for L1 − T filter is 2.51 while the one for L1 − C filter is 1.52. Cross validation procedure In this paragraph, we discuss how to employ a cross-validation scheme in order to calibrate the smoothing parameter λ of our model. We define two additional parameters which characterize the trend detection mechanism. The first parameter T1 is the width of the data windows to estimate the optimal λ with respect to our target strategy. This parameter controls the precision of our calibration. The second parameter T2 is used to estimate the prediction error of the trends obtained in the 10

Trading Strategies with L1 Filtering Figure 1.8: Scaling power law of the smoothing parameter λmax

main window. This parameter characterizes the time horizon of the investment strategy. Figure 3.7 shows how the data set is divided into different windows in the Figure 1.9: Cross-validation procedure for determining optimal value λ? Training set |

|

T1

Test set -|

T2

Historical data

-

Forecasting | T2 k Today Prediction

cross validation procedure. In order to get the optimal parameter λ, we compute the total error after scanning the whole data by the window T1 . The algorithm of this calibration process is described as following:

11

Trading Strategies with L1 Filtering Algorithm 1 Cross validation procedure for L1 filtering procedure CV_Filter(T1 , T2 ) Divide the historical data by m rolling test sets T2i (i = 1, . . . , m) For each test window T2i , compute the statistic λimax ¯ and the standard deviation From the array of λimax , compute the average λ σλ ¯ − 2σλ and λ2 = λ ¯ + 2σλ Compute the boundaries λ1 = λ for j = 1 : n do Compute λj = λ1 (λ2 /λ1 )(j/n) Divide the historical data by p rolling training sets T1k (k = 1, . . . , p) for k = 1 : p do For each training window T1k , run the L1 filter Forecast the trend for the adjacent test window T2k Compute the error ek (λj ) on the test window T2k end for P k Compute the total error e (λj ) = m k=1 e (λj ) end for Minimize the total error e (λ) to find the optimal value λ? Run the L1 filter with λ = λ? end procedure Figure 1.10 illustrates the calibration procedure for the S&P 500 index with T1 = 400 and T2 = 50 for the S&P 500 index (the number of observations is equal to 1 008 trading days). With m = p = 12 and n = 15, the estimated optimal value λ? for the L1 − T filter is equal to 7.03. We have observed that this calibration procedure is more favorable for long-term time horizon, that is to estimate a global trend. For short-term time horizon, the prediction of local trends is much more perturbed by the noise. We have computed the probability of having good prediction on the tendency of the market for longterm and short-term time horizons. This probability is about 70% for 3 months time horizon while it is just 50% for one week time horizon. It comes that even if the fit is good for the past, the noise is however large meaning that the prediction of the future tendency is just 1/2 for an increasing market and 1/2 for a decreasing market. In order to obtain better results for smaller time horizons, we improve the last algorithm by proposing a two-trend model. The first trend is the local one which is determined by the first algorithm with the parameter T2 corresponding to the local prediction. The second trend is the global one which gives the tendency of the market over a longer period T3 . The choice of this global trend parameter is very similar to the choice of the moving-average parameter. This model can be considered as a simple version of mean-reverting model for the trend. In Figure 1.11, we describe how the data set is divided for estimating the local trend and the global trend. The procedure for estimating the trend of the signal in the two-trend model is summarized in Algorithm 2. The corrected trend is now determined by studying the relative position of the historical data to the globaltrend. The reference position is characterized by the standard deviation σ yt − xG where xG t t is the filtered global 12

Trading Strategies with L1 Filtering Figure 1.10: Calibration procedure with the S&P 500 index 7.5

7

80

e(λ)

6.5 2007

2008

2009

2010

2011

60

40

−2

0

2

4

6

8

ln λ

trend.

1.4

Application to momentum strategies

In this section, we apply the previous framework to the S&P 500 index. First, we illustrate the calibration procedure for a given trading date. Then, we backtest a momentum strategy by estimating dynamically the optimal filters.

1.4.1

Estimating the optimal filter for a given trading date

We would like to estimate the optimal filter for January 3rd, 2011 by considering the period from January 2007 to December 2010. We use the previous algorithms Figure 1.11: Cross validation procedure for two-trend model Training set | |

T1

Forecasting

Test set -

|

T3

|

T2

-

Historical data 13

|

| k Today

 - Global trend   T3  Local trend  T2 -

Prediction

Trading Strategies with L1 Filtering Algorithm 2 Prediction procedure for the two-trend model procedure Predict_Filter(Tl , Tg ) Compute the local trend xL t for the time horizon T2 with the CV_FILTER procedure Compute the global trend xG t for the time horizon T3 with the CV_FILTER procedure  Compute the standard deviation σ yt − xG of data with respect to the global t trend  G then if yt − xG t < σ yt − xt Prediction ← xL t else Prediction ← xG t end if end procedure

with T1 = 400 and T2 = 50. The optimal parameters are λ1 = 2.46 (for the L1 − C filter) and λ2 = 15.94 (for the L2 − T filter). Results are reported in Figure 1.12. The trend for the next 50 trading days is estimated to 7.34% for the L1 − T filter and 7.84% for the HP filter whereas it is null for the L1 − C and L1 − T C filters. By comparison, the true performance of the S&P 500 index is 1.90% from January 3rd, 2011 to March 15th, 201110 . Figure 1.12: Comparison between different L1 filters on S&P 500 Index

10

It corresponds exactly to a period of 50 trading days

14

Trading Strategies with L1 Filtering

1.4.2

Backtest of a momentum strategy

Design of the strategy Let us consider a class of self-financed strategies on a risky asset St and a risk-free asset Bt . We assume that the dynamics of these assets is: dBt = rt Bt dt dSt = µt St dt + σt St dWt where rt is the risk-free rate, µt is the trend of the asset price and σt is the volatility. We denote αt the proportion of investment in the risky asset and (1 − αt ) the part invested in the risk-free asset. We start with an initial budget W0 and expect a final wealth WT . The optimal strategy is the one which optimizes the expectation of the utility function U (WT ) which is increasing and concave. It is equivalent to the Markowitz problem which consists of maximizing the wealth of the portfolio under a penalty of risk:   λ 2 α α sup E (WT ) − σ (WT ) 2 α∈R which is equivalent to:   λ 2 2 sup αt µt − W0 αt σt 2 α∈R As the objective function is concave, the maximum corresponds to the zero point of the gradient µt − λW0 αt σt2 . We obtain the optimal solution: αt? =

1 µt λW0 σt2

In order to limit the explosion of αt , we also impose the following constraint αmin ≤ αt ≤ αmax :     1 µt ? αt = max min , αmin , αmax λW0 σt2 The wealth of the portfolio is then given by the following expression:     ? ? St+1 Wt+1 = Wt + Wt αt − 1 + (1 − αt )rt St Results In the following simulations, we use the estimators µ ˆt and σ ˆt in place of µt and σt . For µ ˆt , we consider different models like L1 , HP and moving-average filters11 whereas we use the following estimator for the volatility: Z t 1 X Si 1 T 2 2 σt dt = ln2 σ ˆt = T 0 T Si−1 i=t−T +1

We consider a long/short strategy, that is (αmin , αmax ) = (−1, 1). In the particular 1 case of the µ ˆL t estimator, we consider three different models: 11

1 We note them respectively µ ˆL ˆHP and µ ˆMA . t t t , µ

15

Trading Strategies with L1 Filtering Table 1.1: Results for the Backtest Model S&P 500 µ ˆMA t µ ˆHP t 1 µ ˆL t 1 µ ˆL t L1 µ ˆt

Trend

(LT) (GT) (LGT)

Performance 2.04% 3.13% 6.39% 3.17% 6.95% 6.47%

Volatility 21.83% 18.27% 18.28% 17.55% 19.01% 18.18%

Sharpe −0.06 −0.01 0.17 −0.01 0.19 0.17

IR 0.03 0.13 0.03 0.14 0.13

Drawdown 56.78 33.83 39.60 25.11 31.02 31.99

1. the first one is based on the local trend; 2. the second one is based on the global trend; 3. the combination of both local and global trends corresponds to the third model. For all these strategies, the test set of the local trend T2 is equal to 6 months (or 130 trading days) whereas the length of the test set for global trend is four times the length of the test set – T3 = 4T2 – meaning that T3 is one year (or 520 trading days). This choice of T3 agrees with the habitual choice of the width of the windows in moving average estimator. The length of the training set is also four times the length of the test set T1 . The study period is from January 1998 to December 2010. In the backtest, the trend estimation is updated every day. In Table 2.3, we summarize the results obtained with the different models cited above for the backtest. We remark that the best performances correspond to the case of global trend, HP and two-trend models. Because HP filter is calibrated to the window of the moving-average filter which is equal to T3 , it is not surprising that the performances of these three models are similar. On the considered period of the backtest, the S&P does not have a clear upward or downward trend. Hence, the local trend estimator does not give a good prediction and this strategy gives the worst performance. By contrast, the two-trend model takes into account the trade-off between local trend and global trend and gives a better result

1.5

Extension to the multivariate case

  (1) (m) We now extend the L1 filtering scheme to a multivariate time series yt = yt , . . . , yt . The underlying idea is to estimate the common trend of several univariate time series. In finance, the time series correspond to the prices of several assets. Therefore, we can build long/short strategies between these assets by comparing the individual trends and the common trend. For the sake of simplicity, we assume that all the signals are rescaled to the same 16

Trading Strategies with L1 Filtering

order of magnitude12 . The objective function becomes new: m

2 1 X

(i)

y − x + λ kDxk1 2 2 i=1

In Appendix A.1.1, we show thatPthis problem is equivalent to the L1 univariate (i) as the signal. problem by considering y¯t = m−1 m i=1 y

1.6

Conclusion

Momentum strategies are efficient ways to use the market tendency for building trading strategies. Hence, a good estimator of the trend is essential from this perspective. In this paper, we show that we can use L1 filters to forecast the trend of the market in a very simple way. We also propose a cross-validation procedure to calibrate the optimal regularization parameter λ where the only information to provide is the investment time horizon. More sophisticated models based on a local and global trends is also discussed. We remark that these models can reflect the effect of meanreverting to the global trend of the market. Finally, we consider several backtests on the S&P 500 index and obtain competing results with respect to the traditional moving-average filter.

12

For example, we may center and standardize the time series by subtracting the mean and dividing by the standard deviation.

17

Bibliography [1] Alexandrov T., Bianconcini S., Dagum E.B., Maass P. and McElroy T. (2008), A Review of Some Modern Approaches to the Problem of Trend Extraction , US Census Bureau, RRS #2008/03. [2] Beveridge S. and Nelson C.R. (1981), A New Approach to the Decomposition of Economic Time Series into Permanent and Transitory Components with Particular Attention to Measurement of the Business Cycle, Journal of Monetary Economics, 7(2), pp. 151-174. [3] Boyd S. and Vandenberghe L. (2009), Convex Optimization, Cambridge University Press. [4] Cleveland W.P. and Tiao G.C. (1976), Decomposition of Seasonal Time Series: A Model for the Census X-11 Program, Journal of the American Statistical Association, 71(355), pp. 581-587. [5] Daubechies I., Defrise M. and De Mol C. (2004), An Iterative Thresholding Algorithm for Linear Inverse Problems with a Sparsity Constraint, Communications on Pure and Applied Mathematics, 57(11), pp. 1413-1457. [6] Efron B., Tibshirani R. and Friedman R. (2009), The Elements of Statistical Learning, Second Edition, Springer. [7] Harvey A. (1991), Forecasting, Structural Time Series Models and the Kalman Filter, Cambridge University Press. [8] Hodrick R.J. and Prescott E.C. (1997), Postwar U.S. Business Cycles: An Empirical Investigation, Journal of Money, Credit and Banking, 29(1), pp. 1-16. [9] Kalaba R. and Tesfatsion L. (1989), Time-varying Linear Regression via Flexible Least Squares, Computers & Mathematics with Applications, 17, pp. 1215-1245. [10] Kim S-J., Koh K., Boyd S. and Gorinevsky D. (2009), `1 Trend Filtering, SIAM Review, 51(2), pp. 339-360. [11] Tibshirani R. (1996), Regression Shrinkage and Selection via the Lasso, Journal of the Royal Statistical Society B, 58(1), pp. 267-288. 19

Chapter 2

Volatility Estimation for Trading Strategies We review in this chapter various techniques for estimating the volatility. We start by discussing the estimators based on the range of daily monitoring data then we consider the stochastic volatility model in order to determine the instantaneous volatility. At high trading frequency, the stock prices are fluctuated by an additional noise, socalled the micro-structure noise. This effect comes from the bid-ask bounce due to the short time scale. Within a short time interval, the trading price does not converge to the equilibrium price determined by the “supply-demand” equilibrium. In the second part, we discuss the effect of the micro-structure noise on the volatility estimation. It is very important topic concerning an enormous field of “high-frequency” trading. Examples of backtesting on index and stocks will illustrate the efficiency of considered techniques. Keywords: Volatility, voltarget strategy, range-based estimator, high-low estimator, microstructure noise.

2.1

Introduction

Measuring the volatility is one of the most important questions in finance. As stated in its name, volatility is the direct measurement of the risk for a given asset. Under the hypothesis that the realized return follows a Brownian motion, volatility is usually estimated by the standard deviation of daily price movement. As this assumption relates the stock price to the most common object of stochastic calculus, many mathematical work have been carried out on the volatility estimation. With the increasing of the trading data, we can explore more and more useful information in order to improve the precision of the volatility estimator. New class of estimators which are based on the high and low prices was invented. However, in the real world the asset price is just not a simple geometric Brownian process, different effects have been observed including the drift or the opening jump. A general correction 21

Volatility Estimation for Trading Strategies

scheme based on the combination of various estimators have been studied in order to eliminate these effects. As far as the trading frequency increases, we expect that the precision of estimator gets better as well. However, when the trading frequency reaches certain limit1 , new phenomena due to the nonequlibrum of the market emerge and spoil the precision. It is called the micro-structure noise which is characterized by the bid-ask bounce or the transaction effect. Because of this noise, realized variance estimator overestimates the true volatility of the price process. A suggestion based on the use of two different time scales can aim to eliminate this effect. The note is organized as following. In Section II, we review the basic volatility estimator using the variance of realized return (note from B.Bruder article) then we introduce all the variation based on the range estimation. In section III, we discuss how to measure the instantaneous volatility and the effect of the lag by doing the moving-average. In section IV, we discuss the effect of the microstructure on the high frequency volatility.

2.2 2.2.1

Range-based estimators of volatility Range based daily data

In this paragraph, we discuss the general characteristics of the asset price and introduce the basic notations which will be used for the rest of the article. Let us assume that the dynamics of asset price follows the habitual Black-Scholes model. We denote the asset price St which follows a geometric Brownian motion in continuous time: dSt = µt dt + σt dBt St

(2.1)

Here, µt is the return or the drift of the process whereas σt is the volatility. Over the period of T = 1 trading day, the evolution is divided in two time intervals: the first interval with ratio f describes the closing interval (before opening) and the second interval with ratio 1 − f describes the opening interval (trading interval). On the monitoring of the data, the closing interval is unobservable and is characterized by the jumps in the opening of the market. The measure of closing interval is not given by the real closing time but the jumps in the opening of the market. If the logarithm of price follows a standard Brownian motion without drift, then the fraction f / (1 − f ) is given by the square of ratio between the standard deviation of the opening jump and the daily price movement. We will see that this idea can give a first correction due to the close-open effect for all the estimators discussed below. In order to fix the notation, we define here different quantities concerning the statistics of the price evolution: • T is the time interval of 1 trading day 1

This limit defines the optimal frequency for the classical estimator. It is more and less agreed to be one trade every 5 minutes.

22

Volatility Estimation for Trading Strategies Figure 2.1: Data set of 1 trading day

• f is the fraction of closing period • σ ˆt2 is the estimator of the variance σt2 • Oti is the closing price on a given period [ti , ti+1 [ • Cti is the closing price on a given period [ti , ti+1 [ • Hti = maxt∈[ti ,ti+1 [ St is the highest price on a given period [ti , ti+1 [ • Lti = mint∈[ti ,ti+1 [ St is the lowest price on a given period [ti , ti+1 [ • oti = ln Oti − ln Cti−1 is the opening jump • uti = ln Hti − ln Oti is the highest price movement during the trading open • dti = ln Lti − ln Oti is the lowest price movement during the trading open • cti = ln Cti − ln Oti is the daily price movement over the trading open period

2.2.2

Basic estimator

For the sake of simplicity, let us start this paragraph by assuming that there is no opening jump f = 0. The asset price St described by the process (3.17) is observed in a series of discrete dates {t0 , ..., tn }. In general, this series is not necessary regular. Let Rti be the realized return in the period [ti−1 , ti [, then we obtain: Rti = ln Sti − ln Sti−1

ti

  1 2 = σu dBu + µu du − σu du 2 ti−1 Z

In the following, we assume that the couple (µt , σt ) is independent to the Brownian motion Bt of the asset price evolution. 23

Volatility Estimation for Trading Strategies

Estimator over a given period In appendix B.1, we show that the realized return Rti is related to the volatility as:  2  2  1 2 2 2 E Rti |σ, µ = (ti − ti−1 ) σti + (ti − ti−1 ) µti−1 − σti−1 2 This √ quantity can 2not be a good estimator of volatility because its standard deviation is 2 (ti+1 − ti ) σti which is proportional to the estimated quantity. In order to reduce the estimation error, we focus on the estimation of the average volatility over the period tn − t0 . The average volatility is defined as: Z tn 1 2 σ = σ 2 du (2.2) tn − t0 t0 u This quantity can be measured by using the canonical estimator defined as: n 1 X 2 Rti tn − t0

σ ˆ2 =

i=1

 The variance of this estimator is approximated as var σ ˆ 2 ≈ 2σ 4 /n or the standard √ 2 √ is small if deviation is proportional to 2σ / n. It means that the estimation error  √  2 n is large enough. Indeed the variance of the average volatility reads var σ ˆ ≈ √ 2 σ / (2n) and the standard deviation is approximated to σ/ 2n. Effect of the weight distribution In general, we can define an estimator with a weight distribution wi such as: 2

σ ˆ =

n X

wi Rt2i

i=1

then the expectation value of the estimator is given by: 

2



E σ ˆ |σ, µ =

n Z X i=1

ti

wi σu2 du

ti−1

A simple example of the general definition is the estimator with annualized return √ Ri / ti+1 − ti . In this case, our estimator becomes: n

σ ˆ2 =

1 X Rt2i n tn − t1 i=1

for which the expectation value is: n  2  X E σ ˆ |σ, µ = i=1

1 ti − ti−1 24

Z

ti

ti−1

σu2 du

(2.3)

Volatility Estimation for Trading Strategies

We remark that if the time step (time increment) is constant ti − ti−1 = T , then we obtain the same result as the canonical estimator. However, if the time step ti − ti−1 is not constant, the long-term return is underweighted while the short-term return is overweighted. We will see in the next discussion on the realized volatility, the way of choosing the weight distribution can help to improve the quality of the estimator. For example, we will show that the IGARCH estimation can lead to an exponential weight distribution which is more appropriate to estimate the realized volatility. Close to close, open to close estimators As discussed above, the volatility can be obtained by an using moving-average on discrete ensemble data. The standard measurement is to employ the above result of the canonical estimator for the closing prices (so-called “close to close” estimator): 2 σ ˆCC =

n X 1 ((oti + cti ) − (o + c))2 (n − 1) T i=1

Here, T is the time period corresponding to 1 trading day. In the rest of the paper, we user CC to denote the close to close estimator. We remark that in this formula, there are two different points in comparison to the one defined above. Firstly, we have subtracted the mean value of the closing price (o + c) in order to eliminate the drift effect: n n 1 X 1 X o= oti , c = cti nT nT i=1

i=1

Secondly, the prefactor is now 1/ (n − 1) T but not 1/nT . In fact, we have subtracted the mean value then maximum likehood procedure leads to the factor 1/ (n − 1) T . We can define also two other volatility estimators which is “open to close” estimator (OC): n X 1 2 σ ˆC = (cti − c)2 (n − 1) T i=1

and the “close to open” estimator (CO): 2 σ ˆO =

n X 1 (oti − o)2 (n − 1) T i=1

We remind that oti is the opening jump for a given trading period, cti is the daily movement of the asset price such that the close to close return is equal to (o + c). We remark that the “close to close ” estimator does not depend on the drift and the closing interval f . Without presence of the microstructure noise, this estimator is unbiased. Hence, it is usually used as a benchmark to judge the efficiency of other estimators σ ˆ which is defined as:  2  var σ ˆCC 2 eff σ ˆ = var (ˆ σ2)  4 /n. The quality of an estimator is determined by its high value where var σ ˆ 2 = 2σ  of efficiency eff σ ˆ 2 > 1. 25

Volatility Estimation for Trading Strategies

2.2.3

High-low estimators

We have seen that the daily deviation can be used to define the estimator of the volatility. It comes from the fact that one has assumed that the logarithm of price follows a Brownian motion. We all know that the standard deviation in the diffusive √ process over an interval time ∆t is proportional to σ ∆t , hence using the variance to estimate the volatility is quite intuitive. Indeed, within a given time interval, if additional information of the price movement is available such as the highest value or the lowest value, this range must provide as well a good measure of the volatility. This idea is first addressed by W. Feller in 1951. Later, Parkinson (1980) has employed the first result of Feller’s work to provide the first “high-low” estimator (so-called Parkinson estimator). If one uses close prices to estimate the volatility, one can eliminate the effect of the drift by subtracting the mean value of daily variation. By contrast, the use of high and low prices can not eliminate the drift effect in such a simple way. In addition, the high and low prices can be only observed in the opening interval, then it can not eliminate the second effect due to the opening jump. However, as demonstrated in the work of Parkinson (1980), this estimator gives a better confidence but it obviously underestimate the volatility because of the discrete observation of the price. The maximum and minimum value over a time interval ∆t are not the true ones of the Brownian motion. They are underestimated then it is not surprising that the result will depend strongly on the frequency of the price quotation. In the high frequency market, the third effect can be negligible however we will discuss this effect in the later. Because of the limitation of Parkinson’s estimator, an other estimator which is also based on the work of Feller was proposed by Kunitomo (1992). In order to eliminate the drift, he construct a Brownian bridge then the deviation of this motion is again related to the diffusion coefficient. In the same line of thought, Rogers and Satchell (1991) propose an other use of high and low prices in order to obtain a drift-independent volatility estimator. In this section, we review the three techniques which are always constrained by the opening jump.

The Parkinson estimator Let us consider the random variable uti − dti (namely the range of the Brownian motion over the period [ti , ti+1 [), then the Parkinson estimator is defined by using the following result (Feller 1951): h i E (u − d)2 = (4 ln 2) σ 2 T By inversing this formula, we obtain a natural estimator of volatility based on high and low prices. The Parkinson’s volatility estimator is then defined as (Parkinson 1980): σ ˆP2

n 1 X 1 = (uti − dti )2 nT 4 ln 2 i=1

26

Volatility Estimation for Trading Strategies

In order to estimate the error of the estimator, we compute the variance of σ ˆP2 which is given by the following expression:  4   σ 9ζ (3) 2 var σ ˆP = 2 −1 n 16 (ln 2) Here, ζ (x) is the Riemann function. In comparison to the benchmark estimator “close to close” , we have an efficiency:  eff σ ˆP2 =

32 (ln 2)2 = 4.91 9ζ (3) − 16 (ln 2)2

The Garman-Klass estimator Another idea employing the additional information from the high and low value of the price movement within the trading day in order to increase the estimator efficiency was introduced by Garman and Klass (1980). They construct a best analytic scale estimator by proposing a quadratic form estimator and imposing the well-known invariance condition of Brownian motion on the set of variable (u, d, c). By minimizing its variance, they obtain the optimal variational form of quadratic estimator which is given by the following property: i h E 0.511 (u − d)2 − 0.019 (c (u + d) − 2ud) − 0.383c2 = σ 2 T Then the Garman-Klass estimator is defined as: 2 σ ˆGK

n i 1 Xh = 0.511 (uti − dti )2 − 0.019 (cti (uti + dti ) − 2uti dti ) − 0.383c2ti nT i=1

 2 The minimal value of the variance corresponding to the quadratic estimator is var σ = GK  2 0.27σ 4 /n and its efficiency is now eff σGK = 7.4. The Kunitomo estimator Let Xt the logarithm of price process Xt = ln St , the Ito theorem gives us its evolution:   σt2 dXt = µt − dt + σt dBt 2 If the drift term becomes relevant in the estimation of volatility, one can eliminate it by constructing a Brownian bridge on the period T as following: Wt = Xt −

t XT T

If the initial condition is normalized to X0 = 0, then by definition we always have XT = 0. This construction eliminates automatically the drift term when its daily variation is small µti+1 − µti  µti . We define the range of the Brownian bridge 27

Volatility Estimation for Trading Strategies

Dti = Mti − mti where Mti = maxt∈[ti ,ti+1 [ Wt and mti = mint∈[ti ,ti+1 [ Wt . It has been demonstrated that the variance of the range of Brownian bridge is directly proportional to the volatility (Feller 1951):   E D2 = T π 2 σ 2 /6 (2.4) Hence, Kunimoto’s estimator is defined as following: 2 σ ˆK =

n 1 X 6 (Mti − mti )2 nT π2 i=1

Higher moment of the Brownian bridge can be also calculated analytically and is given by the formula 2.10 in Kunitomo (1992). In particular, the variance of the  2 Kunitomo’s estimator is equal to var σK = σ 4 /5n which implies the efficiency of 2 = 10. this estimator eff σK The Rogers-Satchell estimator Another way to eliminate the drift effect is proposed by Rogers and Satchell. They consider the following property of the Brownian motion: E [u (u − c) + d (d − c)] = σ 2 T This expectation value does not depend on the drift of the Brownian motion, hence it does provide a drift-independent estimator which can be defined as: 2 σ ˆRS =

n 1 X [uti (uti − cti ) + dti (dti − cti )] nT i=1

 2 The variance of this estimator is given by var σ ˆRS = 0.331σ 4 /n which gives an  2 efficiency eff σ ˆRS = 6. Like the other techniques based on the range ”high-low”, this estimator underestimates the volatility due to the fact that the maximum of a discretized Brownian motion is smaller than the true value. Rogers and Satchell have also proposed a correction scheme which can be generalized for other technique. Let M be the number of quoted price, then h = T /M is the step of the discretization, then the corrected estimator taking account of the finite step error is give by the root of the following equation: √ 2 σ ˆh2 = 2bhˆ σh2 + 2 (u − d) a hˆ σh + σ ˆRS √  √   where a = 2π 1/4 − 2 − 1 /6 and b = (1 + 3π/4) /12.

2.2.4

How to eliminate both drift and opening effects?

A common way to eliminate both effects coming from the drift and the opening jump is to combine the various available volatility estimators. The general scheme 28

Volatility Estimation for Trading Strategies

is to form a linear combination of opening estimator σO and close estimator σC or a high-low estimator σHL . The coefficients of this combination are determined by a minimization procedure on the variance of the result estimator. Given the faction of closing interval f , we can improve all high-low estimators discussed above by introducing the combination: σ ˆ2 = α

2 σ ˆO σ ˆ2 + (1 − α) HL f 1−f

Here, the trivial choice is α = f and the estimator becomes independent of the opening jump. However, the optimal value of the coefficient is chosen as α = 0.17 for Parkinson and Kunimoto estimators whereas it value is α = 0.12 for GarmanKlass estimator (Garman and Klass 1980). This technique can eliminate the effect of the opening jump for all estimator but only Kunimoto estimator can avoid both effects. Applying the same idea, Yang and Zhang (2000) have proposed another combination which can also eliminate both effect as Kunimoto estimator. They choose the following combination: σ ˆY2 Z = α

2  σ ˆO 1−α 2 2 + κˆ σC + (1 − κ) σ ˆHL f 1−f

2 as high-low estimator because In the work of Yang ans Zhang, they have used σ ˆRS it is drift independent estimator. The coefficient α will be chosen as α = f and κ is given by optimizing the variance of estimator. The minimization procedure gives the optimal value of the parameter κ:

κo =

β−1 β + n+1 n−1

h i where β = E (u (u − c) + d (d − c))2 /σ 4 (1 − f )2 . As the numerator is proportional

to (1 − f )2 , β is in dependent of f . Indeed, the value of β varies not too much (from 1.331 to 1.5) when the drift is changed. In practice, the value of β is chosen as 1.34.

2.2.5

Numerical simulations

Simulation with constant volatility We test various volatility estimators via a simulation of a geometric Brownian motion with constant annualized drift µ = 30% and constant annualized volatility σ = 15%. We realize the simulation based on N = 1000 trading days with M = 50 or 500 intra-day observations in order to illustrate the effect of the discrete price on the family of high-low estimators. • Effect of the discretization We first test the effect of the discretization on the various estimators. Here, 29

Volatility Estimation for Trading Strategies

we take M = 50 or 500 intraday observations with µ = 0 and f = 0. In Figure 2.2, we present the simulation results for M = 50 price quotation in a trading day. All the high-low estimators are weakly biased due the discretization effect. They all underestimate the volatility as the range of estimator is small than the true range of Brownian motion. We remark that the close-to-close is unbiased however its variance is too large. The correction scheme proposed by Roger and Satchell can eliminate the discretization effect. When the number of observation is larger, the discretization effect is negligible and all estimators are unbiased (see Figure 2.3). Figure 2.2: Volatility estimators without drift and opening effects (M = 50) 20 Simulated σ,

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

19 18 17

σ (%)

16 15 14 13 12 11 10 0

100

200

300

400

500

600

700

800

900

1000

• Effect of the non-zero drift We consider now the case with non-zero annual drift µ = 30%. Here, we take M = 500 intraday observations. In Figure 2.4, we observe that the Parkinson estimator and the Garman-Klass estimator are strongly dependent on the drift of Brownian motion. Kunimoto estimator and Rogers-Satchell estimator are not dependent on the drift. • Effect of the opening jump For the effect of the opening jump, we simulate data with f = 0.3. In Figure 2.4, we take M = 500 intraday observations with zero drift µ = 0. We observe that with the effect of the opening jump, all high-low estimator underestimate the volatility except for the YZ estimator. By using the combination between 2 with the other estimators, the effect of the the open volatility estimator σ ˆO opening can be completely eliminated (see Figure 2.6). 30

Volatility Estimation for Trading Strategies

Figure 2.3: Volatility estimators without drift and opening effect (M = 500) 20 Simulated σ,

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

19 18 17

σ (%)

16 15 14 13 12 11 10 0

100

200

300

400

500

600

700

800

900

1000

Figure 2.4: Volatility estimators with µ = 30% and without opening effect (M = 500)

Simulated σ,

26

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

24

σ (%)

22 20 18 16 14 12 0

100

200

300

400

500

31

600

700

800

900

1000

Volatility Estimation for Trading Strategies

Figure 2.5: Volatility estimators with opening effect f = 0.3 and without drift (M = 500) 20 Simulated σ,

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

19 18 17

σ (%)

16 15 14 13 12 11 10 9 0

100

200

300

400

500

600

700

800

900

1000

Figure 2.6: Volatility estimators with correction of the opening jump (f = 0.3)

32

Volatility Estimation for Trading Strategies

Simulation with stochastic volatility We consider now the simulation with stochastic volatility which is described by the following model:  dSt = µt St dt + σt St dBt (2.5) dσt2 = ξσt2 dBtσ in which Btσ is a Brownian motion independent to the one of asset process. We will first estimate the volatility with all the proposed estimators then verify the quality of these estimators via a backtest using the voltarget strategy2 . For the simulation of the volatility, we take the same parameters as above with f = 0, µ = 0, N = 5000, M = 500, ξ = 0.01 and σ0 = 0.4. In Figure 2.7, we present the result corresponding to different estimators. We remark that the group of high-low estimators gives a better result for volatility estimation. We can estimate the error Figure 2.7: Volatility estimators on stochastic volatility simulation 55 Simulated σ,

CC,

OC,

P,

K,

GK,

RS,

RS−h,

YZ

50 45

σ (%)

40 35 30 25 20 15 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

committed for each estimator by the following formula: N X = (ˆ σt − σt )2 t=1

The errors obtained for various estimators are summarized in the below Table 2.1. We now apply the estimation of the volatility to perform the voltarget strategies. The result of the this test is presented in Figure 2.8. In order to control the 2

The detail description of voltarget strategy is presented in Section Backtest

33

Volatility Estimation for Trading Strategies Table 2.1: Estimation error for various estimators Estimator PN σ − σ)2 t=1 (ˆ

2 σ ˆCC

σ ˆP2

2 σ ˆK

2 σ ˆGK

2 σ ˆRS

σ ˆY2 Z

0.135

0.072

0.063

0.08

0.076

0.065

quality of the voltarget strategy, we compute the volatility of the voltarget strategy obtained by each estimator. We remark that the calculation of the volatility on the voltarget strategies is effectuated by the close-to-close estimator with the same averaging window of 3 months (or 65 trading days). The result is reported in Figure 2.9. As shown in the figure, all estimators give more and less the same results. If we compute the error committed by these estimators, we obtain CC = 0.9491, P = 1.0331, K = 0.9491, GK = 1.2344, RS = 1.2703, Y Z = 1.1383. This result may comes form the fact that we have used the close-to-close estimator to calculate the volatility of all voltarget strategies. Hence, we consider another check of the Figure 2.8: Test of voltarget strategy with stochastic volatility simulation 2.6 Benchmark,

CC,

OC,

P,

GK,

RS,

YZ

2.4 2.2 2 1.8 1.6 1.4 1.2 1 0.8

0

500

1000

1500

2000

2500

t

3000

3500

4000

4500

5000

estimation quality. We compute the realized return of the voltarget strategies: RV (ti ) = ln Vti − ln Vti−1 where Vti is the wealth of the voltarget portfolio. We expect that this quantity follows a Gaussian probability distribution with volatility σ ? = 15%. Figure 2.10 shows the probability density function (Pdf) of the realized returns corresponding to all considered estimators. In order to have a more visible result, we compute the different between the cumulative distribution function (Cdf) of each estimator and 34

Volatility Estimation for Trading Strategies Figure 2.9: Test of voltarget strategy with stochastic volatility simulation 25 CC,

OC,

P,

K,

GK,

RS,

YZ

σ (%)

20

15

10 0

500

1000

1500

2000

2500

3000

3500

4000

4500

5000

the expected Cdf (see Figure 2.11). Both results confirm that the Parkinson and the Kunitomo estimators improve the quality of the volatility estimation.

2.2.6

Backtest

Volatility estimations of S&P 500 index We now employ the estimators discussed above for the S&P 500 index. Here, we do not have all tick-by-tick intraday data, hence the Kunimoto’s estimator and the Rogers-Satchell correction can not be applied. We remark that the effect of the drift is almost negligible which is confirmed by Parkinson and Garman-Klass estimators. The spontaneous opening jump is estimated simply by:  2 ! σ ˆC ft = 1 + σ ˆO We then employ the exponential-average technique to obtain a filter of this quantity. We obtain the average value of closing interval over the considered data for S&P 500 f¯ = 0.015 and for BBVA SQ Equity f¯ = 0.21. In the following, we use different estimators in order to extract the signal ft . The trivial one is using ft as the prediction of the opening jump, we denote fˆt , then we contruct the habitual ones like the moving-average fˆma , the exponential moving-average fˆexp and the cumulated average fˆc . In Figure 2.15, we show result corresponding to different filtered f on the 35

Volatility Estimation for Trading Strategies

Figure 2.10: Comparison between different probability density functions 45

Expected Pdf,

CC,

OC,

P,

K,

GK,

RS,

YZ

40 35 30

Pdf

25 20 15 10 5 0 −0.05

−0.04

−0.03

−0.02

−0.01

0

RV

0.01

0.02

0.03

0.04

0.05

Figure 2.11: Comparison between the different cumulative distribution functions

CC,

OC,

P,

K,

GK,

RS,

YZ

0.07 0.06 0.05

∆Cdf

0.04 0.03 0.02 0.01 0 −0.01 −0.02 −0.06

−0.04

−0.02

0

RV

36

0.02

0.04

0.06

Volatility Estimation for Trading Strategies

Figure 2.12: Volatility estimators on S& P 500 index

CO,

100

CC,

OC,

P,

GK,

RS,

YZ

90 80 70

σ (%)

60 50 40 30 20 10 01/2001

01/2003

01/2005

01/2007

01/2009

01/20011

Figure 2.13: Volatility estimators on on BHI UN Equity

80

CO,

CC,

OC,

P,

GK,

RS,

YZ

70 60

σ (%)

50 40 30 20 10

01/2001

01/2003

01/2005

01/2007

37

01/2009

01/2011

Volatility Estimation for Trading Strategies

Figure 2.14: Estimation of the closing interval for S&P 500 index 0.15

Realized closing ratio Moving average Exponential average Cummulated average Average

f

0.1

0.05

0 01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

Figure 2.15: Estimation of the closing interval for BHI UN Equity

Realized closing ratio Moving average Exponential average Cummulated average Average

0.8

f

0.6

0.4

0.2

01/2001

01/2003

01/2005

01/2007

38

01/2009

01/2011

Volatility Estimation for Trading Strategies

BHI UN Equity data. Figure 2.13 shows that the family of high-low estimator give a better result than the calissical close-to-close estimator. In order to check the quality of these estimators on the prediction of the volatility, we checke the value of the “Likehood” function corresponding to each estimator. Assuming that the observable signal follows the Gaussian distribution, the likehood function is defined as:  n n  n 1X 1 X Ri+1 2 2 l(σ) = − ln 2π − ln σi − 2 2 2 σi i=1

i=1

where R is the future realized return. In Figure 2.17, we present the result of the likehood function for different estimators. This function reaches its maximal value for the ‘Roger-Satchell’ estimator. Figure 2.16: Likehood function for various estimators on S&P 500 4

1.98

x 10

1.97

1.96

1.95

1.94

CC

OC

P

GK

RS

YZ

Backtest on voltarget strategy We now backtest the efficiency of various volatility estimators with vol-target strategy on S&P 500 index and an individual stock. Within the vol-target strategy, the exposition to the risky asset is determined by the following expression: αt =

σ? σ ˆt

where σ ? is the expected volatility of the strategy and σ ˆt is the prediction of the volatility given by the estimators above. In the backtest, we take the annualized volatility σ ? = 15% with historical data since 01/01/2001 to 31/12/2011. We present the results for two cases: 39

Volatility Estimation for Trading Strategies Figure 2.17: Likehood function for various estimators on BHI UN Equity 4

1.806

x 10

1.804

1.802

1.8

1.798

1.796

1.794

CC

OC

P

GK

RS

YZ

• Backtest on S&P 500 index with moving-average equal to 1 month (n = 21) of historical data. We remark in this case that the volatility of the index is small then the error on the volatility estimation causes less effect. However, the high-low estimators suffer the effect of discretization then they underestimate the volatility. For the index, this effect is more important therefore the closeto-close estimator gives the best performance.

• Backtest on single asset with moving-average equal to 1 month (n = 21) of historical data. In the case with a particular asset such as the BBVA SQ Equity, the volatility is important hence the error due the efficiency of volatility estimators are important. High-low estimators now give better results than the classical one.

In order to illustrate the efficiency of the range-based estimators, we realize a ranking between high-low estimator and the benchmark estimator close-to-close. We 2 apply the voltarget strategy for close-to-close estimator σ ˆCC and a high-low esti2 mator σ ˆHL . Then we compare the Sharpe ratio obtained by these two estimators and compute the number of times where the high-low estimator gives better performance over the ensemble of stocks. The result over S&P 500 index and its first 100 compositions is summarized in Table 2.3. 40

Volatility Estimation for Trading Strategies

Figure 2.18: Backtest of voltarget strategy on S&P 500 index 1.3

S&P 500,

CC,

OC,

P,

GK,

RS,

YZ

1.2 1.1 1 0.9 0.8 0.7 0.6 0.5 01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

Figure 2.19: Backtest of voltarget strategy on BHI UN Equity 3

Benchmark,

CC,

OC,

P,

GK,

RS,

YZ

2.5

2

1.5

1

01/2001

01/2003

01/2005

01/2007

41

01/2009

01/2011

Volatility Estimation for Trading Strategies 2 2 Table 2.2: Performance of σ ˆHL versus σ ˆCC for different averaging windows

Estimator 6 month 3 month 2 month 1 month

σ ˆP2 56.2% 52.8% 60.7% 65.2%

2 σ ˆGK 52.8% 49.4% 60.7% 64.0%

2 σ ˆRS 52.8% 51.7% 60.7% 64.0%

σ ˆY2 Z 57.3% 53.9% 56.2% 64.0%

2 L versus σ 2 Table 2.3: Performance of σ ˆH ˆCC for different filters of f

Estimator fˆc fˆma fˆexp fˆt

2.3

σ ˆP2

2 σ ˆGK

2 σ ˆRS

σ ˆY2 Z

65.2% 64.0% 64.0% 64.0%

64.0% 61.8% 61.8% 61.8%

64.0% 61.8% 60.7% 60.7%

64.0% 64.0% 64.0% 64.0%

Estimation of realized volatility

The common way to estimate the realized volatility is to estimate the expectation value of the variance over an observed windows. Then we compute the corresponding volatility. However, to do so we encounter a great dilemma: taking a long historical window can help to decrease the estimation error as discussed in the last paragraph or taking a short historical data allows an estimation of volatility closer to the present volatility. In order to overcome this dilemma, we need to have an idea about the dynamics of the variance σt2 that we would like to measure. Combining this knowledge on the dynamics of σt2 with the committed error on the long historical window, we can find out an optimal windows for the volatility estimator. We assume that the variance follows a simplified dynamics which has been used in the last numerical simulation: 

dSt = µt St dt + σt St dBt dσt2 = ξσt2 dBtσ

in which Btσ is a Brownian motion independent to the one of asset process.

2.3.1

Moving-average estimator

In this section, we show how the optimal window of the moving-average estimator is obtained via a simple example. Let us consider the canonical estimator: n 1 X 2 σ ˆ = Rti nT 2

i=1

42

Volatility Estimation for Trading Strategies

Here, the time increment is chosen to be constant ti − ti−1 = T , then the variance of this estimator at instant tn is: var σ ˆ

2





2σt4n T tn − t0

=

2σt4n n

On another hand, σt2 is now itself a stochastic process, hence its conditional variance to σt2n gives us the error due to the use of historical observations. We rewrite: Z tn Z tn 1 1 2 2 σ dt = σtn − (t − t0 ) σt2 ξ dBtσ tn − t0 t0 t tn − t0 t0 then the error due to the stochastic volatility is given by:   Z tn 2 tn − t0 4 2 nT σt4n ξ 2 1 2 ≈ σt dt σtn var σtn ξ = tn − t0 t0 3 3

The total error of the canonical estimator is simply the sum of these errors due to the fact that the two considered Brownian motions are supposed to be independent. We define the function of total estimation errors as following:  2σ 4 nT σt4n ξ 2 e σ ˆ 2 = tn + n 3 In order to obtain the optimal window for volatility estimation, we minimize the error function e σ ˆ 2 with respect to nT which leads to the following equation: σt4n ξ 2 2σt4n − 2 3 n T

= 0

√ This equation provides a very simple solution nT = 6T /ξ with the optimal error p 2 ≈ 2 2T /3 σt4n ξ. The major difficulty of this estimator is to calibrate is now e σ ˆopt the parameter ξ which is not trivial because ξt2 is an unobservable process. Different techniques can be considered such as the maximum likehood which will be discussed later.

2.3.2

IGARCH estimator

We discuss now another approach for estimating the realized volatility based on the IGARCH model. The detail theoretical derivation of the method is given in Drost F.C. et al. (1993, 1999) It consists of a volatility estimator of the form: 2 σ ˆt2 = β σ ˆt−T +

1−β 2 Rt T

where T is a constant increment of estimation . In integrating the recurrence relation above, we obtain the estimator of the variance IGARCH in function of the return observed in the past: σ ˆt2

n 1−β X i 2 2 = β Rt−iT + β n σ ˆt−nT T i=1

43

(2.6)

Volatility Estimation for Trading Strategies

We remark that the contribution of the last term tends to 0 when n tends to infinity. This estimator again has the form of a weighted average then similar approach as in the canonical estimator is applicable. Assuming that the volatility follows the lognormal dynamics described by Equation 2.3, therefore the optimal value of β is given by: p ξ 8T − ξ 2 T 2 − 4 β? = (2.7) ξ2T − 4 We encounter here again the same question as the canonical case that is how to calibrate the parameter ξ of the lognormal dynamics. In practice, we proceed in the inverse way. We seek first the optimal value β ? of the IGARCH estimator then use the inverse relation of equation 2.7 to determine the value of ξ: s ξ=

4 (1 − β ? )2 T 1 + β ?2

Remark 1 Finally, as insisted at the beginning of this discussion, we would like to point out that IGARCH estimator can be considered as an exponential weighted average. We begin first with a IGARCH estimator with constant time increment. The expectation value of this estimator is: # " +∞ X  1 − β 2 E σ ˆt2 σ = E β i Rt−iT σ T i=1 Z +∞ 1 − β X i t−iT +T 2 β σu du = T t−iT i=1 Z t−iT +T +∞ X 1 i = P+∞ β σu2 du i T β t−iT i=1 i=1 Z t−iT +T +∞ X 1 iT λ = P+∞ e σu2 du iT λ t−iT i=1 T e i=1 

with λ = − ln β/T . In this present form, we conclude that the IGARCH estimator is a weighted-average of the variance σt2 with an exponential weight distribution. The annualized estimator of the volatility can be written as: P+∞ E



σ ˆt2 σ



=

i=1

R t−iT +T 2 e−iT λ t−iT σu du P+∞ −iT λ i=1 T e

This expression admits a continuous limit when T → 0 . 44

Volatility Estimation for Trading Strategies

2.3.3

Extension to range-based estimators

The estimation of the optimal window in the last discussion can be also generalized to the case of range-based estimators. The main idea is to obtain the trade-off between the estimator error (variance of the estimator) and the dynamic volatility described by the model (2.3). The equation that determines the total error of the estimator is given by:  nT 4 2 e(ˆ σ 2 ) = var σ ˆ2 + σ ξ 3 tn Here, we remind that the first term in this expression is the estimator error coming from the discrete sum whereas the second term is the error of the stochastic volatility. In fact, the first term is already given by the study of various estimators in the last section. The second term is typically dependent on the choice of volatility dynamics. Using the notation of the estimator efficiency, we rewrite the above expression as: e(ˆ σ2) =

1 2σt4n nT 4 2 + σ ξ eff (ˆ σ2) n 3 tn

The minimization procedure of the total error is exactly the same as the last example on the canonical estimator, then we obtain the following result of the optimal averaging window: s 6T nT = (2.8) eff (ˆ σ2) ξ2 The IGARCH estimator can also be applied for various type of high-low estimator, the extension consists of performing an exponential moving average in stead of the simple average. The parameter of the exponential moving average β will be determined again by the maximum likehood method as shown in the discussion below.

2.3.4

Calibration procedure of the estimators of realized volatility

As discussed above, the estimators of realized volatility depend on the choice of the underlying dynamics of the volatility. In order to obtain the best estimation of the realized volatility, we must estimate the parameter which characterizes this dynamics. Two possible approaches to obtain the optimal value of the these estimators are: • using the least square problem which consists to minimize the following objective function: n X 2 Rt2i +T − T σ ˆt2i i=1

• or using the maximum likehood problem which consists to maximize the loglikehood objective function: n

n

i=0

i=0

X1  X Rt2i +T n − ln 2π − ln T σ ˆt2 − 2 2 2T σ ˆt2i 45

Volatility Estimation for Trading Strategies

We remark here that the moving-average estimator depends only on the averaging window whereas the IGARCH estimator depends only on the parameter β. In general, there is no way to compare these two estimators if we do not use a specific dynamics. By this way, the optimal values of both parameters are obtained by the optimal value of ξ and that offers a direct comparison between the quality of these two estimators. Example of realized volatility We illustrate here how the realized volatility is computed by the two methods discussed above. In order to illustrate how the optimal value of the averaging window nT or β ? are calibrated, we plot the likehood functions of these two estimator for one value of volatility at a given date. In Figure 2.20, we present the logarithm of likehood functions for different value of ξ. The maximal value of the function l(ξ) gives us the optimal value ξ ? which will be used to evaluate the volatility for the two methods. We remark that the IGARCH estimator is better to estimate the global maximum because its logarithm likehood is a concave function. For the the moving-average method, its logarithm likehood function is not smooth and presents complicated structure with local maximums which is less efficient for the optimization procedure. Figure 2.20: Comparison between IGARCH estimator and CC estimator 1720

CC optimal IGARCH

1715 1710 1705

l(ξ)

1700 1695 1690 1685 1680 1675 1670 0

0.05

0.1

0.15

0.2

ξ

0.25

0.3

0.35

0.4

We now test the implementation of IGARCH estimators for various high-low estimators. As we have demonstrated that the IGARCH estimator is equivalent to 46

Volatility Estimation for Trading Strategies

exponential moving-average, then the implementation for high-low estimators can be set up in the same way as the case of close-to-close estimator. In order to determine the optimal parameter β ? , we perform an optimization scheme on the logarithm likehood function. In Figure 2.21, we present the comparison of the logarithm likehood function between different estimators in function of the parameter β. The optimal parameter β ? of each estimator corresponds to the maximum of the logarithm likehood function. In order to have a clear idea about the corresponding size of the Figure 2.21: Likehood function of high-low estimators versus filtered parameter β

CC,

1490

OC,

P,

GK,

RS,

YZ

1485 1480 1475

l(β)

1470 1465 1460 1455 1450 1445 1440

0.7

0.75

0.8

0.85

0.9

0.95

1

β

moving-average window to the optimal parameter β ? , we use the formula (2.7) to effectuate the conversion. The result is reported in the Figure 2.22 below. Backtest on the voltarget strategy We take historical data of S&P 500 index over the period since 01/2001 to 12/2011 and the averaging window of the close-to-close estimator is chosen as n = 25. In Figure2.23, we show the different estimations of realized volatility. In order to test the efficiency of these realized estimators (moving-average and IGARCH), we first evaluate the likehood function for the close-to-close estimator and realized estimators then apply these estimators for the voltarget strategy as performed in the last section. In Figure 2.25, we present the value of likehood function over the period from 01/2001 to 12/2010 for three estimators: CC, CC optimal (moving-average) and IGARCH. The estimator corresponding to the highest value of the likehood function is the one that gives the best prediction of the volatility.

47

Volatility Estimation for Trading Strategies

Figure 2.22: Likehood function of high-low estimators versus effective moving window 1485 CC,

OC,

P,

GK,

RS,

YZ

1480 1475 1470

l(n)

1465 1460 1455 1450 1445 1440 0

10

20

30

40

n

50

60

70

80

90

Figure 2.23: IGARCH estimator versus moving-average estimator for close-to-close prices 100

CC CC optimal IGARCH

80

σ (%)

60

40

20

01/2001

01/2003

01/2005

01/2007

48

01/2009

01/2011

Volatility Estimation for Trading Strategies

Figure 2.24: Comparison between different IGARCH estimators for high-low prices

CC,

90

CO,

P,

GK,

RS,

YZ

80 70

σ (%)

60 50 40 30 20 10 01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

Figure 2.25: Daily estimation of the likehood function for various close-to-close estimators 1900

CC CC optimal CC IGARCH

1800

1700

l(ˆ σ)

1600

1500

1400

1300

01/2001

01/2003

01/2005

01/2007

49

01/2009

01/20011

Volatility Estimation for Trading Strategies Figure 2.26: Daily estimation of the likehood function for various high-low estimators 1900

CC,

OC,

P,

GK,

RS,

YZ

1800

1700

l(ˆ σ)

1600

1500

1400

1300

1200 01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

In Figure 2.27, the result of the backtest on voltarget strategy is performed for the three considered estimators. The estimators which dynamical choice of averaging parameters always give better result than a simple close-to-close estimator with fixed averaging window n = 25. We next backtest on the IGARCH estimator applied on the high-low price data, the comparison with IGARCH applied on close-to-close data is shown in Figure 2.28. We observe that the IGARCH estimator for close-to-close price is one of the estimators which produce the best backtest.

2.4

High-frequency volatility estimators

We have discussed in the previous sections how to measure the daily volatility based on the range of the observed prices. If more information is available in the trading data like having all the real-time quotation, can one estimate more accurately the volatility? As far as the trading frequency increases, we expect that the precision of estimator get better as well. However, when the trading frequency reaches certain limit, new phenomenon coming from the non-equilibrium of the market emerges and spoils the precision. This limit defines the optimal frequency for the classical estimator. In the literature, it is more and less agree to be at the frequency of one trade every 5 minutes. This phenomenon is called the micro-structure noise which are characterized by the bid-ask spread or the transaction effect. In this section, we will summarize and test some recent proposals which attempt to eliminate the micro-structure noise. 50

Volatility Estimation for Trading Strategies

Figure 2.27: Backtest for close-to-close estimator and realized estimators

S&P 500 CC CC optimal CC IGARCH

1.4

1.2

1

0.8

0.6

01/2001

01/2003

01/2005

01/2007

01/2009

01/2011

Figure 2.28: Backtest for IGARCH high-low estimators comparing to IGARCH closeto-close estimator 1.4

S&P 500,

CC,

OC,

P,

GK,

RS,

YZ

1.2

1

0.8

0.6

01/2001

01/2003

01/2005

01/2007

51

01/2009

01/2011

Volatility Estimation for Trading Strategies

2.4.1

Microstructure effect

It has been demonstrated in the financial literature that the realized return estimator is not robust when the sampling frequency is too high. Two possible explanations of this effect the following. In the probabilistic point of view, this phenomenon comes from the fact that the cumulated return (or the logarithm of price) is not a semimartingal as we assumed in the last section. However, it emerges only in the short time scale when the trading frequency is high enough. In the financial point of view, this effect is explained by the existence of the so-called market microstructure noises. These noises come from the existence of the bid-ask spread. We now discuss the simplest model which includes the mircrostruture noise as an independent noise to the underlying Brownian motion. We assume that the true cumulated return is an unobservable process and follows a Brownian motion:  dXt =

σ2 µt − t 2

 dt + σt dBt

The observed signal Yt is the cumulated return which is perturbed by the microstructure noise t : Yt = Xt + t For the sake of symplicity, we use the following assumptions:     (i) ti is iid with E [ti ] = 0 and E 2ti = E 2 (ii) t ⊥ ⊥ Bt From these assumptions, we see immediately that the volatility estimator based on historical data Yti is biased:   var(Y ) = var(X) + E 2   The first term var(X) is scaled as t (estimation horizon) and E 2 is constant, this estimator can be considered as unbiased if the time horizon is large enough   (t > E 2 /σ 2 ). At high frequency, the second term is not negligible and better estimator must be able to eliminate this term.

2.4.2

Two time-scale volatility estimator

Using different time scales to extract the true volatility of the hidden price process (without noise) is both independently proposed by Zhang et al. (2005) and Bandi et al. (2004). In this paragraph, we employ the approach in the first reference to define the intra-day volatility estimator. We prefer here discussing the main idea of this method and its practical implementation rather than all the detail of stochastic calculus concerning the expectation value and the variance of the realized return3 . 3

Detail of the derivation of this technique can be found in Zhang et al. (2005).

52

Volatility Estimation for Trading Strategies

Definitions and notations In order to fix the notations, let us consider a time-period [0, T ] which is divided in to M − 1 intervals (M can be understood as the frequency). The quadratic variation of the Bronian motion over this period is denoted: Z T hX, XiT = σt2 dt 0

For the discretized version of the quadratic variation, we employ the [., .] notation: X 2 [X, X]T = Xti+1 − Xti ti ,ti+1 ∈[0,T ]

Then the habitual estimator of realized return over the interval [0, T ] is given by: X 2 [Y, Y ]T = Yti+1 − Yti ti ,ti+1 ∈[0,T ]

We remark that the number of points in the interval [0, T ] can be changed. In fact, the expectation value of the quadratic variation should not depend on the distribution of points in this interval. Let us define the ensemble of points in one period as a grid G: G = {t0 , . . . , tM } Then a subgrid H is defined as:

H = {tk1 , . . . , tkm } where (tkj ) with j = 1, . . . m is a subsequence of (ti ) with i = 1, . . . M . The number of increments is denoted as: |H| = card (H) − 1

With these notations, the quadratic variation over a subgrid H reads:  2 X [Y, Y ]H = Y − Y t t k k T i+1 i tki ,tki+1 ∈H

The realized volatility estimator over the full grid If we compute the quadratic variation over the full grid G which means that at highest frequency. As discussed above, it is not surprising that it will suffer the most effect of the microstructure noise: [Y, Y ]GT = [X, X]GT + 2 [X, ]GT + 2 [, ]GT Under the hypothesis of the microstructure noise, the conditional expectation value of this estimator is equal to: i h   E [Y, Y ]GT X = [X, X]GT + 2M E 2 53

Volatility Estimation for Trading Strategies

and the variation of the estimator:         G var [Y, Y ]T X = 4M E 4 + 8 [X, X]GT E 2 − 2var 2 + O(n−1/2 ) In these two expressions above, the sums are arranged order by order. In the limit M → ∞, we obtain the habitual result of central limit theorem:   1/2   L → 2 E 4 N (0, 1) M −1/2 [Y, Y ]GT − 2M E 2 − Hence, as M increases, [Y, Y ]GT becomes a good estimator of the microstructure noise and we denote: 1 E[ [2 ] = [Y, Y ]GT 2M The central limit theorem for this estimator states:    L  1/2 M 1/2 E[ [2 ] − E 2 − → E 4 N (0, 1) as M → ∞ The realized volatility estimator over subgrid As we mentioned in the last discussion, increasing the frequency will spoil the estimation of the volatility due to the presence of the microstructure noise. The naive solution is to reduce the number of point in the grid or to consider only a subgrid, then one can take the average over a number choice of subgrids. Let us consider a subgrid H with |H| = m − 1, then the same result as for the full grid can be obtained in replacing M by m: i h  2 H E [Y, Y ]T X = [X, X]H T + 2mE  Let us consider a sequence of subgrids H(k) with k = 1 . . . K which satisfies SKnow (k) G = k=1 H and H(k) ∩ H(l) = ∅ with k 6= l. By averaging over these K subgrid, we obtain the result: K i h (k) 1 X avg E [Y, Y ]T X = [Y, Y ]H T K k=1

P We define the average length of the subgrid m = (1/K) K k=1 mk , then the final expression is: i h  2 avg E [Y, Y ]avg T X = [X, X]T + 2mE  This estimator of volatility is still biased and the precision depends strongly on the choice of the length of subgrid and the number of subgrids. In the paper of Zhang et al., the authors have demonstrated that there exists an optimal value K ? for which we can reach the best performance of estimator. 54

Volatility Estimation for Trading Strategies

Two time-scale estimator As the full-grid averaging estimator and the subgrid averaging estimator both contain the same component coming from the microstructure noise to a factor, we can employ both estimators to have a new one where the microstructure noise can be completely eliminated. Let us consider the following estimator:     m −1 m avg G 2 [Y, Y ]T − σ ˆts = 1 − [Y, Y ]T M M This estimator now is an unbiased estimator with its precision determined by the choice of K and m. In the theoretical framework, this optimal value is given as a function of the noise variance and the forth moment of the volatility. In practice, we employ a scan over the number of the subgrid of size m ∝ M/K in order to look for the optimal estimator.

2.4.3

Numerical implementation and backtesting

We now backtest the proposed technique on the S&P 500 index with the choice of the sub grid as following. The full grid is defined by the ensemble of data every minute from the opening to the close of trading days (9h to 17h30). Data is taken since the 1st February 2011 to the 6th June 2011. We denote the full grid for each trading day period: G = {t0 , . . . , tM } and the subgrid is chosen as following:

H(k) = {tk−1 , tk−1+K . . . , tk−1+nk K } where the indice k = 1, . . . , K and nk is the integer making tk−1+nk K the last element in H(k) . As we can not compute exactly the value of the optimal value K ? for each trading period, we employ an iterative scheme which tends to converge to the optimal value. Analytical expression of K ? is given by Zhang et al.: K? =

 2 !1/3 12 E 2 M 2/3 T Eη 2

where η is given by the expression: 2

Z

T

η =

σt4 dt

0

In the first approximation, we consider the case where the intraday volatility is constant then the expression of η cans be simplified to η 2 = T σ 4 . In Figure 2.29, we present the result of the intraday volatility which takes into account only the trading day for the S&P 500 index under the assumption of constant volatility. The twotime scale estimator reduces the effect of microstructure noise effect on the realized volatility computed over the full grid. 55

Volatility Estimation for Trading Strategies Figure 2.29: Two-time scale estimator of intraday volatility 35

Volatility with full grid Volatility with subgrid Volatility with two scales

30

σ (%)

25

20

15

10

5

0 02/11

2.5

03/11

04/11

05/11

06/11

Conclusion

Voltarget strategies are efficient ways to control the risk for building trading strategies. Hence, a good estimator of the volatility is essential from this perspective. In this paper, we show that we can use the data rang to improve the forecasting of the volatility of the market. The use of high and low prices is less important for the index as it gives more and less the same result with traditional close-to-close estimator. However, for independent stock with higher volatility level, the high-low estimators improves the prediction of volatility. We consider several backtests on the S&P 500 index and obtain competing results with respect to the traditional moving-average estimator of volatility. Indeed, we consider a simple stochastic volatility model which permit to integrate the dynamics of the volatility in the estimator. An optimization scheme via the maximum likehood algorithm allows us to obtain dynamically the optimal averaging window. We also compare these results for rang-based estimator with the wellknown IGARCH model. The comparison between the optimal value of the likehood functions for various estimators gives us also a ranking of estimation error. Finally, we studied the high frequency volatility estimator which is a very active topic of financial mathematics. Using simple model proposed by Zhang et al, (2005), we show that the microstructure noise can be eliminated by the two time scale estimator.

56

Bibliography [1] Bandi F. M. and Russell J. R. (2006), Saperating Microstructure Noise from Volatility Journal of Financial Economics, 79, pp. 655-692. [2] Drost F. C. and Nijman T. E. (1993), Temporal Aggregation of GARCH Processes Econometrica, 61, pp. 909-927. [3] Drost F. C. and Werker J. M. (1999), Closing the GARCH gap: Continuous time GARCH modeling Journal of Econometrics, 74, pp. 31-57 . [4] Feller W. (1951), The Asymptotic Distribution of the Range of Sums of Independent Random Variables, Annals of Mathematical Statistics, 22, pp. 427-432. [5] Garman M. B. and Klass M. J. (1980), On the estimation of security price from historical data, Journal of Business, 53, pp. 67-78. [6] Kunimoto N. (1992), Improving the Parkinson method of estimating security price volatilities, Journal of Business, 65, pp. 295-302. [7] Parkinson M. (1980), The extreme value method for estimating the variance of the rate of return, Journal of Business, 53, pp. 61-65. [8] Rogers L. C. G. and Satchell S. E. (1991), Estimating variance form high, low and closing prices, Annals of Applied Probability 1, pp. 504-512. [9] Yang D. and Zhang Q. (2000), Drift-Independent Volatility Estimation Based on High, Low, Open and Close Prices, Journal of Business, 73, pp. 477-491. [10] Zhang L., Mykland P. A. and Ait-Sahalia Y. (2005), A Tale of Two Time Scales: Determining Integrated Volatility With Noisy High-Frequency Data Journal of the American Statistical Association, 100(472), pp. 1394-1411.

57

Chapter 3

Support Vector Machine in Finance In this chapter, we review in the well-known machine learning technique so-called support vector machine (SVM). This technique can be employed in different contexts such as classification, regression or density estimation according to Vapnik [1998]. Within this paper, we would like first to give an overview on this method and its numerical variation implementation, then bridge it to financial applications such as the stock selection. Keywords:Machine learning, Statistical learning, Support vector machine, regression, classification, stock selection.

3.1

Introduction

Support vector machine is an important part of the Statistical Learning Theory. It was first introduced in the mid-90 by Boser et al., (1992) and contributes important applications for various domains such as pattern recognition (for example: handwritten, digit, image), bioinformatic e.t.c. This technique can be employed in different contexts such as classification, regression or density estimation according to Vapnik [1998]. Recently, different applications in the financial field have been developed via two main directions. The first one employs SVM as non-linear estimator in order to forecast the market tendency or volatility. In this context, SVM is used as a regression technique with feasible possibility for extension to non-linear case thank to the kernel approach. The second direction consists of using SVM as a classification technique which aims to elaborate the stock selection in the trading strategy (for example long/short strategy). In this paper, we review the support vector machine and its application in finance in both points of view. The literature of this recent field is quite diversified and divergent with many approaches and different techniques. We would like first to give an overview on the SVM from its basic construction to all extensions including the multi classification problem. We next present different numerical implementations, then bridge them to financial applications. 59

Support Vector Machine in Finance

This paper is organized as following. In Section 2, we remind the framework of the support vector machine theory based on the approach proposed in O.Chapelle (2002). We next work out various implementations of this technique from both both primal and dual problems in Section 3. The extension of SVM to the case of multi classification is discussed in Section 4. We finish with the introduction of SVM in the financial domain via an example of stock selection in Sections 5 and 6.

3.2

Support vector machine at a glance

We attempt to give an overview on the support vector machine method in this section. In order to introduce the basic idea of SVM, we start with the first discussion on the classification method via the concept of hard margin an soft margin classification. As the work pioneered by Vapnik and Chervonenkis (1971) has established a framework for Statistical Learning Theory, so-called “VC theory ”, we would like to give a brief introduction with basic notation and the important Vapnik-Chervonenkis theorem for Empirical Risk Minimization principle (ERM). Extension of ERM to Vicinal Risk Minimization (VRM) will be also discussed.

3.2.1

Basic ideas of SVM

We illustrate here the basic ideas of SVM as a classification method. The main advantage of SVM is that it can be not only described very intuitively in the context of linear classification but also extended in an intelligent way to the non-linear case. Let us define the training data set consisting of pairs of “input/output” points (xi , yi ), with 1 ≤ i ≤ n. Here the input vector xi belongs to some space X whereas the output yi belongs to {−1, 1} in the case of bi-classification. The output yi is used to identify the two possible classes. Hard margin classification The most simple idea of linear classification is to look at the whole set of input {xi ⊂ X } and search the possible hyperplane which can separate the data in two classes based on its label yi = ±1. Its consists of constructing a linear discriminant function of the form: h(x) = wT x + b where the vector w is the weight vector and b is called the bias. The hyperplane is defined by the following equation: H = {x : h(x) = wT x + b = 0} This hyperplane divides the space X into two regions: the region where the discriminant function has positive value and the region with negative value. The hyperplane is the also called the decision boundary. The linear classification comes from the fact that this boundary depends on the data in the linear way.

60

Support Vector Machine in Finance Figure 3.1: Geometric interpretation of the margin in a linear SVM.

We now define the notion of a margin. In Figure 3.1 (reprinted from Ben-Hur A. et al., 2010), we give a geometric interpretation of the margin in a linear SVM. Let x+ and x− be the closest points to the hyperplane from the positive side and negative side. The cycle data points are defined as the support vectors which are the closest points to the decision boundary (see Figure 3.1). The vector w is the √ normal vector to the hyperplane and we denote its norm kwk = wT w and its ˆ = w/kwk. We assume that x+ and x− are equidistant from the decision direction w boundary. They determine the margin from which the two classes of points of data set D are separated: 1 T ˆ (x+ − x− ) mD (h) = w 2 In the geometric consideration, this margin is just half of the distant between two ˆ closest points from both sides of the hyperplane H projected in the direction w. Using the equations that define the relative positions of these points to the hyperplane H: h(x+ ) = wT x+ + b = a h(x− ) = wT x− + b = −a where a > 0 is some constant. As the normal vector w and the bias b are undetermined quantity, we can simply divide them by a and renormalized all these equations. This is equivalent to set a = 1 in the above expression and we finally get 1 T 1 ˆ (x+ − x− ) = mD (h) = w 2 kwk 61

Support Vector Machine in Finance

The basic idea of maximum margin classifier is to determine the hyperplane which maximizes the margin. For a separable dataset, we can define the hard margin SVM as the following optimization problem: min w,b

1 kwk2 2

(3.1)

 u.c. yi wT xi + b > 1 i = 1...n  Here, yi wT xi + b > 1 is just a compact way to express the relative position of two classes of data points to the hyperplane H. In fact, we have wT xi + b > 1 for the class yi = 1 and wT xi + b < −1 for the class yi = −1. The historical approach to solve this quadratic program is to map the primal problem to dual problem. We give here the main result while the detailed derivation can be found in the Appendix C.1. Via KKT theorem, this approach gives us the following optimized solution (w? , b? ): w? =

n X

αi? yi xi

i=1

where α? = (α1? , . . . , αn? ) is the solution of the dual optimization problem with dual variable α = (α1 , . . . , αn ) of dimension n: max α u.c.

n X i=1

αi −

n 1 X αi αj yi yj xTi xj 2 i,j=1

αi ≥ 0 i = 1...n

We remark that the above optimization problem is a quadratic program in the vectorial space Rd with n linear inequality constraints. It may become meaningless if it has no solution (the dataset is inseparable) or too many solutions (stability of boundary decision on data). The questions on the existence of a solution in Problem 3.5 or on the sensibility of solution on dataset are very difficult. A quantitative characterization can be found in the next discussion on the framework of VapnikChervonenskis theory. We will present here an intuitive view of this problem which depends on two main factors. The first one is the dimension of the space of function h(x) which determines the decision boundary. In the linear case, it is simply determined by the dimension of the couple (w, b). If the dimension of this function space is two small as in the linear case, it is possible that there exists no linear solution or the dataset can not be separated by a simple linear classifier. The second factor is the number of data points which involves in the optimization program via n inequality constraints. If the number of constraints is too large, the solution may not exist neither. In order to overcome this problem we must increase the dimension of the optimization problem. There exists two possible ways to do this. The first one consists of relaxing the inequality constrains by introducing additional variables which aim to tolerate the strict separation. We will allow the separation with certain error (some data points in the wrong side). This technique is introduced first by 62

Support Vector Machine in Finance

Cortes C. and Vapnik V. (1995) under the name “Soft margin SVM”. The second one consists of using the non-linear classifier which directly extend the function space to higher dimension. The use of non-linear classifier can increase rapidly the dimension of the optimization problem which invokes a computation problem. An intelligent way to get over is employing the notion of kernel. In the next discussions, we will try to clarify these two approaches then finish this section by introducing two general frameworks of this learning theory. Soft margin classification  In fact, the inequality constrains described above yi wT xi + b > 1 ensure that all data points will be well classified with respect to the optimal hyperplane. As the data may be inseparable, an intuitive way to overcome is relaxing the strict constrains by introducing additional variables ξi with i = 1, . . . , n so-called slack variables. They allow to commit certain error in the classification via new constrains:  yi w T xi + b > 1 − ξi

i = 1...n

(3.2)

For ξi > 1, the data point xi is completely misclassified whereas P 0 ≤ ξi ≤ 1 can be interpreted as margin error. By this definition of slack variables, ni=1 ξi is directly related to the number of misclassified points. In order to fix P our expected error in the classification problem, we introduce an additional term C ni=1 ξip in the objective function and rewrite the optimization problem as following: n

min w,b,ξ u.c.

X 1 kwk2 + C ξi 2 i=1  yi wT xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n

(3.3)

Here, C is the parameter used to fix our desired level of error and p ≥ 1 is an usual way to fix the convexity on the additional term 1 . The soft-margin solution for the SVM problem can be interpreted as a regularization technique that one can find different optimization problem such as regression, filtering or matrix inversion. The same result can be found with regularization technique later when we discuss the possible use of kernel. Before switching to next discussion on the non-linear classification with kernel approach, we remark that the soft margin SVM problem is now at higher dimension d + 1 + n. However, the computation cost will be not increased. Thank to the KKT theorem, we can turn this primal problem to a dual problem with more simple constrains. We can also work directly with the primal problem by effectuating a trivial optimization on ξ. The primal problem is now no longer the a quadratic program, however it can be solved by Newton optimization or conjugate gradient as demonstrated in Chapelle O. (2007). 1

It is equivalent to define a Lp norm on the slack vector ξ ∈ Rn

63

Support Vector Machine in Finance

Non-linear classification, Kernel approach The second approach to improve the classification is to employ the non-linear SVM. In the context of SVM, we would like to insist that the construction of non-linear discriminant function h(x) consists of two steps. We first extend the data space X of dimension d to a feature space F with higher dimension N via a non-linear transformation φ : X → F, then a hyperplane will be constructed in the feature space F as presented before: h (x) = wT φ (x) + b Here, the result vector z = (z1 , . . . , zN ) = φ (x) is N -component vector in F space, hence w is also a vector of size N . The hyperplane H = {z : wT z + b = 0} defined in F is no longer a linear decision boundary in the initial space X : B = {x : wT φ (x) + b = 0} At this stage, the generalization to non-linear case helps us to avoid the problem of overfitting or underfitting. However, a computation problem emerges due to the high dimension of the feature space. For example, if we consider an quadratic transformation, it can lead to a feature space of dimension N = d(d + 3)/2. The main question is how to construct the separating hyperplane in the feature space? The answer to this question is to employ the mapping to the dual problem. By this way, our N -dimension problem turn again to the following n-dimension optimization problem with dual variable α: max α u.c.

n X i=1

αi −

n 1 X αi αj yi yj φ (xi )T φ (xj ) 2 i,j=1

αi ≥ 0 i = 1...n

Indeed, the expansion of the optimal solution w? has the following form: ?

w =

n X

αi? yi φ (xi )

i=1

In order to solve the quadratic program, we do not need the explicit form of the non-linear application but only the kernel of the form K (xi , xj ) = φ (xi )T φ (xj ) which is usually supposed to be symmetric. If we provide only the kernel K (xi , xj ) for the optimization problem, it is enough to construct later the hyperplane H in the feature space F or the boundary decision in the data space X . The discriminant function can be computed as following thank to the expansion of the optimal w? on the initial data xi i = 1, . . . , n: h (x) =

n X

αi yi K (x, xi ) + b

i=1

From this expression, we can construct the decision function which can be used to classified a given input x as f (x) = sign (h (x)). 64

Support Vector Machine in Finance

For a given non-linear function φ (x), we can compute the kernel K (xi , xj ) via the scalar product of two vector in F space. However, the reciprocal result does not stay unless the kernel satisfies the condition of the Mercer’s theorem (1909). Here, we study some standard kernel which are already widely used in the pattern recognition domain: p i. Polynomial kernel: K (x, y) = xT y + 1  ii. Radial Basis kernel: K (x, y) = exp −kx − yk2 /2σ 2  iii. Neural Network kernel: K (x, y) = tanh axT y − b

3.2.2

ERM and VRM frameworks

We finish the review on SVM by discussing briefly on the general framework of Statistical Learning Theory including the SVM. Without enter into the detail like the important theorem of Vapnik-Chervonenkis (1998), we would like to give a more general view on the SVM by answering some questions like how to approach SVM as a regression, how to interpret the soft-margin SVM as a regularization technique... Empirical Risk Minimization framework The Empirical Risk Minimization framework was studied by Vapnik and Chervonenkis in the 70s. In order to show the main idea, we first fix some notations. Let (xi , yi ), 1 ≤ i ≤ n be the training dataset of pairs input/output. The dataset is supposed to be generated i.i.d from unknown distribution P (x, y). The dependency between the input x and the output y is characterized in this distribution. For example, if the input x has a distribution P (x, y) and the out put  is related to x via function y = f (x) which is altered by a Gaussian noise N 0, σ 2 , then P (x, y) reads  P (x, y) = P (x) N f (x − y) , σ 2  We remark in this example that if σ → 0 then N 0, σ 2 tends to a Dirac distribution which means that the relation between input and output can be exactly determined by the maximum position of the distribution P (x, y). Estimating the function f (x) is fundamental. In order to give measurement of the estimation quality, we compute the expectation value of the loss function with respect to the distribution P (x, y). We define here the loss function in two different contexts: 1. Classification: l (f (x) , y) = If (x)6=y where I is the indicator function. 2. Regression: l (f (x) , y) = (f (x) − y)2 The objective of statistical learning is to determine the function f in the a certain function space F which minimizes the expected loss or the risk objective function: Z R (x) = l (f (x) , y) dP (x, y) 65

Support Vector Machine in Finance

As the distribution P (x, y) is unknown then the expected loss can not be evaluated. However, with available training dataset {xi , yi }, one could compute the empirical risk as following: n 1X Remp = l (f (xi ) , y) n i=1

In the limit of large dataset n → ∞, we expect the convergence: Remp (f ) → R (f ) for all tested function f thank to the law of large number. However, does the learning function f which minimizes Remp (f ) is the one minimizing the true risk R (f )? The answer to this question is NO. In general, there is infinite number of function f which can learn perfectly the training dataset f (x) = yi ∀i. In fact, we have to restraint the function space F in order to ensure the uniform convergence of the empirical risk to the true risk. The characterization of the complexity of a space of function F was first studied in the VC theory via the concept of VC dimension (1971) and the important VC theorem which gives an upper bound of the convergence probability P {sup f ∈ F |R (f ) − Remp (f )| > ε} → 0. A common way to restrict the function space is to impose a regularization condition. We denote Ω (f ) as a measurement of regularity, then the regularized problem consists of minimizing the regularized risk: Rreg (f ) = Remp (f ) + λΩ (f ) Here λ is the regularization parameter and Ω (f ) can be for example the Lp norm on some deviation of f . Vapnik and Chervonenkis theory We are not going to discuss in detail the VC theory on the statistical learning machine but only recall the most important result concerning the characterization of the complexity of function class. In order to well quantify the trade-off between the overfit problem and the inseparable data problem, Vapnik and Chervonenkis have introduced a very important concept which is the VC dimension and the important theorem which characterize the convergence of empirical risk function. First, the VC dimension is introduced to measure the complexity of the class of functions F Definition 3.2.1 The VC dimension of a class of functions F is defined as the maximum number of point that can be exactly learned by a function of F: n o h = max |X|, X ⊂ X , such that ∀b ∈ {−1, 1}|X| , ∃f ∈ F ∀ xi ∈ X, f (xi ) = bi (3.4) With the definition of the VC dimension, we now present the VC theorems which is a very powerful tool with control the upper limit of the convergence for the empirical risk to the true risk function. These theorems allows us to have a clear idea about the superior boundary on the available information and the number of observation in the training set n. By satisfying this theorem, we can control the trade-off between overfit and underfit. The relation between factors or coordinates of vector x and VC dimension is given in the following theorem: 66

Support Vector Machine in Finance

Theorem 3.2.2 (VC theorem of hyperplanes) Let F be the set of hyperplanes in Rd : n o  F = x 7→ sign wT x + b , w ∈ Rd , b ∈ R then VC dimension is d + 1 This theorem gives the explicit relation between the VC dimension and the number of factors or the number of coordinates in the input vector of the training set. It can be used in the next theorem in order to evaluate the necessary information for having a good classification or regression. Theorem 3.2.3 (Vapnik and Chervonenskis) let F be a class of function of VC dimension h, then for any distribution P r and for any sample {(xi , yi )}i=1n˙ drawn from this distribution, the following inequality holds true: ( ) (     ) 2n 1 2 P r sup |R (f ) − Remp (f )| > ε < 4 exp h 1 + ln − ε− n h n f ∈F An important corollary of the VC theorem is the upper bound for the convergence of the empirical risk function to the risk function: Corollary 3.2.4 Under the same hypothesis of the VC theorem, the following inequality is hold with the probability 1 − η: s  η h ln 2n 1 h + 1 − ln 4 + ∀f ∈ F, R (f ) − Remp (f ) ≤ n n We will skip all the proofs of these theorems and postpone the discussion on the importance of VC theorems important for practical use later in Section 6 as the overfit and underfit problems are very present in any financial applications. Vicinal Risk Minimization framework Vicinal Risk Minimization framework (VRM) was formally developed in the work of Chapelle O. (2000s). In EVM framework, the risk is evaluated by using empirical probability distribution: n

1X dPemp (x, y) = δxi (x)δyi (y) n i=1

where δxi (x), δyi (y) are Dirac distributions located at xi and yi respectively. In the VRM framework, instead of dPemp , the Dirac distribution is replaced by an estimate density in the vicinity of xi : n

1X dPvic (x, y) = dPxi (x)δyi (y) n i=1

67

Support Vector Machine in Finance

Hence, the vicinal risk is then defined as following: Z Rvic (f ) =

n

1X l (f (x) , y) dPvic (x, y) = n

Z l (f (x) , yi ) dPxi (x)

i=1

In order to illustrate the different between the ERM framework and VRM framework, let us consider the following example of the linear regression. In this case, our loss function l (f (x) , y) = (f (x) − y)2 where the learning function is of the form f (x) = wT x + b. Assuming that the vicinal density probability dPxi (x) is approximated by a white noise of variance σ 2 . The vicinal risk is calculated as following: n

1X Rvic (f ) = n i=1

Z

(f (x) − yi )2 dPxi (x)

n Z  1X = (f (xi + ε) − yi )2 dN 0, σ 2 n

=

1 n

i=1 n X i=1

(f (xi ) − yi )2 + σ 2 kwk2

It is equivalent to the regularized risk minimization problem: Rvic (f ) = Remp (f ) + σ 2 kwk2 of parameter σ 2 with L2 penalty constraint.

3.3

Numerical implementations

In this section, we discuss explicitly the two possible ways to implement the SVM algorithm. As discussed above, the kernel approach can be applied directly in the dual problem and it leads to a simple form of an quadratic program. We discuss first the dual approach for the historical reason. Direct implementation for the primal problem is little bit more delicate that why it was much more later implemented by Chapelle O. (2007) by Newton optimization method and conjugate gradient method. According to Chapelle O., in term of complexity both approaches propose more and less the same efficiency while in some context the later gives some advantage on the solution precision.

3.3.1

Dual approach

We discuss here in more detail the two main applications of SVM which are the classification problem and the regression problem within the dual approach. The reason for the historical choice of this approach is simply it offers a possibility to obtain a standard quadratic program whose numerical implementation is well-established. Here, we summarize the result presented in Cortes C. and Vapnik V. (1995) where they introduced the notion of soft-margin SVM. We next discuss the extension for the regression. 68

Support Vector Machine in Finance

Classification problem As introduced in the last section, the classification encounters two main problems: the overfitted problem and the underfitted problem. If the dimension of the function space is two large, the result will be very sensible to the input then a small change in the data can cause an instability in the final result. The second one consists of nonseparable data in the sense that the function space is too small then we can not obtain a solution which minimizes the risk function. In both case, regularization scheme is necessary to make the problem well-posed. In the first case, on should restrict the function space by imposing some condition and working with some specific function class (linear case for example). In the later case, on needs to extend the function space by introducing some tolerable error (soft-margin approach) or working with non-linear transformation. a) Linear SVM with soft-margin approach In the work of Cortes C. and Vapnik V. (1995), they have first introduced the notion of soft-margin by accepting that there will be some error in the classification. They characterize this error by additional variables ξi associated to each data points xi . These parameters intervene in the classification via the  constraints. For a given hyperplane, the constrain yi wT xi + b ≥ 1 which means that the point xi is well-classified and is out of the margin. When we  T change this condition to yi w xi + b ≥ 1 − ξi with ξi ≥ 0 i = 1...n, it allow first to point xi to be well-classified but in the margin for 0 ≤ ξi < 1. For the value ξi > 1, there is a possibility that the input xi is misclassified. As written above, the primal problem becomes an optimization with respect to the margin and and the total committed error. ! n X 1 p 2 min kwk + C.F ξi w,b,ξ 2 i=1  T u.c. yi w xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n Here, p is the degree of regularization. We remark that only for the choice of p ≥ 1 the a soft-margin can have an unique solution. The function F (u) is usually chosen as a convex function with F (0) = 0, for example F (u) = uk . In the following we consider two specific cases: (i) Hard-margin limit with C = 0; (ii) L1 penalty with F (u) = u, p = 1. We define the dual vector Λ = (α1 , . . . , αn ) and the output vector y = (y1 , . . . , yn ). In order to write the optimization problem in vectorial form, we define as well the operator D = (Dij )n×n with Dij = yi yj xTi xj . i. Hard-margin limit with C = 0. As shown in Appendix C.1.1, this problem can be mapped to the following dual problem: 1 max ΛT 1 − ΛT DΛ 2 Λ T u.c. Λ y = 0, Λ ≥ 0 69

(3.5)

Support Vector Machine in Finance

ii. L1 penalty with F (u) = u, p = 1. In this case the associated dual problem is given by: 1 max ΛT 1 − ΛT DΛ 2 Λ T u.c. Λ y = 0, 0 ≤ Λ ≤ C1

(3.6)

The full derivation is given in Appendix C.1.2. Remark 2 For the case with L2 penalty (F (u) = u, p = 2), we will demonstrate in the next discussion that it is a special case of kernel approach for hard-margin case. Hence, the dual problem is written exactly the same as hardmargin case with an additional regularization term 1/2C added to the matrix D:   1 T 1 T max Λ 1 − Λ I Λ D+ (3.7) 2 2C Λ u.c.

ΛT y = 0, Λ ≥ 0

b) Non-linear SVM with Kernel approach The second possibility to extend the function space is to employ a non-linear transformation φ (x) from the initial space X to the feature space F then construct the hard margin problem. This approach conducts to the following dual problem with the use of an explicit kernel K (xi , xj ) = φ (xi )T φ (xj ) in stead of xTi xj . In this case, the D operator is a matrix D = (Dij )n×n with element: Dij = yi yj K (xi , xj ) With this convention, the two first quadratic programs above can be rewritten in the context of non-linear classification by replacing D operator by this new definition with the kernel. We finally remark that, the case of soft-margin SVM with quadratic penalty (F (u) = u, p = 2) can be also seen as the case of hard-margin a  SVM with √  ˜ modified Kernel. We introduce a new transformation φ (xi ) = φ (xi ) 0 . . . yi / 2C . . . 0 √ ˜ = where the element y / C is at i + dim(φ(xi )) position, and new vector w i  √ √  w ξ1 2C . . . ξn C . In the new representation, the objective function kwk2 /2+   P ˜ 2 /2 whereas the inequality constrain yi φ (w)T xi + b ≥ C ni=1 ξi2 becomes simply kwk   ˜ T φ˜ (xi ) + b ≥ 1. Hence, we obtain the hard-margin SVM 1 − ξi becomes yi w with a modified kernel which can be computed simply: ˜ i , xj ) = φ˜ (xi )T φ˜ (xj ) = K(xi , xj ) + δij K(x 2C This kernel is consistent with QP program in the last remark. 70

Support Vector Machine in Finance

In summary, the linear SVM is nothing else a special case of the non-linear SVM within kernel approach. In the later, we study the SVM problem only for the two case with hard and soft margin within the kernel approach. After obtaining the optimal vector Λ? by solving the associated QP program described above, we can compute b by condition then derive the decision function f (x). We remind Pnthe KKT ? ? that w = i=1 αi yi φ (x). i. For the hard-margin case, KKT condition given in Appendix C.1.1:    αi? yi w?T φ (xi ) + b? − 1 = 0 We notice that for the value αi > 0, the inequality constraint becomes equality. As the inequality constraint becomes equality constrain, these points are the closest points to the optimal frontier and they are called support-vectors. Hence, b can be computed easily for a given support vector (xi , yi ) as following: b? = yi − w?T φ (xi ) In order to enhance the precision of b? , we evaluate this value as the average all over the set SV of support vectors : b? =

=

1 nSV 1 nSV

X i∈SV

yi −

X

αj? yj φ (xj )T φ (xi )

i,j∈SV

 X i∈SV

yi 1 − αi?

 X

K (xi , xj )

j∈SV

ii. For the soft-margin case, KKT condition given in Appendix C.1.2 is slightly different:    αi? yi w?T φ (xi ) + b? − 1 + ξi = 0

However, if αi satisfies the condition 0 ≤ αi ≤ C then we can show that ξi = 0. The condition 0 ≤ αi ≤ C defines the subset of training points (support vectors) which are closest to the frontier of separation. Hence, b can be computed by exactly the same expression as the hard-margin case.

From the optimal value of the triple (Λ? , w? , b? ), we can construct the decision function which can be used to classified a given input x as following: ! n X ? f (x) = sign αi yi K (x, xi ) + b (3.8) i=1

Regression problem In the last sections, we have discussed the SVM problem only in the classification context. In this section, we show how the regression problem can be interpreted as a SVM problem. As discussed in the general frameworks of statistical learning (ERM 71

Support Vector Machine in Finance

or VRM), the SVM problem consists of minimizing the risk function Remp or Rvic . The risk function can be computed via the loss function l (f (x) , y) which defines our objective (classification or regression). Explicitly, the risk function is calculated as: Z R (f ) = l (f (x) , y) dP (x, y) where the distribution dP (x, y) can be computed in the ERM framework or in the VRM framework. For the classification problem, the loss function is defined as l (f (x) , y) = If (x)6=y which means that we count as an error whenever the given point is misclassified. The minimization of the risk function for the classification can be mapped then to the minimization of the margin 1/ kwk. For the regression problem, the loss function is l (f (x) , y) = (f (x) − y)2 which means that we count the loss as the error of regression. Remark 3 We have chosen here the loss as the least-square error just for illustration. In general, it can be replaced by any positive function F of f (x) − y. Hence, we have the loss function in general form l (f (x) , y) = F (f (x) − y). We remark that the least-square case corresponds to L2 norm, then the most simple generalization is to have the loss function as Lp norm l (f (x) , y) = |f (x) − y|p . We show later that the special case with L1 can bring the regression problem to a similar form of soft-margin classification. In the last discussion on the classification, we have concluded that the linear-SVM problem is just a special case of non-linear-SVM within kernel approach. Hence, we will work here directly with non-linear case where the training vector x is already transformed by a non-linear application φ (x). Therefore, the approximate function of the regression reads f (x) = wT φ (x)+b. In the ERM framework, the risk function is estimated simply as the empirical summation over the dataset: n

Remp =

1X (f (xi ) − yi )2 n i=1

whereas in the VRM framework, if we assume that dP (x, y) is a Gaussian noise of variance σ 2 then the risk function reads: n

Rvic =

1X (f (xi ) − yi )2 + σ 2 kwk2 n i=1

The risk function in the VRM framework can be interpreted as a regulated form of risk function in the ERM framework. We rewrite the risk function after renormalizing it by the factor 2σ 2 : n X 1 2 Rvic = kwk + C ξi2 2 i=1

with C = Here, we have introduced new variables ξ = (ξi )i=1...n which satisfy yi = f (xi ) + ξi = wT φ (xi ) + b + ξi . The regression problem can be now 1/2σ 2 n.

72

Support Vector Machine in Finance

written as a QP program with equality constrain as following: n

min w,b,ξ

X 1 kwk2 + C ξi2 2

u.c.

yi = wT φ (xi ) + b + ξi

i=1

i = 1...n

In the present form, the regression looks very similar to the SVM problem for the classification. We notice that the regression problem in the context of SVM can be easily generalized by two possible ways: • The first way is to introduce more general loss function F (f (xi ) − yi ) instead of the least-square loss function. This generalization can lead to other type of regression such as ε-SV regression proposed by Vapnik (1998). • The second way is to introduce a weight ωi distribution for the empirical distribution instead of the uniform distribution: dPemp (x, y) =

n X

ωi δxi (x)δyi (y)

i=1

As financial quantities depend more on the recent pass, hence an asymmetric weight distribution in the favor of recent data would improve the estimator. The idea of this generalization is quite similar to exponential moving-average. By doing this, we recover the results obtained in Gestel T.V. et al., (2001) and in Tay F.E.H. and Cao L.J. (2002) for the LS-SVM formalism. For examples, we can choose the weight distribution as proposed in Tay F.E.H. and Cao L.J. (2002): ωi = 2i/n (n + 1) (linear distribution) or ωi = (1 + exp (a − 2ai/n)) (exponential weight distribution). Our least-square regression problem can be mapped again to a dual problem after introducing the Lagrangian. Detail calculations are given in Appendix C.1. We give here the principle result which invokes again the kernel Kij = K (xi , xj ) = φ (xi )T φ (xj ) for treating the non-linearity. Like the classification case, we consider only two problems which are similar to the hard-margin and the soft-margin in the context of regression. i. Least-square SVM regression: In fact, the regression problem discussed above similar to the hard-margin problem. Here, we have to keep the regularization parameter C as it define a tolerance error for the regression. However, this problem with the L2 constrain is equivalent to hard-margin with a modified kernel. The quadratic optimization program is given as following:   1 T 1 T max Λ y − Λ K+ I Λ (3.9) 2 2C Λ u.c. ΛT 1 = 0 73

Support Vector Machine in Finance

ii. ε-SVM regression The ε-SVM regression problem was introduced by Vapnik (1998) in order to have a similar formalism with the soft-margin SVM. He proposed to employ the loss function in the following form: l (f (x) , y) = (|y − f (x)| − ε) I{|y−f (x)|≥ε} The ε-SVM loss function is just a generalization of L1 error. Here, ε is an additional tolerance parameter which allows us not to count the regression error small than ε. Insert this loss function into the expression of risk function then we obtain the objective of the optimization problem: n

Rvic =

X 1 kwk2 + C (|f (xi ) − yi | − ε) I{|yi −f (xi )|≥ε} 2 i=1

Because the two ensembles {yi −f (xi ) ≥ ε} and {yi −f (xi ) ≤ −ε} are disjoint. We now break the function I{|yi −f (xi )|≥ε} into two terms: I{|yi −f (xi )|≥ε} = I{yi −f (xi )−ε≥0} + I{f (xi )−yi −ε≥} By introducing the slack variables ξ and ξ 0 as the last case which satisfy the condition ξi ≥ yi − f (xi ) − ε and ξi0 ≥ f (xi ) − yi − ε. Hence, we obtain the following optimization problem: n

min 0 w,b,ξ ,ξ u.c.

X  1 kwk2 + C ξi + ξi0 2 i=1

T

w φ (xi ) + b − yi ≤ ε + ξi , T

yi − w φ (xi ) − b ≤ ε +

ξi0 ,

ξi ≥ 0 i = 1...n

ξi0 ≥ 0 i = 1...n

Remark 4 We remark that our approach gives exactly the same result as the traditional approach discussed in the work of Vapnik (1998) in which the objective function is constructed by minimizing the margin with additional terms defining the regression error. These terms are controlled by the couple of slack variables. The dual problem in this case can be obtained by performing the same calculation as the soft-margin SVM: max Λ,Λ0

Λ − Λ0

T

y − ε Λ + Λ0

u.c.

Λ − Λ0

T

1 = 0,

T

1−

T  1 Λ − Λ0 K Λ − Λ0 (3.10) 2

0 ≤ Λ, Λ0 ≤ C1

For the particular case with ε = 0, we obtain: 1 max ΛT y − ΛT KΛ 2 Λ T u.c. Λ 1 = 0, |Λ| ≤ C1 74

Support Vector Machine in Finance

After the optimization procedure using QP program, we obtain the optimal vector Λ? then compute b? by the KKT condition: wT φ (xi ) + b − yi = 0 for support vectors (xi , yi ) (see Appendix C.1.3 for more detail). In order to have good accuracy for the estimation of b, we average over the set of support vectors SV and obtain:   n n SV X X 1 yi − αi? b? = K (xi , xj ) nSV i=1

j=1

The SVM regressor is then given by the following formula: f (x) =

n X

αi? K (x, xi ) + b?

i=1

3.3.2

Primal approach

We discuss now the possible of an direct implementation for the primal problem. This problem has been proposed and studied by Chapelle O. (2007). In this work, the author argued  that both primal and dual  implementations give the same complexity 2 of the order O max (n, d) min (n, d) . Indeed, according to the author, the primal problem might give a more accurate solution as it treats directly the quantity that one is interested in. It is can be easily understood via the special case of a LS-SVM linear estimator where both primal and dual problems can be solved analytically. The main idea of primal implementation is to rewrite the optimization problem under constraint as a unconstrained problem by performing a trivial minimization on the slack variables ξ. We then obtain: n X  1 2 L yi , wT φ (xi ) + b min kwk + C w,b 2

(3.11)

i=1

Here, we have L (y, t) = (y − t)p for the regression problem whereas L (y, t) = max (0, 1 − yt)p for the classification problem. In the case with quadratic loss or L2 penalty, the function L (y, t) is differentiable with respect to the second variable hence one can obtain the zero gradient equation. In the case where L (y, t) is not differentiable such as L (y, t) = max (0, 1 − yt), we have to approximate it by a regular function. Assuming that L (y, t) is differentiable with respect to t then we obtain: w+C

n X ∂L i=1

∂t

 yi , wT φ (xi ) + b φ (xi ) = 0

which leads to the following representation of the solution w: w=

n X

βi φ (xi )

i=1

75

Support Vector Machine in Finance

By introducing the kernel Kij = K (xi , xj ) = φ (xi )T φ (xj ) we rewrite the primal problem as following: n X  1 min β T Kβ + C L yi , KiT β + b β ,b 2 i=1

(3.12)

where Ki is the ith column of the matrix K. We note that it is now an unconstrained optimization problem which can be solved by gradient descent whenever L (y, t) is differentiable. In Appendix C.1, we present detail derivation of the primal implementation in for the case quadratic loss and soft-margin classification.

3.3.3

Model selection - Cross validation procedure

The possibility to enlarge or restrict the function space let us the possibility to obtain the solution for SVM problem. However, the choice of the additional parameter such as the error tolerance C in the soft-margin SVM or the kernel parameter in the extension to non-linear case is fundamental. How can we choose these parameters for a given data set? In this section, we discuss the calibration procedure so-called “model selection ” which aims to determine the ensemble of parameters for SVM. This discussion bases essentially on the result presented the O. Chapelle’s thesis (2002). In order to define the calibration procedure, let us first define the test function which is used to estimate the SVM problem. In the case where we have a lot of data, we can follow the traditional cross validation procedure by dividing the total data in two independent sets: the training set and the validation set. The training set {xi , yi }1≤i≤n is used for the optimization problem whereas the validation set {x0i , yi0 }1≤i≤m is used to evaluate the error via the following test function: m

 1 X T = ψ −yi0 f x0i m i=1

where ψ (x) = I{x>0} with IA the standard notation of the indicator function. In the case where we do not have enough data for SVM problem, we can employ directly the training set to evaluate the error via the “Leave-one-out error” . Let f 0 be the classifier obtained by the full training set and f p be the one with the point (xp , yp ) left out. The error is defined by the test of the decision rule f p on the missing point (xp , yp ) as following: n 1X T = ψ (−yp f p (xp )) n i=1

We focus more here the first test error function with available validation data set. However, the error function requires the step function ψ which is discontinuous can cause some difficulty if we expect to determine the best selection parameter via the optimal test error. In order to perform the search for minimal test error by gradient 76

Support Vector Machine in Finance

descent for example, we should smooth the test error by regulate the step function by: 1 ψ˜ (x) = 1 + exp (−Ax + B) The choice of the parameter A, B are important. If A is too small the approximation error is too much whereas if A is large the test error is not smooth enough for the minimization procedure.

3.4

Extension to SVM multi-classification

The single SVM classification (binary classification) discussed in the last section was very well-established and becomes a very standard method for various applications. However, the extension to the multi classification problem is not straight forward. This problem still remains a very active research topic in the recognition domain. In this section, we give a very quick overview on this progressing field and some practical implementations.

3.4.1

Basic idea of multi-classification

The multiclass SVM can be formulated as following. Let (xi , yi )i=1...n be the training set of data with characteristic x ∈ Rd under classification criterion y. For example, the training data belong to m different classes labeled from 1 to m which means that y ∈ {1, . . . , m}. Our task is to determine a classification rule F : Rd → {1, . . . , m} based on training set data which aims to predict to which class belongs the test data xt by evaluating the decision rule f (xt ). Recently, many important contributions have progressed the field both in the accuracy and complexity (i.e. reduction of time computation). The extensions have been developed via two main directions. The first one consists of dividing the multiclassification problem into many binary classification problem by using “one-againstall” strategy or “one-against-one”. The next step is to construct the decision function in the recognition phase. The implementation of the decision for “one-against-all” strategy is based on the maximum output among all binary SVMs. The outputs are usually mapped into an estimation probability which are proposed by different authors such as Platt (1999). For “one-against-one”strategy, in order to take the right decision, the Max Wins algorithm is adopted. The resultant class is given by the one voted by the majority of binary classifiers. Both techniques encounter the limitation of complexity and high cost of computation time. Other improvement in the same direction such as the binary decision tree (SVM-BDT) was recently proposed by Madzaro G. et al., (2009). This technique proved to be able to speed up the computation time. The second direction consist of generalizing the kernel concept in the SVM algorithm into a more general form. This method treats directly the multiclassification problem by writing a general form of the large margin problem. It will be again mapped into the dual problem by incorporating the kernel concept. 77

Support Vector Machine in Finance

Crammer K. and Singer Y. (2001) introduced an efficient algorithm which decomposes the dual problem into multiple optimization problems which can be solved later by fixed-point algorithm.

3.4.2

Implementations of multiclass SVM

We describe here the two principal implementations of SVM for multiclassification problem. The first one concerns a direct application of binary SVM classifier, however the recognition phase requires a careful choice of decision strategy. We next describe and implement the multiclass kernel-based SVM algorithm which is a more elegant approach. Remark 5 Before discussing details of the two implementations, we remark that there exists other implementations of SVM such as the application of Nonnegative Matrix Factorization (Poluru V. K. et al., 2009) in the binary case by rewriting the SVM problem in NMF framework. Extension of this application to multiclassification case must be an interesting topic for future work. Decomposition into multiple binary SVM The most two popular extensions of single SVM classifier to multiclass SVM classifier are using the one-against-all strategy and one-against-all strategy. Recently, another technique utilizing the binary decision tree provided less effort in training the data and it is much faster for recognition phase with a complexity of the order O [log2 N ]. All these techniques employ directly the above SVM implementation. a) One-against-all strategy: In this case, we construct m single SVM classifiers in order separate the training data from every class to the rest of classes. Let us consider the construction of classifier separating class k from the rest. We start by attributing the response zi = 1 if yi = k and zi = −1 for all yi ∈ {1, . . . m} / {k}. Applying this construction for all classes, we obtain finally the m classifiers f1 (x) , . . . , fm (x). For a testing data x the decision rule is obtained by the maximum of the outputs given by these m classifiers: y = argmaxk∈{1...m} fk (x) In order to avoid the error coming from the fact that we compare the output corresponding to different classifiers, we can map the output of each SVM into the same form of probability proposed by Platt (1999): Pˆ r ( ωk | fk (x)) =

1 1 + exp (Ak fk (x) + Bk )

where ωk is the label of the k th class. This quantity can be interpreted as a measure of the accepting probability of the classifier ωk for a given point x with 78

Support Vector Machine in Finance

output fk (x). However, nothing guarantees that hence we have to renormalize this probability:

Pm

ˆ

k=1 P r ( ωk | fk

(x)) = 1,

Pˆ r ( ωk | fk (x)) Pˆ r ( ωk | x) = Pm ˆ j=1 P r ( ωj | fj (x)) In order to obtain these probability, we have to calibrate the parameters (Ak , Bk ). It can be realized by performing the maximum likehood on the training set (Platt (1999)). b) One-against-one strategy: Other way to employ the binary SVM classifier is to construct Nc = m(m − 1)/2 binary classifiers which separate all couples of classes (ωi , ωj ). We denote the ensemble of classifiers C = {f1 , . . . , fNc }. In the recognition phase, we evaluate all possible outputs f1 (x) , . . . , fNc (x) over C for a given point x. These outputs can be mapped to the response function of each classifier signfk (x) which determines to which class the point x belongs with respect to the classifier fk . We denote N1 , . . . , Nm numbers of times that the point x is classified in the classes ω1 , . . . , ωm respectively. Using the responses we can construct a probability distribution Pˆ r ( ωk | x) over the set of classes {ωk }. This probability again is used to decide the recognition of x. c) Binary decision tree: Both methods above are quite easy for implementation as they employ directly the binary solver. However, they are all suffer a high cost of computation time. We discuss now the last technique proposed recently by Madazarov G. et al., (2009)which uses the binary decision tree strategy. With advantage of the binary tree, the technique gains both complexity and computation time consumption. It needs only m − 1 classifiers which do not always run on the whole training set for constructing the classifiers. By construction, for recognizing a testing point x, it requires only O (log2 N ) evaluation by descending the tree. Figure 3.2 illustrates how this algorithm works for classifying 7 classes. Multiclass Kernel-based Vector Machines A more general and elegant formalism can be obtained for multiclassification by generating the concept kernel. Within this discussion, we follow the approach given in the work of Crammer G. et al., (2001) but with more geometrical explanation. We demonstrate that this approach can be interpreted as a simultaneous combination of “one-against-all” and “one-against-one” strategies. As in the linear case, we have to define a decision function. For the binary case, f (x) = sign (h (x)) where h (x) is the boundary (i.e. f (x) = +1 if x ∈ class 1 whereas f (x) = −1 if x ∈ class 2). For the decision function must as-well indicate the class index. In the work of Crammer K. et al., (2001), they proposed to construct the decision rule F : Rd → {1, . . . , m} as following:  F (x) = argmaxk∈{1,...,m} WkT x 79

Support Vector Machine in Finance Figure 3.2: Binary decision tree strategy for multiclassification problem

Here, W is the d × m weight matrix in which each column Wk corresponds to a d × 1 weight vector. Therefore, we can rewrite the weight matrix as W = (W1 W2 . . . Wm ). We remind that the vector x is of dimension d. In fact, the vectors Wk corresponding to k th class can be interpreted as the normal vector of the hyperplan in the binary SVM. It characterizes the sensibility of a given point x to the k th class. The quantity WkT x is similar to a “score ” that we attribute to the class ωk . Remark 6 This construction looks quite similar to the “one-against-all” strategy. The main difference is that for the “one-against-all” strategy, all vectors W1 . . . Wm are constructed independently one by one with binary SVM whereas within this formalism, they are constructed spontaneously all together. We will show in the following that the selection rule of this approach is more similar to “one-against-one” strategy. Remark 7 In order to have an intuitive geometric interpretation, we treat here the case of linear-classifier. However, the generation to non-linear case will be straight  forward when we replace xTi xj by φ xTi f (xj ). This step introduces the notion of kernel K (xi , xj ) = φ (xi )T φ (xj ). By definition Wk is the vector defining the boundary which distinguishes the class ωk from the rest. It is a normal vector to the boundary and point to the region occupied by class ωk . Assuming that we are able to separate correctly all data by classifier W. For any point (x, y) when we compute the position of x with respect to two classes ωy and ωk for all k 6= y, we must find that x belongs to class ωy . As Wk defines the vector pointing to the class ωk , hence when we compare a class ωy to a class ωk , it is natural to define the vector Wy − Wk to define the vector point to class ωy but not ωk . As consequence, Wk − Wy is the vector  point to class ωk but not ωy . T T When x is well classified, we must have Wy − Wk x > 0 (i. e. the class ωy has 80

Support Vector Machine in Finance

the best score). In order to have a margin like the binary case, we impose strictly that WyT − WkT x ≥ 1 ∀k 6= y. This condition can be translated for all k = 1 . . . m by adding δy,k (the Kronecker symbol) as following:  WyT − WkT x + δy,k ≥ 1 Therefore, solving the multi-classification problem for training set (xi , yi )i=1...n is equivalent to finding W satisfying:  WyTi − WkT xi + δyi ,k ≥ 1 ∀i, k We notice here that w = WiT − WjT is normal vector to the separation boundary  Hw = z|wT z + bij = 0 between two classes ωi and ωj . Hence the width of the margin between two classes is as in the binary case: M (Hw ) =

1 kwk

Maximizing the margin is equivalent to minimizing the norm kwk. Indeed, we have  2 2 2 2 kwk = kWi − Wj k ≤ 2 kWi k + kWj k . In order to maximize all the margin at the same time, it turns out that we have to minimize the L2 -norm of the matrix W: kWk22 =

m X i=1

kWi k2 =

m X d X

Wij2

i=1 j=1

Finally, we obtain the following optimization problem: min W

u.c.

1 kWk2 2  WyTi − WkT xi + δyi ,k ≥ 1 ∀i = 1 . . . n, k = 1 . . . m

The extension the similar case with“soft-margin” can be formulated easily bu introducing the slack variables ξi corresponding for each training data. As before, this slack variable allow the point to be classified in the margin. The minimization problem now becomes: ! n X 1 min kWk2 + C.F ξip W,ξ 2 i=1  T T u.c. Wyi − Wk xi + δyi ,k ≥ 1 − ξi , ξi ≥ 0 ∀i, k Remark 8 Within the ERM or V RM frameworks, we can construct the risk function via the loss function l (x) = I{F (x)6=y} for the couple of data (x, y). For example, in the ERM framework, we have: n

1X I{F (xi )6=yi } Remp (W) = n i=1

81

Support Vector Machine in Finance

The classification problem is now equivalent to find the optimal matrix W? which minimizes the empirical risk function. In the binary case, we have seen that the optimization of risk function is equivalent to maximizing the margin kwk2 under linear constraint. We remark that in VRM framework, this problem can be tackled exactly as the binary case. In order to prove the equivalence of minimizing the risk function with the large margin principle, we look for a linear superior boundary the indicator function I{F (x)6=y} . As shown in Crammer K. et al., (2001), we consider the following function:  g (x, y; k) = WkT − WyT x + 1 − δy,k In fact, we can prove that I{F (x)6=y} ≤ g (x, y) = max g (x, y; k) k

∀ (x, y)

 We first remark that g (x, y; y) = WyT − WyT x + 1 − δy,y = 0, hence g (x, y) ≥ g (x, y; y) = 0. If the point (xi , yi ) satisfies F (xi ) = yi then WyTi x = maxk WkT xi and I{F (x)6=y} (xi ) = 0. For this case, it is obvious that I{F (x)6=y} (xi ) ≤ g (xi , yi ). If we have now F (xi ) 6= yi then WyTi x < maxk WkT xi and I{F (x)6=y} (xi ) = 1. In this  case, g (x, y) = maxk WkT x − WyT + 1 ≥ 1. Hence, we obtain again I{F (x)6=y} (xi ) ≤ g (xi , yi ). Finally, we obtain the upper boundary of the risk function by the following expression: Remp (W) ≤

n   1X max WkT − WyTi xi + 1 − δyi ,k k n i=1

If the the data is separable, then the optimal value of the risk function is zero. If one require that the superior boundary of the risk function is zero, then the W? which optimizes this boundary must be the one optimize Remp (W). The minimization can be expressed as:   max WkT − WyTi xi + 1 − δyi ,k = 0 ∀i k

or in the same form of the large margin problem:  WyTi − WkT xi + 1 + δyi ,k ≥ 1

∀i, k

Follow the traditional routine for solving this problem, we map it into the dual problem as in the case with binary classification. The detail of this mapping is given in K. Crammer and Y. Singer (2001). We summarize here their important result in the dual form with the dual variable η i of dimension m with i = 1 . . . n. Define τ i = 1yi − η i where 1yi is zero column vector except for ith element, then in the case with soft margin p = 1 and F (u) = u we have the dual problem: ! n 1X T  T  1 X T max Q (τ ) = − xi x j τ i τ j + τ i 1 yi τi 2 C i,j

u.c.

τ i ≤ 1yi

and

i=1

τ Ti 1

= 0 ∀i

82

Support Vector Machine in Finance

We remark here again that we obtain a quadratic program which involves only the interior product between all couples of vector xi , xj . Hence the generation to nonlinear is straight forward with the introduction of the kernel  concept. The general problem is finally written by replacing the the factor xTi xj by the kernel K (xi , xj ):  1 1X max Q (τ ) = − K (xi , xj ) τ Ti τ j + τi 2 C i,j

u.c.

τ i ≤ 1yi

and

τ Ti 1

n X

! τ Ti 1yi

(3.13)

i=1

(3.14)

= 0 ∀i

The optimal solution of this problem allows to evaluate the classification rule: H(x) = arg max

( n X

r=1...m

) τ i,r K (x, xi )

(3.15)

i=1

For small value of class number m, we can implement the above optimization by the traditional QP program with matrix of size mn × mn. However, for important number of class, we must employ efficient algorithm as stocking a mn×mn is already a complicate problem. Crammer and Singer have introduced an interesting algorithm which optimize this optimization problem both in stockade and computation speed.

3.5

SVM-regression in finance

Recently, different applications in the financial field have been developed through two main directions. The first one employs SVM as non-linear estimator in order to forecast the market tendency or volatility. In this context, SVM is used as a regression technique with feasible possibility for extension to non-linear case thank to the kernel approach. The second direction consists of using SVM as a classification technique which aims to elaborate the stock selection in the trading strategy (for example long/short strategy). The SVM-regression can be considered as a non-linear filter for times series or a regression for evaluating the score. We discuss first here how to employ the SVM-regression as as an estimators of the trend for a given asset. The observed trend can be used later for momentum strategies such as trend-following strategy. We next use SVM as a method for constructing the score of the stock for long/short strategy.

3.5.1

Numerical tests on SVM-regressors

We test here the efficiency of different regressors discussed above. They can be distinguished by the form of loss function into L1 -type or L2 type or by the form of non-linear kernel. We do not focus yet on the calibration of SVM parameter and reserve it for the next discussion on the trend extraction of financial time series with a full description of cross validation procedure. For a given times series yt we would like to regress the data with the training vector x = t = (ti )i=1...n . Let us consider 83

Support Vector Machine in Finance

two model of time series. The first model is simply an determined trend perturbed by a white noise: yt = (t − a)3 + σN (0, 1) (3.16) The second model for our tests is the Black-Scholes model of the stock price: dSt = µt dt + σt dBt St

(3.17)

We notice here that the studied signal yt = ln St . The parameters of the model are the annualized return µ = 5% and the annulized volatility σ = 20%. We consider the regression on a period of one year corresponding to N = 260 trading days. The first test consists of comparing the L1 -regressor and L2 -regressor for Gaussian kernel (see Figures 3.3-3.4). As shown in Figure 3.3 and Figure 3.4, the L2 -regressor seems to be more favor for the regression. Indeed, we observe that the L2 -regressor is more stable than the L1 -regressor (i.e L1 is more sensible to the training data set) via many test on simulated data of Model 3.17. In the second test, we compare different L2 regressions corresponding to four typical kernel: 1. Linear, 2. Polynomial, 3. Gaussian, 4. Sigmoid. Figure 3.3: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16) 20

15

10

yt

5

0

−5

−10

−15

−20 0

Real signal L1 regression L2 regression 0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

t

3.5.2

SVM-Filtering for forecasting the trend of signal

Here, we employ SVM as a non-linear filtering technique for extracting the hidden trend of a time series signal. The regression principle was explained in the last 84

Support Vector Machine in Finance

Figure 3.4: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.17) 0.1

0.05

ln(St /S0 )

0

−0.05

−0.1

−0.15

−0.2

Real signal L1 regression L2 regression

−0.25 0

50

100

150

200

250

300

t

Figure 3.5: Comparison of different regression kernel for model (3.16) 20

15

10

yt

5

0

−5

−10

Real signal Linear Polynomial Gaussian Sigmoid

−15

−20 0

0.5

1

1.5

2

2.5

t

85

3

3.5

4

4.5

5

Support Vector Machine in Finance Figure 3.6: Comparison of different regression kernel for model (3.17) 0.15

0.1

0.05

yt

0

−0.05

−0.1

Real signal Linear Polynomial Gaussian Sigmoid

−0.15

−0.2 0

50

100

150

200

250

300

t

discussion. We apply now this technique for estimating the derivative of the trend µ ¯t , then plug it into a trend-following strategy. Description of trend-following strategy We choose here the most simple trend-following strategy whose exposure is given by: et = m

µ ˆt σ ˆt2

with m the risk tolerance and σ ˆt the estimator of volatility given by: σ ˆt2

1 = T

Z 0

T

σt2 dt

1 = T

t X i=t−T +1

ln2

Si Si−1

In order to limit the risk of explosion of the exposure et , we capture it by a superior and inferior boundaries emax and emin :     µ ˆt et = max min m 2 , emin , emax σ ˆt The wealth of the portfolio is then given by the following expression:     ? St+1 ? Wt+1 = Wt + Wt et − 1 + (1 − et )rt St 86

Support Vector Machine in Finance

SVM-Filtering We discuss now how to build a cross-validation procedure which can help to learn the trend of a given signal. We employ the moving-average as a benchmark to compare with this new filter. An important parameter in moving-average filtering is the estimation horizon T then we use this horizon as a reference to calibrate our SVM-filtering. For the sake of simplicity, we studied here only the SVM-filter with Gaussian kernel and L2 penalty. The two typical parameters of SVM-filter are C and σ. C is the parameter which allows some certain level of error in the regression curve while σ characterizes the horizon of estimation and it is directly proportional to T . We propose to scheme of the validation procedure which base on the following structure of data division: training set, validation set and testing set. In the first scheme, we fix the kernel parameter σ = T and optimize the error tolerance parameter C on the validation set. This scheme is comparable to our benchmark moving-average. The second scheme consists of optimizing both couple of parameter C, σ on the validation set. In this case, we let our validation data decides its estimation horizon. This scheme is more complicate to interpret as σ is now a dynamic parameter. However, by affecting σ to the local horizon, we can have an additional understanding on the change in the price of the underlying asset. For example, we can determine in the historical data if the underlying asset undergoes a period with long or short trend. It can help to recognize some additional signature such as the cycle of between long and short trends. We report the two schemes in the following algorithm. Figure 3.7: Cross-validation procedure for determining optimal value C ? σ ? Training |

|

T1

Validation -|

T2

-

Historical data

Forecasting | T2 k Today Prediction

Backtesting We first check the SVM-filter with simulated data given by the Black-Scholes model of the price. We consider a stock price with annualized return µ = 10% and annualized volatility σ = 20%. The regression is based on 1 trading year data (n = 260 days) with a fixed horizon of 1 month T = 20 days. In Figure 3.8, we present the result of the SVM trend prediction with fixed horizon T = 20 whereas Figure 3.9 presents the SVM trend prediction for the second scheme.

3.5.3

SVM for multivariate regression

As a regression method, we can employ SVM for the use of multivariate regression.  (i) Assuming that we consider an universal of d stocks X = X i=1...d during the 87

Support Vector Machine in Finance

Figure 3.8: SVM-filtering with fixed horizon scheme 0.15

0.1

0.05

yt

0

−0.05

−0.1

Real signal Training Validation Prediction

−0.15

−0.2 0

50

100

150

200

250

300

t

Figure 3.9: SVM-filtering with dynamic horizon scheme 0.15

0.1

0.05

yt

0

−0.05

−0.1

Real signal Training Validation Prediction

−0.15

−0.2 0

50

100

150

t

88

200

250

300

Support Vector Machine in Finance Algorithm 3 SVM score construction procedure SVM_Filter(X, y, T ) Divide data into training set Dtrain , validation set Dvalid and testing set Dtest Regression on the training data Dtrain Construct the SVM prediction on validation set Dvalid if Fixed horizon then σ=T Compute Error(C) prediction error on Dvalid Minimize Error(C) and obtain the optimal parameters (C ? ) else Compute Error(σ, C) prediction error on Dvalid Minimize Error(σ, C) and obtain the optimal parameters (σ ? , C ? ) end if Use optimal parameters to predict the trend on testing set Dtest end procedure

period of n dates. The performance of the index or an individual stock that we are interested in is given by y. We are looking for the prediction of the value of yn+1 by using the regression of the historical data of (Xt , yt )t=1...n . In this case, the different stocks play the role of the factors of vector in the training set. We can as well apply other regression like the prediction of the performance of the stock based on available information of all the factors. Multivariate regression We first test here the efficiency of the multivariate regression on a simulated model. Assuming that all the factors at a given date j follow a Brownian motion. (i)

(i)

dXt = µt dt + σt dBt

∀i = 1...d

Let (yt )1n˙ be the vector to be regressed which is related to the input X by a function: yt = f (Xt ) = WtT Xt We would like to regress the vector y = (yt )t=2...n by the historical data (Xt )t=1...n−1 by SVM-regression. This regression is give by the function yt = F (Xt−1 ). Hence, the prediction of the future performance of yn+1 is given by: E [yn+1 |Xn ] = F (Xn ) In Figure 3.10, we present the results obtained by Gaussian kernel with L1 and L2 penalty condition whereas in Figure 3.11, we compare the result obtained with different types of kernel. Here, we consider just a simple scheme with the lag of one trading day for the regression. In all Figures, we remark this lack on the prediction of the value of y. 89

Support Vector Machine in Finance

Figure 3.10: L1 -regressor versus L2 -regressor with Gaussian kernel for model (3.16) 6 5 4 3

yt

2 1 0 −1

Real signal L1 regression L2 regression

−2 −3 0

50

100

150

200

250

300

350

400

450

500

t

Figure 3.11: Comparison of different kernels for multivariate regression 5 4 3

yt

2 1 0 −1

Real signal Linear Polynomial Gaussian Sigmoid

−2 −3 0

50

100

150

t

90

200

250

Support Vector Machine in Finance

Backtesting

3.6

SVM-classification in finance

We now discuss the second applications of SVM in the finance as a stock classifier within this section. We will first test our implementations of the binary classifier and multiclassifier. We next employ the SVM technique to study two different problems: (i) recognition of sectors and (ii) construction of SVM score for stock picking strategy.

3.6.1

Test of SVM-classifiers

For the binary classification problem, we consider the both approaches (dual/primal) to determine the boundary between two given classes based on the available information of each data point. For the multiclassification problem, we first extend the binary classifier to the multi-class case by using the binary decision tree (SVMBDT). This algorithm was demonstrated to be more efficient than the traditional approaches such as “one-against-all” or “one-against-one” both in computation time and in precision. The general approach of multi-SVM will be then compared to SVM-BDT. Binary-SVM classifier Let us compare here the two proposed approaches (dual/primal) for solving numerically SVM-classification problem. In order to realize the test, we consider a random training data set of n vector xi with classification criterion yi = sign (xi ). We present here the comparison of two classification approaches with linear kernel. Here, the result of primal approach is directly obtained by the software of O. Chapelle 2 . This software was implemented with L2 penalty condition. Our dual solver is implemented for both L1 and L2 penalty conditions by employing simply the QP program. In Figure 3.12, we show the results of classification obtained by both methods within L2 penalty condition. We test next the non-linear classification by using the Gaussian kernel (RBF kernel) for the binary dual-solver. We generate the simulated data by the same way as the last example with x ∈ R2 . The result of the classification is illustrated in Figure 3.13 for RBF kernel with parameter C = 0.5 and σ = 2 3 . Multi-SVM classifier We first test the implementation of SVM-BDT via simulated data (xi )i=1...n which are generated randomly. We suppose that these data are distributed in Nc classes. In order to test efficiently our multi-SVM implementation, the response vector y = 2 The free software of O. Chapelle can be found in the following website http://olivier.chapelle.cc/primal/ 3 We used here the “plotlssvm ” function of the LS-SVM toolbox for graphical illustration. Similar result was aso obtained by using “trainlssvm” function in the same toolbox.

91

Support Vector Machine in Finance

Figure 3.12: Comparison between Dual algorithm and Primal algorithm

Primal,

Dual,

Margins

Boundary,

6

h(x, y)

4

2

0

−2

−4 0

10

20

30

40

50

60

70

80

90

100

Training data

Figure 3.13: Illustration of non-linear classification with Gaussian kernel

1

2.5

class 1 class 2

2 1.5 1

1 0.5

X2

1

0 −0.5 1

−1 −1.5 −2

1

−2.5 −3

−2.5

−2

−1.5

−1

−0.5

X1

92

0

0.5

1

1.5

2

Support Vector Machine in Finance

(y1 . . . yn ) is supposed to be dependent only on the first coordinate of the data vector: z = U (0, 1)

x1 = Nc z

y = [x1 ] + N (0, 1) xi = U (0, 1)

∀i > 1

Here [a] denote the part of a. We can generate our simulated data in much more general way but it will be very hard to visualize the result of the classification. Within the above choice of simulated data, we can see that in the case  = 0 the data a separable in the axis x1 . In the geometric view, the space Rd is divided in to Nc zones along the axis x1 : Rd−1 × [0, 1[, . . . , Rd−1 × [Nc , Nc + 1[. The boundaries are simply the Nc hyperplane Rd−1 crossing x1 = 1 . . . Nc . When we introduce some noise on the coordinate x1 ( > 0), then the training set is now is not separable by these ensemble of linear hyperplanes. There will be some misclassified points and some deformation of the boundaries thank to non-linear kernel. For the sake of simplicity, we assume that the data (x, y) are already gathered by group. In Figures ?? and 3.15, we present the classification results for in-sample data and out-of-simple data in the case  = 0 (i.e. separable data). We are now introduce the noise in the Figure 3.14: Illustration of multiclassification with SVM-BDT for in-sample data C10 C09 C08

Classes

C07 C06 C05 C04 C03 C02 C01

Real sector distribution Multiclass SVM S10

S20

S30

S40

S50

Stocks

data coordinate x1 with  = 0.2. 93

S60

S70

S80

S90

S99

Support Vector Machine in Finance

Figure 3.15: Illustration of multiclassification with SVM-BDT for out-of-sample data C10 C09 C08

Classes

C07 C06 C05 C04 C03 C02 C01

Real sector distribution Multiclass SVM S05

S10

S15

S20

S25

S30

S35

S40

S45

S50

Stocks

Figure 3.16: Illustration of multiclassification with SVM-BDT for  = 0 1.2

C1,

C2,

1

2

C3,

C4,

3

4

C5,

C6,

C7,

C8,

C9,

C10

1

x2

0.8

0.6

0.4

0.2

0

5

x1

94

6

7

8

9

Support Vector Machine in Finance Figure 3.17: Illustration of multiclassification with SVM-BDT for  = 0.2 1.2

C1,

C2,

C3,

C4,

C5,

C6,

C7,

C8,

C9,

3

4

5

6

7

8

9

C10

1

x2

0.8

0.6

0.4

0.2

0

3.6.2

1

2

x1

10

SVM for classification

We employ here multi-SVM algorithm for all the compositions of the Eurostoxx 300 index. Our goal is to determine the boundaries between various sector to which belong the compositions of the index. As the algorithm contains two main parts, classification and prediction, we then can classify our stocks via their common properties resulted from the available factors. The number of misclassified stocks or the error of classification can give us an estimation on sector definition. We next study the recognition phase on the ensemble of tested data. Classification of stocks by sectors In order to well classify the stocks composing the Eurostoxx 300 index, we consider the Ntrain = 100 most representative stocks in term of value. In order to establish the multiclass-svm classification using the binary decision tree, we sort the Ntrain = 100 assets by sectors. We next employing the SVM-BDT for computing the Ntrain − 1 binary separators. In Figure 3.18, we present the classification result with Gaussian kernel and L2 penalty condition. For σ = 2 and C = 20, we are able to well classify the 100 assets over ten main sectors: Oil & Gas, Industrials, Financial, Telecommunications, Health Care, Basic Materials, Consumer Goods, Technology, Utilities, Consumer Services. In order to check the efficiency of the classification, we test the prediction quality on a test set which is composed by Ntest = 50 assets. In Figure 3.19, we compare the SVM-BDT result with the true sector distribution of 50 assets. We obtain in 95

Support Vector Machine in Finance Figure 3.18: Multiclassification with SVM-BDT on training set Consumer Services Utilities Technology Consumer Goods Basic Materials Health Care Telecommunications Financials Industrials Oil & Gas S1

Real sector distribution Multiclass SVM S10

S20

S30

S40

S50

S60

S70

S80

S90

S100

this case the rate of correct prediction is about 58%. Calibration procedure As discussed above in the implementation part of the SVM-solver, there are two kinds of parameter which play important role in the classification process. The first parameter C concerns the tolerance error of the margin and the second parameters concern the choice of kernel (σ for Gaussian kernel for example). In last example, we have optimized the couple of parameters C, σ in order to have the best classifiers which do not commit any error on the traing set. However, this result is true only in the case if the sectors are correctly defined. Here, nothing guaranties that the given notion of sectors is the most appropriate one. Hence, the classification process should consist of two steps: (i) determine of binary SVM classifiers on training data set and (ii) calibrate parameters on the validation set. In fact, we decide to optimize this couple of parameters C, σ by minimizing the realized error on the validation set because the committed error on the training set (learning set) must be always smaller than the one on validation set (unknown set). In the second phase, we can redefine the sectors in the sens that if any asset is misclassified, we change its sector label and repeat the optimization on the validation set until convergence. In the end of the calibration procedure, we expect to obtain first a new recognition of sectors and second a multi-classifiers for new assets. As SVM uses the training set to lean about the classification, it must commits less error on this set than on the validation set. We propose here to optimize the 96

Support Vector Machine in Finance Figure 3.19: Prediction efficiency with SVM-BDT on the validation set Consumer Services Utilities

Real sector distribution Multiclass SVM

Technology Consumer Goods Basic Materials Health Care Telecommunications Financials Industrials Oil & Gas

S101

S110

S120

S130

S140

S150

SVM parameters by minimizing the error on the validation set. We use the same error function defined in Section 3 but apply it on the validation data set V: Error =

X  1 ψ −yi0 f x0i card (V) i∈V

where ψ (x) = I{x>0} with IA the standard notation of the indicator function. However, the error function requires the step function ψ which is discontinuous can cause some difficulty if we expect to determine the best selection parameter via the optimal test error. In order to perform the search for minimal test error by gradient descent for example, we should smooth the test error by regulate the step function by: ψ˜ (x) =

1 1 + exp (−Ax + B)

The choice of the parameter A, B are important. If A is too small the approximation error is too much whereas if A is large the test error is not smooth enough for the minimization procedure. Recognition of sectors By construction, SVM-classifier is a very efficient method for recognize and classify a new element with respect to a given number of classes. However, it is not able to recognize the sectors or introduces a new correct definition of available sectors over an universal of available data (stocks). In finance, the classification by sector is more 97

Support Vector Machine in Finance

related to the origin of stock than the intrinsic property of the stock in the market. It may introduce some problem on the trading strategy if a stock is misclassified, for example, the case of pair-trading strategy. Here, we try to overcome this weakness point of SVM in order to introduce a method which modifies the initial definition of sectors. The main idea of sector recognition procedure is the following. We divide the available data into two sets: training set and validation set. We employ the training set to learn about the classification and the validation set to optimize the SVM parameters. We start with the initial definition of the given sectors. Within each iteration, we learn the training set in order to determine the classifiers then we test the validation error. An optimization procedure on the validation error helps us to determine the optimal parameters of SVM. For each ensemble of optimal parameters, we encounter some error on the training set. If the validation is smaller on certain threshold with no error on the training set, we reach the optimal configuration of sector definition. In the case, there are errors on the training set, we relabel the misclassified data point and define new sectors with this correction. All the sector labels will be changed by this rule for both training and validation sets. The iteration procedure will be repeat until no error on the training set is committed for a given expected threshold of error on the validation set. The algorithm of this sectorrecognition procedure is summarized in the following table: Algorithm 4 Sector recognition by SVM classification procedure SVM_SectorRecognition(X, y, ε) Divide the historical data by training set T and validation set V Initiate the sectors label by the physical sector names: Sec01 , . . . , Sec0m while E T >  do while E V >  do Compute the SVM separators for labels Sec1 , . . . , Secm on T for given (C, σ) Construct the SVM predictor from the separator Sec1 , . . . , Secm Compute error E V on validation set Update parameter (C, σ) until convergence of E V >  end while Compute error E T on training set Verify misclassified points of training set Relabel the misclassified points then update definition of sectors end while end procedure

3.6.3

SVM for score construction and stock selection

Traditionally, in order to improve the stock picking we rank the stocks by constructing a “score” based on all characterizations (so-called factor) of the considered stock. We require that the construction of this global quantity (combination of factors) 98

Support Vector Machine in Finance

must satisfy some classification criterion, for example the performance. We denote here the (xi )i=1...n with xi the ensemble of factors for the ith stock. The classification criterion such as the performance is denoted by the vector y = (yi )i=1...n . The aim of SVM-classifier in this problem is to recognize which stocks (scores) belong to the high/low performance class (overperformed/underperformed). More precisely, we have to identify the a boundary of separation as a function of score and performance f (x, y). Hence, the SVM stock peaking consists of two steps: (i) construction of factors ensemble (i.e. harmonize all characterizations of a given stock such as the price, the risk, marco-properties e.t.c into comparable quantities); (ii) application of SVM-classifier algorithm with adaptive choice of parameters. In the following, we are going to first give a brief description of score constructions and then establish the backtest on stock-picking strategy. Probit model for score construction We summary here briefly the main idea of the score construction by the Probit model. Assuming that the set of training data (xi , yi )i=1...n is available. Here x is the vector of factors whereas y is the binary response. We look for constructing a conditional probability distribution of the random variable Y for a given point X. This probability distribution can be used later to predict the response of a new data point xnew . The probit model suppose to estimate this conditional probability in the form:  P r (Y = 1 |X ) = Φ XT β + α with Φ (x) the cumulative distribution function (CDF) of the standard normal distribution. The couple of parameters (α, β) can be obtained by using estimators of maximum likehood. The choice of the function Φ (x) is quite natural as we work with a binary random variable because it allows to have a symmetric probability distribution. Remark 9 We remark that this model can be written in another form with the introduction of a hidden random variable: Y ? = XT β + α +  where  ∼ N (0, 1). Hence, Y can be interpreted as an indicator for whether Y ? is positive.  1 if Y ? > 0 Y = I{Y ? >0} = 0 otherwise In finance, we can employ this model for the score construction. If we define the binary variable Y is the relative return of a given asset with respect to the benchmark: Y = 1 if the return of is higher than the one of the benchmark and Y = 0 otherwise. Hence, P r (Y = 1|X) is the probability for the give asset with the vector of factors X to be super-performed. Naturally, we can define this quantity as a score measuring the probability of gain over the benchmark: S = P r (Y = 1|X) 99

Support Vector Machine in Finance

In order to estimate the regression parameters α, β, we maximize the log-likehood function: L (α, β) =

n X i=1

  yi ln Φ xTi β + α + (1 − yi ) ln 1 − Φ xTi β + α

Using the estimated parameters by maximum likehood, we can predict the score of the a given asset with its factor vector X as following:   Sˆ = Φ XT βˆ + α ˆ The probability distribution of the score Sˆ can be computed by the empirical formula n   1X P r Sˆ < s = I{Si 0}

Here, the parameters of the model α0 and β0 are chosen as α0 = 0.1 and β0 = 1. We employ the Probit regression in order to determine the score of n = 500 data in the cases d = 2 and d = 5. The comparisons between the Probit score and the simulated score are presented in Figures 3.20-3.22 SVM score construction We discuss now how to employ SVM to construct the score for a given ensemble of the assets. In the work of G. Simon (2005), the SVM score is constructed by using SVM-regression algorithm. In fact, with SVM-regression algorithm, we are able to forecast the future performance E [µt+1 |Xt ] = µ ˆt based on the present ensemble of factor then this value can be employed directly as the prediction in a trend-following strategy without need of score construction. We propose here another utilization 100

Support Vector Machine in Finance

Figure 3.20: Comparison between simulated score and Probit score for d = 2 0.7

0.65

0.6

Score

0.55

0.5

0.45

0.4

0.35

Simulated score Probit score 0

50

100

150

200

250

300

350

400

450

500

Assets

Figure 3.21: Comparison between simulated score CDF and Probit score CDF for d=2 1 0.9 0.8 0.7

CDF

0.6 0.5 0.4 0.3 0.2 0.1 0 0

Simulated CDF Probit CDF 0.1

0.2

0.3

0.4

0.5

Score

101

0.6

0.7

0.8

0.9

1

Support Vector Machine in Finance Figure 3.22: Comparison between simulated score PDF and Probit score PDF for d=2 6

Simulated PDF Probit PDF

5

PDF

4

3

2

1

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

Score

of SVM algorithm based on SVM-classification for building the scores which allow later to implement long/short strategies by using the selection curves. Our main idea of SVM-score construction is very similar to Probit model. We first define a binary variable Yi = ±1 associated to each asset xi . This variable characterizes the performance of the asset with respect to the benchmark. If Yi = −1, the stock is underperformed whereas Yi = 1 the stock is overperformed. We next employ the binary SVM-classification to separate the universal of stocks into two classes: high performance and low performance. Finally, we define the score of each stock the its distance to the boundary decision.

Selection curve In order to construct a simple strategy of type long/short for example, we must be able to establish a selection rule based on the score obtained by Probit model and SVM regression. Depending on the strategy long, short or long/short, we expect to build a selection curve which determine the portion of assets which have a certain level of error. For a long strategy, we prefer to buy a certain portion of high performance with the knowledge on the possible committed error. To do so, we define a 102

Support Vector Machine in Finance

selection curve for which the score plays the role of the parameter: Q (s) = P r (S ≥ s)

E (s) = P r (S ≥ s |Y = 0 ) ∀ s ∈ [0, 1]

This parametric curve can be traced in the the square [0, 1] × [0, 1] as shown in Figure 3.23. On the x-axis, Q (s) defines the quantile corresponding to the stock selection among the considered universal of stocks. On the y-axis, E (s) defines the committed error corresponding to the stock selection. Precisely, for a certain quantile, it measures the chance that we pick the bad performance stock. Two trivial limits are the points (0, 0) and (1, 1). The first point corresponds to the limit with no selection whereas the second point corresponds to the limit with all selection. A good score construction method should allow a selection curve as much convex as possible because it guaranties a selection with less error. Figure 3.23: Selection curve for long strategy for simulated data and Probit model 1 0.9

Simulated data Probit model

0.8

P r(S > s|Y = 0)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P r(S > s)

Reciprocally, for a short strategy, the selection curve can be obtained by tracing the following parametric curve: Q (s) = P r (S ≤ s)

E (s) = P r (S ≤ s |Y = 1 ) ∀ s ∈ [0, 1]

Here, Q (s) aims us to determine the quantile of low-performance stocks to be shorted while E (s) helps us to avoid selling the high-performance one. As the selection 103

Support Vector Machine in Finance Figure 3.24: Probit scores for Eurostoxx data with d = 20 factors 1 0.9

Probit on Training Probit on Validation

0.8

P r(S > s|Y = 0)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P r(S > s)

curve is independent of the score definition, it is an appropriate quantity to compare different scoring techniques. In the following, we employ the selection curve for comparing the score constructions of the Probit model and of the SVM-regression. Figure 3.24 shows the comparison of the selection curves constructed by SVM score and Probit score on the training set. Here, we did not effectuate any calibration on the SVM parameters. Backtesting and comparison As presented in the last discussion on the regression, we have to build a cross validation procedure to optimize the SVM parameters. We follow the traditional routine by dividing the data in three independent sets: (i)training set, (ii)validation set and (iii)testing set. The classifier is obtained by the training set whereas its optimal parameters (C, σ) will be obtained by minimizing the fitting error on the validation set. The efficiency of the SVM algorithm will be finally checked on the testing set. We summarize the cross-validation procedure in the below algorithm. In order to make the training set close to both validation data and testing data, we decide to divide the data in the the following time order: validation set, training set and testing set. Using this way, the prediction score on the testing set contains more information in the recent past. We now employ this procedure to compute the SVM score on the universal of stocks of Eurostoxx index. Figure 3.25 present the construction of the score basing on the the training set and validation set. The SVM parameters are optimized on 104

Support Vector Machine in Finance Algorithm 5 SVM score construction procedure SVM_Score(X, y) Divide data into training set Dtrain , validation set Dvalid and testing set Dtest Classify the training data by using high/low performance criteria Compute the decision boundary on Dtrain Construct the SVM score on Dvalid by using the distance to the decision boundary Compute Error(σ, C) prediction error and classification error on Dvalid Minimize Error(σ, C) and obtain the optimal parameters (σ ? , C ? ) Use optimal parameters to compute the final SVM-score on testing set Dtest end procedure

the validation set while the final score construction uses both training and validation set in order to have largest data ensemble. Figure 3.25: SVM scores for Eurostoxx data with d = 20 factors 1

SVM Training SVM Validation SVM Testing

0.9 0.8

P r(S > s|Y = 0)

0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P r(S > s)

3.7

Conclusion

Support vector machine is a well-established method with a very wide use in various domain. In the financial point of view, this method can be used to recognize and to predict the high performance stocks. Hence, SVM is a good indicator to build efficients trading strategy over an universal of stocks. Within this paper, we first have revisited the basic idea of SVM in both classification and regression contexts. 105

Support Vector Machine in Finance

The extension to the case of multi-classification is also discussed in detail. Various applications of this technique were introduced and discussed in detail. The first class of applications is to employ SVM as forecasting method for time-series. We proposed two applications: the first one consists of using SVM as a signal filter. The advantage of the method is that we can calibrate the model parameter by using only the available data. The second application is to employ SVM as a multi-factor regression technique. It allows to refine the prediction with additional inputs such as economic factors. For the second class of applications, we deal with SVM classification. Two main applications that we discussed in the scope of this paper are the score construction and the sector recognition. Both resulting information are important to build momentum strategies which are the core of the modern asset management.

106

Bibliography [1] Allwein E. L. et al., (2000) , Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers, Journal of Machine Learning Research, 1, pp. 113-141. [2] At A. (2005), Optimisation d’un Score de Stock Screening, Rapport de stageENSAE, Société Générale Asset Management. [3] Basak D., Pal S. and Patranabis D.J. (2007), Support Vector Regression, Neural Information Processing, 11, pp. 203-224. [4] Ben-Hur A. and Weston J. (2010), A User’s Guide to Support Vector Machines, Methods In Molecular Biology Clifton Nj, 609, pp. 223-239. [5] Burges C. J. C. (1998), A Tutorial on Support Vector Machines for Pattern Recognition, Data Mining and Knowledge Discovery, 2, pp. 121-167. [6] Chapelle O. (2002), Support Vector Machine: Induction Principles, Adaptive Tuning and Prior Knowledge PhD thesis, Paris 6. [7] Chapelle O. et al., (2002), Choosing Multiple Parameters for Support Vector Machine, Machine Learning, 46, pp. 131-159. [8] Chapelle O. (2007), Training a Support Vector Machine in the Primal, Journal Neural Computation, 19, pp. 1155-1178. [9] Cortes C. and Vapnik V. (1995), Support-Vector Networks, Machine Learning, 20, pp. 273-297. [10] Crammer K. and Singer Y. (2001), On the Algprithmic Implementation of Multiclass Kernel-based Vector Machines, Journal of Machine Learning Research, 2, pp. 265-292. [11] Gestel T. V. et al., (2001), Financial Time Series Prediction Using Least Squares Support Vector Machines Within the Evidence Framework, IEEE Transactions on neural Networks, 12, pp. 809-820. [12] Madzarov G. et al., (2009), A multi-class SVM Classifier Utilizing Binary Decision Tree ,Informatica, 33, pp. 233-241. 107

Support Vector Machine in Finance

[13] Milgram J. et al., (2009), “One Against One” or “One Against All”: Which One is Better for Handwriting Recognition with SVMs? (2006) ,Tenth International Workshop on Frontiers in Handwriting Recognition. [14] Potluru V. K. et al., (2009), Efficient Multiplicative updates for Support Vector Machines ,Proceedings of the 2009 SIAM Conference on Data Mining. [15] Simon G. (2005), L’Econométrie Non Linéaire en Gestion Alternative, Rapport de stage-ENSAE, Société Générale Asset Management. [16] Tay F.E.H. and Cao L.J. (2002), Modified Support Vector Machines in Financial Times Series forecasting,Neurocomputing,48, pp. 847-861 [17] Tsochantaridis I. et al., (2004), Support Vector Machine Learning for Interdependent and Structured Output Spaces,Proceedings of the 21 st International Confer- ence on Machine Learning,Banff, Canada. [18] Vapnik V. (1998), Statistical Learning Theory, John Wiley and Sons,New York.

108

Conclusions Within the internship in the R&D team of Lyxor Asset Management, I had chance to work on many interesting topics concerning the quantitative asset management. Beyond of this report, the resutls obtained during the stay have been employed for the 8th edition of the Lyxor White Paper on the trend filtering technique. This work has been published in the Lyxor White Paper series. The main results of this intership can be divided into two main lines. The first results consists of improving the trend and volatility estimations which are important quantities for implementing dynamical strategies. The second main results concern the application of the machine learning technology in finance. We expect to employ the “Support vector machine” for forcasting the expected return of financial assets and for having a criterial for stock selection. In the first part, we focused on improving the trend and volatility estimations in order to implement two crucial momentum-strategies: trend-following and voltarget. We show that we can use L1 filters to forecast the trend of the market in a very simple way. We also propose a cross-validation procedure to calibrate the optimal regularization parameter λ where the only information to provide is the investment time horizon. More sophisticated models based on a local and global trends is also discussed. We remark that these models can reflect the effect of mean-reverting to the global trend of the market. Finally, we consider several backtests on the S&P 500 index and obtain competing results with respect to the traditional moving-average filter. On another hand, voltarget strategies are efficient ways to control the risk for building trading strategies. Hence, a good estimator of the volatility is essential from this perspective. In this report, we present the improvement on the forecasting of volatility by using some novel technologies. The use of high and low prices is less important for the index as it gives more and less the same result with traditional close-to-close estimator. However, for independent stock with higher volatility level, the high-low estimators improves the prediction of volatility. We consider several backtests on the S&P 500 index and obtain competing results with respect to the traditional moving-average estimator of volatility. Indeed, we consider a simple stochastic volatility model which permit to integrate the dynamics of the volatility in the estimator. An optimization scheme via the maximum likehood algorithm allows us to obtain dynamically the optimal averaging window. We also compare these results for range-based estimator with the well-known IGARCH model. The comparison between the optimal value of the likehood functions for various estimators

Support Vector Machine in Finance

gives us also a ranking of estimation error. Finally, we studied the high frequency volatility estimator which is a very active topic of financial mathematics. Using simple model proposed by Zhang et al, (2005), we show that the microstructure noise can be eliminated by the two time scale estimator. Support vector machine is a well-established method with a very wide use in various domain. In the financial point of view, this method can be used to recognize and to predict the high performance stocks. SVM is a good indicator to build efficient trading strategies over a stocks universe. Within the second part of this report, we first have revisited the basic idea of SVM in both classification and regression contexts. The extension to the case of multi-classification is also discussed in detail. Various applications of this technique were introduced and discussed in detail. The first class of applications is to employ SVM as forecasting method for time-series. We proposed two applications: the first one consists of using SVM as a signal filter. The advantage of the method is that we can calibrate the model parameter by using only the available data. The second application is to employ SVM as a multi-factor regression technique. It allows to refine the prediction with additional inputs such as economic factors. For the second class of applications, we deal with SVM classification. Two main applications that we discussed in the scope of this paper are the score construction and the sector recognition. Both resulting information are important to build momentum strategies which play an important role in Lyxor quantitative management.

110

Appendix A

Appendix of chaper 1 A.1 A.1.1

Computational aspects of L1 , L2 filters The dual problem

The L1 − T filter This problem can be solved by considering the dual problem which is a QP program. We first rewrite the primal problem with new variable z = Dx: 1 ky − xk22 + λ kzk1 2 u.c. z = Dx

min

We construct now the Lagrangian function with the dual variable ν ∈ Rn−2 : L (x, z, ν) =

1 ky − xk22 + λ kzk1 + ν > (Dx − z) 2

The dual objective function is obtained in the following way: 1 inf x,z L (x, z, ν) = − ν > DD> ν + y > D> ν 2 for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent to the dual problem: 1 > ν DD> ν − y > D> ν 2 u.c. −λ1 ≤ ν ≤ λ1

min

This QP program can be solved by traditional Newton algorithm or by interior-point methods, and the final solution of the trend reads x? = y − D > ν 111

Support Vector Machine in Finance

The L1 − C filter

The optimization procedure for L1 − C filter follows the same strategy as the L1 − T filter. We obtain the same quadratic program with the D operator replaced by (n − 1) × n matrix which is the discrete version of the first order derivative:   −1 1 0  0 −1 1  0     . . D=  .    −1 1 0  −1 1 The L1 − T C filter

In order to follow the same strategy presented above, we introduce two additional variables z1 = D1 x and z2 = D2 x. The initial problem becomes: 1 ky − xk22 + λ1 kz1 k1 + λ2 kz2 k1 2  z 1 = D1 x u.c. z 2 = D2 x

min

The Lagrangian function with the dual variables ν1 ∈ Rn−1 and ν2 ∈ Rn−2 is:

1 ky − xk22 +λ1 kz1 k1 +λ2 kz2 k1 +ν1> (D1 x − z1 )+ν2> (D2 x − z2 ) 2 whereas the dual objective function is:

2   1

inf x,z1 ,z2 L (x, z1 , z2 , ν1 , ν2 ) = − D1> ν1 + D2> ν2 + y > D1> ν1 + D2> ν2 2 2

L (x, z1 , z2 , ν1 , ν2 ) =

for −λi 1 ≤ νi ≤ λi 1 (i = 1, 2). Introducing the variable z = (z1 , z2 ) and ν = (ν1 , ν2 ), the initial problem is equivalent to the dual problem: 1 > ν Qν − R> ν 2 u.c. −ν + ≤ ν ≤ ν +     D1 λ1 > + with D = , Q = DD , R = Dy and ν = 1. The solution of the D2 λ2 primal problem is then given by x? = y − D> ν. min

The L1 − T multivariate filter

As in the univariate case, this problem can be solved by considering the dual problem which is a QP program. The primal problem is: min

m

2 1 X

(i)

y − x + λ kzk1 2 2 i=1

u.c. z = Dx 112

Support Vector Machine in Finance

Let us define y¯ = (¯ yt ) with y¯t = m−1

Pm

i=1 y

(i) .

The dual objective function becomes:

m >   1 X  (i) 1 y − y¯ y (i) − y¯ inf x,z L (x, z, ν) = − ν > DD> ν + y¯> D> ν + 2 2 i=1

for −λ1 ≤ ν ≤ λ1. According to the Kuhn-Tucker theorem, the initial problem is equivalent to the dual problem: 1 > ν DD> ν − y¯> D> ν 2 u.c. −λ1 ≤ ν ≤ λ1

min

This QP program can be solved by traditional Newton algorithm or by interior-point methods and the solution is: x? = y¯ − D> ν

A.1.2

The interior-point algorithm

We present briefly the interior-point algorithm of Boyd and Vandenberghe (2009) in the case of the following optimization problem: min f0 (x)  Ax = b u.c. fi (x) < 0 for i = 1, . . . , m where f0 , . . . , fm : Rn → R are convex and twice continuously differentiable and rank (A) = p < n. The inequality constraints will become implicit if one rewrite the problem as: min f0 (x) +

m X i=1

I− (fi (x))

u.c. Ax = b where I− (u) : R → R is the non-positive indicator function1 . This indicator function is discontinuous, hence the Newton method can not be applied. In order to overcome this problem, we approximate I− (u) by the logarithmic barrier function ? (u) = −τ −1 ln (−u) with τ → ∞. Finally the Kuhn-Tucker condition for this I− approximation problem gives rt (x, λ, ν) = 0 with:  ∇f0 (x) + ∇f (x)> λ + A> ν rτ (x, λ, ν) =  − diag (λ) f (x) − τ −1 1  Ax − b 

1

We have:

 I− (u) =

0 ∞

113

u≤0 u>0

Support Vector Machine in Finance

The solution of rτ (x, λ, ν) = 0 can be obtained by Newton’s iteration for the triple y = (x, λ, ν): rτ (y + ∆y) ' rτ (y) + ∇rτ (y) ∆y = 0

This equation gives the Newton’s step ∆y = −∇rτ (y)−1 rτ (y) which defines the search direction.

A.1.3

The scaling of smoothing parameter of L1 filter

We can try to estimate the order of magnitude of the parameter λmax by considering the continuous case. Assuming that the signal is a process Wt . The value of λmax in the discrete case defined by:



−1

>

λmax = DD Dy



RT

can be considered as the first primitive I1 (T ) = 0 Wt dt of the process Wt if D = D1 RT Rt (L1 − C filtering) or the second primitive I2 (T ) = 0 0 Ws ds dt of Wt if D = D2 (L1 − T filtering). We have: Z T Wt dt I1 (T ) = 0 Z T = WT T − t dWt 0 Z T = (T − t) dWt 0

The process I1 (T ) is a Wiener integral (or a Gaussian process) with variance: Z T   T3 (T − t)2 dt = E I12 (T ) = 3 0 In this case, we expect that λmax ∼ T 3/2 . The second order primitive can be calculated in the following way: Z T I2 (T ) = I1 (t) dt 0 Z T = I1 (T ) T − t dI1 (T ) 0 Z T = I1 (T ) T − tWt dt 0 Z T 2 T2 t WT + dWt = I1 (T ) T − 2 0 2  Z T T2 t2 = − WT + T2 − Tt + dWt 2 2 0 Z 1 T = (T − t)2 dWT 2 0 114

Support Vector Machine in Finance

This quantity is again a Gaussian process with variance: E[I22 (T )]

1 = 4

T

Z

(T − t)4 dt =

0

T5 20

In this case, we expect that λmax ∼ T 5/2 .

A.1.4

Calibration of the L2 filter

We discuss here how to calibrate the L2 filter in order to extract the trend with respect to the investment time horizon T . Though the L2 filter admits an explicit solution which is a great advantage for numerical implementation, the calibration of the smoothing parameter λ is not trivial. We propose to calibrate the L2 filter by comparing the spectral density of this filter with the one obtained with the movingaverage filter. For this last filter, we have: x ˆMA t

t−1 1 X = yi T i=t−T

It comes that the spectral density is: 1 f (ω) = 2 T

2 −1 TX e−iωt t=0

For the L2 filter, we k now that the solution is x ˆHP = 1 + 2λDT D the spectral density is: 

1 1 + 4λ (3 − 4 cos ω + cos 2ω) 2  1 ' 1 + 2λω 4

f HP (ω) =

−1

y. Therefore,

2

The width of the spectral density for the L2 filter is then (2λ)−1/4 whereas it is 2πT −1 for the moving-average filter. Calibrate the L2 filter could be done by matching this two quantities. Finally, we obtain the following relationship: 1 λ ∝ λ? = 2



T 2π

4

In Figure A.1, we represent the spectral density of the moving-average filter for different windows T . We report also the spectral density of the corresponding L2 filters. For that, we have calibrated the optimal parameter λ? by least square minimization. In Figure A.2, we compare the optimal estimator λ? with the one corresponding to 10.27 × λ? . We notice that the approximation is very good. 115

Support Vector Machine in Finance

Figure A.1: Spectral density of moving-average and L2 filters

Figure A.2: Relationship between the value of λ and the length of the moving-average filter

116

Support Vector Machine in Finance

A.1.5

Implementation issues

The computational time may be large when working with dense matrices even if we consider interior-point algorithms. It could be reduced by using sparse matrices. But the efficient way to optimize the implementation is to consider band matrices. Moreover, we may also notice that we have to solve a large linear system at each iteration. Depending on the filtering problem (L1 − T , L1 − C and L1 − T C filters), the system is 6-bands or 3-bands but always symmetric. For computing λmax , one may remark that it is equivalent to solve a band system which is positive definite. We suggest to adapt the algorithms in order to take into account all these properties.

117

Appendix B

Appendix of chapter 2 B.1 B.1.1

Estimator of volatility Estimation with realized return

We consider only one return, then the estimator of volatility can be obtained as following: !2 Z ti Z ti 2 1 2 2 Rti = ln Sti − ln Sti−1 = µu du − σu du σu dWu + 2 ti−1 ti−1 The conditional expectation with respect to the couple σu and µu which are supposed to be independent to dWu is given by: !2 Z ti Z ti  2  1 2 2 µu du − σu du σu du + E Rti |σ, µ = 2 ti−1 ti−1 which is approximatively equal to: (ti −

ti−1 ) σt2i−1

 2 1 2 + (ti − ti−1 ) µti−1 − σti−1 2 2

The variance of this estimator characterizes the error and reads:   !2 Z ti  1 var Rt2i |σ, µ = var  σu dWu + µu du − σu2 du σ, µ 2 ti−1 As the conditional expectation of

R ti

 1 2 σ dW + µ du − σ du with respect to u u u u 2 ti−1  R ti 1 2 σ et µ is a Gaussian variable of mean value ti−1 µu du − 2 σu du and variance R ti 2 ti−1 σu du. Therefore, we obtain the variance of the estimator: var Rt2i |σ, µ = 2 

Z

ti

ti−1

!2 σu2 du

Z

ti

+4 ti−1

119

!2 1 2 µu du − σu du (B.1) 2 ti−1

! Z σu2 du

ti

Support Vector Machine in Finance

which is approximatively equal to: 2 (ti − ti−1 )

2

σt4i−1

+ 4 (ti − ti−1 )

3

σt2i−1

2  1 2 µu du − σu du 2

We remark that when the time step (ti√− ti−1 ) becomes small, the estimator becomes unbiased with its standard deviation 2 (ti − ti−1 ) σt2i−1 . This error is directly proportional to the quantity to be estimated. In order to estimate the average variance between t0 and tn or the approached volatility at tn , we can employ the canonical estimator n X

Rt2i =

n X

i=1

i=1

ln Sti − ln Sti−1

2

The expectation value of this estimator reads !2 " n # Z Z ti n tn X X 1 2 2 2 E µu du − σu du Rti σ, µ = σu du + 2 ti−1 t0 i=1

i=1

We observe that his estimator is weakly biased, however this effect is totally negligible. If we consider a volatility of 20% with a trend of 10%, the estimation of volatility is 20.006% instead of 20%. The variance of the canonical estimator (estimation error) reads: n X i=1

Z

!2

ti

σu2 du

2

!2 1 2 µu du − σu du 2 ti−1

! Z σu2 du

ti

Z +4

ti−1

ti−1

ti

which can be roughly estimated by: n X i=1

Z

ti

2 ti−1

!2 σu2 du

≈ 2σ 4

n X i=1

(ti − ti−1 )2

If the recorded time ti are regularly distributed with time-spacing ∆t, then we have: ! n X ≈ 2σ 4 (tn − t0 ) ∆ var Rt2i σ, µ i=1

120

Appendix C

Appendix of chapter 3 C.1

Dual problem of SVM

In the traditional approach, the SVM problem is first mapped to the dual problem then is solved by a QP program. We present here the detail derivation of the dual problem in both hard-margin SVM and soft-margin SVM case.

C.1.1

Hard-margin SVM classifier

Let us start first with the hard-margin SVM problem for the classification: min w,b

1 kwk2 2

 u.c. yi wT xi + b ≥ 1 i = 1...n In order to get the dual problem, we construct the Lagrangian for inequality constrains by introducing positive Lagrange multipliers Λ = (α1 , . . . , αi ) ≥ 0: L (w, b, Λ) =

n n X  X 1 kwk2 − α i yi w T x i + b + αi 2 i=1

i=1

In minimizing the Lagrangian with respect to (w, b), we obtain the following equations: n

X ∂L = w − α i yi xi = 0 ∂wT i=1

∂L =− ∂b

n X

α i yi = 0

i=1

Insert these results into the Lagrangian, we obtain the dual objective LD function with respect to the variable w: 1 LD (Λ) = ΛT 1 − ΛT DΛ 2 121

Support Vector Machine in Finance

with Dij = yi yj xTi xj and the constrains ΛT y = 0 and Λ ≥ 0. Thank to the KKT theorem, the initial optimization problem is equivalent to maximizing the dual objective function LD (Λ) 1 max ΛT 1 − ΛT DΛ 2 Λ u.c. ΛT y = 0, Λ ≥ 0

C.1.2

Soft-margin SVM classifier

We turn now to the soft-margin SVM classifier with L1 constrain case F (u) = u, p = 1. We first write down the primal problem: ! n X 1 min kwk2 + C.F ξip w,b,ξ 2 i=1  T u.c. yi w xi + b ≥ 1 − ξi , ξi ≥ 0 i = 1...n For both case, we construct Lagrangian by introducing the couple of Lagrange multiplier (Λ, µ) for 2n constraints. 1 L (w, b, Λ, µ) = kwk2 + C.F 2

n X

! ξi

i=1



n X i=1

n   X αi yi wT xi + b − 1 + ξi − µi ξi i=1

with the following constraints on the Lagrange multipliers Λ ≥ 0 and µ ≥ 0. Minimizing the Lagrangian with respect to (w, b, ξ) gives us: n

X ∂L = w − α i yi xi = 0 ∂wT i=1

∂L =− ∂b

n X

α i yi = 0

i=1

∂L =C −Λ−µ=0 ∂ξ with inequality constraints Λ ≥ 0 and µ ≥ 0. Insert these results into the Lagrangian leads to the dual problem: 1 max ΛT 1 − ΛT DΛ 2 Λ T u.c. Λ y = 0, 0 ≤ Λ ≤ C1 122

(C.1)

Support Vector Machine in Finance

C.1.3

ε-SV regression

We study here the ε-SV regression. We first write down the primal problem with all constrains: ! n X 1 kwk2 + C min ξi w,b,ξ 2 i=1

u.c.

T

w xi + b − yi ≤ ε + ξi yi − wT xi − b ≤ ε + ξi0

ξi ≥ 0 ξi0 ≥ 0 i = 1...n In this case, we have 4n inequality constrain. Hence, we construct Lagrangian by introducing the positive Lagrange multipliers (Λ, Λ0 , µ, µ0 ). The Lagrangian of this primal problem reads: ! n n n X X X  1 2 0 µi ξi − µ0i ξi0 ξi − L w, b, Λ, Λ , µ = kwk + C.F 2 i=1

i=1



n X i=1

αi wT φ (xi ) + b − yi + ε + ξi − 

n X i=1

i=1

βi −wT φ (xi ) − b + yi + ε + ξi0



with Λ = (αi )i=1...n , Λ0 = (βi )i=1...n and the following constraints on the Lagrange multipliers Λ, Λ0 , µ, µ0 ≥ 0. Minimizing the Lagrangian with respect to (w, b, ξ) gives us: n

X ∂L = w − (αi − βi ) yi xi = 0 ∂wT i=1

∂L = ∂b

n X i=1

(βi − αi ) yi = 0

∂L = CI − Λ − µ = 0 ∂ξ ∂L = CI − Λ0 − µ0 = 0 ∂ξ 0 Insert these results into the Lagrangian leads to the dual problem: max Λ,Λ0

Λ − Λ0

T

y − ε Λ + Λ0

u.c.

Λ − Λ0

T

1 = 0,

T

1−

T  1 Λ − Λ0 K Λ − Λ0 2

(C.2)

0 ≤ Λ, Λ0 ≤ C1

T

When ε = 0, the term ε (Λ + Λ0 ) 1 in the objective function disappears, then we can reduce the optimization problem by changing variable (Λ − Λ0 ) → Λ. The inequality constrain for new variable reads |Λ| < CI. 123

Support Vector Machine in Finance

The dual problem can be solved by the QP program which gives the optimal solution Λ? . In order to compute b, we use the KKT condition:  αi wT φ (xi ) + b − yi + ε + ξi = 0  βi yi − wT φ (xi ) − b + ε + ξi = 0 (C − αi ) ξi = 0 (C − βi ) ξi0 = 0

We remark that the two last conditions give us: ξi = 0 for 0 < αi < C and ξi0 = 0 for 0 < βi < C. This result implies direclty the following condition for all support vectors of training set (xi , yi ): wT φ (xi ) + b − yi = 0 We denote here SV the set of support vectors. Using the condition w = and averaging over the training set, we obtain finally: b=

nSV 1 X

nSV

i

Pn

i=1 (αi

− βi ) φ (xi )

(yi − (z)i ) = 0

with z = K (Λ − Λ0 ).

C.2

Newton optimization for the primal problem

We consider here the Newton optimization scheme for solving the unconstrainted primal problem: n X  1 L yi , KiT β + b min LP (β, b) = min β T Kβ + C β ,b β ,b 2 i=1

The required condition of this scheme is that the function L (y, t) is differentiable. We study first the case of quadratic loss where L (y, t) is differentiable then the case with soft-margin where we have to regularize L (y, t).

C.2.1

Quadratic loss function

For the quadratic loss case, the penalty function has a suitable form: L (yi , f (xi )) = max (0, 1 − yi f (xi ))2 This function is differentiable everywhere and its derivative reads: ∂L (y, t) = 2y (yt − 1) I{yt≤1} ∂t However, the second derivative is not defined at the point yt = 1. In order to avoid this problem, we consider directly the function L as a function of the vector β and 124

Support Vector Machine in Finance

perform a quasi-Newton optimization. The second derivative now is replaced by an approximation of the Hessian matrix. The gradient of the objective function with respect to the vector (bβ)T is given as following:     T 0  2C1T I0 1 2C1T I0 K b 1 I y ∇LP = − 2C 2CK T I0 1 K + CKI0 K β KI0 y and the pseudo-Hessian matrix is given by:   2C1T I0 1 2C1T I0 K H= 2CKI0 1 K + 2CKI0 K Then the Newton iteration consists of updating the vector (bβ)T until convergence as following:     b b ← − γH −1 ∇LP β β

C.2.2

Soft-margin SVM

For the soft-margin case, the penalty function has the following form L (yi , f (xi )) = max (0, 1 − yi f (xi )) which requires a regularization. A differentiable approximation is to use the following penalty function:   0 if yt > 1 + h  2 (1+h−yt) L (y, t) = if |1 − yt| ≤ h 4h   1− yt if yt < 1 − h

125

Published paper in the Lyxor White Paper Series:

Trend Filtering Methods For Momentum Strategies Lyxor White Paper Series, Issue # 8, December 2011 http://www.lyxor.com/fr/publications/white-papers/wp/52/

December 2 0 11 Issue #8

W H I T E PA PE R

T R E N D F I LT E R I N G METHODS FOR M O M E N T U M S T R AT E G I E S

Benjamin Bruder Research & Development Lyxor Asset Management, Paris [email protected]

Tung-Lam Dao Research & Development Lyxor Asset Management, Paris [email protected]

Jean-Charles Richard Research & Development Lyxor Asset Management, Paris [email protected]

Thierry Roncalli Research & Development Lyxor Asset Management, Paris [email protected]