Article, book and MOOC summaries by Vincent Zoonekynd (2018-01)

signal from noise – but it is manual, and some of the ...... model, with changepoints accounting for changes in ..... 19th century logic and 21st century computing ..... Sparse solutions to nonnegative linear systems ...... tween two documents is the cost of the optimal trans- ...... The paper ends with a discussion of functorial.
21MB taille 17 téléchargements 1083 vues
Deep reinforcement learning bootcamp P. Abbeel et al. (2017) 1. Planning or optimal control is the search of the optimal policy on a known Markov decsion problem (MDP). Value iteration iterates the Bellman equation V0 (s) = 0

[ ] Vk+1 (s) = Max E′ R(s, q, s′ ) + γVk (s′ ) a s ∑ [ ] = Max P (s′ |a, s) R(s, a, s′ ) + γVk (s′ ) a

s′

– Use Huber loss instead of square loss; – Use RMSProp instead of standard SGD; – Anneal the exploration rate. Neural fitted Q-iteration is a batch version of DQN: generate a lot of ε-greedy episodes; fit the resulting target for a while; iterate. Double DQN uses two sets of weights, to select the best action, and to estimate it (otherwise Maxa Qθ (s, a) is biased upwards) Prioritized experience replay uses the Bellman error as transition weights. ′ ′ Q (s , a) − Q (s, a ) r + γ Max θ θ 1 2 ′

Q-value iteration is similar, but computes Q(s, a), a the expected value, if we start in state s, take action a, and act optimally thereafter. In a duelling DQN, the neural network adds structure ∑ [ ] to the Q-value and separately forecasts state value and ′ ′ Qk+1 (s, a) = P (s′ |a, s) R(s, a, s′ )+γ Max Q(s , a ) advantage, Q(s, a) = V (s) + A(s, a). a′ s′

These are just systems of equations but, because of the maximum, they are not linear. If the policy π is fixed, however, it is just a linear system, which can be solved iteratively or directly (policy evaluation). ∑ [ ] V π (s) = P (s′ |a, s) R(x, a, s′ ) + γV π (s′ ) s′

Policy iteration alternates two steps: – Compute the value V π of the policy πk ; – Improve the policy by taking the best 1-step action [ ] πk+1 (s) = Argmax P (s′ |a, s) R(s, a, s′ ) + γV π (s′ ) a

2. If the MDP is not known, we can act at random and estimate the expected Q-value (tabular Q-learning). Q(s, a) ← (1 − α)Q(s, a) + α · target target = R(s, a, s′ ) + γ Max Q(s′ , a′ ) ′ a

This is off-policy learning: the first term comes from the policy actually followed (e.g., ε-greedy), the second from the optimal∑ policy found∑so far. The learning rate α should satisfy αt = ∞, αt2 < ∞. Approximate Q-learning replaces the table Q(s, a) with a function (e.g., a linear combination of handselected features), and a gradient update ( )2 1 θ ← θ − α∇θ Qθ (s, a) − target . 2 3. Deep Q networks (DQN) use a neural network to model Qθ , but – The target is not stationary; – The data is not iid; To address those concerns: – Use batches (experience replay); – Use an older set of weights for the target (target network); Article and book summaries by Vincent Zoonekynd

Noisy nets add noise to the parameters for better exploration. 4. Policy gradient methods directly look for an optimal stochastic policy πθ (s) (a deterministic policy is a combinatorial object, more difficult to optimize). This is on-policy (less exploration) and less sample-efficient, but can deal with large action spaces (with Q-learning, Argmaxa Q(s, a) can be difficult to compute). Consider a whole episode and let τ = (s0 , u0 , s1 , u1 , . . . , sN , uN ) be a state-action sequence. ∑ We want to maximize the expected utility U (θ) = Pθ (τ )R(τ ). ∑ ∇θ U = ∇θ Pθ (τ )R(τ ) ∑ ∇θ Pθ (τ ) R(τ ) = Pθ (τ ) Pθ (τ ) ∑ = Pθ (τ )∇θ log Pθ (τ )R(τ ) [ ] = E ∇θ log Pθ (τ )R(τ ) 1 ∑ ≈ ∇θ log Pθ (τi )R(τi ) m i ∏ ∇θ log Pθ (τ ) = ∇θ log P (st+1 |st , at )πθ (st |st ) =



t

∇θ log P (st+1 |st , at ) + ∇θ log πθ (at , st )

t

=



∇θ log πθ (at |st )

t

This can also be derived using importance sampling: [ ] U (θ) = E R(τ ) τ ∼θ [ ] pθ (τ ) = E R(τ ) τ ∼θold pθold (τ ) [ ] ∇θ pθ (τ )|θ=θold ∇U (θ)|θ=θold = E R(τ ) τ ∼θold pθold (τ ) [ ] = E ∇θ log Pθ (τ )|θ=θold R(τ ) τ ∼θold

1/587

To reduce variance, one can add a baseline 1 ∑ ∇U (θ) ≈ ∇θ log Pθ (τi )R(τi ) m i ( ) 1 ∑ ≈ ∇θ log Pθ (τi ) R(τi ) − b m i

The Async advantage actor critic (A3C) uses Q5 ; the generalized advantage estimation (GAE) averages all of those: ∑ Qπ = (1 − λ) λn Qπn .

and discard terms that do not depend on the current action 1 ∑∑ ∇U (θ) ≈ ∇θ log πθ (ait , sit ) (R(τi ) − b) m i t (∑ ) 1 ∑∑ = ∇θ log πθ (ait , sit ) R(sik , aik ) − b . m i t

5. Notice that the policy gradient step is also the gradient step of some optimization problem: we can replace that step with a few iterations of an optimization algorithm solving that problem, with a constraint or penalty to limit the step size.

n⩾0

More precisely, the gradient Et [∇θ log π(a|s)A] could come from the surrogate loss Et [log π(a|s)A] or to the importance sampling loss ∑ ] [ The difference Ait = R(sik , aik ) − b is the advanπθ (a|s) k⩾t tage. A . Et πθold (a|s) Here are possible choices for the baseline: [ ] 1 ∑ Trust region policy optimization (TRPO) uses – Constant: b = E R(τ ) ≈ R(τi ); a constraint to define the “trust region” (the region m – Optimal constant, where we trust our approximation to be close enough) )2 ∑( [ ( )] ∇ log p(τi ) R(τi ) E KL πθold (·|s), πθ (·|s) ⩽ δ. b= ; ( ) ∑ 2 ∇ log p(τi ) In practice, use a linear approximation of the objective, a quadratic approximation of the constraint (the – Time-dependent: Kullback-Leibert divergence becomes the Fisher infor[ ] 1 ∑∑ mation matrix – it is a natural gradient) and solve R(sik , aik ); b = E R(τ⩾t ) ≈ m i using the conjugate gradient (Hessian-free optimizak⩾t tion). – State-dependent (actor-critic – in particular, if the current state is hopeless, regardless of the actions Proximal policy optimization (PPO) uses a taken, there is no valuable information and the gra- penalty instead of a constraint – but adjusting the level of the penalty is harder than adjusting the size of the dient is zero): trust region – a constant value does not work. b(st ) = E[rt + rt+1 + · · · ] = V π (st ). Those methods do not work well with CNNs, RNNs To estimate the value function V π (s), we can use su- and complicated architectures. pervised learning, e.g., Variants include: – Monte-Carlo: minimize – KFAC: Natural policy gradient blockwise approxi[ ]2 ∑ mation of the Fisher information matrix; π Vϕ (sit ) − R(sik , uik ) – ACKTR: A2C + KFAC natural gradient; k⩾t – PPO with clipped objective. (we end up with two neural nets: one for the policy, 6. Practical advice: one for the value function); – Temporal difference (TD): use the Bellman equation – Use several baselines: cross-entropy, policy gradient, Q-learning SARSA; for V π ∑

– Use more samples per batch; 2

R + γVϕπ (st+1 ) − Vϕπ (st ) + ϕi+1 = Argmin i 2 – Check the sensitivity to all parameters – the method ϕ should be robust; 2 λ ∥ϕ − ϕi ∥ – Try different random seeds; ( ) – Standardize: clip (x − µ)/σ, −10, 10 ; (there are now two hyperparameters, γ and λ). – Use KL(πold , πnew ) as diagnostic; spikes indicate a To reduce the variance: drop in performance. – Discount more; 7. When computing the gradient of an expectation, – Use function approximation: the parameter can be in the distribution, or in the ink⩾t

Qπ (s, u) = E[r0 + γr1 + γ 2 r2 + · · · ]

= Q0

= E[r0 + γV π (s1 )]

= Q1

2

π

= E[r0 + γr1 + γ V (s2 )]

= Q2

= E[r0 + γr1 + · · · + γ r4 + γ V π (s5 )]. 4

5

Article and book summaries by Vincent Zoonekynd

tegrand, or in both. [ ] [ ] ∇θ E f (X) = E f (x)∇θ log pθ (x) s∼pθ [ ] [ ] fθ (x) = E ∇θ fθ (x) ∇θ E x∼N (0,1)

2/587

A stochastic computation graph is a DAG some of whose nodes are stochastic, i.e., of the form “sample x from pinput ” instead of “x = f (input)”; we want the gradient of the expectation of the final output. Examples include dropout or hard attention. 8. The cross-entropy method (CEM, aka CMA-ES) works well for Tetris (for a small number of parameters: 22). It is a generalization of the finite difference method, obtained with a 2-element population, defined with mirrored (antithetic) noise.

trajectories to learn from. Hierarchical RL improves exploration by learning subtasks and combining them to reach the desired goal. Feudal networks use two agents, a manager, who decides on the subtasks (and is rewarded if they help get close to the goal), and a worker, who tries to perform the subtasks (and is not aware of the ultimate goal).

For real robots, train the agent in a simulated environment, and fine-tune it in thee physical world. Oneshot imitation learning can then learn a new task In practice, transform the reward R(τ ) to its percentile from a single example. 12. Current reinforcement learning focuses on a single or to an indicator variable for the best k% policies. Derivative-free methods are not sample-efficient, task, in contrast to machine learning, where generalization matters. but very easy to parallelize. 9. Model-based reinforcement learning is sample-efficient, and can even be used on real (physical) robots: model p(s′ |s, a) using Gaussian processes, neural nets of Gaussian model mixtures; optimize πθ (a|s); iterate a few times (otherwise the distribution of states can be very different).

To improve generalization, train on different physical robots.

The global (DP, NN, GMM) can use used as a prior, with a local (Bayesian linear) model in the neighbourhood of the current trajectory, using a constraint ( ) KL p(τ ) ∥ pold (τ ) < ε.

Given a single demonstration, decompose it into intermediate steps (e.g., the latent representation from a CNN), and use them as intermediate goals. Translation from the demonstration context to the current context (e.g., the position of the objects) may be needed.

For high-dimensional data (e.g., the goal could be specified by an image, with no explicit reward), learn in latent space.

Model-based RL learns about the physical world and can use that knowledge to forecast the outcome of an action and act towards a new goal.

10a. If an agent has rational preferences (i.e., it cannot be taken advantage of), its preferences can be summarized by a utility function, and its behaviour described by expected utility maximization (von Neuman and Morgenstern theorem).

Multi-task learning is a form of regularization and data augmentation – some aspects of the world, useful to plan a task, may be easier to learn when performing another task.

Humans are not rational (Allais’s paradox). 10b. Inverse reinforcement learning (IRL) infers a reward function from demonstrations – but it is not uniquely defined, and the demonstrations are ont truely optimal. Do not learn a single reward function, but a distribution of reward functions – the maximum entropy one (the partition function is problematic, but the corresponding gradient descent algorithm can be seen as a generative adversarial network (GAN)). 11. The main challenges in RL are sample efficiency and exploration (we are still relying on random actions).

If the problem has some structure, if you can decompose the task into different “modules”, have separate neural nets learn them, and only combine them at the end (we already saw that with duelling DQNs).

Model-agnostic meta learning looks for a policy that can easily be adjusted to new tasks by policy gradient – for instance, a robot that could be steered in any direction after just one gradient step (one-shot learning). 13. The programming exercises use Chainer. Can’t decide? Undecide! C. Goodman-Strauss (2010) Turing machines can be encoded in seemingly innocuous mathematical objects – the undecidability of the stopping problem leads to more concrete undecidable problems:

Distributional RL (C51) does not learn a value func- – Given a finite collection of tiles and a “seed” pattern, tion but value distributions. can we tile the whole plane? – Collatz-like functions, in particular Fractran proUnsupervised RL deals with sparse rewards by grams – given a list of fractions and a starting integer learning more about the environment through auxn, multiply it with the first fraction p/q in the list illiary prediction and control tasks (e.g., predicting so that np/q is integer; continue with that integer – that some pixels will change colours, or inducing those for instance changes – for images, use CNN features instead of pixels). Imagination-augmented agents learn an imperfect model, and use samples from that model as additional Article and book summaries by Vincent Zoonekynd

3 11

847 45

143 6

7 3

10 91

3 7

36 325

1 2

36 6

generates all prime powers of 10, in order; 3/587

– Post’s tag production systems: start with a sequence of zeroes and ones, remove the first three digits and add 00 or 1101 at the end, depending on whether the first digit was 0 or 1; – SAT (with statements such as “the kth cell contains a at time t”, “the machine is reading cell k at time t in state s”, etc. – Conway’s game of life; – Wolfram’s rule 110. Facets of entropy R.W. Yeung (2012) The entropy function of a collection of discrete random variables (Xi )1⩽i⩽n is { P(J1, nK) → R H: α 7→ H(Xα ) where H(Xα ) is the joint entropy of (Xi )i∈α . It satisfies the polymatroidal axioms H(∅) = 0

Nonlinear time series analysis with R R. Huffaker et al. (2017) Given a dynamical system, i.e., a system of differential equations, we can study its orbits – they are sometimes chaotic, in the sense that – Nearby trajectories diverge exponentially fast (chaos); – The trajectories tend to accumulate around an “attractor” with non-integral dimension (strangeness). For continuous (autonomous) systems, chaotic behaviour is only possible in dimension at least 3, but discrete systems (e.g., the Poincaré map of a continuous system, i.e., the intersections of its trajectories with a plane) in dimension 1 or 2 can be chaotic. Conversely, given just one coordinate of one of those trajectories, observed with noise, is it possible to reconstruct the other coordinates, the whole attractor, and the original dynamical system? This is the goal of nonlinear time series analysis.

The embedding of a univariate time series (xt )t⩾0 with delay τ and dimension m is the multivariate time seH(α) + H(β) ⩾ H(α ∩ β) + H(α ∪ β). ries (xt , xt+τ , . . . , xt+(m−1)τ )t⩾0 . Under reasonable asIn addition to entropy, the Shannon information mea- sumptions, it can be used to describe the dynamics of the system: one coordinate is indeed enough (Takens). sures are α ⊂ β =⇒ H(α) ⩽ H(β)

To estimate the delay τ , use the first minimum of the automutual information (AMI – but there is no theoI(X; Y ) = H(X) + H(Y ) − H(X, Y ) retical justification, and even no guarantee that a minI(X; Y |Z) = H(X, Z) + H(Y, Z) − H(Z, Y, Z) − H(Z). imum exists). Joint entropy and mutual information behave like set To estimate the embedding dimension, look at the prointersection and union. portion of false near neighbours (FNN), i.e., the proportion of points close, in phase space, at time t, but H(X) A not at time t + 1 – sign that some information has been H(Y ) B left out. PCA can also be used. H(X, Y ) A∩B For this (and for the Lyaponov exponent), we want the I(X; Y ) A∪B points to be close in phse space, but not because they are close in time – they should be somehow “intepenThe polymatroidal axioms aree equivalent to the basic dent”. The Theiler window specifies a minimum time inequalities: separation. To estimate it, plot the proportion of FNN as a fuction of both space and time separation, or use entropy ⩾ 0 the first zero of the autocorrelation function (ACF). conditional entropy ⩾ 0 H(X|Y ) = H(X, Y ) = H(Y )

mutual information ⩾ 0 conditional mutual information ⩾ 0 Those inequalities are not sufficient to describe entropy functions: Hn = {H : P(J1, nK) → R} Γ∗n = {H ∈ Hn : H is entropic} Γn = {H ∈ Hn : H polymatroidal} Γ∗n ⊂ Γn Γ∗2 Γ∗3

= Γ2 ⊊ Γ3 but Γ∗3 = Γ3

The dimension of the attractor can be defined using box counting lim

ε→0

log N (ε) log(1/ε)

where N (ε) is the number of balls (or boxes) of size ε needed to cover the set or the correlation dimension log C(ε) log ε ] 1 [ # |yi − yj | ⩽ ε C(ε) = lim n→+∞ n2 D2 = lim

ε→0

For n ⩾ 3, Γ∗n is neither closed nor convex, but Γ∗n is a cone; for n ⩾ 4, there are many more non-Shannon- (this is a different notion: it gives more weight to areas visited more often). The embedding dimension should type inequalities: Γ∗4 ⊊ Γ4 . be at least 2D + 1. Article and book summaries by Vincent Zoonekynd

4/587

The recurrence plot shows the distance (in phase space) between the observation at time t and that at time s; it can be thresholded to give a binary image, and reduced to a few numeric quantities: emphrecurrent quantitative analysis (RQA) looks, among other things, at the proportion of points in vertical or diagonal segments.

eigenvectors, and compare with surrogate data to have a p-value. To detect changepoints, one can also estimate the distribution of values on a moving window: it will usually have one peak, but may have two (or be a mixture distribution) during transition periods

Singular spectrum analysis (SSA) can help separate signal from noise – but it is manual, and some of the components have to be grouped.

Phenomenological modeling estimates the coefficients of an ODE from data, e.g., with regression, assuming the coefficients are polynomial, approximating the To test for stationarity, use Schreiber’s nonlinear cross- derivatives with high (4th) order centered finite differν prediction stationarity test (divide the time series into ences, with the lasso or adaptive lasso (|β| penalty, non-overlapping segments; fit a nonlinear model on 0 < ν < 1) to deal with multicolinearity. each, e.g., k-NN on an embedding; measure the perImplementations in tseriesChaos (mutual, d2, formance of model i on segment j ̸= i versus that on false.nearest, lyap_k), nonlinearTseries (RQA), segment i). tseriesEntropy (Srho.test.ts, surrogateAR, A linear system is a differential equation of the form Trho.test.AR, Trho.test.SA), Rssa, fractal y˙ = Ay. With real eigenvalues, there can be a sta- (surrogate), multispatialCCM, MESS, extRemes. ble, unstable or saddle point; with complex eigenvalues, it can be a stable or unstable focus equilibrium, A differential equation for modeling or a centre point. If you see exponentially damped or Nesterov’s accelerated gradient method: exploding trajectories, use a linear model. theory and insight E. Su et al. Nesterov momentum xk = yk−1 − s∇f (yk−1 ) k−1 (xk − xk−1 ) yk = xk + k+2

To test for nonlinearity, use the BDS test, or the Hellinger distance Sρ (k) between fXt ,Xt+k and by a can be modeled by an ODE fXt fXt+k (which can be seen as a “nonlinear correla3 tion”) or the moving block bootsrap (in a time series x ¨ + x˙ + ∇f (x) = 0. of 0s and 1s, estimate p ˆ and check if its variance is t √ pˆ(1 − pˆ)/n as it would for an iid sequence). The O(1/t2 ) convergence rate is guaranteed if the Non-linearity can also be assessed using surrogate data, damping term 3 x˙ is sufficiently large: 3 is the smallt i.e., time series with the same “linear properties”, e.g., est constant guaranteeing it but, for large t, the sys– ARMA surrogates: Fourier-transform the time se- tem is over-damped. This suggests resetting t whenever ⟨x, ˙ x ¨⟩ = 0 or, for Nesterov momentum, resetting ries, randomize the phases, transform it back; – Amplitude-adjusted Fourier transforms (AAFT) sur- k whenever ∆x starts decreasing. rogates: idem, but transform the output to recover the same distribution; Training deep and recurrent networks – PPS surrogates (for aperiodic oscillations): (first cowith Hessian-free optimization ordinate of a) random walk on the reconstructed J. Martens and I. Sutskever (2015) shadow attractor Hessian-free (HF) optimization (or truncated Newand looking at some statistic, e.g., correlation dimenton) approximates the objective function with a sion, maximum Lyapunov exponent, nonlinear predicquadratic tion error or permutation entropy. Granger causality only works for stochastic systems. For deterministic ones, try to predict future values of y using x’s attractor (convergent cross-mapping, CMM): if x(ta ), x(tb ), . . . are neighbours of x(t0 ), then y(ta ), y(tb ), . . . should be close to y(t0 ), and increasingly so as more data arrives. Delayed cross-mapping check how the forecast skill changes with the lag between x and y. To detect changepoints, compute the SSA decomposition before and after a possible changepoint, compute the scalar product (i.e., correlation) between the first Article and book summaries by Vincent Zoonekynd

h(θ + δ) ≈ h(θ) + G′ δ + 12 δ ′ Gδ but computes the update θ ← θ − H −1 G using a few conjugate gradient (CG) steps: there is no need to invert the curvature matrix or even to compute it. – Hessian-vector multiplications are directional derivatives of the gradient: they can be computed with forward differentiation; – Replace the Hessian with the generalized Gauss– Newton (GNN) matrix, which assumes f ′′ = 0, i.e., replaces the model f with its first order approxima5/587

tion): for scalar functions, ( ) h(θ) = L f (θ) ( ) h′ (θ) = f ′ (θ)L′ f (θ) ( ) ( ) h′′ (θ) = f ′′ (θ)L′ f (θ) + f ′ (θ)L′ f (θ) f ′ (θ) ( ) ≈ f ′ (θ)L′ f (θ) f ′ (θ); for supervised learning, h(θ) =

( ) 1 ∑ L y, f (x, θ) |S| (x,y)∈S

1 ∑ J ′ HL J, h′′ (θ) ≈ |S| (x,y)∈S

where J = ∇z L and HL = ∇zz L – Use Tykhonov damping (L2 regularization) and make it scale-sensitive by using the diagonal of the Hessian (Levengerg–Marquardt); – Use structural damping, i.e., a penalty, not just to changes in the parameters, but to some intermediate quantities (e.g., hidden activations – just add them to the output of the network and feed them to the loss function); – Increase or decrease the scale of the penalty by looking at the ratio ∆objective/∆penalty (Levenberg– Marquardt heuristic); – Instead of a penalty, try a trust region (harder); – Use early CG stopping; – Initialize CG with the previous direction (slightly scaled down) rather than 0; – Use a preconditioner: diagonal preconditioners are effective with deep neural nets, because the scale of the gradients can grow and shrink exponentially between layers – but less so for RNNs, because the weights are reused. You may want to ( blend the diagonal matrix with the identity, P = diag(d)+κI)ξ . There is no efficient algorithm to compute the diagonal of the Hessian: use the diagonal of the empirical Fisher information ( ) ( )′ 1 ∑ F¯ = ∇L y, f (x, θ) ∇L y, f (x, θ) |S| (x,y)∈S

( ) 1 ∑ d = diagF¯ = sq∇L y, f (x, θ) |S| sq = element-wise square

gradient as input and the updates as output of a coordinate-wise LSTM cell whose latent variables are the first and second moments of the previous gradients (momentum and AdaDelta’s normalizing factors). To help make the algorithm scale-invariant: – Randomly scale the functions to minimize at training time; ∑ – Add a random convex function, g(x) = n1 (xi −vi )2 to make the objective better-behaved; – Do not feed the raw gradient and momentum to the LSTM, but rescale them with the AdaDelta factor. Fighting biases with dynamic boosting A.V. Dorogush et al. Boosting uses the same model to build the data and to compute the gradient: this data reuse introduces bias in the gradient. CatBoost avoids it. Oblivious decision trees use the same criterion across an entire level of the tree. Escaping from saddle points – online stochastic gradient for tensor decomposition R. Ge et al. Even for non-convex functions, SGD converges to a local minimum in a polynomial number of iterations. Failures of gradient-based deep learning S. Shalev-Shwartz et al Gradient-based optimization fails when: – The gradient is too flat and/nor not informative enough about the position of the minimum; – The sample gradient gas a low signal-to-noise ratio; – The condition number is high. Possible solutions include: – Preconditioning; – Intermediate losses in end-to-end architectures; – “Forward-only update rule”: if a derivative is zero, in intermediate computations, replace it with 1. An overview of gradient descent optimization algorithms S. Ruder

(but it is biased) or, for an unbiased estimator, d=

E v∼N (0,1)

[sq J ′ HL v]. 1/2

– If possible, estimate the gradient and the Hessian on the same (large) minibatch. Learning gradient descent: better generalization and longer horizons K. Lv et al. (2017) The optimization algorithm can be added to the model, to have the computer choose the best stochastic gradient descent variant. For instance, one can use the Article and book summaries by Vincent Zoonekynd

A tutorial on Bayesian optimization of expensive cost functions, with applications to active user modeling and hierarchical reinforcement learning R. Brochu et al. (2010) Illustrated introduction to Bayesian optimization followed by non-trivial examples: – Bayesian modeling of preferences ( ) P [X ≻ Y ] = Φ f (X) − f (Y ) f ∼ GP 6/587

(the Bradley–Terry model assumes f is linear and uses a logit instead of the probit Φ); – Hierarchical reinforcement learning. Batched high-dimensional Bayesian optimization via structural kernel learning Z. Wang et al. (2017) Bayesian optimization can scale to high-dimensional functions by assuming a latent additive structure, and learning it with Gibbs sampling (Dirichlet prior for the mixing proportions: θ ∼ Dir(α), zj ∼ Multi(θ)). Function evaluations can be batched using a determinantal point process (DPP) and processed in parallel.

– The basis of products of univariate Legendre polynomials. Other target distributions (the min-copula, a standard Gaussian) and other transports (e.g., Rosenblatt), or other orthogonal polynomials (Hermite) can be used. The monotone transport from µ to ν is the only measurable map T such that T# µ = ν and T = ∇ϕ, for some convex function ϕ (it is guaranteed to exist if µ = Unif[0, 1]d or, more generally, if it gives zero measure to sets of Hausdorff measure at most d − 1) [Brenier’s theorem].

The Rosenblatt transport is   F (x ) X 1 1 Distribution-free predictive inference  FX2 |X1 =x1 (x2 )    for regression T (x1 , . . . , xd ) = FX |X =x ,X =x (x3 ) . 3 1 1 2 2   J. Lei et al. (2017) .. . Conformal inference provides distribution-free, finite-sample prediction sets. With a naive approach, fitting the model on (x1 , y1 ), . . . , (xn , yn ) and using the The power Voronoi diagram is defined using the quantiles of the residuals to form a prediction interval “distance” µ ˆ(xn+1 ) ± qα , the prediction intervals are too small. 2 d(xi , u) = ∥u − xi ∥ + wi . Instead, fit the model on (xn , y1 ) . . . , (xn , yn ), (xn+1 , y), for all values of y: the prediction set is the set of y’s in The convex piecewise linear function their µ ˆ(xn+1 ) ± qy,α interval. To reduce the computations, do not refit the model for all values of y, but split the data into equal-sized parts, use the first to fit the model, and the second to estimate the residual quantiles. Using N splits, each at level α/N , and taking the intersection of the corresponding intervals, reduces randomness but enlarges the interval – the Bonferroni effect dominates. The Jackknife (using quantiles of the leave-out-one residuals) relies on a similar idea, but is more fragile, unless we impose strong conditions on the estimator.

ϕh (u) = Max u · xi + hi , i

2

for hi = − 21 (∥xi ∥ + wi ) has constant gradient xi on each Voronoi cell. The monotone transport between µ and the empirical distribution of a sample x1 , . . . , xn is T = ∇ϕh with ∫ 1∑ hi h = Argmin ϕh (u)du − n (which can be computed by Newton’s method).

Multivariate quantiles and multivariate L-moments A. Decurninge (2014) Univariate L-moments can be defined from order statistics: λ1 = E[X] λ2 = 12 E[X(2) − X(1) ] λ3 = 13 E[X(3) − 2X(2) + X(1) ] ··· λr =

( ) r−1 1∑ (−1)r E[Xr−k:k ] r k

or by projecting the quantile function Q on the shifted Legendre polynomials (an orthogonal basis of ∫1 L2 ([0, 1]; R) for the scalar product ⟨f, g⟩ = 0 f g – ∫1 Legendre polynomials use −1 f g). Multivariate L-moments can be defined using: – The monotone transport between the distribution of interest and Unif[0, 1]d to define the quantiles, Q(F (u)) = T (u); Article and book summaries by Vincent Zoonekynd

Forecaster’s dilemma: extreme events and forecast evaluation S. Lerch et al. Forecasts are often only evaluated when an extreme event occurs (earthquake, financial crisis, etc.): this encourages forecasters to always predict disaster – whenever they will be examined, they will be right. There is no fix for point forecasts, but probability forecasts are more flexible. The joint distribution of forecast F and observation Y can be decomposed as [F, Y ] = [F ][Y |F ] = sharpness × calibration. The quality of a forecast distribition F for an observation y can be measured by the logarithmic score LogS(F, y) = − log f (y) or by the continuously ranked probability score ∫ CRPS(F, y) =

+∞ (

−∞

)2 F (y) − 1y⩽z dz. 7/587

There are weighted variants: f (y) CL(F, y) = −w(y) log ∫ wf ∫ ( )2 twCRPS(F, y) = w(z) F (z) − 1y⩽z dz.

Alpha-beta divergences discover micro and macro structures in data K. Narayan et al. (2015) The Kullback-Leibler divergence D(P ∥Q) =

∑ i̸=j

Pij log

Pij Qij

The Cramér distance minimized by t-SNE can be replaced by the α-βas a solution to biased Wasserstein gradients divergence M.G. Bellemare et al. 1 ∑ α β In machine learning, the Kullback-Leibler divergence D(P ∥Q) = −Pijα Qβij + Pijα+β + Qα+β . ij αβ α + β α + β (relative entropy) is often used to measure the “disi̸=j

tance” between the data and the fitted model. Instead, one can use the Wasserstein metric, which does not only consider probabilities, but also proximities. It is scale-invariant

Different values of (α, β) focus on macro-structures, micro-structures or hard-to-classify observations (α ≈ 1, β ≈ 0, α + β ≈ 1).

β

d(cX, cY ) ⩽ |c| d(X, Y ) and sum-invariant d(A + X, A + Y ) = d(X, Y ) but does not have unbiased sample gradients (∑ ) δ xi , Q θ . ∇θ d(P, Qθ ) ̸= E ∇θ d x1 ,...,xm ∼p

Prefer the Cramér distance (∫ )2 ( )2 d2 (P, Q) = FP (x) − FQ (x) dx ,

Entropic graph-based posterior regularization M.W. Libbrecht et al. (2015) To encourage variables associated to nodes in a graph to have similar posterior distributions when those nodes are linked, add a penalty for their KL divergence. More precisely, posterior regularization introduces an auxiliary joint distribution q, adds a regularizer to it, and a penalty to make it close to the the posterior p, Penalty(p) = Max −D(q∥p) + Penalty(q). q

Persistence topology of syntax A. Port et al. which is scale-invariant, sum-invariant, and has unbiased sample gradients. Persistence homology shows that language evolution cannot be summarized by a tree, but often needs a more general phylogenetic network – for IndoA note on the evaluation of generative models European, the first homology generator comes from L. Theis et al. (2016) ancient Greek. The Jensen-Shanon divergence (JSD) is a symmetrized KL divergence: Spin class models of syntax ( p + q) 1 ( p + q) 1

and language evolution + KL q JSD(p, q) = KL p 2 2 2 2 K. Siva et al. The maximum mean discrepancy (MMD) id MMD(p, q) =

E′

x,x ∼p y,y ′ ∼q

[ ]1/2 k(x, x′ ) − 2k(x, y) + k(y, y ′ ) .

Spin model to model and forecast the evolution of a weighted graph with binary feature vectors at each node. Graph grammars, insertion Lie algebras and quantum field theory M. Marcolli and A. Port

Random projection through multiple optical scattering: Graph grammars generalize context-free and contextapproximating kernels at the speed of light sensitive grammars, and model parallelism: the proA. Saade et al. duction rules replace a subgraph (a single, nonAnalogue, optical devices can efficiently compute ran- terminal node, in the case of context-free grammars) with a new graph. dom projections.

Article and book summaries by Vincent Zoonekynd

8/587

Prevalence and recoverability of syntactic parameters in sparse distributed memories J.J. Park et al.

FairTest: discovering unwarranted associations in data-driven applications F. Tramèr et al.

A Kanerva network (or sparse distributed memory) Unwarranted associations are statistically signifilearns a mapping FN 2 → {±1} from a dataset (xi , yi )i cant associations, in a subpopulation, between a proas follows: tected attribute and an output, with no accompanying – Pick k points x ˆ ,···x ˆ ∈ FN , each associated with a explanatory factor. Examples include i

k

2

count variable yˆ1 , . . . , yˆk , initialized at 0; – For each observation (xi , yi ), find the points within distance d of xi , and increment/decrement their counts. Here, they are used to check which language features (SVO order, etc. – from the SSWL (now TerraLing) or WALS databases) can be recovered. They can also N learn the identity map FN 2 → {±1} , i.e., learn a dataset. Implementation in msbrogli/sdm. Submodularity in data subset selection and active learning K. Wei et al. (2015) To select a small sample of data on which to train a classifier with minimal performance loss, consider the data log-likelihood, ∑ ( ) ℓ(S) = log p xi , yi |θ(s) i∈V

– Unintended side effects (e.g., discounted prices if there is a competitor nearly exclude low-income areas); – Large errors affecting a subpopulation (e.g., future health prediction for the elderly). The dual-sparse topic model: mining focused topics and focused terms in short text T. Lin et al. (2014) Spike-and-slab prior for sparse topic mixtures and sparse word usage. A unified model for unsupervised opinion spamming detection incorporating text generality Y. Xu et al. Hierarchical Bayesian model for spam detection, combining text, user and item features, estimated with Gibbs-EM, i.e., alternating betweEo collapsed Gibbs sampling and variational inference gradient descent.

i.e., the likelihood of the whole data V when the model Bayesian post-selection inference is estimated on a subset S ⊂ V . For the naive Bayes in the linear model or the nearest neighbour classifiers, this is a difference S. Panigrahi ef al. of submodular functions but, under the constraints |S| = k and S balanced (same label distribution as V ), In selective inference (adaptive data analysis), the anait reduces to a modular function, which can be approx- lyst looks at the data before deciding which question to ask. This can be modeled by conditioning on selection, imately maximized with the lazy greedy algorithm. i.e., by using a truncated log-likelihood, for a sequence model (k largest statistics, BY correction) or a (generLangevin diffusions and alized) linear model (through a convex approximation the Metropolis-adjusted Langevin algorithm of the truncated likelihood). T. Xifara et al. Given a diffusion dXt = µ(Xt )dt + σ(Xt )dWt , the probability density function p(x, y) satisfies the FokkerPlank equation

Learning the nonlinear geometry of high-dimensional data: models and algorithms T. Wu and W.U. Bajwa (2015)

∂p ∂(µp) 1 ∂(σ 2 p) =− + ∂t ∂x 2 ∂x2

Union-of-subspace (UoS) models have trouble when the subspaces are close: constrain the subspaces to be close, using a distance on the Grassmanian, e.g., d(S1 , S2 ) = ∥D1 − PS2 D1 ∥F where Di is an orthonormal basis of Si and PS2 is the projection on S2 – it can also be √ defined the principal angles θk as ∑ from k 2 d(S1 , S2 ) = s − cos θ12 , where s is the subspace dimension (for nonlinear data, use a kernel).

or

∂p 1 = −1′ ∇x (µp) + 1′ ∇2x (σσ ′ p)1 ∂t 2 in dimension n. One can therefore build a diffusion with a prescribed stationary distribution, e.g., dXt =

1 ∇ log πdt + dWt 2

or

√ 1 A∇ log πdt + Γdt + AdWt 2 (with Γ(x) = 0 if A(x) does not depend on x). dXt =

Article and book summaries by Vincent Zoonekynd

Scalable Gaussian processes for characterizing multidimensional change surfaces W. Herlands et al. (2015) Mixture models ∑ f (x) = pi fi (x),

p = softmax(w(x)), 9/587

with a Gaussian process prior on w, can be seen as generalizations of changepoint models.

– Randomized k-d forests: multiple k-d trees searched in parallel, in which the split dimension is chosen randomly from the top 5 with highest variances – non-axis-aligned variants exist but do not perform Deterministic independent component analysis significantly better; R. Huang et al. – Priority search k-means trees: find k clusters and Variants of the HKICA algorithm (FastICA, i.e., process each cluster recursively until they reach a finding directions maximizing some measure of nonminimum size – it is just another way of hierarchiGaussianity, only has theoretical guarantees in the cally partitioning the space. noiseless case):

– Sample ϕ, ψ ∼ N (0, I); Six myths 1 – Compute mp (η) = E[(η ′ X)p ], f = 12 (m4 − 3m22 ), of polynomial interpolation quadrature ∇2 fˆ; L.N. Trefethen – Compute the eigenvectors of (∇2 fˆϕ)(∇2 fˆψ)−1 , A = Equispaced interpolation may diverge (Runge phe(µ1 | · · · |µd ); nomenon) but Chebychev interpolation, i.e., interpo– We then have x ≈ As + Gaussian noise. lation on [−1, 1] at xk = cos(jπ/n), converges for Lipschitz continuous functions. On restricted nonnegative matrix factorization D. Chistikov et al. (2016) It can be numerically evaluated with the barycentric formula The nonnegative rank of an n × m nonnegative matrix ∏ x − xi M is the smallest d for which we can find nonnegaℓj (x) = xj − xi tive n × d and d × m matrices W and H such that i̸=j ∏ M = W H. If M is rational, W and H need not be so, R ℓ(x) = (x − xi ) i.e., rankQ + > rank+ in general (in dimensions beyond ∏ 3). The restricted NMF requires rank M = rank W , ℓ′ (xj ) = (xi − xj ) defining a restricted nonnegative rank. i̸=j Distributional rank aggregation and an axiomatic analysis A. Prasad et al. (2015)

ℓ(x) ℓ(x) 1 = wj x − xj ℓ′ (xj ) x − xj ∑ ∑ wj yj g(x) = ℓj (x)yj = ℓ(x) x − xj ∑ wj 1 = ℓ(x) 1 x − xj ∑ wj yi g(x) x − xj = ∑ g(x) = wj 1 x − xj

ℓj (x) =

Rank aggregation is the problem of combining several rankings (e.g., from search engines) into one. Distributional rank aggregation only uses the distribution (histogram) of the rankings. The normative axioms of “social welfare theory”, non-dictatorship, universality, transversality, Pareto efficiency and independence to irrelevant alternatives (Arrow’s impossibility theorem) can be relaxed and satisfied. The monomials xk are well-suited to find roots on the circle; for roots on [−1, 1], prefer Chebychev polynomiParallel resampling in the particle filter als. L.M. Murray et al. (2015) The propagation and weighting steps of sequential Monte Carlo (SMC, particle filters) are easy to parallelize, but the resampling step is less so – try:

Fixed points of belief propagation an analysis via polynomial continuation C. Knoll et al.

– Rejection sampling, if an upper bound on the weights Homotopy continuation is a way of numerically finding the solutions of a system of equations, by folis known: lowing a homotopy from an easy-to-solve system to the j ∼ UnifJ1, N K while Unif(0, 1) > w1 /wmax ; desired one. In the case of a polynomial system, the polyhedral homotopy method finds all the solutions (it – Metropolis: run N identical Markov chains in paralscales better than Gröbner bases). lel, for B steps, sampling from Multinom(w) – only Non-free implementation in Hom4PS-3. the weight ratios wi /wj are needed. Scalable nearest neighbor algorithms for high-dimensional data M. Muja and D.G. Lowe (2014) The FLANN library for approximate nearest neighbour search automatically selects the algorithm among: Article and book summaries by Vincent Zoonekynd

Transfinite game values in infinite chess C.D.A. Evans and J.D. Hamkins A position has value (at most) n if there is a strategy leading to a check-mate in (at most) n moves, whatever the opponent does. 10/587

A position with value ω, the first infinite ordinal, is – Shapelets are small patterns that may appear in a a position with Black to play resulting in a mate-in-n time series, position for White, with n a large as Black wants. Min d(xi−w:i+w , shapelet) A position with value ω + n is n moves away from a i position with value ω. selected by genetic algorithms – you can also look at A position with value nω is a position in which Black their position or the number of occurrences. can announce n times “I will make an announcement in ki moves” or, for the last one, “I will lose in kn moves”, Automatic time series phenotyping with the ki ’s as large as desired. using massive feature extraction In a position with value kω, Black can play to be in a B.D. Fulcher and N.S. Jones (2016) position with value kω, with k as large as desired. Normalize the time series with a “robust sigmoid”, )−1 ( x − median x ← 1 + exp − 1.35 × IQR

A position in infinite chess with game value ω 4 C.D.A. Evans et al. What is a Leavitt path algebra? G. Abrams

Large-scale unusual time series detection R.J. Hyndman et al. (2015)

Given a momoid M , it is possible to find a ring R so A few more time series features: that (Mod R, ⊕) ≃ M . – Lumpiness: variance of the variances in 24observation blocks; Highly comparative time-series analysis: – Spikiness: variance of the l.o.o. variances of the STL the empirical structure of time series residuals; and their methods – Level and variance change: maximum change in B.F. Fulcher et al. (2013) mean or variance in consecutive 24-observation blocks; The hctsa Matlab library (GPL) provides 1000 time series features for classification, clustering and biclus- – KL score: maximum Kulback-Leibler divergence between kernel density estimates of consecutive 48tering, to be used with the UCR time series dataset for observation blocks; nearest neighbour search. – Flat spots: maximum run length of the (10-value) quantized time series. FATS: Feature analysis for time series Some features are not defined: peak, trough, curvature. I. Nun et al. (2015) The fats Python library computes 30 time series features (to classify astronomy time series). Feature-based time series analysis B.D. Fulcher (2017) Look at global features: – Distribution of the values (µ, σ, etc.); sd(¯ x1:w , x ¯w+1:2w , . . . ) – Stationarity, e.g., StatAv = sd(x) – ACF, Fourier and wavelet coefficients; – Nonlinear time series analysis: Lyapunov coefficients, correlation dimension, correlation entropy; – Entropy: approximate entropy, sample entropy, permutation entropy; – Scaling (fractality): Hurst exponent (DFA); – Time series models: parameters and goodness-of-fit statistics, properties of the residuals;

Forecastable component analysis G.M. Goerg (2013) Find orthogonal directions, maximizing forecastability, measured by Ω = 1 − H/ log 2π, where H is the spectral entropy, i.e., the entropy of the spectral density (the spectrum, rescaled to be a probability density on S1 ). Implementation in ForeCA::Omega. A scalable method for time series clustering X. Wang et al. (2004) Cluster time series using the following features:

– –

and subsequence features: – If the time series are aligned, consider Mean(xi:j ), – Sd(xi:j ), Slope(xi:j ) – these are families of features: – select them greedily, using the entropy gain (as in – decision trees); Article and book summaries by Vincent Zoonekynd

Var(detrended) Var(raw) Var(deseasoned) Seasonality = 1 − Var(raw) Periodicity: position of the first peak in the ACF, it it is positive, and significantly higher than the previous trough; Serial correlation: Box-Pierce statistic; Non-linearity: Teräsvirta test statistic; Skewness, kurtosis, Hurst exponent, maximum Lyapunov exponent.

– Trend = 1 −

11/587

All those features have values in [0, +∞): rescale them parametrically eaq − 1 q 7→ b + eaq

pdc: an R package for complexity-based clustering of time series A.M. Brandmaier (2015)

To cluster time series, use an m-dimensional embedso that quantiles 10% and 90% correspond to time se- ding, replace the m-dimensional vectors with the corries with and without the features (respectively an ex- responding permutation (their ranks), and use the Hellinger distance (a metric approximation of the KL ample from the literature, and white noise). divergence) between the permutation distributions as a dissimilarity measure. Choose the embedding dimenTSClust: an R package sion m and the delay τ to minimize the (average) norfor time series clustering malized permutation entropy P. Montero and J.A. Vilar (2014) 1 ∑ − pσ log pσ . The TSclust package defines the following dissimilarm! σ∈Sn ity measures: – Lp distance; – Dynamic time warp (DTW) distance, |xi − yj | ;

Visualising forecasting algorithm performance using time series instance spaces Y. Kang et al. (2016)

d(x, y) = Min Max |xi − yj | ;

Time series features (spectral entropy, trend and seasonality strength from the STL decomposition, period (if known), ACF, optimal Box-Cox transformation to make the time series linear) can be used to

d(x, y) = Min



r

(i,j)∈r

– Fréchet distance, r

(i,j)∈r

– Correlation: √ 2(1 − Cor(x, y)),

(

1 − Cor(x, y) 1 + Cor(x, y)

)β/2 ;

– L2 distance between the ACF, with constant or exponentially decaying weights; – L2 distance between the periodograms, the normalized periodograms, the integrated periodograms (Cramér–von Mises distance between the spectral densities); – Distance between the wavelet coefficients, the AR(∞) coefficients or the cepstral coefficients; – Statistics from ARMA tests checking if the two time series come from the same model; – Distance between the SAX representations (obtained by aggregating, and then quantizing with quantiles); – Divergence between the distributions of the permutations induced by an m-dimensional embedding; – Normalized compression distance and variants: C(xy) − Min{C(x), C(y)} , Max{C(x), C(y)}

C(xy) ; C(x)C(y)

– Distance between the k-step-ahead forecast distributions, from a sieve bootstrap (i.e., bootstrap on the residuals of an AR or similar model). Those measures can be corrected, for correlation or complexity, by multiplying by Max{CE(x), CE(y)} 1 or , 1 + exp k Cor(∆x, ∆y) Min{CE(x), CE(y)} 2

where CE(x) = ∥∆x∥2 .

Article and book summaries by Vincent Zoonekynd

– Check how diverse the M3 dataset (in the MComp package) is; – Add synthetic time series in its gaps, using a genetic algorithm; – Visualize which forecasting algorthms perform well with which type of time series. JIDT: an information-theoretic toolkit for studying the dynamics of complex systems J.T. Lizier (2014) The JIDT Java library provides several informationtheoretic measures for time series, centered on transfer entropy, and several estimators. Entropy rate

H(Xn+1 |Xn )

Active information storage

I(Xn ; Xn+1 )

Transfer entropy

I(Yn+1 ; Xn+1 |Xn )

Conditional TE

I(Yn+1 ; Xn+1 |Xn , Zn )

(you can replace Xn with (Xn , . . . , Xn+k−1 )). Multiscale entropy analysis M. Costa et al. (2000) Compute the average of the datapoints on nonoverlapping windows of size τ , and then the sample entropy; plot the sample entropy versus τ – for 1/f noise, it is flat. Introducing nonlinear time series analysis in undergraduate courses M. Perc (2004) Nonlinear time series analysis studies stationary, deterministic time series, from data alone, i.e., without knowledge of the underlying dynamical system. 12/587

The dynamical system can often be recovered from in diagonals, in vertical lines, length distribution of the Takens embedding, i.e., the cloud of points those lines. (xt , xt+τ , . . . , xt+(m−1)τ ) ∈ Rm . To choose the delay – Network characteristics: interpret the recurrence τ , check when the ACF decreases to 1/e, or when the matrix as a graph and look at some of its metrics: AMI (auto-mutual information) reaches its first mincentrality, shortest paths, clustering coefficients, etc. imum – though there is no reason such a minimum Distinguishing noise from chaos is difficult. should exist. To choose the embedding dimension m, notice that two points close in the embedding should Complex network approach remain close at the next time step – if not, i.e., if for recurrence analysis of time series |xt+mτ − xs+mτ | N. Marwan et al. (2009) > 10, ∥pt − ps ∥ ⩽ σ but ∥pt − ps ∥ The following five RQA features they are false nearest neighbours (FNN) – we are missing some information about them. Plot the proportion of FNN as a function of m to help select m.

– Maximum diagonal length; – Laminicity: proportion of points in vertical lines; – Link density (for the network whose adjacency matrix is the recurrence plot); To test for stationarity, split the time series into non– Clustering coefficient; overlapping segments and compare the mean and stan– Average minimum path length dard deviation on each segment, or look at the crossprediction error δij when forecasting the next obser- can identify different regimes of the logistic map vation in segment j with a k-NN model fitted on seg- xn+1 ← αsn (1−xn ) as α varies, or (when estimated on ment i. a moving window) the different regimes of a real-world time series. To recover the dynamical system, – Split the embedding space into boxes; Characterizing the structural diversity – Whenever a trajectory crosses a box, compute its of complex networks across domains average direction, as a unit vector; K. Ikehara and A. Clauset (2016) – Average the vectors in each (non-empty) box – if the directions are consistent, the average will still have To compare networks, look at scale-invariant features: clustering coefficient, degree assortativity and network unit norm. motifs (compared with a random graph with the same The Lyaponov exponent measures how fast nearby tradegree distribution). jectories diverge; it can be estimated by choosing a point ps close to pt and averaging Time series classification with COTE: the 1 ∥pt+ν − ps+ν ∥ collective of transformation-based ensembles log ν ∥pt − ps ∥ A. Bagnall et al. (2014) with ν = mτ , and replacing the point ps at each step, trying to keep the same direction. Nonlinear time series analysis revisited E. Bradley and H. Kantz (2015) Nonlinear time series analysis reconstructs the state space of a dynamical system, from data alone, to compute fractal dimension, Lyapunov exponents (instability), Kolmogorov-Sinai entropy (unpredictability), etc. – but noise and the finiteness of the data call for caution. – The fraction of pairs of data points at distance at most ε (the correlation sum) scales as εD2 , where D2 is the fractal (Renyi) dimension. – The time series should be long enough: as a rule of thumb, N > 42D2 . – To test if the values are significantly different from a null model, use surrogate time series: Fourier transform the data, randomize the phases, transform back, map onto the original values (only keeping the rank), and somehow recover the correlations. – Permutation entropy. – Recurrence plots and recurrence quantitative analysis (RQA): proportion of black points in the plot, Article and book summaries by Vincent Zoonekynd

Classify time series using ensemble methods (random forests or rotation forests – random forests with a PCA on a random subset of the variables at each node), on data transformed into the time (shapelets), frequency (periodogram) or autocorrelation (ACF, PACF, AR coefficients) domains. The shapelet features are the distances to smaller time series s (“shapelets”), d(s, x) =

Min

y⊂x len(x)=len(s)

d(s, y);

those smaller time series are taken from subsequences of the training data, pruned to retain only the most discriminating ones (small distances to one class, large distance to the others) and to avoid redundancy. Forecasting at scale S.J. Taylor and B. Letham To forecast time series, Facebook uses a GAM-like model, with trend, seasonality (Fourier) and (irregular) holiday components – curve fitting is easier than time series modeling. The trend is either piecewise linear (with a large number of potential changepoints, but a Laplace prior keeps them sparse), or a logistic growth 13/587

model, with changepoints accounting for changes in market size (e.g., using the population from the World Bank). R/Python implementation, via Stan, in fbprophet.

A large set of audio features for sound description (similarity and classification) in the Cuidado project G. Peeters (2004) Audio features include:

Distributed and parallel time series feature – Spectral shape (linear or exponential fit; properties extraction for industrial big data applications of the spectral density); M. Christ et al. (2017) – ACF, zero crossing rate; Select time series features (from hctsa) using tests – Length of the ADSR (envelope) phases; for feature ⊥ ⊥ outcome (Fisher, Kolmogorov-Smirnov – Fundamental frequency, inharmonicity (weighted average distance between successive spectral peaks or Kendall rank, depending on whether the variables and multiples of the fundamental), harmonic deviare binary or continuous) and the Benjamini-Yekutieli ation (deviation of the harmonics from the spectral (BY) procedure to control the false discovery rate envelope), odd/even harmonic ratio, tristimulus (rel(FDR). ative energy in harmonics 1, 2–4 and 5+). The all-relevant feature selection using random forest M.B. Kursa and W.R. Rudnicki (2010)

Time-varying market beta: does the estimation methodology matter? B. Nieto et al. (2014)

To find all the relevant features (instead of just a nonredundant set), add a “shadow” variable for each feature, by shuffling its values, fit a random forest, and mark as relevant the features more important than the best random one; remove them, and iterate until nothing left is relevant.

Comparison of several time-varying beta estimators (prefer Kalman or GARCH):

R implementation in Boruta.

– Regression on a moving window, with constant or decreasing weights; – Multivariate GARCH BEKK: DCC:

Surrogate time series T. Schreiber and A. Schmitz (1999) To test if a time series is linear, e.g., – H0 : iid; – H0 : ARMA (“Gaussian linear”); – H0 : Nonlinear transformation of an ARMA process; (these are composite null hypotheses), look at statistics such as time reversibility E[(Xn − Xn−τ )3 ] or the nonlinear prediction error (nlpe). The p-value can be computed with simulations. The bootstrap would estimate an ARMA model on the data and generate new data from the fitted model. Constrained randomization generates data with the same ARMA estimates, i.e., with the same ACF. For instance, one can Gaussianize the data, compute its FFT, randomize the phase, transform back, and restore the distribution, but this tends to give a flatter power spectrum. Instead, start with a shuffled time series and iterate: – Force the spectrum to be correct, by rescaling it to the correct amplitudes, keeping the phases; – Force the distribution to be correct (resampling). Conserving the Fourier amplitudes preserves the periodic ACF, not the ACF: to limit the problem, measure the corresponding artefacts and remove a few observations at the beginning or the end to minimize them. Alternatively, use simulated annealing to find a permutation of the original time series with a small discrepancy between the ACFs (not exactly the minimum discrepancy: that would be the original time series).

Article and book summaries by Vincent Zoonekynd

Ht+1 = C ′ C + A′ εt ε′t A + B ′ Ht B Ht = Dt Rt Dt

(where Dt is a diagonal GARCH model) and asymmetric variants; – Kalman filters random walk: random observations:

βt+1 = βt + noise βt = β0 + noise

Arbitrated ensemble for time series forecasting V. Cerqueira et al. To combine time series forecasts, model how each model performs for different types of inputs and how this performance changes with time (the metamodel forecasts the model error). Graph-theoretic properties of the darkweb V. Griffith et al. The Darkweb is only loosely connected: most sites have no outgoing links. Data from the tor2web proxy onion.link, crawled with scrapinghub.com (commercial), starting from directoryvi6plzm.onion and ahmia.fi/onions. A divide-and-conquer framework for distributed graph clustering W. Yang and H. Xu (2015) The generalized stochastic block model divides the nodes into clusters and outliers; the edge probability is p for notes inside the same cluster and q < p otherwise (edges between clusters or involving an outlier). It can be fitted, in parallel, with a divide-and-conquer approach: 14/587

– Randomly partition the nodes into groups; – Find clusters in each of those subgraphs; – Build a “fused” graph, whose nodes are the subgraph clusters, and with an edge between two clusters if the edge density exceeds some threshold t (q < t < p); – Find clusters in the fused graph.

sider a non-Gaussian state space model i : node t : time nit : number of occupants of node i at time t xijt : flow from i to j between t − 1 and t x0it ∼ Poisϕit new visitors xi·t ∼ Mult(ni,t−1 , θi·t ) movement between nodes

Real-time community detection in large social networks on a laptop B. Chanberlain et al. The minhash of a subset A ⊂ J1, nK is hσ (A) = Min{σ(a), a ∈ A} for a (fixed) random permutation σ ∈ Sn . The Jaccard similarity can be approximated with minhashes:

J(A, B) =

|A ∩ B| = P [hσ (A) = hσ (B)]. |A ∪ B| σ∈Sn

ϕit : arrival rate θijt : transition probabilities ϕijt θijt = ∑ ϕi·t ϕijt = µt αij βjt γijt βjt : attractivity of node j at time t Unpaired image-to-image translation using cycle-consistent adversarial networks J.Y. Zhu et al. CycleGAN translates (unpaired) images between two domains (paintings and photographs, horses and zebras, summer and winder, aerial photographs and maps, etc.) by learning two mappings

Given a large graph, the neighbours of a node can be compressed (lossily) using minhash signatures (e.g., G 1000 hash functions for 1,000,000 nodes), and localityX Y B sensitive hashing (LSH) can be used to find approximate nearest neighbours: with the Jaccard similarity, with an adversarial loss to make the distribution of they form a weighted graph, on which we can compute G(X ) indistinguishable from that of Y , and to ensure (local) communities. The Jaccard similarity also helps F ◦ G ≃ id (cycle-consistency). solve coverage problems (“find the smallest number of athletes that have influence over half of Twitter”) – InfoGAN: interpretable representation |A ∪ B| = |A|+|B|−|A ∩ B| links the number of neighlearning by information maximizing bours of a community and of a node and their Jaccard generative adversarial nets similarity. X. Chen et al. To make the latent space of GANs more interpretable (in an ICA-like way), split the noise into an incompressible part z and a “code” c, and solve the minimax Querrying k-truss community problem ( ) in large and dynamic graphs Min Max VGAN (D, G) − λI c; G(z, c) G D X. Huang et al. (2014) VGAN (D, G) = E [log D(x)] + x∼Data The k-truss of a graph G is the largest subgraph in [ ( )] which each edge is in at least k − 2 triangles. E log 1 − DG(c, z) (c,z)∼Noise

A k-truss community is a maximal k-truss subgraph in which any two edges are reachable through a series of adjacent triangles.

after replacing the mutual information with a (variational) lower bound.

After putting the k-truss subgraphs in a suitable index, one can retrieve the k-truss community of a vertex in linear time.

Adversarial feature learning J. Donahue et al. (2017)

Bayesian dynamic modeling and analysis of streaming network data X. Chen and K. Irie (2015) To model the flow of visitors on a graph (website), conArticle and book summaries by Vincent Zoonekynd

The BiGAN has an encoder, converting the data to features, and its discriminator uses both the data and the latent representation. noise

generator

generated data

encoder

real data

discriminator features

15/587

Texture networks: feed-forward synthesis of textures and stylized images D. Ulyanov et al (2016) To generate a texture (target), find a CNN f generating similar VGG features, i.e., minimizing ∥VGG f (noise) − VGG target∥ .

∥VGG f (noise, content) − VGG target∥ ,

AudioSet is the audio equivalent of ImageNet: 600 classes, 2 million 10-second samples, from YouTube.

where the target is fixed. Rethinking the inception architecture for computer vision C. Szegedy et al.



3×3

n×n

3×3



Compute sentence embeddings by reconstructing the surrounding sentences, using a seq2seq RNN with GRU units, trained on the BookCorpus dataset. AudioSet: an ontology and human-labeled dataset for audio events J.F. Gemmeke et al.

For style transfer, use

5×5

Skip-thought vectors R. Kiros et al.

n×1 1×n

Unsupervised machine translation using monolingual corpora only G. Lample et al. Learn embedding of two languages into the same latent space, initialized with a bilingual lexicon (only words – no sentences), and learn to reconstruct a sentence from language A from a noisy version of its embedding. Add an adversarial regularization, with a discriminator to recognize the language from the latent space. The translation loop is an example of noise. Alatent = Blatent

X-CNN: cross-modal convolutional neural networks for sparse datasets P. Veličković et al. Process the input layers separately, mix them, and merge them. R G B

softmax FC

Asentence

Bsentence Alatent = Blatent Neural architecture search with reinforcement learning B. Zoph and Q.V. Le (2017)

Reinforcement learning can help improve existing network architectures (number of layers, dimension, filter size, stride, etc.) and devise new LSTM-like cells.

conv layers

Multiplicative LSTM for sequence modelling B. Krause et al. Tensor RNNs have character-specific hidden-to-hidden weights xt ht−1 + Whx xt . h(t) = Whh Multiplicative RNNs use a sparse description of those weight matrices xt Whh = Whm · diag(Wmx xt ) · Wmh .

This idea can be combined with LSTMs (which already contain a multiplication).

Self-critical sequence training for image captioning S.J. Rennie et al. Reinforcement learning deals with non-differentiable rewards: it maximizes the expected rewars wrt a baseline to reduce minibatch variance. Those ideas can be used for the non-differentiable evaluation metrics commonly used in natural language processing (BLEU, etc.) instead of cross-entropy. Learning options in reinforcement learning M. Stolle and D. Precup Use frequently-visited states as subgoals for your reinforcement learning system to easily adapt to different goals.

Google’s multilingual neural machine translation system: enabling zero-shot translation M. Johnson et al.

Effect of reward function choices in risk-averse reinforcement learning S. Ma and J.Y. Yu

The same network can translate into many languages: just add a token, at the begining of the sentence, specifying the target language.

Given a Markov decision process (MDP), one can maximize the value-at-risk (VaR) or the expected shortfall (ES, CVaR) instead of the expected discounted reward.

Article and book summaries by Vincent Zoonekynd

16/587

Hyperband: a novel bandit-based approach forecasts as much as cross-entropy: to hyperparameter optimization ∑( ) 1y=yi + βp(y|xi ) log p(y|xi ) entropy-penalized L. Li et al. (2016) − To tune hyperparameters, successive halving uniformly allocates resources (number of training sets, number of features, iterations) to n random configurations, evaluates their performance, throws away the botttom half, and continues, improving the precision on the performance of the remaining configuration, until there is only one left. For a given time budget B, Hyperband tried different trade-offs between n and B/n.

Learning concept embedding with combined human-machine expertise M.W. Wilber Add triplet constraints dij ⩽ dik to t-SNE (from MechanicalTurk).

The high-dimensional geometry of binary neural networks A.G. Anderson and C.P. Berg

i,y





( ) log p(yi |xi ) + λ

capped

i





log p(yi |xi )

top-k.

1⩽i⩽k

World literature according to Wikipedia: introduction to a DBpedia-based framework C. Hube et al. (2017) DBpedia extracts data from Wikipedia infoboxes, using hand-written (crowd-sourced) rules. The data is not as clean as we would like: for instance, there is no single, reliable way of identifying “writers”. PathNet: evolution channels gradient descent in super neural networks C. Fernando et al.

To learn different tasks on the same network (“evolutionary dropout”): pick a random set of paths through the network, train the corresponding subnet, for the In high dimensions, binarization approximately first task; freeze the corresponding weights; proceed to preserves directions and batch-normalized weightthe second task. activation dot products – but keep continuous weights for the first layer: it is very different. Deep forest: towards an alternative to deep neural networks Z.H. Zhou and J. Feng (2017) Exploring loss function topology with cyclical learning rates Deep models need not be limited to neural nets: use a L.N. Smith and N. Topin (2017) random forest for a classification problem; concatenate its input and its output (class probabilities) and feed Repeatedly increasing and decreasing the learning rate them to another random forest; iterate a few times. (e.g., 0.1 → 1 → 0.1) allows the optimization to find Those layers could be convolution-like, taking patches several different minima, often better than a constant of an image or sequence as input. learning rate.

Generating focused molecule libraries for drug discovery with recurrent neural networks M.H.S. Segler et al. Fit a recurrent neural net (3 stacked LSTM layers, 1024 dimensions, dropout=0.2) on the Smiles description of 1,000,000 molecules (a 1-dimensional, textual description of molecules) and use it to generate new molecules for further (in silico) testing.

Batch renormalization: towards reducing minibatch dependence in batch-normalized models S. Ioffe When using batch normalization with small or non-iid batches, add a (non-learned) affine transform to account for the difference in mean and variance between the minibatch and the population. PixelNet: representation of the pixels, by the pixels, and for the pixels A. Bansal

Kernel approximation methods for speech recognition For pixel-level predictions (semantic segmentation, A. May et al. edge detection, surface normal extraction), use a spatially-invariant (convolutional) model, with multiRandom features approximate the kernel trick. They scale features (“hypercolumns”, i.e., connections from can be selected iteratively: generate random features, earlier, more detailed layers) and only sample a few fit the model, keep the best features; start again with (2000) pixels per image (to reduce the dependence, a new set of random features. which could harm stochastic gradient descent). The following loss functions do not penalize incorrect Article and book summaries by Vincent Zoonekynd

17/587

Cosine normalization: using cosine similarity instead of dot product in neural networks L. Chunjie et al.

Database learning: towards a database that becomes smarter every time Y. Park et al.

Replace the dot product x 7→ w · x in neural networks The answer to each database query can help fine-tune a with the cosine similarity (i.e., correlation) probabilistic model of the data, used to speed up future queries and/or provide faster approximate results. w·x x 7→ . ∥w∥ ∥x∥ Active convolution: learning the shape of convolution for image classification Neural audio synthesis of musical nodes Y. Jeon and J. Kim with WaveNet autoencoders J. Engel et al. Active convolution units have no fixed shape and allow for fractional pixel coordinates. The NSynth dataset contains 300,000 4-second notes, from 1000 instruments, from commercial sample libraries, with different pitches and velocities. −→ Learning character-level compositionality with visual features F. Liu et al.

A simple neural network module for relational learning A. Santoro et al.

For alphabetical languages, character-level models can understand word morphology and deal with unknown words. For more complex writing systems (Chinese, Relation networks (RN) are built from modules of the Japanese, Korean), convert the characters to 36 × 36 form (∑ ) images and feed them to a CNN + GRU-CNN. x 7−→ fϕ gθ (xi , xj ) Tacotron: towards end-to-end speech synthesis Y. Wang et al. End-to-end speech synthesis by combining convolution (n-grams), batch normalization, residual connections, highway networks, bidirectional GRUs and attention to produce spectrogram frames. A study of complex deep learning networks on high-performance, neuromorphic and quantum computers T.E. Potok et al. Adiabatic quantum computers can fit (non-restricted, i.e., non-bipartite) Boltzmann machines – networks with intra-layer connections are becoming tractable. Neuromorphic computers (spiking neural nets, SNN), built with memristors, may make computations more efficient. Opening the black box of deep neural networks via information R. Schwartz-Ziv and N. Tshby Representing a learning (deep) neural net in the in-) ( formation plane I(input; hidden), I(hidden; output) where the mutual information I(X, Y ) = KL(px,y ∥px py ) measures the information the input variable X has on the output Y , reveals two phases: – Compressing the information, i.e., increasing I(hidden; output) while preserving I(input; hidden); – Discarding irrelavant information, i.e., decreasing I(input; hidden), which helps generalization.

Article and book summaries by Vincent Zoonekynd

ij

where f and g are multi-layer perceptrons (but it is the same gθ for all pairs). Fast algorithms for convolutional neural networks A. Lavin and S. Gray The Winograd minimal filtering algorithm computes 1- or 2-dimensional filters with the minimal number of multiplications – for the small filters used on CNNs, it is faster than the FFT. HyperNetworks D. Ha et al. Hypernetworks are networks generating the weights of another network; they can be trained end-to-end with backpropagation. They can generate non-shared LSTM (or CNN) weights. Deep feature interpolation for image content changes P. Upchurch et al. Deep neural networks can describe images with features in a linear (not curved) space: linear interpolation in a direction (age, gender, glasses, smile, etc.) produces high-resolution image transforms. Fitted learning: models with awareness of their limits N. Kardan and K.O. Stanley 18/587

For classification problems, use k neurons for each class in the output layers (instead of 1), train to set their values to 1, but use their product at test time – this limits overly-confident generalizations.

Wide and deep learning for recommender systems H.T. Cheng et al.

Unsupervised learning for physical interaction through video prediction C. Finn et al. Predict motion, at the pixel level, assuming that pixels do not move large distances and that pixels are only transformed in a small number of ways (pixels belonging to the same object move in the same way). Residual networks behave like ensembles of relatively shallow networks A. Veit et al. Aggregated residual transformations for deep neural networks S. Xie

1×1 3×3

3×3

1×1

1×1 +

Input convex neural networks B. Amos et al. Neural nets can be constrained to be convex functions of their input, by requiring the weights to be positive z1

z2

Aggregated residual transformations for deep neural networks S. Xie et al.

vs

+

y

No batch normalization is needed for the activation function { x if x > 0 selu(x) = λ α(ex − 1) if x ⩽ 0.

Put the non-linearity (purple) before or after aggregating.

1×1

···

Self-normalizing neural networks G. Klambauer

z3

···

zk

(skip connections are useful for the network to remain expressive), or of part of the input, zi+1 ∼ zi ⊙ ui + zi ⊙ y + ui by requiring part of the input (the z ⊙ u term) to be positive. x

u1

u2

···

uk−1

y

z1

z2

···

zk−1

zk

Such networks can be used in optimization problems, Argmin f (x, y; θ). y

Article and book summaries by Vincent Zoonekynd

+

Memory networks J. Weston et al. (2015) Add (classical) memory cells to a neural network: at each time step, the neural network decides which cells to update and how, using their contents together with the new data. DeepCoder: learning to write programs M. Balog et al. Inductive program synthesis (IPS) for a simple array manipulation language (only linear control flow: head, last, take, drop, access, min, max, rev, sort, sum; map, filter, count, zip, scan; (+1), (−1), (×2), (/2), (× − 1), (2 ), (> 0), (< 0), odd, even). The neural network is only used to guide more traditional approaches: stochastic local search, depth-first search, etc. mixup: beyond empirical risk minimization H. Zhang et al. To help the network generalize better to examples slightly outside the training distribution, augment the data with convex combinations (1 − λ)(x1 , y1 ) + λ(x2 , y2 ) (use one-hot-encoding for y).

19/587

Coherent line drawing H. Kang et al. The edge tangent flow of an image I is computed iteratively: tx ← ∇I(x)⊥ 1 ∑ tx ← k

( ) σ ∥∇I(x)∥ − ∥∇I(y)∥ (tx · ty )ty .

∥x−y∥⩽r

Applying a difference of Gaussians (DoG) in the direction orthogonal to the edge flow gives a nice line drawing. Simplex noise demystified S. Gustavson (2005) Simplex noise is similar to Perlin noise, but replaces hypercubes with simplices, which facilitates interpolation. SATGraf: visualizing the evolution of SAT formula structures in solvers Z. Newsham et al. (2014) To visualize how SAT solvers transform a boolean formula, look at the clause-variable incidence graph – its community structure is related to the running time. Network risk and financial crisis Y.A. Chen (2015) Build a parametric portfolio from the Katz centrality x = γAx + β1 of the weighted network defined by the variance of residual returns.

Deep stock representation learning: from candlestick charts to investment decisions G. Hu et al. Feed candlestick charts (i.e., images), for the FTSE 100 stocks, to a convolutional autoencoder; cluster them using the resulting latent representation, with a community detection algorithm (no need to specify the number of clusters); invest in the stocks with the largest Sharpe ratio in the cluster. Local explosion modelling by noncausal processes C. Gouriéroux and J.M. Zakoian (2016) Non-causal AR(1) processes can model bubbles growing and bursting; aggregating them allows for different rates of increase (e.g., a continuous distribution); one can also add a (causal, Gaussian) AR(1). The myths and limits of passive hedge fund replication N. Amenc et al. (2007) There are two approaches to replicate hedge funds (neither works well): – Compute the exposure of hedge fund returns to a set of pre-selected factors; – Build an option, on a reference asset (anything, e.g., S&P 500) and choose the payoff function giving the desired payoff distribution (“moment matching”). The price of that option can be used as a measure of performance. This “replicating” strategy has comparable returns, but not at the same times. The first moment is not captured: the returns are lower. [There is a bivariate, copula-based variant but it only models (portfolio, underlying), not (portfolio, fund).] Possible improvements to capture nonlinearities, dynamic trading, non-stationarity and time series properties include:

Compute the beta of each stock wrt that portfolio; decompose the idiosyncratic volatility into “network – Heuristic dynamic (rule-based) strategies; volatility” and a remainder. – Instrumental variables for time- and statedependence; – Future factor loading forecasts; Media network and return predictability – Time-varying models (Kalman filter, particle filter, L. Guo and Y. Tao (2017) Markov switching); The connection score of two stocks i, j, is – Ad hoc options portfolios (e.g, ATM and OTM puts and calls on S&P 500); ∑ CSij = toneik × tonejk – Statistical techniques to estimate the options (and k underlyings) needed. where the sum if over all news mentioning both i and j and the tone is in [−1, 1]. The media connection index ∑ CSij ∑i̸=j ij CSij can be used as a risk measure.

Article and book summaries by Vincent Zoonekynd

An alternative approach to alternative beta T. Toncalli and J. Teïletche (2007) Use the Kalman filter to estimate time-dependent factor exposures. It also decomposes the returns (for one period) into traditional (i.e., average) beta, alternative (difference with the average) beta and alpha (residual, intercept).

20/587

Tracking problems, hedge fund replication and alternative beta T. Roncalli and G. Weisang (2008)

Hedge fund replication: a model combination approach M.S. O’Doherty et al. (2016)

The Kalman filter provides (Gaussian, linear) factor exposures. The unscented Kalman filter (UKF) allows for non-Gaussian innovations. One can add non-linear assets to the list of factors; if they depend on a parameter (say, a strike), it can be modeled as a random walk, as part of the hidden state, but the model becomes nonlinear – use a particle filter (PF, sequential Monte Carlo, SMC). Hedge fund performance can be decomposed into: traditional beta, alternative beta, lag (alternative beta one period ahead), alternative alpha (the rest – this includes illiquid assets).

Instead of directly estimating a linear replication model, estimate separate replication models, using domestic equities (S&P 500, S&P Midcap 400), international equities (FTSEE 100, Nikkei 225), currencies (GBP, CHF, JPY) and commodities (gold, corn, oil) and average using “log-scale” weights (BMA).

Regular(ized) hedge fund clones D. Giamouridis and S. Paterlini (2009)

The two factors added to the Fama-French model (market, size, value, profitability, investment) are not robust, lack an economic rationale, and still ignore momentum.

Use the lasso for factor-based hedge fund replication [the resulting weights are biased towards zero].

Can hedge fund returns be replicated? The linear case J. Hasanhodzic and W. Lo

A primer on alternative risk premia R. Hamdan et al. (2016) Five concerns with the five-factor model D. Blitz et al.

Bayesian Poisson Tucker decomposition for learning the structure of international relations A. Schein et al. (2016)

Model international relations (country i takes action Replicate individual hedge funds using in-sample or a on country j at time t) using country-community, 24-month rolling window regressions against USD, AA action-topic and time-regime factor matrices: bonds, BAA−treasury, S&P 500, commodities (GSCI) (∑ ∑ ) ∑ ∑ k a and VIX monthly changes, with no intercept, weights yit− θic θjd ϕak ψtr λrc− →d →j ∼ Poisson summing to 1, and using leverage to match the tarc r d k get’s volatility. The clones have lower autocorrelation with Gamma priors, estimated with MCMC. (a proxy for illiquidity).

Hedge fund replication: putting the pieces together V. Weber and F. Peres (2013) Do not only use static factors, but also dynamic ones (carry, momentum). Correct hedge fund returns for illiquidity: ˆ t = Rt − αRt−1 R 1−α Rt = observed returns

(Gettner)

α = Cor(Rt , Rt−1 ).

Send in the clones? Hedge fund replication using futures contracts N.P.B. Bollen and G.S. Fisher (2012)

A gentle introduction to value at risk L. Ballotta and G. Fusai (2017) The idiosyncratic momentum anomaly D. Blitz et al. (2017) Use idiosyncratic momentum (rather than total momentum) to forecast future returns. Deep learning for forecasting stock returns in the cross-section M. Abe and H. Nakayama (2018) Deep neural networks, with 1 to 6 fully-connected hidden layers, 50% dropout, tanh activations, to forecast (uniformized) 1-month ahead stock returns for MSCI Japan, from 25 (uniformized) investment factors, both current and past values (3, 6, 9 and 12 months), trained for 100 epochs on universe-sized minibatches, on a 10year window.

Replicating hedge fund indices [many people invest in Proofs are programs: funds of funds] with futures (USD, 10-year, gold, oil, 19th century logic and 21st century computing S&P 500) using 1 to 4 year rolling window regressions P. Wadler (2000) gives poor performance ∑ ∑ and shows no sign of market Frege’s logic considers judgments (e.g., ⊢ A), axioms timing, r ∼ fi + 1fi >0 fi ; the clone-target correla(e.g., ⊢ A → A) and deduction rules, e.g., tion is high, but mostly due to market exposure. ⊢B→A ⊢B . ⊢A Article and book summaries by Vincent Zoonekynd

21/587

Gentzen’s logic adds assumptions to the judgments obtained from inference rules, i.e., natural or dinat(e.g., B1 , . . . , Bn ⊢ A instead of ⊢ A) and to the de- ural transformations, most of which come from the duction rules, e.g., monoidal/braided/cartesian structure. Γ⊢B→A ∆⊢B . Γ, ∆ ⊢ A While this may look more complex (in particular, the three if-then notions: → for propositions, ⊢ for judgments and ·· in the inference rules), the deduction rules end up simpler and more intuitive. Typed λ-calculus has typing judgments (e.g., x1 : B1 , . . . , xn : Bn ⊢ t : A, meaning that if the xi ’s have types Bi , then t has type A) and the same deduction rules as Gentzen’s system, e.g., Γ⊢t:B→A ∆⊢u:B . Γ, ∆ ⊢ t(u) : A

In λ-calculus, everything is a function. For instance, Church numerals are 0 = λf 7→ λx 7→ x

f 7→ id

1 = λf 7→ λx 7→ f (x)

f 7→ f

2 = λf 7→ λx 7→ f (f (x))

f 7→ f 2

and addition and multiplication are times = λa λb λf a(b(f ))

( ) plus = λa λb λf λx a(f ) b(f )(x) .

(I write λx 7→ t or λx t instead of (λx.t) and dispense with many of the parentheses.)

Typed λ-calculus, where every term has a type, forms a cartesian closed category with types as objects and equivalence classes of terms as morphisms (if we (Unless we explicitly add recursion, all functions termi- do not focus on what is computed but how it is comnate and the calculus is not Turing-complete.) Term puted, we get a 2-category, with terms as morphisms reduction corresponds to proof simplification (Curry- and equivalence classes to rewrites as 2-morphisms). Howard correspondance). Linear type theories correspond to symmetric monoidal categories. They can also be described by Physics, topology, logic and computation: combinators. The usual ones a Rosetta stone I = λx 7→ x J.C. Baez and M.M. Stay (2009) K = λx 7→ λy 7→ x Variants of monoidal categories ( ) S = λx 7→ λy 7→ λz 7→ x(z) y(z) – braided: – symmetric: = cannot do because K forgets information and S du– cartesian: ⊗ = × plicates it. Each term t is equivalent to cp(t) vp(t), – closed: ⊗ ⊣ Hom where cp(t) is the combinator part and vp(t) the vari– compact: Hom(X, Y ) = X ∗ ⊗ Y able part. – dagger (with a functor † : C → Cop , the identity on Ob C, with f †† = f ) Internal set theory appear in many domains, e.g., – cobordisms – tangles in Rn – finite-dimensional Hilbert spaces (the inner product gives a dagger structure) – finite-dimensional representations of a Lie (resp. quantum) group form a compact symmetric (braided) monoidal category.

E. Nelson (2002) There are two ways of presenting non-standard analysis: either explicitly, by constructing the hyperreals, or by tweaking the ZFC axioms (internal set theory):

– The subset axiom, to define {x ∈ X : A(x)}, is only valid for internal formulas A (those which do not use the “standard” predicate); – Transfer: ∀st t1 · · · ∀st tn (∀st xA ⇔ ∀xA), for any internal formula A; Propositional calculus forms a monoidal category with propositions as objects, a single morphism X → Y – Idealization: ∀stfin X ∃y ∀x ∈ X A ⇔ ∃y ∀st x ∈ X A (for A internal); whenever X ⇒ Y , and ⊗ = ∧, Hom(X, Y ) = X ⇒ Y . It is a Heyting algebra, i.e., a preorder C such that – Standardization: C and Cop be cartesian closed. ∀st X ∃st Y ∀st z (z ∈ Y ⇐⇒ z ∈ X ∧ A(z)) Linear logic replaces ∧, ⇒, ⊤ with ⊗, ⊸ and I and for any formula A (internal or external), which alonly requires a symmetric monoidal category, not a lows us to define S {z ∈ X : A(z)} the standard set cartesian closed one: there are no natural morphisms whose standard elements are the standard elements X → X ⊗ X or X → I, i.e., we cannot duplicate or z of X such that A(z). delete information (as in quantum physics or chemistry). The “standard” predicate is no longer limited to real Alternatively, one can consider propositions as ob- numbers, but applies to sets, functions, topological jects and proofs as morphisms, where the proofs are spaces, points in topological spaces, etc. Article and book summaries by Vincent Zoonekynd

22/587

∑ Includes exercises. Check Radically elementary proba- instead of subsets, e.g., µ = x∈X δx if X is discrete, bility theory, by the same author, on probability theory and the functions { n with non-standard finite probability spaces. R → R+{ ( ) } δµ,ℓ : x 7→ inf r > 0 : µ B(x, r) > ℓ Division by three They are not smooth, but the distance to measure P.G. Doyle and J.H. Conway (1994) d2µm is Proving A × 3 ≈ B × 3 =⇒ A ≈ B does not require   Rn → R+∫ the axiom of choice. 2 1 m dµm : δµ,ℓ (x)2 dℓ  x 7→ m 0 Developing bug-free machine learning systems with formal mathematics (it is the average of the squared distances from x to S. Selsam et al. (2017) its k nearest neighbours). If two measures µ and ν are close (for the Wasserstein distance), then the sublevel The lean theorem prover tries to combine program sets of their distance-to-measure functions are homoverification (à la Coq) and formal mathematics (à la topy equivalent (reconstruction theorem). Mizar). Here, it is used for a provably correct gradient 2. For many tasks, full reconstruction is not needed descent implementation. and a few homotopy or homology invariants suffice. ∪ In particular, the nerve of an open cover X = i Ui The ring of algebraic functions all of whose finite intersections are contractible is hoon persistence bar codes motopy equivalent to X (nerve theorem). A. Adcock et al. (2013) 3. Persistent homology keeps track, not only of the To turn a barcode [x1 , y1 ], . . . , [xn , yn ] into features, dimensions of the homology groups (Betti numbers) ∪ for machine use multisymmetric polynomials ∑ a learning, of x∈X B(x, ε) as ε increases, but also of individual b pab = i xi yi , for instance connected components, cycles and cavities – the result, ∑ the (birthi , deathi ) pairs, forms the persistence diaxi (yi − xi ) gram. The bottleneck distance between two per∑ sistence diagrams D and D′ is the smallest δ > 0 for (ymax − yi )(yi − xi ) which we can match all points of D at distance at least ∑ x2i (yi − xi )4 δ from the diagonal with a point of D′ at distance at ∑ most δ. It is bounded by twice the Gromov-Hausdorff (ymax − yi )2 (yi − xi )4 . distance. Barcodes form an affine (semi)algebraic indvariety, with ring of algebraic functions Λ2 = lim k[x1 , y1 , . . . , xn , yn ]Sn (multisymmetric polynomi−→ als), quotiented by an ideal D = lim Dn corresponding −→ to the identification of intervals of length zero. Λ2 is freely ∑ generated by the multisymmetric power sums pab = i xai yib , and D by the pa+1,b − pa,b+1 . An alternative is to use the bottleneck or Wasserstein distance.

4. The Mapper algorithm visualizes a dataset X ⊂ Rn and ∪ a function f : X → R by choosing an open cover Ii of f (X) ⊂ R by intervals, clustering the points in each f −1 (Ii ) into a (finite) cover, and plotting the 1-skeleton of the corresponding cover of X. Topological analysis of financial time series: landscapes of crashes M. Gidea and Y. Katz (2017)

2

(it is proper and x 7→ ∥x∥ − ϕ(x) is convex). To increase the robustness to outliers, consider measures 2

Article and book summaries by Vincent Zoonekynd



death







● ●

● ●

NA

1. A subset X ⊂ Rn can be represented by its distance function { n R → R+ ϕ: x 7→ d(x, X)

g(A[, 1], A[, 2])

The persistence landscape is obtained from the Persistent homology (birth, death) persistence diagram by flipping it 45°, H. Edelsbrunner and D. Morozov associating a piecewise function to each birth-death (Handbook of discrete pair, and considering the sequence of functions given and computational geometry, 2017) by the kth largest values. While persistence diagrams form a (non-complete) metric space for the Wasserstein The implementation details are not that complicated. distance, persistence landscapes form a Banach space for the Lp norm. (Measure theory on an infinite diHigh-dimensional topological data analysis mensional Banach space is a little more complicated: F. Chazal (Handbook of discrete for instance, the notions of weak (Pettis) and strong and computational geometry, 2017) (Bochner) integrability differ.)

birth

23/587

The L1 or L2 norm of the persistence landscape of the daily returns of a few indices (S&P 500, DJIA, Nasdaq, Russel 2000) on a 50-day window may help forecast crises.

A point is critical if one of Link+ (p) = {q ∈ Neigh(p) : f (q) > f (p)} Link− (p) = {q ∈ Neigh(p) : f (q) < f (p)}

Statistical topological data analysis is not simply connected. Critical points can be paired using persistence landscapes into birth-death pairs, defining a persistence diagram. P. Bubenik (2015) To simplify f , one can prune short-lived pairs (those whose persistence, f (death) − f (birth) is small). The Initial paper on persistence landscapes. persistence curve is the plot of the number of critical A persistence landscapes toolbox points with persistence at least x, as x increases. for topological statistics P. Bubenik and P. Dłotka

The Reeb graph segments M into regions where π0 f −1 c does not change.

Standalone, file-based C++ implementation (comput- The Morse complex is defined by the attraction bassins ing persistence landscapes, their averages, their Lp dis- of f . tances). Also check the TDA R package. Topology-based data analysis identifies a subgroup of breast cancers with a unique mutational profile and excellent survival M. Nicolau et al. (2011)

Subsampling methods for persistent homology F. Chazal et al. (2015) Compute the persistence landscape on several subsamples and average them.

The mapper algorithm takes a cloud of points X and a (Morse) function f : X → R (e.g., a random projecEquilibrated adaptive learning rates tion, a measure of density, a distance to a baseline, an for non-convex optimization eccentricity measure), applies a clustering algorithm to Y.N. Dauphin et al. −1 the points in each fa = f (]a − ε, a + ε[), and builds a graph with those clusters as nodes and edges between In the presence of different curvatures, in particular x ∈ fa and y ∈ fb if x ∩ y ̸= ∅ (partial clustering, around a saddle point, gradient descent oscillates a lot. discrete Morse theory). In R, check TDAmapper. Preconditioning uses a change of variables, It is another way of “filtering” a cloud of points into a graph, keeping most of its features (loops, appendages, etc.). Topological methods for the analysis of high dimensional datasets and 3D object recognition G. Singh et al. In the Mapper algorithm, using several Morse functions defines a simplicial complex instead of a graph – and letting ε vary leads to persistence homology. The Topology Toolkit J. Tierny et al. Topological data analysis (TDA) is not a single algorithm (Mapper, or persistence homology, depending in who you ask), but a smogasbord of techniques. TTK is a library (C++, BSD, accessible from Python) and graphical toolkit (built on Paraview) using lowdimensional TDA (mostly discrete Morse theory) to visualize scientific data (PDE solutions and other scalar fields).

θ = D−1/2 θˆ ˆ = f (D−1/2 θ) ˆ = f (θ) fˆ(θ) ∇fˆ = D−1/2 ∇ ∇2 fˆ = D−1/2′ (∇2 f )D−1/2 θt+1 ← θt − ηD−1 ∇f (θt ). The absolute Hessian (defined from the eigendecomposition of the Hessian, by taking the absolute value of the eigenvalues) gives a perfect conditioner (the curvature is the same in all directions), D = |H|, but it is computationally expensive. The Jacobi conditioner is the diagonal of the Hessian. It could be estimated as diag H =

E

[v ⊙ Hv].

v∼Unif({±1}n )

AdaGrad, AdaDelta, RMSProp are other diagonal conditioners.

The equilibrium conditioner is the diagonal of the abThe data is given as a Morse function f : M → R, solute Hessian; it can be defined as the L2 norm of the i.e., a scalar field (or several of them); in contrast with rows of the Hessian; it can be estimated as high-dimensional TDA, the PL manifold M is often 2 uninteresting (R2 or R3 ). ∥Hi· ∥ = E [(Hv)2 ]. vi ∼N (0,1) TDA studies the transitions of the Betti numbers βk f −1 {x} or βk f −1 ]−∞, x] or βk f −1 [x, +∞[ as x varies. Article and book summaries by Vincent Zoonekynd

24/587

A smart stochastic algorithm An entropy search portfolio for nonconvex optimization for Bayesian optimization with applications to robust machine learning B. Shahriari et al. A. Aravkin and D. Davis Use a portfolio of acquisition functions and follow that ∑ Machine learning minimizes fi (x). Trimmed ma- giving the largest decrease in the uncertainty of the 1⩽i⩽n location of the minimizer (or sample accordingly). ∑ chine learning minimizes f(i) (x), where f(i) (x) (1⩽i⩽k) A theoretical analysis of are the order statistics of fi (x) i . optimization by Gaussian continuation The problem can be formulated as H. Mobahi and J.W. Fisher (2015) Find To minimize Where

w ∈ Rn , x ∈ H 1∑ wi fi (x) + r1 (w) + r2 (x) n i r1 = 1∆h ∆h = {w ∈ [0, 1]n : w′ 1 ⩽ h}.

Optimization by continuation [or mollifying] starts to optimize an easy simplification (e.g., a convex relaxation) of the problem, and progressively transforms it into the actual task.

and solved by proximal alternating minimization (w ← proxr1 (· · · ), x ← proxr2 (· · · )). With mini-batches, use variance reduction (SVRG). Model calibration with neural networks A. Hernandez (2016)









Given enough training data, optimization can be replaced with a trained neural net.

The irace package: iterated racing for automatic algorithm configuration M. López-Ibáñez et al.

Pymanopt: a Python toolbox for optimization on manifolds using automatic differentiation J. Townsend et al. (2016)

Implementation of the iterated F-race algorithm, with a few extensions (other statistical tests, restarts, etc.).

Automatic differentiation can also compute (first and second) derivatives on Riemannian manifolds.

Guaranteed non-convex optimization: submodular maximization over continuous domains A.A. Bian et al. (2016)

Learning to learn for global optimization of black box functions Y. Chet et al. A recurrent neural network (RNN: LSTM or differentiable computer) can learn to reproduce what a Bayesian blackbox optimizer does (it uses its memory to store information about the previous function evaluations). It is much faster than GP-based algorithms, and competitive in case of model mis-specification. Understanding deep learning requires rethinking generalization C. Zhang et al. We still do not know why deep neural networks work well: VC dimension, Rademacher complexity, uniform stability, explicit/implicit regularization, model capacity fail to explain why they generalize so well in some cases (real data) but not others.

The notion of submodularity can be generalized to continuous domains: for f : Rn → R, the following conditions are equivalent. (i) f (x) + f (y) ⩾ f (x ∧ y) + f (x ∨ y), where ∧ and ∨ are the coordinate-wise min and max; ∂2f (ii) ∀x ∀i ̸= j ⩽ 0; ∂xi ∂xj (iii) f (λei + a) − f (a) ⩾ f (λei + b) − f (b), whenever a ⩽ b, ai = bi , λ ⩾ 0, where ei is the ith basis vector. Best subset selection via a modern optimization lens D. Bertsimas et al. (2015)

Best subset selection for linear regression can be done exactly for p ⩽ 30 (leaps), with the lasso, or with non-convex penalized regression; the minimax concave penalty (MCP), implemented in Sparsenet, is quadratic for β small and then constant; the smoothly A tutorial on Bayesian optimization clipped absolute deviation (SCAD) penalty is linear, of expensive cost functions, with applications then quadratic, then constant. to active user modeling and hierarchical Progress in mixed integer programming makes solving reinforcement learning the exact problem reasonable for p = 100, n = 1000. E. Brochu et al. (2010) 2 Hard thresholding computes Hk (x) = Argmin ∥β − c∥2 . ∥β∥0 ⩽k

Article and book summaries by Vincent Zoonekynd

25/587

If g is Lipschitz and convex, g(η) ⩽ g(β) +

L 2 ∥η − β∥2 + ⟨∇g(β), η − β⟩. 2

To solve Minimize g(β) subject to ∥β∥0 ⩽ k β

for g lipschitz and convex, iterate ( ) 1 βm+1 ← Hk βm − ∇g(βm ) L (discrete projected gradient) until ∥∆β∥2 ⩽ ε; then solve the continuous problem for the corresponding non-zero coordinates. A line search βm+1 = (1 − λ)βm + λH gives better empirical performance. Subgroup discovery M. Atzmueller (2005) Subgroup discovery (SD) is a frequent-itemset-based supervised clustering technique: after extracting frequent association rules for the target variable, select a small number of them with good quality (measured by some statistical test: χ2 , gof, etc.) so that they hardly overlap and have good coverage. Anytime discovery of a diverse set of patterns with Monte Carlo Tree search G. Bosc et al. Pattern mining is a form of structured learning (progressively build patterns, one element at a time): Monte Carlo Tree Search (MCTS) produces more diverse rules than greedy approaches.

Phase recovery from a Bayesian point of view: the variational approach A. Drémeau and F. Krzakala The phase recovery problem, finding x ∈ Cn such that y = |Dx|, can be solved by alternating projections, convex relaxation, or variational Bayes. Katyusha: the first direct acceleration of stochastic gradient methods Z. Allen-Zhu (2016) Katyusha momentum combines variance reduction and momentum. xn+1 ← λzn + µx∗ + (1 − λ − µ)yn i ∼ U (J1, nK) ∇n+1 ← ∇f (x∗ ) + ∇fi (xn+1 ) − ∇fi (x∗ ) yn+1 ← xn+1 − β∇n+1 zn+1 ← zn − α∇n+1 x∗ updated from time to time

Incorporating Nesterov momentum into Adam T. Dozat Adding gradient noise improves learning for very deep networks A. Neekakantan et al. (2015) Exploiting the structure: stochastic gradient methods using raw clusters Z. Allen-Zhu et al. (2016)

On b-bit min-wise hashing for large-scale (LSH-based) approximate clustering transforms “big regression and classification with sparse data data” into “data”; the clusters can improve variance R.D. Shah and N. Meinshausen (2016) reduction in SVRG. Large-scale datasets (large p, large n, but sparse) can be tackled with random projections (X 7→ XA, A ranLip reading sentences in the wild dom) or sketches (X 7→ AX). b-bit min-wise hashing J.S. Chung et al. reduces a large sparse binary matrix X as follows: Curriculum learning. – Pick a random permutation of the columns; – In each row, find the index (in the permuted matrix) of the first non-zero value; – Use the last b bits of those indices as b new columns in the compressed matrix; – Repeat a few times with more permutations.

Machine teaching: an inverse problem to machine learning and an approach toward optimal education X. Zhu

“Similarity” between rows is preserved.

Machine teaching is the inverse problem of machine learning (ML): not going from training data to fitted Sparse solutions to nonnegative linear systems model, but manufacturing a training dataset for which and applications the ML algorithm of interest would derive the desired A. Bhaskara et al. model.

To solve Ax = b, where A, x, b have nonnegative coefficients and ∥b∥1 = 1, increment x coordinatewise while ∑ (Ax)j /bj keeping the “potential” Φ(x) = b (1 + δ) j j small; a small number of iterations leads to a sparse approximate solution.

Article and book summaries by Vincent Zoonekynd

Training sets

ML algorithm (e.g., SVM)

Fitted model

machine teaching (e.g., SVM−1 )

26/587

Unitary evolution recurrent neural networks A. Arjovsky et al. (2016) To avoid exploding or vanishing gradients, use (real) unitary weight matrices (all eigenvalues have modulus 1). Unitary matrices are not easy to parametrize: model them as products of – – – –

(Complex) diagonal matrices); 2 Reflexions: R = I − 2vv ∗ / ∥v∥ , v ∈ Cn ; Permutation matrices (use a single, fixed one); Fourier and inverse Fourier transforms.

To generate raw audio samples (16 000 Hz), use dilated causal convolutions, gated units, residual and skip connections.

Strongly-typed recurrent neural networks D. Balduzzi and M. Ghifary (2016) Physicists’ dimension analysis suggests RNN, LSTM, GRU variants.

Full-capacity unitary recurrent neural networks S. Wisdom et al. (2016) Unitary matrices can eliminate the vanishing/exploding gradient problem in RNNs. Instead of using a parametrization of some of those matrices, consider them as points on the (Stiefel) manifold of unitary matrices. Self-normalizing neural networks G. Klambauer et al.

Learning scalable deep kernels with recurrent structure M. Al-Shedivat et al. Adding a Gaussian process (GP) layer to a recurrent neural net (RNN) and using the negative log-marginal likelihood as objective is problematic: the objective no longer factorizes over the data and stochastic optimization cannot be used – but semi-stochastic alternating gradient descent (full gradient for the GP parameters, stochastic gradient for the RNN) is good enough.

Batch/layer/weight normalization may not be needed: scaled exponential linear units (SELU) { f (x) =

λx α(ex − 1)

if x ⩾ 0 if x < 0

with λ > 1 (e.g., λ = 1.0507, α = 1.7581) are selfnormalizing, in the sense that if X ∼ N (0, 1) then Ef (X) = 0 and Var f (X) = 1 and if X is close to standardized, then f (X) is more so.

Professor forcing: a new algorithm for training recurrent networks A. Lamb et al. (2016) Recurrent neural networks (RNN) generate sequences one character at a time, each conditional on the previous ones. During training, one can use:

– The previous characters of the ground truth sequence (“teacher forcing”, maximum likelihood); – The previous outputs; – A mixture of both; – An adversarial approach, training another model to Understanding the difficulty of training distinguish between true and generated sequences, deep feedforward neural networks and using its performance as a penalty. X. Glorot and Y. Bengio (2010) Improved semantic representations Xavier initialization suggests to select the random from tree-structured initial weights so that the activations of all layers have long short-term memory networks zero mean and unit variance. K.S. Tai et al. Modeling temporal dependencies in high-dimensional sequences: applications to polyphonic music generation and transcription N. Boulanger-Lewandowski et al. (2012)

LSTMs can be generalized from chains to trees. Deep recursive neural networks for compositionality in language O. İrsoy and C. Cardie

Restricted Boltzman machines (RBM) economically model high-dimensional discrete distributions. A recurrent temporal RBM (RTRBM) is a sequence of RBMs whose parameters are a (fixed) linear function of the previous hidden state. An RNN-RBM is a sequence of RBMs whose parameters are given by a recurrent neural net (RNN). They can be used as a prior to improve the accuracy of polyphonic transcription. the

WaveNet: a generative model for raw audio A. can den Oord et al. Article and book summaries by Vincent Zoonekynd

movie

was

Recursive

great

the

movie

was

great

Deep recursive

27/587

AtomNet: a deep convolutional neural network – Approximately count triangles by considering the for bioactivity prediction in structure-based vectors with coordinates xT = #edges in T where drug discovery T ∈ P3 (V ); I. Wallach et al. – Greedy matching; √ – Random walks ( t √ passes on the data suffice: stitch √ CNNs can leverage molecule shapes (on a 3t walks of length t). dimensional grid) for drug discovery. Many streaming algorithms use a linear sketch, i.e., a random projection (of the adjacency or some related Unsupervised representation learning with deep matrix). Algorithms for sliding window graphs (the convolutional generative adversarial networks last w edges in an infinite stream) often keep the k A. Radford et al. (2016) most recent edges satisfying some property. – Replace pooling layers with strided convolutions (discriminator) or fractionally strided convolutions Darwini: (generator); generating realistic large-scale social graphs – Remove fully-connected layers; S. Edunov et al. – Use batch normalization; – Use ReLU (generator), tanh (generator output) and To generate a (large) graph with prescribed degree and leaky ReLU (discriminator). clustering coefficient distributions (conditional on degree): Convolutional networks on graphs – Assign a target degree and clustering coefficient to for learning molecular fingerprints each node; D. Duvenaud et al. – Group the nodes in clusters, according to cd(d − 1); To fingerprint molecules (or graphs), for each node, – Build and Erdös-Rényi graph on each cluster, choosing the probability (and cluster size) to get the decompute a hash of its neighbours, then a hash of the sired expected clustering coefficient (and a degree neighbours’ hashes, and so on (circular fingerprints). inferior to the target); Alternatively, use a graph convnet with large random – Add edges between clusters to get the desired degree weights and tanh activations. distribution [tricky for large-degree nodes], trying to keep the joint degree distribution. Geometric deep learning on graphs and manifolds using mixture model CNNs How to partition a billion-node graph F. Monti et al. L. Wang et al. (2014) CNNs can be defined on arbitrary graphs or manifolds, not just lattices in Euclidian space. For instance, for To partition a graph (for distributed processing, or each vertex x, and each neighbouring vertex y ∈ N (x), sparse matrix ordering): define pseudo coordinates u(x, y) (e.g., geodesic po- – Coarsen it by finding a maximum match (a set of lar coordinates (ρ, θ) on a surface, or on edges sharing no vertices) and collapsing its edges, ( the degree ) a graph), apply a weight function w u(x, y) (e.g., a until the graph is sufficiently small; Gaussian or triangular kernel, possibly with a learnable – Assign a unique label to each vertex; set the label parameter) and set of each vertex to the majority label in its neighbourhood; ∑ ( ) Dj (x)f = w u(x, y) f (y). – If there are too many labels, collapse vertices with y∈N (x) the same label, and propagate labels anew.

Graph stream algorithms: a survey A. McGregor

Graphons, mergeons and so on J. Eldridge et al. (2016)

A graphon is a symmetric measurable function w : Many graph problems can be solved, at least approxi- [0, 1]2 → [0, 1]. It can be seen as a weighted graph mately, with streaming data, i.e., without storing the on the uncountable vertex set [0, 1]. It defines a sewhole graph in memory (but you still need Ω(#V ) stor- quence of random graphs (graph-valued random variage): able) (Gn )n , where Gn samples n points xi uniformly – Test connectivity by building a spanning forest (a in [0, 1] and then an edge i–j with probability w(xi , xj ). set of edges): add the next edge {u, v} if there is It is consistent (the random graph obtained by recurrently no path from u to v; moving the (n + 1)st vertex from Gn+1 has the same – Approximate distances by building a subgraph: add distribution as Gn ) and local (if S, T ⊂ [n] are disthe next edge {u, v} if the current d(u, v) is above joint, then Gn |S ⊥ ⊥ Gn |T ); these properties characsome threshold (i.e., if the new edge closes a small terize graphons. A subset A ⊂ [0, 1] is disconnected at loop); level λ if there exists S ⊂ A such that 0 < µ(S) < µ(A) – Minimum spanning tree (MST); and w < λ on S×(A\S). Define an equivalence relation Article and book summaries by Vincent Zoonekynd

28/587

∼ on the set Aλ of λ-connected subsets by A1 ∼ A2 A machine learning perspective iff ∃A ∈ Aλ A ⊃ A1 ∪ A2 . The clusters at level λ on predictive coding with PAQ are the (essential) largest elements in each equivalence B. Knoll and N. de Freitas (2011) class. A mergeon describes how clusters merge as λ Prediction by partial matching (PPM) is a lossless comdecreases: it is a graphon M such that pression algorithm well-suited to text: ∪ – Start with a prior probability distribution on the letM −1 [λ, 1] = C × C. ters; C λ-cluster – Encode the first character with the Huffman code or the arithmetic code for this distribution; – Somehow update the probability distribution; iterate. GraphChi-DB: simple design for a scalable graph database system – on just a PC For instance, one could use an n-gram model, i.e., A. Kyrola and C. Guestrin keep track of all the substrings seen so far, and use Adapt the log-structured merge tree (LSM tree) to store edges: partition them on the destination vertex, but sort each partition along the source vertex.

the longest matches to predict (the probability) of the next letter. With noisy inputs, approximate matching helps. For images, scan by row or column, or along a space-filling curve (Hilbert) – you can beat jpeg but not jpeg2000.

Time series analysis of the S&P 500 index: a horizontal visibility graph approach M.D. Vamvakaris et al. (2017)

PAQ8 uses 500 such models, for bit-level predictions, ensembled with a mixture of experts (implemented as a neural net).

The horizontal visibility graph is a variant of the visibility graph: – Fit the degree distribution with a power law: if the exponent is greater than log 3/2, the time series is auto-correlated; if it is less, it is chaotic; – The Hellinger distance between the in- and outdegree distributions measures irreversibility.

● ● ● ● ● ●





● ●

C(x|y) + C(y|x) C(xy) C(xy) dCDM (x, y) = C(x) + C(y) ] 1[ de (x, y) = E[x|y] + E[y|x] 2 C(x, y) − Min{C(x), C(y)} dNDM (x, y) = Max{C(x), C(y)} dC (x, y) =

● ● ● ●

● ●



● ●

Compression can be used to define distances



where C(x) is the compressed size of x (approximation of its Kolmogorov complexity), C(x|y) is the compressed size of x if the compressor is trained on y, E[x|y] is the corresponding cross-entropy of x.

Arithmetic coding encodes a text given a sequence of probability distributions on letters, as a single raProcess systems engineering tional number: divide the interval [0, 1] into segments, as a modeling paradigm for analyzing systemic one for each possible letter, of length their probabilrisk in financial networks ities; pick the segment corresponding to the current R. Bookstaber et al. (2015) letter; repeat the process on that segment, with the The financial system can be modeled as a signed- next letter; iterate and return any rational number in directed graph (SDG): the directed arcs represent the final segment. causal relations, and the sign of x → y is that of ∂y/∂x. Process hazard analysis can help identify feedA brief review back loops, which amplify shocks – many agents act in of the ChaLearn AutoML challenge a locally stabilizing but globally destabilizing way. I. Guyon et al. (2016) Diffusion-convolutional neural networks J. Atwood and D. Towsley (2016)

AutoML is a competition for automated (black-box, with no human intervention) supervised learning, combining the following ideas:

On a graph, one can forecast node labels Y from node – Decision trees, k-NN, naive Bayes, neural nets; features X by using the An X as predictors, where A is – Uniformize, replace missing values, group modalities of categorical variables, feature selection, dimension the adjacency matrix. Averaging over the nodes, this reduction (PCA, kPCA, ICA, MDS, LLE, Laplacian can be generalized to edge or graph labels. eigenmaps), clustering (k-means) – Hyperparameter tuning (SMAC, hyperopt) Article and book summaries by Vincent Zoonekynd

29/587

– Model selection (k-fold cross-validation, l.o.o, o.o.b., bilevel optimization).

(alignment) between the two domains. source domain

Structure discovery in nonparametric regression through compositional kernel search D. Duvenaud et al. (2013)

target domain

X1

Y

X2

The “automated statistician” performs a search in a space of kernel structures, built from sums and prod- One can also use an adversarial approach to build ucts of linear, square exponential, periodic, etc. bricks, domain-invariant features. in a Gaussian process. output source domain features Machine learning techniques: reductions between prediction quality metrics A. Beygelzimmer et al. Reduction is the set of techniques converting a lossminimization problem into a well-studied machinelearning one, typically binary classification or regression. Examples include: – Importance-weighted classification (different cost for false positives and negatives) by reweighting the training set; – Multiclass classification with one-against-all (inconsistent), error correcting codes (ECOC, inconsistent), probabilistic ECOC, all-pairs; – Cost-sensitive multiclass classification, with weighted all-pairs; – Quantile regression with importance weighted classification; – Ranking (of binary data). Multi-view machines B. Cao et al.

discriminator

target domain

features

no training data

Progressive neural networks A.A. Rusu et al. To learn a task related to an already-learnt one, consider a new neural net, parallel to the first one, initialized with random weights, with lateral connections. To keep the model scalable (task n + 1 has connections from tasks 1, 2, · · · , n), use dimension reduction (single-layer network) for those lateral connections. output1

output2

input

Sparse variant of factorization machines. Net2Net: accelerating learning FastDBT: A speed-optimized via knowledge transfer and cache-friendly implementation T. Chen et al. (2016) of stochastic gradient-boosted decision trees for multivariate classification Transform an already trained network into a deeper T. Keck one by inserting a layer initialized as the identity function, or into a wider one by duplicating some of the May be faster than XGBoost. Also check LightGBM nodes (keep their input weights and halve their output and CatBoost. weights). Coresets for scalable Bayesian logistic regression J.H. Higgins et al. (2016)

Sampling generative networks T. White (2016)

In high dimensions, linear interpolation goes through Replace large datasets with a coreset, i.e., a weighted atypical points (too chose to the origin): prefer spher(smaller) subset (obtained, e.g., from an approxiical interpolation. mate streaming clustering algorithm) to approximate Bayesian posterior likelihood (for logistic regression). To study analogies, e.g., king−man+woman, generate images in a grid having those three points as corners. Domain-adversarial training In the latent space, find the directions corresponding of neural networks to “smile”, “open mouth”, “gender” or even “blurry” Y. Ganin et al. (2016) and generate a 1- or 2-dimensional grid of images by moving along those directions. Domain adaptation (DA) is the transfer of a model from one domain to another, sometimes by a mapping Article and book summaries by Vincent Zoonekynd

30/587

Efficient convolutional auto-encoding via random convexification and frequency-domain minimization M.C. Oveneke et al. (2016)

To estimate a Gaussian graphical model, identify the neighbourhood of each nodel with a lasso regression. For mixed data (exponential family), use GLMs. For time-varying models, use (Gaussian) weights.

Use random projections for the non-linear encoding part of the auto-encoders, and only learn the (linear) decoding part. [cf echo networks, reservoir learning]

Inference compilation and universal probabilistic programming T.A. Le et al.

Universal adversarial perturbations Neural networks can be trained to provide better proS.M. Moosavi-Dezfooli et al. posal distributions for sequential importance sampling (SIS). There exist almost universal adversarial perturbations: a small perturbation which, when added to almost any Memory efficient kernel approximation image, tricks a neural network into misclassifying it. S. Si et al. (2014) Matrix neural networks J. Gao et al. (2016) If your neural network has matrices as input (instead of vectors), try bilinear layers: Y = σ(U XV ′ + B) (U , V and B are matrices).

Approximate large kernel matrices as a direct sum of low-rank matrices, where the blocks correspond to clusters. Multi-class generative adversarial networks with the L2 loss function X. Mao et al.

FastText.zip: Replace the sigmoid cross-entropy loss in the discrimicompressing text classification models nator with an L2 loss. A. Joulin et al. Product quantization decomposes the space into k or⊕k thogonal subspaces Rn = i=1 Vi , computes 2b (with b = 8) k-means centroids in each of them, and∑ approximates points as sums of those centroids x ≈ qi (x). Use when you want to approximate scalar products – e.g., for vector embeddings.

GANs for sequences of discrete elements with Gumbel-softmax distribution M. Kusner and J.M. Hernández-Lobato The Gumbel softmax distribution is the distribution of y = softmax

1 τ (h + g)

iid

Tensorizing neural networks A. Novikov et al.

where gi ∼ Gumbel(0, 1). When τ → 0, this becomes the one-hot encoding of hi + gi , i.e., y ∼ Multinomial(p), where p = softmax h, i.e., pi ∝ exp hi . Dimension reduction (TensorTrain) for dense weight It can be used as a differentiable approximation of the matrices (of fully-connected layers). multinomial distribution, e.g., in GANs. Neural network based clustering using pairwise constraints Y.C. Hsu and Z. Kira (2016) Cluster data by training a (softmax) network on similar and dissimilar pairs. Composing graphical models with neural networks for structured representations and fast inference M.J. Johnson et al. Use neural nets to add non-linear transformations to your graphical models; inference is tricky: SVAE (structural variational encoders) use stochastic gradients of a variational approximation.

Augmenting supervised neural networks with unsupervised objectives for large-scale image classification Y. Zhang et al. (2016) Augment supervised networks with autoencoder losses. output

input

target output

input

target output

target

input

Infinite-dimensional word embedding E.T. Nalisnick and S. Ravi (2016)

mgm: Structure estimation for time-varying mixed graphical models Make word embeddings “infinite dimensional” by iniin high-dimensional data tializing them with high-dimensional (not really infiJ.M.B. Haslbeck and L.J. Waldorp nite) vectors, using only the first z components, with a Article and book summaries by Vincent Zoonekynd

31/587

penalty for z, and a sparsity penalty on the vectors [unrelated to Dirichlet processes, iHMM and other finitedimensional models]. Pointer networks O. Vinyals et al. Attention-based networks for TSP, convex hull, Delaunay triangulation. Structured prediction energy networks D. Belanger and A. McCallum (2016) Structured prediction can be performed indirectly, by learning an energy function and minimizing it. y∈

{0, 1}n

E(x, y) energy

vs x

x y

y = f (x)

y = Argmin E(x, y)

For instance, the model could use a 2-layer neural net to compute features and combine them, linearly and bilinearly, with the label. The problem can be relaxed from y ∈ {0, 1}n to y ∈ [0, 1]n .

illegal fishing provide real-world data). The defender’s strategy should be robust to suboptimal adversaries. Apprenticeship learning using inverse reinforcement learning and gradient methods G. Neu et al. (2007) Inverse reinforcement learning (IRL) learns a policy (a map from state features to actions) from an expert’s observed behaviour. Instead, one can learn the reward function for which the optimal policy is as close to the expert’s as possible. Dueling network architectures for deep reinforcement learning Z. Wang et al. In deep reinforcement learning, separately learn the state value function and the state-dependent action advantage function. Deep successor reinforcement learning T.D. Kulkarni et al. Successor reinforcement learning learns the value function as the inner product between a reward predictor and a successor map.

Energy linear

bilinear

label prior

Convexified convolutional neural networks Y. Zhang et al. (2016)

Training a 1-hidden-layer CNN with linear activations is an optimization problem with a low-rank constraint (corresponding to weight sharing) which can be relaxed y x into a nuclear norm constraint. For non-linear activations, use a kernel (RKHS). For deeper networks, this To learn the model, notice that we want ∀y E(xi , y) ⩾ no longer works, but greedy layer-wise training seems E(xi , yi ), with a larger difference if y and yi are far a good heuristic. apart, e.g., ∀y E(xi , y) − E(xi , yi ) ⩾ d(y, yi ) ⩾ 0. This suggests minimizing a structured SVM (SSVM) loss End-to-end kernel learning ∑ [ ] Max d(y, yi ) − E(xi , y) + E(xi , yi ) + . with supervised convolutional kernel networks y i J. Mairal (2016) 2-layer NN

A convolutional kernel network uses the kernel trick to Machine learning techniques increase the dimension and add non-linearities for Stackelberg security games: a survey projection ϕ G. De Nittis and F. Trovò (2016) Rn − → RN −−−−−−→ Rm , N ≫ n, m In the Stackelberg security game (SSG), a defender protects T targets from an attacker, by allocating re- without explicitly computing the high-dimensional cosources R < T . The strategies are {x ∈ RT : 0 ⩽ x ⩽ ordinates. The kernel parameters can be learnt as 1, x′ 1 ⩽ R}. If the adversary attacks t, both receive a usual. penalty and a reward: Attacker : xt Pta + (1 − xt )Rta Defender : (1 −

xt )Ptd

+

xt Rtd

Boundedly rational adversaries attack a target at random, with a probability depending on their (potentially incorrect) subjective utility. The subjective utility function can be estimated from data (poaching and Article and book summaries by Vincent Zoonekynd

Network in network M. Lin et al. CNNs slide simple masks over an image, to detect features. Instead, one could slide slightly deeper networks (“micro-networks”).

32/587

Towards a mathematical theory of super-resolution E.J. Candès and C. Fernandez-Granda (2012)

The more you know: using knowledge graphs for image classification K. Marino et al.

It is possible to recover a spike∑train (i.e., a discrete complex measure, x = i ai δti ) from lowresolution measurements or, equivalently, from a truncated Fourier transform x ˆ = Fn x, by solving a convex problem

LSTMs can be generalized to arbitrary DAGs (instead of time, ).

Find To minimize Such that

y, a complex measure ∥y∥TV Fn y = x ˆ

Propagating over the “promising” subset of the graph may be sufficient. A deep and autoregressive approach for topic modeling of multimodal data Y. Zheng et al. (2016)

Represent images as bags of visual words (e.g., SIFT provided the spikes are sufficiently separated, where features) to use topic models (LDA, DocNADE). the total variance of a measure µ on [0, 1] is a contin1 uous analogue of the ℓ norm: v1 , . . . , vd words ∏ p(v) = p(vi |v 0 the strength of the prior. A Dirichlet process is a family of random variables (G(A))A⊂Θ whose finite margins are Dirichlet: for all (measurable) partition Θ = A1 ⊔ · · · ⊔ An , ( ) ( ) G(A1 ), . . . , G(An ) ∼ Dir αH(A1 ), . . . , αH(An ) . A Dirichlet process is a random probability ∑ distribution G on Θ whose samples are of the form k⩾1 πk δθk with (broken stick construction) θk ∼ H πk = β k



(1 − βℓ )

1⩽ℓ⩽k

βk ∼ Beta(1, α). This is sometimes written π ∼ GEM(α). The urn scheme explains how to sample from a Dirichlet process G ∼ DP(α, H), sample from that sample, θ1 , . . . , θn ∼ G, and marginalize G out. Start with a sample θ1 ∼ H. Once you have θ1 , . . . , θn , with probability α/(α + n), sample a new point θn+1 ∼ H; with probability n/(α + n), sample from the existing points θn+1 ∼ Unif({θ1 , . . . , θn }). The Chinese restaurant process (CRP) is similar, but only looks at the partition of J1, nK induced by the unique values in (θ1 , . . . , θn ). The Dirichlet process mixture model (or infinite mixture model) can be written Θ

parameter space

H

prior

α

strength of the prior

G ∼ DP(α, H) θi ∼ G xi ∼ Fθi Article and book summaries by Vincent Zoonekynd

(note that there are usually duplicated values in the θi ) or, equivalently Θ

parameter space

H

prior

α

strength of the prior

θk ∼ H π ∼ GEM(α)

cluster k parameters cluster probabilities

zi ∼ Mult(π)

cluster membership

xi ∼ Fθzi

ith observation

Hierarchical Bayesian nonparametric models with applications Y.W. Teh and M.I. Jordan (2009) A hierarchical Dirichlet process (HDP) is a Dirichlet process (DP) whose base distribution is a sample from a DP. For instance, the HDP mixture model allows multiple clustering problems to share clusters, e.g., to share topics across documents. clustering problem cluster

corpus document topic

population genetic makeup subpopulation

The infinite HMM (iHMM) or HDP-HMM is a hidden Markov model (HMM) with an unbounded number of hidden states (one needs to increase the probability of self-transitions to avoid the creation of many redundant states). The Pitman-Yor process, which replaces Beta(1, α) with Beta(1 − d, α + kd) and yields the power laws commonly seen in language modeling, also has a hierarchical variant (HPY). A random measure G is completely random if A1 , . . . , An disjoint =⇒ G(A1 ), . . . , G(An ) independent. The Dirichlet process is not completely random because it is a probability measure. The Beta process is a completely random measure generalizing the Dirichlet process and often used as a prior for featural representations (i.e., binary matrices, i.e., clustering problems where an observation can be in several clusters). A sample from ∑ a Beta process B ∼ BP(c, B0 ) is of the form B = k⩾1 ωk δthetak , where (ωk , θk )k⩾1 is a sample from a Poisson process on [0, 1] × Θ with rate measure ν(dω, dθ) = cω −1 (1 − ω)c−1 dωB0 (dθ) where c > 0 (concentration). There is a stick-breaking construction (for c = 1) ∑ B= ωk δθk vk ∼ Beta(1, α) ωk =

k ∏ (1 − vi ) 1

θ k ∼ B0 91/587

or (2-parameter Beta process) B=

Kn ∑∑ n⩾1 k=1

Kn ∼ Poisson

where λ1 , λ2 , λ3 depend on the trigram (x1 , x2 , x3 ), (in particular, λ = 0 if the denominator is zero), e.g.,

ωnk δθnk (

cα c+n−1

)

θnk ∼ B ωnk ∼ Beta(1, c + n − 1)

count(x1 , x2 ) count(x1 , x2 ) + γ count(x2 ) λ1 = (1 − λ1 ) count(x2 ) + γ λ3 = 1 − λ1 − λ2 λ1 =

count∗ (x1 , x2 ) if count(x1 , x2 ) > 0, count(x1 ) where the discounted count is

– P (x2 |x1 ) = Bayesian nonparametric models P. Orbanz and Y.W. Teh A nonparametric Bayesian model is an infinitedimensional (parametric) Bayesian model, i.e., a finitedimensional parametric Bayesian model whose dimension increases with sample size. Examples include: – Gaussian process: f ∼ GP means ( ) f (x1 ), . . . , f (xn ) ∼ N ; ∑ ∑ – Dirichlet process G = k⩾1 πk δθk , πk = 1; – Chinese restaurant∑process (partitions from a DP); – Beta process G = k⩾1 πk δθk ; – Indian buffet process (partitions from a BP); – Pitman-Yor process; – Hierarchical DP, BP or PY; – Dependent DP; – etc.

count∗ (x1 , x2 ) = count(x1 , x2 ) − β and the missing probability mass 1−

∑ count∗ (x1 , x2 ) x2

count(x1 )

is assigned to the bigrams with no counts.

2. One can hope to describe sentences with a contextfree grammar (CFG). For a given (grammatical) sentence, there can be several (left-most) derivations – the sentence can be ambiguous. A probabilistic CFG (PCFG) assigns a probability to each rule of the grammar, and therefore defines a probability distribution on the set of derivations (not just the set of sentences). The training data provides the grammar and empirical probabilities from the rules. The CKY algorithm uses dynamic programming to find the most likely parse tree Showing that those processes can be tricky: (for a grammar in Chomsky normal form (CNF), i.e., – Compatible finite-dimensional marginals and Kol- whose rules are of the form X → Y1 Y2 or X → y) by mogorov’s extension theorem (only for countable computing π(i, j, X) = Max likelihood(X → xi:j ). The families); inside algorithm similarly computes the probability of ∑ – Explicit construction (e.g., stick-breaking); a sentence using π(i, j, X) = likelihood(X → xi:j ). – Limit of finite-dimensional distributions; 3. PCFGs are not sensitive to lexical information: – Exchangeable sequences and de Finetti’s theorem. for instance, they cannot notice that into/IN (resp. Consistency of infinite-dimensional Bayesian models is of/IN) is more (less) likely to be attached to a VP not as common as for finite-dimensional ones. than to a NP. There is also no way of encoding the “close attachment preference” (a PP is more likely to In R, check the DPpackage package. be attached to a nearby NP than to a distant one). A lexicalized PCFG is a PCFG in which each nonNatural language processing terminal is paired with a lexical item (the “head” of M. Collins (Coursera & Columbia, 2013) the non-terminal), e.g., 1. A language model is a probability distribution S(examined) →2 NP(lawyer) VP(examined) on V + , the set of sentences written with a vocabulary V . The sample distribution, from a corpus, is not a good model: it assigns a zero probability to (the subscript indicates the origin of the lexical item). sentences never observed. Markov models (n-gram Since the number of parameters explodes, they should models) assume that P (xk+1 |past) = P (xk+1 |x1:k ) = be regularized, for instance: P (xk+1 |xk−n+1:k ). These probabilities can be esti- – P [S(examined) →2 NP(lawyer)VP(examined)] = mated as follows: P [S(examined)→2 NP VP(examined)] P [lawyer| · · · ] – Shrink P [S(examined) →2 NP VP(examined)] tocount(x1 , x2 , x3 ) – P (xx |x1 , x2 ) = wards P [S →2 NP VP] count(x1 , x2 ) – Shrink P [lawyer|S(examined)→2 NP VP(examined)] count(x1 , x2 , x3 ) – P (x3 |x1 , x2 ) = λ1 + towards P [lawyer|S →2 NP VP]. count(x1 , x2 ) count(x2 , x3 ) 4. The noisy channel model (generative model) to λ2 + count(x2 ) translate from French f to English e is count(x3 ) λ3 p(e, f ) = p(e)p(f |e). # words Article and book summaries by Vincent Zoonekynd

92/587

It uses a language model for the target, p(e), and a con- where ϕ : Inputs × Labels → Rd is a feature map. This ditional model p(f |e), even though we are eventually is reminiscent of multinomial logistic regression interested in p(e|f ). score = θ · ϕ(input, label) log-linear It is too complicated to model p(f1 · · · fn |e1 · · · en , m) score = θlabel · ϕ(input) logistic. directly, but p(f1 · · · fn , a1 · · · am |e1 · · · en , m), where ai ∈ J1, nK is the position of the English word cor7. A maximum entropy Markov model (MEMM) responding to the ith French word, and the ai can be is of the form integrated out. One can use a model for the alignments P [ai = j|m, n], P (tag1 , . . . , tagm |word1 , . . . , wordm ) ∏ = P (tagi |tag1 , . . . , tagi−1 , word1 , . . . , wordm ) similar languages i noun/adjective inversions ∏ SVO vs SOV = P (tagi |tagi−1 , word1 , . . . , wordm ) (Markov) i

another for the word pairings, [fi |ej , ai = j] and muland each factor is modeled with a log-linear model tiply them (IBM model 2). The decoding problem, Argmaxe p(e)p(f |e), is hard, but the alignment and word pairing models are easier, and useful in other models: use the EM algorithm (do not use hard alignments but weighted mixtures) and start with the uniform alignment probability (IBM model 1). 5. Considering phrases, i.e., consecutive groups of words, improves performance. To translate with a phrase model and a trigram language model, progressively build the translation, keeping track of the “state” (last two words (for the trigram model), last position (for the alignment model), boolean vector indicating which words have already been translated, score). Do not use a completely greedy algorithm, but keep a pool of good partial translations (states) and progressively expand them. 6. (Penalized) log-linear models generalize (smoothed) n-gram models: define features, e.g., fk (x1 , . . . , xn ) = 1xn =world fk (x1 , . . . , xn ) = 1xn−1 ,xn =hello,world ,

P [tagi |tagi−1 , sentence] ∝ exp[θ · ϕ(sentence, i, tagi−1 , tagi )]. The most likely tag sequence can be estimated with the Viterbi algorithm. MEMMs differ from HMMs in the use of features (they are difficult to add to a HMM). A conditional random field (CRF) is a big log-linear model, in which the feature vector Φ(sentence, sequence of tags) is of the form Φ(sentence; tag1 , . . . , tagm ) = ∑ ϕ(sentence, i, tagi−1 , tagi ). It can be used with the Viterbi algorithm and fitted by gradient descent (the big sum in the normalizing constant can be dealt with with the forward-backward algorithm). One can also consider trigram MEMM, modeling P [tagi |tagi−1 , tagi−2 , sentence], and generalizing trigram HMM P [tagi |tagi−1 , tagi−2 ]. Features can include xi = W ∧ tagi = T

let p(xn+1 |x1 . . . xn ) ∝ exp[θ · f (x1 · · · xn xn+1 )] or, writing y = xn+1 and x = (x1 , . . . , xn ), p(y|x) ∝ exp[θ · f (x, y)] exp[θ · f (x, y)] p(y|x) = ∑ exp[θ · f (x, z)] z

and find θ that maximizes the penalized log-likelihood ∑ ( ) 2 L(θ) = log p y (i) |x(i) , θ = 21 ∥θ∥2 i

using gradient ascent – the gradient is easy to compute, ∑ ( ) ∑ ∑ ( (i) ) ( (i) ) ∂L = fk x(i) , y (i) − p z|x fk x , z . ∂θk z i i A log-linear model to label a sentence is of the form P [label|input, θ] ∝ exp[θ · ϕ(input, label)] Article and book summaries by Vincent Zoonekynd

suffix(xi , 3) = ABC ∧ tagi = T prefix(xi , 3) = ABC ∧ tagi = T tagi = T1 ∧ tagi−1 = T2 ∧ tagi−2 = T3 tagi = T1 ∧ tagi−1 = T2 tagi = T xi−1 = W ∧ tagi = T xi−2 = W ∧ tagi = T xi+1 = W ∧ tagi = T xi+2 = W ∧ tagi = T xi contains [:digit:] ∧ tagi = T xi contains [:upper:] ∧ tagi = T xi contains [-] ∧ tagi = T. 8. The naive Bayes model assumes Xi ⊥ ⊥ Xj | Y : P (Y = y, X1 = x1 , . . . , Xn = xn ) = ∏ P (Y = y) P [Xi = xi |Y = y] i

93/587

i.e., p(y, x1 , . . . , xn ) = q(y)



qi (xi |y).

i

The maximum likelihood estimator turns out to be the sample frequencies. If the labels Y are not observed, it is a clustering problem (not unlike k-means, but with discrete variables), which can be tackled with the EM algorithm: – E-step: estimate the cluster membership probabilities; – M-step: estimate the parameters. 9. The Viterbi algorithm uses dynamic programming to compute the most likely sequence of hidden states in a HMM, with Ti,j = probability of the most likely path of hidden states emitting oi:j and ending in state xi . The forward-backward algorithm computes ∑ ∑ ψ(s) and ψ(s) s:sj =a

where ψ(s) =



s:sj =a sj+1 =b

R implementations include mboost, gbm (gradient boosting); GAMBoost, CoxBoost (likelihood-based boosting).

Teaching logic using a state-of-the-art proof assistant C. Kaliszyk et al. (2007)

ψ(sj−1 , sj , j),

ψ(sj−1 , sj , j) = t(sj |sj−1 )e(xj |sj )

idea, with weighted subsets (giving more weight to currently misclassified observations) and a weighted average of the forecasts. Statistical boosting progressively builds an (interpretable) statistical model, instead of averaging forecasts. For instance, if the weak learners are 1-variable GAMs, each boosting iteration adds one variable to the model, in a way similar to the lasso (it provides variable selection and a regularized path). While gradient boosting finds the best (incremental) candidate hi (xi , β) among h1 (x1 , β), . . . , hk (xk , β) and moves the model slightly in its direction, by adding εhi (xi , β), likelihood-based boosting looks at penalized hi (xi , β), each maximizing loglik(β) − penalty(β), and adds the best one – thanks to the penalty, there is no need to multiply by ε.

(HMM)

ψ(sj−1 , sj , j) ∝ exp[θ · ϕ(x1:m , sj−1 , sj , j)] (CRF).

Coq is used to teach logic, to avoid the almost-but-notcompletely-right proofs many students generate, after defining Coq tactics that exactly match the rules of logic as they are taught.

Those quantities can be used to compute marginal probabilities. It can be generalized to sequences of parse trees: the inside-outside algorithm allows the EM estimation of PCFGs.

jHoles: a tool for understanding biological complex networks via clique weight rank persistent homology J. Binchi et al. (2014)

Workflow control-flow patterns A revised view R. Russell et al. (2006)

Look at the persistent homology of the filtered simplicial complex associated to the truncated graphs of a weighted graph. (jHoles converts the weighted graph into a filtered simplicial complex and gives it to JavaPlex to compute the persistent homology.)

Dependencies between tasks are often represented by a directed acyclic graph (e.g., in a Makefile), but more complicated dependencies are sometimes needed (sequence, and-split, and-join, or-split, or-join, first-tocomplete-or-join, m-out-of-n-or-join, cancellations, interleaving, etc.); they can be modeled with coloured Petri nets (CPN). Temporal evolution of financial-market correlations D.J. Fenn et al. (2011) The inverse participation∑ration of the kth principal 4 component wk is IPR = i wik ; the participation ratio, 1/IPR, is the effective number of assets contributing to the kth component. The evolution of boosting algorithms A. Mayr et al. (2014) Bagging fits a model on random subsets of the data and averages their forecasts. Boosting uses a similar Article and book summaries by Vincent Zoonekynd

Infinite-dimensional word embeddings E.T. Nalisnick and S. Ravi (2016) Many statistical or machine learning models have a latent state, with values in a finite-dimensional vector space. One can also use an infinite-dimensional one, e.g., ℓ2 , by adding a per-dimension penalty.

Distributed representations of sentences and documents Q. Le and T. Mikolov (2014) Word2vec can be generalized to paragraphs: instead of predicting the next word from the previous words, keep the word weights w fixed and add the paragraph id as a predictor: the paragraph weights give a vector 94/587

representation of each paragraph.

GPfit: an R package for fitting a Gaussian process model to deterministic simulator outputs B. MacDonald et al. (JSS, 2015)

mouse

D

W

W

W

W

§id

the

cat

ate

the

Topological pattern recognition for point cloud data G. Carlsson (2013) Gentle introduction to topological data analysis (TDA), with more examples than the other review articles.

When fitting a Gaussian process on non-noisy data, with a Gaussian correlation function Rij = ∏ 2 Cor(yi , yj ) = k exp −θk |xik − xjk | , θk > 0, the correlation matrix can be ill-conditioned (some of the points are too close), and the likelihood has local extrema very close to zero. One can replace R with R + δI, with the smallest nugget δ that makes the matrix well-conditioned, and reparametrize θ as θ = exp ϕ. Check the R packages GPfit (no noise), tgp (noise) or mlegp (noise, numerically unstable). Also mentions lhs::maximinLHS (random but welldispersed points).

To gain insight on the structure of a (contractible) metric space, transform it in some way, e.g., by removIntroducing multivalor: ing a point, adding a point (1-point compactification, a multivariable emulator for separated, locally compact, non-compact spaces), R.K.S. Hankin removing singular points, removing the “center” (i.e., keeping the “end points”) to identify appendages. Gaussian processes in higher dimensions. Functional persistence homology is the persistence homology of f −1 (α) or f −1 ([α, +∞[), for some map Newton’s versus Halley’s method: f : X → R (e.g., the centrality). a dynamical systems approach G.E. Toberts and J. Horgan-Kobelski (2003) Applications include: – Chemistry: the functional persistence homology barcodes measure how close two molecules (clouds of point/atoms) are; – Genetics: non-trivial homology of a set of (viral) genetic sequences (for the Hamming distance) reveals horizontal gene transfer; – Time series: the homology of the set of (centered, normalized) fragments (xt , xt+1 , . . . , xt+k ) can help identify periodic time series; – Cosmology: persistence homology (in particular the Euler characteristic) can be used to compare the structure of the universe (galaxies form clusters, filaments, walls) with that from a Gaussian random field.

To solve f (x) = 0, Newton’s method approximates f with an affine function, Halley’s method with a hyperbola. Pricing composable comtracts on the GP-GPU J. Ahnfelt-Rønne and M.F. Werk (2011) SPL (stochastic process language) is a Haskell DSL to describe and (efficiently) price derivatives. Fractional-parabolic deformations with sinh-acceleration S. Levendorskiĭ Some integrals can be computed using:

Introduction to the R package TDA – A fractional-parabolic deformation, i.e., by replacing an integral over R with an integral over B.T. Fasy et al. 7−→ σ exp[α log(1 + iR)]: Homology and cohomology computation – Sinh acceleration: ∫ ∫ in finite element modeling f (x)dx = f (sinh y) cosh y dy M. Pellikka et al. R R ∑ gmsh can compute homology and cohomology – for ≈ f (sinh n) cosh n some PDE problems, the boundary only determines |n|⩽N the solution within a given cohomology class. PReMiuM: an R package for profile regression mixture models using Dirichlet processes S. Liverani et al. (JSS, 2015)

Market timing and return prediction under model instability M.H. Pesaran and A. Timmermann (2002)

Profile regression is a mixed model in which the groups come from a Dirichlet process mixture clustering.

To detect structural breaks, apply cusum tests to observations reversed in time.

Article and book summaries by Vincent Zoonekynd

95/587

On the equivalence between quadrature rules where ℓ and g are kernel estimators (possibly with and random features a Gaussian, etc. prior) and the threshold y ∗ is some F. Bach (2015) quantile of the sample data p(y < y ∗ ) = γ. The quadrature problem can be formulated ∫as the problem of approximating an element i : f 7→ D f of a Hilbert space as a linear combination of well-chosen elements evx : f 7→ f (x).

The expected improvement (x = parameter, y = loss) is (an increasing function of) ℓ(x)/g(x): to get a new candidate solution, one can sample many points from ℓ and take that with the highest ℓ(x)/g(x).

It is similar to the kernel approximation problem: find- The parameter space of hyperparameter optimization ing a low-dimensional embedding ϕ such that k(x, y) ≈ problems often has a tree structure: the Parzen esti⟨ϕ(x), ϕ(y)⟩. mator can be generalized to a tree-structured Parzen estimator (TPE). The CMA evolution strategy: a tutorial Sequential search can be parallelized with the conN. Hansen (2011) stant liar approach: once a candidate point has been CMA-ES is a population-based optimization algo- chosen, provisionally set its loss to the average loss. rithms that models the current population as a Gaus- Check the hyperopt Python package. sian, samples from this model, keeps the best candidates and iterates. Sequential model-based optimization for general algorithm configuration Gaussian processes for machine learning F. Hutter et al. (2011) C.E. Rasmussen and C.K.I. Williams (2006) Gaussian processes (GP) can be generalized to mixed A Gaussian process is an infinite family of random varicontinuous-categorical variables: ables (Xt )t∈R ; the mean and covariance functions ∑ k(x, y) = exp −λj d(xj , yj )2 , m(t) = E[Xt ], k(s, t) = Cov(Xs , Xt ) suffice to define it. The joint distribution of Xt1 , . . . , Xtn is Gaussian, and the conditional distribution Xs1 , . . . , Xsm |Xt1 , . . . , Xtn , obtained in the usual way (Shur complement) is the posterior distribution.

j

where d is the Euclidean or Hamming distance (note that the coordinates are independent).

Sequential model-based optimization (SMBO, aka To train a Gaussian process, one can use a hierarchi- Bayesian optimization) can use random forests (of recal prior, i.e., posit a parametrization of the mean and gression trees) instead of GPs. covariance functions, e.g., Depending on the problem, you may want to transm(t) = at2 + bt + c form the objective function (e.g., replacing the algo2 rithm running time with its logarithm). (s − t) 2 k(s, t) = σ12 exp − + σ δ ij 2 Instead of estimating the performance of an algorithm ℓ2 2 (the σ δ term accounts for noisy observations) and as a sample mean 2 ij

select the hyperparameters via log-marginal likelihood.

Mean Performance(i, θ)

i∈Instances

Time series analysis using Gaussian processes in Python and the search for Earth 2.0 D. Foreman-Mackey (PyData, 2014) Application of Gaussian processes to detect transits of extrasolar planets.

on a small set of instances, one can learn the mapping (instance, θ) 7→ performance, using instance features, and use it to estimate E[Performance(·, θ)]. (You also need a distribution on instance features. SATzilla uses a similar idea to choose the most promising algorithm (DPLL, LP, etc.) for a given instance.)

Portfolio optimization To maximize the expected improvement (EI), and profor VAR, CVaR, omega and utility duce a diverse set of configurations with high EI, use with general return distributions multi-start local search. W.T. Shaw Biased random portfolios. Algorithms for hyper-parameter optimization J. Bergstra et al. (2011) A Parzen estimator of a conditional probability distribution p(x|y) is an estimator of the form { ℓ(x) if y < y ∗ p(x | y) = g(x) if y ⩾ y ∗ Article and book summaries by Vincent Zoonekynd

Preliminary evaluation of hyperopt algorithms on HPOLib J. Bergstra et al. (2014) Testing the hyperopt algorithms on the HPOLib benchmarking suite suggests that vanilla Gaussian processes (with RBF, rather than Matérn kernels) are good enough.

96/587

Making a science of model search J. Bergstra et al. (2013) TPE is slightly better than random search.

Tree-structured Gaussian process approximation T. Bui and R. Turner

Gaussian process estimation can be sped up using a (smaller) pseudo-dataset of (noiseless) pseudoTime-bounded observations; they can be arranged in a tree (à la sequential parameter optimization Barnes-Hut) and used as a graphical model. F. Hutter et al. (2010) A few improvements on SPO: – Instead of fitting a noise-free Gaussian process to smoothed data (if there are several, contradictory measurements for the same parameter values, you need to somehow reconcile/smooth them), include noise in the model; – (When minimizing running time) stop the computations early if they take too long: a censored running time is still informative; – Use an approximate Gaussian process, the projected process (PP) approximation: instead of considering (inverting) the whole n × n kernel matrix, select (without replacement) p instances and make do with the n × p and p × p submatrices. SMAC: sequential model-based algorithm configuration

F-Race and iterated F-Race: an overview M. Birattari et al. (2009) To minimize an expensive function, e.g., some expected cost, estimated by Monte Carlo simulations: take a set of candidates (e.g., a grid); evaluate them in parallel; regularly test if the (more and more precise) estimated values are significantly different (e.g., with a non-parametric ANOVA test); if so, discard the worst candidate; iterate until only one candidate remains. This can be combined with CMA-ES: when the number of candidates has been sufficiently reduced, replace them with new, nearby candidates. ParamILS: an automatic algorithm configuration framework F. Hutter et al.

Random-forest-based Bayesian optimization, in Java.

Iterated local search (ILS) builds a sequence of local minima: start with a point; use local search to find BayesOpt: a Bayesian optimization library a nearly local minimum; add some noise to escape its for nonlinear optimization, bassin of attraction; iterate, keeping the new minimum experimental design and bandits if it is better (other acceptance criteria are possible). R. Martinez-Cantin (2014) If the objective function is estimated from a Monte Carlo simulation, there is no need to always use the Gaussian process Bayesian optimization, in C++. same number of iterations. If the objective function is the running time of some Efficient benchmarking algorithm, censored values (“more than x”, i.e., we of hyperparameter optimizers via surrogates stopped after x seconds) are sometimes good enough. K. Eggensperger et al. (2015) Comparison of three Bayesian optimization methods: – Gaussian processes (GP, as implemented in Spearmint); – Random forests (RF, as implemented in SMAC); – Tree-structured Parzen density estimators (TPE, as implemented in hyperopt) on realistic benchmarks, surrogates of real-world problems (real-world problems usually require time, software licenses, and/or specialized hardware, and are therefore hard to reproduce). On low-dimensional continuous problems, GP perform best; on more general problems, RF and TPE perform best.

GGA: a gender-based genetic algorithm for the automatic configuration of algorithms K. Tierney Gender-based genetic algorithms keep two subpopulations (genders), using different fitness functions. For instance, if the standard fitness is time-consuming to compute, one could use it for one gender and use the other as a “variety store”. R in Finance 2016 The ForecastCombinations package provides various ways of combining predictions:

Optimization: algorithms and applications – Equal-weighted average; R.K. Arora (2015) – Precision-weighted average; – Regression; Relatively exhaustive list of all optimization-related – Regression with L1 loss; topics, not too deep, but with examples, exercises and – Constrained regression: ∑ β = 1, β ⩾ 0; i i Matlab (?) code. – Average of the best 10% subset regressions; – BIC-weighted average of all subset regressions. Article and book summaries by Vincent Zoonekynd

97/587

Stan, a C++-like language to describe Bayesian mod- – The Arborist package for random forests also proels, in the spirit of BUGS, is not limited to hierarchivides quantile regression (i.e., the leaves do not only cal models, but can also be used to model time series: contain the mean, but quantiles) and monotonicity stochastic volatility, mixtures, hidden Markov models, constraints (reject proposals that breach the conhierarchical stochastic volatility. straints with probability p – not necessarily 1). – The roll package provides more rolling functions Deep learning libraries are no longer limited to Python (mean, var, sd, cor, cov, lm, eigen, pcr, vif); the and Lua: MxNet (a C++ library) can be used from R. computations use Rcpp and are parallelized. There is now an R package for automatic differentia- – The pirls package re-implementats glmnet; it is tion (AD): madness. slower but allows arbitrary link functions. Several talks dealt with portfolio optimization: – Multi-objective portfolio construction, with NSGA2 (mco::nsga2; risk parity, cccp::rp, was also mentioned) – The ROML package provides portfolio optimization functions, a bit like a limited version of disciplined convex optimization, with predefined functions for many objectives. [R finally has a package for disciplined convex optimization: cvxr.] – Portfolio optimization with PortfolioAnalytics and random portfolios Most talks were about finance:

Many talks showcased Shiny, interactive (Javascript) plots and the use of R to teach finance or options. There were a few talks about options and stochastic processes: – The illiquidity of a market can be estimated by comparing the price of an ETF and the NAV of the underlying portfolio; this can be interpreted as a bidask spread by noticing that bid and ask prices are American option prices. – Computation of the returns of the dual moving average (MACD) strategy, under a Brownian motion assumption – Convertible bonds, divergence swaps, stochastic local volatility model, etc.

– Do not use the ADF test to identify trading opportunities: it highlights mean reversion (OrnsteinUhlenbeck process), while you want oscillations. Try A few talks were only remotely related to finance. For a model of the form instance, the rearrangement problem is the problem of finding the intra-column rearrangement of a matrix xt+1 = β0 + β1 × levelt + β2 × slopet + noiset+1 ; that minimizes row sum variability (variance, maximum, minus minimum). Here are a few heuristics: filter for β1 < 0, β2 > 0; keep the best fits. One – Pick a column at random and re-order it so that it could also try a Hilbert transform. be in the order opposite to the sum of the columns; – SVM for stock selection (on the S&P100): compar– Idem with a subset of columns versus the others; ison of 15 different kernels [one could also combine – This subset of columns can be chosen so that its sum them – this is called multiple kernel learning (MKL)]. has as similar a variance as the other columns. – One can measure the connectedness of the bank network by fitting a VAR model (use an adaptive lasso Why should I trust you? penalty to deal with the large number of banks, and Explaining the predictions of any classifier an OHLC-based volatility estimator) and computing M.T. Ribeiro et al. (2015) the H-step ahead contribution of bank i to bank j. – Leveraged ETFs give less return that you may think: To interpret the forecasts of a complex (black box) ( ) model, approximate it, locally (as in local regression) f (x) = log1p k × expm1(x) ⩽ k × x. with simpler models: sparse linear model (e.g., for a small number of binary features), decision trees, falling This becomes more visible when volatility is high, rule lists, etc. The explanation is local: each observaand with compounding: tion gets a different explanation. To explain the whole ∑ ∑ model, provide a diverse, representative set of obserf (xt ) ⩽ k xt . vations and explanations; it is a (weighted) set cover problem: given an instance×feature binary (or weight) – Building an index from GoogleTrends and a list of matrix, select a small number of rows that cover as search terms many columns as possible (a greedy algorithm is good A few talks covered data management issues, e.g., h5, enough). to read and write hdf5 files (a portable file format to store arrays of numbers, data.frames, etc.), feather Falling rule lists (to store columnar data and read it from R or Python) F. Wang and C. Rudin (2014) or ff to manipulate datasets larger than memory. R A falling rule list is a set of rules of the form “if patient can be used in other environments, such as Postgres or satisfies condition 1, then risk = x% else…”, where the Hadoop. risk is decreasing. From a set of candidate conditions A few talks focused on efficient or extensible computations. Article and book summaries by Vincent Zoonekynd

B (frequent item set mining: FP-growth, Eclat, Apriori), one can build a Bayesian model (greedily selected 98/587

decision rules do not perform well).

Ensemble samples with affine invariance J. Goodman and J. Weare (2010)

B : set of candidate rules

The affine-invariant Nelder-Mead optimization algorithm motivates an affine-invariant ensemble of MCMC samplers, using a “stretch move” (adjust the acceptance probability accordingly)

λ : expected number of rules wj : prior probability of j ∈ B Γ(αj , βj ) : prior for the risk of rule j Γ(α∞ , β∞ ) : prior for the risk of the default rule L ∼ Pois(λ) number of rules c1 . . . , cL : drawn from B, without replacement, with probabilities proportional to wj γj ∼ Γ(αj , βj )1⩾1

xcandidate = xj + Z(xk − xj ), k

j ̸= k random

1 p(z) ∝ √ 1[1/a,a] , z

Z ∼ p,

a=2

Similarly, CMAES suggests a “walk move”, xcandidate ∼ N (µS , ΣS ) k

γ∞ ∼ Γ(α∞ , β∞ ) ∑ riskj = log γi

where S is a random subset of particles, |S| ⩾ 2, k ̸∈ S, and µS and ΣS are their mean and variance.

i⩾j

This is implemented in the emcee Python package. Interpretable classifiers using rules and Bayesian analysis: building a better stroke prediction model B. Letham et al. (2015)

Integration in finite terms M. Rosenlicht (1972)

Scalable Bayesian rule lists H. Yang et al. (2016)

An invitation to integration in finite terms E.A. Marchisotto et G.A. Zakeri (1994)

Elementary proof of Liouville’s 1834 theorem, which gives a criterion for an indefinite integral to be exBayesian rule lists generalize falling rule lists to multi- pressible in closed form. For instance, ∫ e−x2 dx has nomial outcomes. no closed form

The alternating decision tree learning algorithm Y. Freund and L. Mason (1999)

More elementary presentation of Liouville’s theorem. Smooth numbers: computational number theory and beyond A. Granville (2008)

An alternating decision tree alternates between prediction nodes and splitting modes; the value of a Smooth numbers are numbers whose prime factors are leaf is the sum of the predictions on its path; the value small. A number if y-smooth if its prime factors are at of a sample is the sum of the values of the paths leading most y. Let ψ(x, y) be the number of y-smooth integers to it. up to x. Then +.5 ψ(x, x1/u ) −→ ρ(u) > 0 x→∞ x a0 where −.7

+.4

+.2

b>1

a>1

−.2

−.1

+.3

−.6

+.1

They can be learnt by boosting; their performance is similar to boosted decision trees (C5.0), but they are easier to interpret. Optimizing the induction of alternating decision trees B. Pfahringer et al. Alternating decision trees are not scalable: heuristics are needed.

Article and book summaries by Vincent Zoonekynd

ρ(u) = 1

u ∈ [0, 1]

ρ(u) = 1 − log u ∫ 1 u ρ(t)dt ρ(u) = u u−1

1⩽u⩽2 u > 1.

They play a role in cryptography and computational number theory. Generalizing dynamic time warping to the multidimensional case requires an adaptive approach M. Shokoohi-Yekta et al. Dependent and independent multidimensional time warp (DTW) give different results. Which is best may be problem-, class- or exemplar-specific, and can be learned.

99/587

Clustering time series using unsupervised shapelets J. Zakaria et al. (After normalizing the time series) apply clustering algorithms to the matrix of distances between the time series and “shapelets” (shorter time series); a small number of discriminative shapelets suffices. Foster-Hart optimal portfolios A. Anand et al.

Equivalence of robust VaR and CVaR optimization S. Lotfi and S.A. Zenios (2016) Robust optimization Minimize Max Loss(α, β) α

β∈B

optimizes the worst case situation, when the values of some parameters are constrained to remain in some box or ellipsoid. Robust CVaR and robust VaR optimization turn out to be equivalent.

For a bounded random variable X such that E[X] > 0 and P [X < 0] > 0, the Foster-Hart risk is the positive An introduction to ROC analysis number r such that E log(1+X/r) = 0. It is not coherT. Fawcett (2005) ent, not convex, not time-consistent, not guaranteed to exist. Precision-recall curves are affected by class skew; ROC curves are not. Option-implied equity premium predictions Given a set of classifiers, look at the convex hull of via entropic tilting their ROC curves; given two classifiers A and B, the K. Metaxoglou et al. (2016) random mixtures pA+(1−p)B form the [AB] segment. Entropy tilting transforms a probability distribution The AUC can be interpreted as the probability, given π to change some of its moments g. two observations from the two classes, of correctly distinguishing them. Find π∗ ∫ π∗ ∗ ∗ To estimate the “precision” of a ROC curve, use bootTo minimize KL(π ∥π) = π log π ∫ strapped curves and look at the variation of tp given Such that

g(x)π ∗ (x)dx = g¯

Using semantic fingerprinting in finance F. Ibriyamova et al. Rather than cosine similarity between bags of words, use vector embeddings (here, Numenta’s “semantic fingerprint”) to identify similar companies. Pure quintile portfolios D. Liu Let r be the vector of stock returns (at a given date), X the matrix of (normalized) factor exposures (stock×factor); the factor returns are (X ′ X)−1 X ′ r; the factor mimicking portfolios are the rows of (X ′ X)−1 X ′ . The factor quintile portfolios are the minimum variance portfolios subject to the constraints (i) n/5 stocks, long-only; (ii) The exposure to the factor of interest is the same as that of the quintile portfolio; (iii) The exposure to the other factors is zero.

fp, or (fp, tp) given the score. ∑ For multiclass problems, use p(ci )AUC(ci ) or ∑ 2 AUC(ci , cj ) |C| (|C| − 1) i 2level(child) Bloofi: multidimensional Bloom filters A. Crainiceanu and D. Lemire (2015) The bitwise or between Bloom filters (same length, same hash function) is still a Bloom filter. By arranging Bloom filters in a B-tree, one can efficiently find in which Bloom filters an element is. ZeroDB white paper M. Egorov and M. Wilkison (2016) Encrypted databases expose their B-trees to their clients. A review of homomorphic encryption and software tools for encrypted statistical machine learning L.J.M. Aslett et al. Homomorphic encryption schemes allow some arithmetic operations on the ciphertext, usually additions and multiplications: ( ) dec enc(x) ⊞ enc(y) = x + y. Article and book summaries by Vincent Zoonekynd

C++, check the HeLib library. Encrypted statistical machine learning: new privacy-preserving methods L.J.M. Aslett (2015) The limitations of homomorphic encryption, in particular, the absence of division, exponential and comparisons, requires a redesign of most statistical procedures; the EncryptedStats package provides Naives Bayes and extreme random forests. Limited forms of comparison and division are actually available: – By quantizing the values and replacing them with indicator variables, comparisons such ∑ as x = y can be expressed as a scalar product xk yk = 1; no branching is possible, but t ? a : b can be written ta + (1 − t)b; – Approximate division can be performed using (encrypted) random number generation, relying on X ∼ Geom(p) =⇒ E[X] = 1/p. Smoothing spline ANOVA models: R package gss C. Gu (JSS, 2014) Classical Anova µij = µ + αi + αj + εij can be generalized to functions f (x1 , x2 ) = (1 − A1 + A1 )(1 − A2 + A2 )f = A1 A2 f + (1 − A1 )A2 f + (1 − A2 )A1 f + (1 − A1 )(1 − A2 )f = ϕ0 + ϕ1 (x1 ) + ϕ2 (x2 ) + ϕ12 (x1 , x2 ) where A1 and A2 are averaging operators, e.g., ∫b (A1 f )(x2 ) = a f (x1 , x2 )dx1 /(b − a) and the problem 109/587

is usually penalized, e.g.,

General purpose convolution algorithm in S4 classes by means of FFT P. Ruckdeschel and M. Kohl (JSS, 2014)

Find To minimize

ϕ0 , ϕ1 , ϕ2 , ϕ12 (ϕ − A1 A2 f )2 + )2 ∫ (0 The distr package provides operations (+, ×, etc.) on ϕ1 − (1 − A1 )A2 f + )2 ∫( independent random variables, e.g., ϕ − (1 − A2 )A1 f + )2 ∫∫ ( 2 ϕ − (1 − A1 )(1 − A2 )f + X ∼ N (0, 3) and Y ∼ N (0, 4) =⇒ X + Y ∼ N (0, 5) ∫ 12′′ 2 ∫ ∫∫ ′′ 2 2 λ1 |ϕ1 | + λ2 |ϕ′′2 | + λ12 ∥ϕ12 ∥ . by discretizing and convolving the distributions when If there are no interactions, this is just a generalized ad- no closed form is available. ditive model (GAM). The same framework also applies to log-density estimation (with less overfitting than mediation: density). In particular, check the ssanova and ssden R package for causal mediation analysis functions. D. Tingley et al. (JSS, 2014) ssanova(y~x) ssanova(y~x1+x2) ssanova(y~x1*x2) ssden(~s)

# # # #

Smoothing GAM GAM with interactions Plot with dssden()

Given random variables T (treatment, binary), M (suspected mediator) and Y (outcome), one can measure how much of a role M plays in the causal process from T to Y ζ

Hierarchical archimedian copulae: the HAC package O. Okhrin and A. Ristig (JSS, 2014)

T

treatment

M

mediator

δ

Y

outcome

by modeling M ∼ T and Y ∼ T + M and considering ( ) ( ) ( ) Total treatment τ = Y 1, M (1) + Y 0, M (0) Cϕ (u1 , . . . , ud ) = ϕ ϕ−1 (u1 ) + · · · + ϕ−1 (ud ) ( ) ( ) Causal mediation δ(t) = Y t, M (1) + Y t, M (0) are exchangeable. By introducing pseudo-variables ( ) ( ) ζ(t) = Y 1, M (t) + Y 0, M (t) , C(ui , uj ), one can consider hierarchical archimedian Direct effect copulas, e.g., which gives decompositions Archimedian copulas,

u4

τ = δ(t) + ζ(1 − t),

t ∈ {0, 1}.

u3 u1 u2 u3

u1 u2 u3 u4

u1 u2

C(u1 , u2 , u3 ) C(C(u1 , u2 ), C(u3 , u4 )) C(C(C(u1 , u2 ), u3 ), u4 )

hmmm: an R package for hierarchical multinomial marginal models R. Colombi et al. (JSS, 2014)

The hierarchical structure can be inferred by finding the two closest variables, replacing them with C(ui , uj ), and iterating; consecutive nodes with a very similar generator ϕ can be collapsed. Check the estimate.copula and [rpd]HAC functions.

Marginal models model the joint distribution of qualitative variables as

copulaedas: an R package for estimation of distribution algorithms based on copulas Y. Gonzalez-Fernandez and M. Soto (2014)

with constraints on some of those factors to enforce conditional independencies (equalities) or stochastic dominance (inequalities) – not unlike graph-less graphical models.

EDA (estimation of distribution algorithms, CMA-ES) optimizes a black-box function as follows: – – – – – –

Start with a random set of candidates Improve them by local search; Keep the best ones; Estimate their distribution; Sample more candidates from this distribution Iterate.

The distribution is often estimated as a Gaussian, but one can also use copula-based models: independence copula, Gaussian copula, hierarchical archimedian copula, regular vine, etc.

Article and book summaries by Vincent Zoonekynd

P (X1 , . . . , Xn ) = P (X1 )P (X2 |X1 )P (X3 |X1 , X2 ) · · · · · · P (Xn |X1 , . . . , Xn−1 ),

structSSI: simultaneous and selective inference for grouped or hierarchically structured data K. Sankaran and S. Holmes (JSS, 2014) Given m p-values p1 ⩽ · · · ⩽ pm , multiple testing procedures usually look at the family-wise error rate (FWER), i.e., try to ensure P [at least one false positive] ⩽ α – Bonferroni: reject H0 (k) if pk ⩽ α/m; – Šidák: reject H0 (k) if pk ⩽ 1 − (1 − α)1/m ; 110/587

– Holm–Bonferroni: reject H0 (k) as long as pk ⩽ Fitting heavy tailed distributions: 1/(m + 1 − k); the poweRlaw package – Holberg: reject H0 (k) as long as there exists ℓ ⩾ k C.S. Gillespie (JSS, 2015) such that pℓ ⩽ 1/(m + 1 − ℓ) To fit a power law distribution p(x) ∝ x−α 1x⩾xmin on or the false discovery rate (FDR), i.e., try to ensure the tail of the data, use the maximum likelihood esti[ ] mator /∑ false positives x E ⩽α α ˆ =1+n log ; positives xmin x⩾xmin

– Benjamin–Hochberg (BH): reject H0 (k) as long as pk ⩽ kα/m; ∑m – BHY: reject H0 (k) as long as pk ⩽ kα/ i=1 (1/i). But they tend to be optimal for independent tests, overly conservative for positively dependent tests, and incorrect for negatively dependent tests – those that remain valid account for an arbitrary dependence structure, but that is too general an assumption.

choose the xmin minimizing the Kolmogorov-Smirnov (KS) distance; boostrap to estimate the uncertainty on xmin ; to estimate the goodness of fit, look at the distribution of the KS statistic on data sampled from . abctools: an R package for tuning approximate Bayesian computation analyses M.A. Nunes and D. Prangle

Some extensions of those tests account for a hierarchical structure: Approximate Bayesian computation (ABC) is a – Group BH (GBH) estimates the proportion of H1 ’s Bayesian method that uses simulations instead of likein each group and reweighs the p-values accordingly; lihoods. Rejection ABC proceeds as follows: – Hierarchical FDR (HFDR) accepts or rejects the null – Draw a parameter θ from the prior; hypothesis at the group level, and only looks at in- – Simulate data x ∼ p(·|θ); compute statistics s(x); dividual tests in groups whose null hypothesis was – If d(sobs , s) ⩽ ε, accept θ; rejected. – Iterate until you have enough samples. The abctools package complements the abc package and can help choose the statistics from a set of candidates, e.g., by approximate sufficiency (greedily add the statistics that produce the largest change in the To find the words most relevant to a given topic, one posterior distribution; stop when it becomes insignifican rank them according to P (term|topic) (but this cant) or by finding the subset minimizing some inforgives too much weight to common, non-discriminative mation criterion (there are also projection methods) words), or P (term|topic)/P (term) (lift – but this gives and help choose ε. too much weight to rare, non-discriminative words), or, better, a weighted average of their logarithms (“relengspatial: a package for fitting the centered vance”). autologistic and sparse spatial generalized linear mixed models for areal data fitdistrplus: J. Hughes an R package for fitting distributions The autologistic model M.L. Delignette-Muller and C. Dutang (2015) ∑ P [zi = 1] One can fit a distribution to data using: maximum likelog = x′i β + η Zj P [zi = 0] lihood (MLE), moment matching, quantile matching, j∈Neigh(i) maximum goodness of fit (gof); the gof can be assessed with the Kolmogorov-Smirnov, Cramer-von Mises or is confounded; one can center the autocovariate with Anderson-Darling distance (AD is a variant of CvM ∑ P [zi = 1] with more weight on the tail – there are 1-sided varilog = x′i β + η (Zj − µj ) P [zi = 0] ants, and variants with more weight). LDAvis: a method for visualizing and interpreting topics C. Sievert and K.E. Shirley

j∈Neigh(i)

SDD: an R package for serial dependence diagrams L. Bagnato et al. (JSS, 2015) The autocorrelation function (ACF) relies on linear correlation. Instead, the dependogram uses the independence χ2 (or its normalization, Cramer’s V ) between X· and X·−k , discretized into n quantile bins, or the Kullback-Leibler divergence between (kernel) density estimators of X· , X·−k and (X· , X·−k ).

Article and book summaries by Vincent Zoonekynd

µj = (1 + e

−xj β −1

)

.

MVN: an R package for assessing multivariate normality S. Korkmaz et al. To check multivariate normality: – Look at the 1-dimensional margins: histograms, qqplots, normality tests; – Plot the 2-dimensional densities (persp, contour); 111/587

∑ – Plot the sample Mahalanobis distance versus χ2 ij Min(wij , wji ) ∑ – Reciprocity r = quantiles; ij wij – Try multivariate normality tests: Mardia (based – Cycling index on skewness kurtosis), Henze–Zirkler (com∑ −αDand ∑ −βD ∑ (s) ij i bines e and e , where Dij = si· Γi (s) ∑ Γ = dMahalanobis (xi , xj ) and Di = dMahalanobis (xi , x ¯)), wi· Royston (Shapiro–Wilk); (s) u −1 (s) – Remove outliers (robust Mahalanobis distance, from Γi = ii (s) minimum covariance determinant estimators instead uii s of sample covariances). U = I + M + M2 + · · · + Ms mij ∝ wij (stochastic matrix)

paircompviz: and R package for visualization of multiple pairwise comparison test results M. Burda

where Γi is the fraction of trade that goes back to i within s steps.

The result of multiple pairwise tests is often represented by a line diagram, or a letter diagram.

Look at the corresponding time series, the contribution of each country, the correlation with oil, etc.

T1

T2

T3

T4

(s)

T5 T1 ̸= T5

Forecasting stock returns during good and bad times D. Huang et al. (2012)

Instead, one can consider the relation Ti ≻ Tj if i is significantly different from j, hope it is a partial order Log-prices can be modeled as the sum of a random walk (this happens surprisingly often) and draw its Hasse (permanent shocks) and an AR(1) process (temporary diagram (i.e., remove the edges that can be inferred shocks) by connectivity) after compressing complete bipartite        pt−1 pt 0 1 1 0 subgraphs qt  = µ + 0 1 0  qt−1  + ε B C B C A A 0 0 λ zt−1 0 zt   7→ 0 0 0 E F E F. D D σ12 ρσ1 σ2  Var ε = 0 0 ρσ1 σ2 σ22

logcondens: computation related to univariate where the correlation ρ between the random walk log-concave density estimation and the AR(1) innovations is a market state indiL. Dümbgen and K. Rufibach cator: ρ ⩾ 0 corresponds to good times, overconfiThe log-concave density estimator is parameter-free, cence, low risk premium, reversal, Cov(rt+1 , rt ) < 0. and log-concavity is a plausible assumption for many Empirically, use MRI = (12-month return − µ)/σ, datasets: consider using its smoothed version (it is (where µ and σ are the long-term mean and stanpiecewise log-linear and bas discontinuities at both dard deviation of the 12-month return) as momenends of the support) instead of a kernel density esti- tum, and IMA = 1log-price>MA(log-price,200d) or ISR = 1MA(Sharpe,6m)>30% long-term quantile as market state inmator. dicator. For a multivariate estimator, check LogConcDEAD. Online graph pruning for pathfinding on grid maps D. Harabor and A. Grastien

Risk premia: asymmetric tail risks and excess returns Y. Lempérière et al.

To account for symmetry and avoid examining nodes unnecessarily, path finding algorithms on grids try to recognize rectangular rooms, dead-ends, swamps, neighbours to prune and “jump points”.

The ranked P&L skewness is the area under the curve of the cummulated standardized values, sorted by increasing absolute values (there are other low-moment estimators of skewness, e.g., the normalized mean-minusmedian). The risk premium present in many markets seems to be due to skewness rather than volatility – investors do not fear small Gaussian fluctuations, but rather large tail events.

Metabolic paths in world economy and crude oil price F. Picciolo et al. (2015) Measure paths in the world trade web (WTW) with: ∑ Min(w·i , wi· ) – Trade imbalance b = i ∑ ij wij Article and book summaries by Vincent Zoonekynd

Forecasting the equity risk premium: the role of technical indicators C.J. Neely et al. (2010) 112/587

To forecast index returns, use both macroeconomic Algebraic computation variables (dividend yield, robust excess return volatilof pattern maximum likelihood ity, term spread) and technical indicators (MACD, moJ. Acharya et al. (2011) mentum, MACD of the signed volume). The PML can be computed analytically with Gröbner bases. For instance, PML(1112234) = ( 15 , 15 , 15 , 15 , 15 ). Financial volatility and economic activity Analytic combinatorics F. Fornari and A. Mele (2010) P. Flajolet and R. Sedgewick (2009) √π Volatility (robust MAD estimator, 2 ⟨|rett |⟩) and The traditional approach to combinatorics, to estimate term spread can predict industrial production growth. the number an of a certain type of objects (e.g., graphs of size n with a certain property) is the following: – Find a recurrence relation involving an (if you can Macroeconomic determinants solve it, you are done); ∑ of stock volatility and volatility premiums – Consider the generating function an xn or ∑ fn(x) = V. Corradi et al. (2012) (if a grows too fast) f (x) = an x /n!; Conversely, the business cycle can predict volatility. – Transform the recurrence relation for a into a functional (or differential) equation for f ; – Solve the equation or, somehow, use it to infer the Short interest and aggregate stock returns asymptotic behaviour of a. D.E. Rapach (2014) Aggregate short interest (as a proportion of shares outstanding, detrended) is informative. The vector algebra war: a historical perspective J.M. Chappell Different formulations can be used to compactly write Maxwell’s field equations:

Analytic combinatorics goes directly from the combinatorial description of the problem to the functional equation for the generating function, and uses complex analysis to estimate the asymptotic behaviour of its coefficients. 1. A combinatorial class (A , |·|) is a set A with a size function |·| : A → N such that, for all n, an = #{x ∈ A : |x| = n} ∑ is finite. Its ordinary generating function is A(z) = an z n ∈ RJzK. The n coefficients are denoted [z ]A(z) = an .

– Gibbs vectors, i.e., vectors in R3 with dot and cross product (R, ·, ×); – Quaternions H, a 4-dimensional albegra with i2 = Simple operations on combinatorial classes translate to j 2 = k 2 = ijk = −1; their generating functions as follows. – Clifford algebra, C ℓ(R3 ), an 8-dimensional algebra, with ei ej = −ej ei , e12 = e22 = e23 = +1, with the natural embedding ei 7→ ei from R3 ; the dot and A ⊔B C(z) = A(z) + B(z) cross product can be recovered as vw = v · w + A ×B C(z) = A(z)B(z) e1 e2 e3 v × w. Seq B C(z) = 1/(1 − B(z)) With the Clifford (or geometric) algebra, Maxwell’s ∑ ϕ(k) 1 equations become ∂(E + jcB) = ρ/ε − µcJ. [No menlog Cyc B C(z) = k 1 − B(z k ) tion of differential forms or screw calculus.] k⩾1 ∏ MSet B C(z) = (1 − z n )−Bn Recent results on pattern maximum likelihood n⩾1 ( ) J. Acharya et al. (2009) ∑1 k = exp B(z ) Given an iid sequence of symbols, e.g., HHTHTTH, k k⩾1 where the list of possible symbols is not known, what ∏ can we say about the multiset of symbol probabilPSet B C(z) = (1 + z n )Bn ities? Standard maximum likelihood (estimate the n⩾1 ( ) probabilities of each symbol and forget the order) only ∑ (−1)k−1 k works well for large samples. Pattern maximum like= exp B(z ) k lihood (PML) replaces the symbols by numbers, startk⩾1 ing with the most frequent, and reorders them, e.g., The last threee formulas are the Pólya logarithm, exababca→121231→111223. A pattern probability p is ponential, and modified exponential. ⨿ The sequences a∑non-increasing sequence p1 ∑ ⩾ p2 ⩾ · · · ⩾ 0 with Seq B = {ε} ⊔ B ⊔ B × B ⊔ · · · = n⩾0 B n are only pi ⩽ 1; the remainder 1 − pi is the continuous defined if b0 = 0; the cycles are Cyc B = (Seq B \ part of p. The PML of a pattern ψ is {ε})/S; the multisets are MSet B = (Seq B)/S. PML(ψ) = Argmax p(ψ). p

Here are a few examples. – E: single object of size 0; – Z: single object of size 1;

Article and book summaries by Vincent Zoonekynd

113/587

– I = Seq⩾1 Z: non-zero integers; – Seq(Z ⊔ Z × Z(: coverings of J0, nK by intervals of length 1 or 2 (Fibonacci numbers); – T = E ⊔ T × Z × T : triangulations (Catalan numbers); – G = Z × Seq G: rooted plane trees – a tree is a root and a sequence of trees (shifted Catalan numbers shifted); – H = Z × MSet H: non-plane (rooted) trees; – C = Seq I: compositions (ways of writing an integer as a sum of integers); – P = (Seq I)/S = MSet I: partitions; – A = mZ: m-letter alphabet; – W = Seq A: words. The language recognized by a finite automaton has generating function L(z) = u′ (I − zT )v, where T is the incidence matrix (it is not binary because there can be several edges, with different labels, between the same vertices), and u and v the indicator vectors of the initial and final states. Rational languages have a rational generating function; context-free languages have an algebraic generating function. 2. A labelled combinatorial class is a combinatorial class (A , |·|) with an action of the symmetric group Sn on each An = {x ∈ A : |x| = n}. Informally, think of the elements of A as graphs (perhaps with some added structure) whose vertex set is J1, nK – the notion of species of structure makes this rigorous and unifies the labelled and unlabelled∑ cases. Its exponential generating function is A(z) = an xn /n!. Operations on labelled combinatorial classes translate to their exponential generating functions.

d with parameters (A , |·| , χ), χ : A → N∑ , and multivariate generating functions, A(z, u) = an,k z n uk , where an,k = #{a ∈ A : |x| = n and χ(a) = k}. This leads, for instance, to the expected ∑n number of cycles in a permutations of size n, µn = k=1 1/k ∼ log n.

4. Complex analysis, starting with Cauchy’s formula [z n ]f (z) =

1 2πi

I Γ

f (z)dz , z n+1

can help study the asymptotic behaviour of those power series. If f is analytic at 0, with radius of convergence R, then fn = [z n ]f (z) is of exponential order R−n , fn ▷◁ R−n , 1/n i.e., lim |an | = 1/R. Saddlepoint bounds are often accurate: [z n ]f (z) ⩽

inf r∈(0,R)

r−n sup |f (z)| . |z|=r

If f is meromorphic on [|z| = R], then [z n ]f (z) =



Pi (n)αi−n + O(R−n ),

where αi are the poles of f on the circle [|z| = R] and Pi are polynomials of degree at most the order of the poles minus 1.

A ⊔B A ⋆B

C(z) = A(z) + B(z) C(z) = A(z)B(z)

6. If f is meromorphic on D = [|z| ⩽ R], with radius of convergence R, and dominant singularities (i.e., on the boundary of∑the convergence disk) ζ1 , . . . , ζk , then g(z) = f (z) − ci (1 − z/ζi )−αi is holomorphic on a neighbourhood of ∑ D, and the asymptotic behaviour of [z n ]f (z) is that of ci (1−z/ζi )−αi . For a more precise asymptotic estimate, repeat the process with g.

Seq B

C(z) = 1/(1 − B(z))

In particular, if there is only one dominant singularity,

Set B

C(z) = exp B(z) 1 C(z) = log 1 − B(z)

Cyc B

Given two labelled objects α ∈ A , β ∈ B, one can build labelled objects (α′ , β ′ ) by relabelling α and β while keeping their order (there are many such relabelings). The set of all such labelled objects form the labelled product A ⋆ B. Examples include: – – – – – –

R = Seq Set⩾1 Z: surjections; S = Set Seq⩾1 Z: partitions; P = Set Cyc Z: permutations; I = Set Cyc1,2 Z: permutations; D = Set Cyc>1 Z: derangements; T = Z ⋆ Set T , T (z) = zeT (z) : a (non-plane) rooted tree is a root and a set of rooted trees; – T = U • : a rooted tree is a pointed unrooted tree; – F = Set Cyc T , F (z) = (1 − T (z))−1 : the graph of an endomorphism is a set of cycles, with a tree attached to each element of those cycles. 3. This can be generalized to combinatorial classes Article and book summaries by Vincent Zoonekynd

f (x) ∼ (1 − z/ζ)−α =⇒ fn ∼

1 −n α−1 ρ n . Γ(α)

This can be generalized to singularities of the form (1 − z)−α

(

1 1 log z 1−z

)β .

8. If there are no singularities, or if the closest singularity is essential, the saddle point method can give the asymptotic behaviour of the coefficients. Given an analytic function f on Ω ⊂ C, the surface z 7→ |f (z)| is not arbitrary: its points are either ordinary (f ̸= 0, f ′ ̸= 0), zeroes (f = 0) or saddle points (f ̸= 0, f ′ = 0) – in particular, the zeroes are the only extrema (this can be used to prove the fundamental theorem of algebra). ∫B A trivial bound for the integral A f (z)dz is ∫ B f (z)dz ⩽ length(Γ) × sup |f | . A Γ 114/587

To have a tighter bound, one can choose a path Γ that goes through a saddle point. In particular (if G is analytic at zero, not a polynomial, with nonnegative coefficients and G(R) = +∞), I G(z)dz G(ζ) 1 n ⩽ n [z ]G(z) = 2πi z n+1 ζ where ζG′ (ζ)/G(ζ) = n + 1 (the saddle point, ζ, and the circle of integration depend on n). The saddle point method is a refinement of this bound using the Laplace method: (under a few assumptions) [z n ]G(z) ∼ b(z) = z 2 ζ

ζn

G(ζ) √ 2πb(ζ)

d2 d log G(z) + z log G(z) dz 2 dz

G′ (ζ) = n. G(ζ)

number of cycles of length k in the cycle decomposition of σ ∈ Sn , fix F [σ] = #{u ∈ F [n] : σu = u} is the number of points of F [n] fixed by F [σ]. The exponential generating series F (x) counts (labelled) F -structures; the (isomorphism) type generating series F˜ (x) counts isomorphism classes of F structures, i.e., unlabelled F -structures; the cycle index series ZF contains more information. Bijections F U ≃ GU for all U (equipotence) do not necessarily define an isomorphism: for instance, the species of linear orders Lin and that of permutations Perm are equiponent, but not isomorphic (the bijections do not form a natural transformation); in particg ] ular, Lin(x) = Perm(x), but Lin(x) ̸= Perm(x). The following operations on species correspond to more or less complex operations on their power series. Sum

∑ 9. The multivariate power series F (x, u) = fn,k uk xn can be seen as a deformation of a univariate series F (x, 1). For n fixed, it gives the probability generating function of a discrete distribution ∑ fn−1 [xn ]F (x, u) = fn−1 fn,k uk k

Product

(F + G)U = F U ⊔ GU ∑ (F · G)U = F U1 × F U2 U =U1 ⊔U2

Substitution

(F ◦ G)U =



F [π] ×

π partition of U

Derivative

F ′ U = F (U ⊔ {∗})

Pointing

F • U = F [U ] × U

Superposition

and can help study the limit law, the convergence speed (Berry-Esseen) and the tail probabilities.

(F × G)U = F U × GU

Composition

(F □ G)U = F (GU )

[For discrete laws, if pn (u) = E[uX n ] pointwise converges to p(u) = E[uX ] for all u A ⊂ C, where A has an accumulation point in the open unit disk, then Xn converges to X in distribution. For continuous laws (µn , σn → ∞ – rescaling is needed), pointwise convergence of the characteristic function (on R) or of the Laplace function (on a neighbourhood of 0) implies convergence in distribution.]

Here are a few examples.

Under some conditions (involving the Laplace transform, but reducing to easy-to-check conditions depending on the type of singularity), the distribution is asymptotically Gaussian.



G[p]

p∈π

A permutation can be decomposed into cycles: the cycles of length greater than 1 form a derangement, the cycles of length 1 form a set. This corresponds to a product of species, Perm = Der·U , where U : E 7→ {∗} is the species of sets, and gives 1/(1 − x) = Der(x)ex . ⨿ Perm(E) = Der(E1 ) × U (E2 ) U =E1 ⊔E2

Since a map is a surjection onto its image, we can relate maps and surjections by Set(I, ·) = Epi(I, ·) · U : ⨿ Set(I, J) = Epi(I, J1 ) × U (J2 ).

J=J1 ⊔J2 Introduction to the theory of species of structures F. Bergeron et al. (2013) A map E → E is defined by (iterate the map until you get cycles, and look at the bassin of attraction of each A species of structures is a functor F from the category element in those cycles): of finite sets and bijections set to itself. – A partition π of E; To a species F , one can associate formal power series – A rooted tree on each block of π; ∑ – A permutation on the blocks of π. xn F (x) = #F [n] n! This is a composition of species, End = Perm(A). n⩾0 ∑ #(F [n]/≃) xn F˜ (x) = n⩾0

ZF (x1 , . . . ) =

∑ 1 ∑ fix F [σ] xσ1 1 xσ2 2 · · · n!

n⩾0

σ∈Sn

where [n] = {1, 2, . . . , n}, #(F [n]/≃) = #(F [n]/Sn ) is the number of isomorphism classes in F [n], σk is the A partition of E is: Article and book summaries by Vincent Zoonekynd

115/587

This gives Partitions = U (U − 1), and Partitions(x) = exp(ex − 1) (Bell numbers). A weighted species is a functor set → R-set, where R = k((t1 , t2 , . . . )) is a field of formal power series and R-set is the category of R-weighted finite sets. An Rweighted finite set is a set A with a map w : A → R. One associates power series associated to a weighted species as before,∑by replacing the size |A| with the inventory |A|w = a∈A w(a). The usual operations (+, ·, ◦, ′ , • , ×, □) extend to weighted species. Weighted species can be used to compute, e.g., the number of rooted trees with a given number of nodes and a given number of leaves. A k-sort species is a functor mset(k) → set, where mset(k) is the category of multisets with k sorts of elements (colours), i.e., of k-tuples of sets.

L2

NA

NA

NA

– A partition π of E; – A set structure on each block of π; – A set structure on π.

L1

L0.5

L0

3. Generalized linear models (logistic, survival, Poisson, etc.) are estimated by minimizing the loglikelihood; one can simply add an ℓ1 penalty to them. Thanks to the ℓ1 penalty, overspecified models are not a problem: there is no need to encode factors as “contrasts”. The ℓ1 regularized SVM path presents jumps: to avoid them, replace the hinge loss with its square; it becomes very similar to logistic regression. )2 1 ∑( Argmin 1 − y1 fi (xi |β) + + λ ∥β∥1 . N β

4. The elastic net combines ℓ1 and ℓ2 penalties. An ℓ1 penalty alone does not work well when there are Qu’est-ce qu’une espèce de structures? identical variables (the solution is not unique). Génèse et description F. Bergeron and G. Labelle (2011) To have groups of variables enter the model or remain unselected in block, the group lasso adds their unMore elementary introduction to the theory of species squared ℓ2 norm as a penalty. Larger groups are more of structures. likely to be selected: to avoid that, weigh the penalties √ with the group size, e.g., p or (better) ∥Z∥F (FrobeStatistical learning with sparsity: nius norm of the design matrix of those factors). the lasso and its generalizations To enforce sparsity inside each group, the sparse group T. Hastie, R. Tibshirani, M. Wainwright (2015) lasso adds an ℓ1 penalty in addition to the unsquared 2. The lasso ℓ2 one. Find

β

To minimize Such that

1 2 ∥y − Xβ∥2 2N ∥β∥1 ⩽ t

can be put in Lagrangian form Argmin β

1 2 ∥y − Xβ∥2 + λ ∥β∥1 . 2N

The 1/2N coefficient makes models of different sizes comparable (e.g., for cross-validation). For a single predictor, the fit can be obtained by softthresholding the least squares estimate ( ) Sλ (β) = sign(β) |β| − λ + = . Cyclical coordinate descent updates the coefficients one at a time, keeping the others fixed, by softthresholding. The degrees of freedom of an estimator for an additive error model yi = f (xi ) + εi , f unknown, εi ∼ N (0, σ 2 ), is 1 ∑ Cov(ˆ yi , yi ). df(ˆ y) = 2 σ i

Sometimes, those groups of variables may overlap (e.g., gene pathways). The overlap group lasso can also enforce a hierarchical structure, e.g., to allow interactions only if the main effects are already present. Generalized additive models (GAM) are ∑ usually estimated by backfitting, fj ← Smooth(y − fˆk ); they k̸ = j solve the optimization problem [ ]2 ∑ (fˆ1 , . . . , fˆJ ) = Argmin E Y − fj (Xj ) . f1 ,...,fJ

j

Sparse GAM (SPAM) adds an unsquared ℓ2 penalty and uses sparse backfitting: [ ]2 ∑ ∑ ∥fj ∥2 Argmin E Y − fj (Xj ) + λ f1 ,...,fJ

j

j

( ∑ ) ˜ fj ← Smooth y − fˆk ( fˆj ←

λ 1−

f˜ 2

)

k̸=j

f˜j +

The fused lasso approximates a time series with a One could also consider ℓ penalties or constraints, piecewise constant function ∑ with 0 ⩽ p < 1, but the problem would no longer 1∑ 2 Argmin (y − θ ) + λ |θi − θi−1 | . i i be convex. 2 θ p

Article and book summaries by Vincent Zoonekynd

116/587

It can be generalized to images, non-equispaced observations, regression with ordered predictors (|βi − βi−1 | – nearby predictors tend to have the same coefficient), piecewise affine or piecewise polynomial functions (trend filtering), nearly isotonic regression ∥(∆θ)− ∥1 . Coordinate descent does not work for the fused lasso. In the 1-dimensional case, one can start with the unpenalized estimate (λ = 0) and progressively fuse neighbooring intervals. One can also solve the dual (lifted) problem or use dynamic programming.

Coordinate descent (optimizaing one coordinate at a time) works if the non-differentiable part of f is separable or, more generally if, when the directional derivatives along the axes are non-negative, then so are all the directional derivatives (for a situation where it does not happen, take f on R2 with polygonal level curves: coordinate descent cannot escape (acute) corners that remain inside one of the quadrants).

For linear regression, if the variables are centered and standardixed, the lasso and the elastic net estimates 5. The nuclear norm (the sum of the singular values) can be computed from the OLS estimates by shrinkage is a generalization of the ℓ1 norm: ∥diag x∥∗ = ∥x∥1 . and soft-thresholding. It is also a convex relaxation of the rank. Coordinate descent is faster than the proximal gradient or Nesterov momentum. Descent methods are iterative optimization algorithms: choose a direction (usually not far from the gradient, Least angle regression (LAR) adds the variables one at i.e., ⟨∇f, dir⟩ < 0); make a small step in this direction; a time but, contrary to stepwise regression, it only initerate. Common direction choices include creases the coefficients until another variable becomes more promising. If the algorithm also removes a varidir = −∇f (gradient descent) able from the active set when its coefficient drops to zero, it computes the lasso regularization path. dir = −D−1 ∇f for some diagonal matrix D dir = −(∇2 f )−1 ∇f

(Newton).

For the alternating direction method of multipliers (ADMM),

The step size is important: estimate it with a 1dimensional optimization, or with Armijo’s rule.

Find To minimize Such that

β, θ f (β) + g(θ) Aβ + Bθ = c

In the presence of constraints, the projected gradient method simply projects the solution back onto the feasible set after each step (gradient descent is similar, the augmented Lagrangian is (with g small) with β ∈ Rn instead of β ∈ C): Lρ (β, θ, µ) = f (β) + g(θ) + ⟨µ, Aβ + Bθ − c⟩ + ρ 2 ⟨ ⟩ 1 2 ∥Aβ + Bθ − c∥2 . βn+1 = Argmin f (βn )+ ∇f (βn ), β−βn + ∥β − βn ∥2 . 2 2s β∈C It can be solved by iterating If the objective is a sum of convex functions f = g + h, one differentiable, one not, the generalized gradient update linearizes g but leaves h as is. ⟨ ⟩ βn+1 = Argmin g(βn ) + ∇g(βn ), β − βn + β

1 2 ∥β − βn ∥2 + h(β) 2s = proxsh (βn − s∇gβn ) The proximal map generalizes the projection: 2

proxh z = Argmin 12 ∥z − θ∥2 + h(θ)

βn+1 = Argmin Lρ (β, θn , µn ) β

θn+1 = Argmin Lρ (βn+1 , θ, µn ) θ

µn+1 = µn + ρ(Aβn+1 + Bθn+1 − c). For the lasso Find To minimize Such that

β, θ 2 1 2 ∥y − Xβ∥2 + λ ∥θ∥1 β − θ = 0,

β is updated by a ridge regression, θ by soft thresholding and µ linearly.

θ

proxIC z = projection of z onto C proxλ∥·∥1 z = Sλ (z) elementwise soft-thresholding prox∥·∥∗ Z = singular value soft-thresholding. Convergence is faster if g is strongly convex. The accelerated gradient descent, e.g., conjugate gradient, or Nesterov momentum βn+1 = θn − s∇f (θn ) n θn+1 = βn+1 + (βn1 − βn ) n+1 is faster (but non-monotonic). Article and book summaries by Vincent Zoonekynd

Minorization-maximization (MM) minimizes a (possibly non-convex) function f by finding a function Ψ : Rm × Rn → R such that ∀β, θ ∀β

f (β) ⩽ Ψ(β, θ) f (β) = Ψ(β, β)

and iterating βn+1 = Argminβ Ψ(β, βn ). The proximal gradient can be seen as a MM (a Lipschitz gradient gives Ψ); in a different context, so can the expectation maximization (EM) algorithm (Ψ comes from Jensen’s inequality). A function f (·, ·) is biconvex if f (·, β) and f (α, ·) are convex (for instance, f (α, β) = (1 − αβ)2 ); a set 117/587

C ⊂ A × B ⊂ Rn is biconvex if the sections Cα,· , C·,β are convex. Alternate convex search (ACS) successively optimizes each block; the function values will converge, but the solution need not; if it does, it is only to a partial optimum. For instance, the maximal singular vectors and values of a matrix can be computed as (this is a generalization of the power method) f (α, β, s) = ∥X − sαβ ′ ∥F Xβn αn+1 = ∥Xβn ∥2 X ′ αn βn+1 = . ∥X ′ αn ∥2 2

gives a 2-dimensional family of matrix factorizations, indexed by (r, λ) (r is the number of columns of A and B). A vector-valued regression Y = XΘ + E can be estimated with the group lasso, which sets whole rows of Θ to zero, or with a nuclear norm penalty, to make it low rank. 2

Argmin ∥Y − XΘ∥F + λ ∥Θ∥∗ Θ

These models (matrix completion, low rank regression) can be written yi = tr(Xi′ Θ) + ε for various choices of the matrices Xi , and estimated as

)2 1 ∑( When there are millions of variables, one can use yi − tr(Xi′ Θ) + λ ∥Θ∥∗ . Argmin N screening rules (dual polytope projection (DPP), seΘ quential DPP, global string rule, sequential strong rule) to reduce their number; the less conservative rules The penalized SVD, make mistakes, but they can be corrected by checking 2 Argmin ∥Z − U DV ′ ∥F + λ1 ∥U ∥1 + λ2 ∥V ∥1 the KKT conditions and adding the variables that viU,D,V olate them. gives sparse singular vectors (the rank-1 problem 6. It is possible to design statistical tests for the lasso: is biconvex and can be solved by alternating soft– Bayesian (MCMC) simulations, after choosing a thresholding; for the rank-r problem, start again with prior (for instance, the number of variables in the Z − udv ′ ). model is often assumed uniform over J0, kK, with Additive matrix decompositions express a matrix Z as k ≪ p) [this is unrelated to the fact that the lasso is a sum, e.g., of low-rank, sparse and small (noise) mathe posterior model (MAP estimator) for a Laplace trices: prior]; 2 – Bootstrap (even slower, but it scales better); Argmin ∥Z − (L + S)∥F + λ1 ∥L∥∗ + λ2 ∥S∥1 . L,S – Covariance test: the change in Cov(ˆ y , y) as a variable enters the model is asymptotically Exp(1); conFor row-wise sparsity (i.e., ∑ to model row-wise corruptrary to stepwise regression, adaptivity and shrinktion), replace ∥S∥1 with ∥Si· ∥2 . age (asymptotically) compensate each other; – Exact (finite sample) tests for adaptive models whose Factor analysis models the data as yi = µ + Γui + εi , selection event can be written Ay ⩽ b (this includes εi ∼ N (0, S), which gives Var y = ΓΓ′ + S. If S is scalar, principal component analysis (PCA) recovers lasso and stepwise regression) Γ; if not, one can use the low rank + sparse + noise One can also debias the lasso estimate to compute condecomposition. fidence intervals. 8. Most statistical procedures can be formulated as an 7. The singular value decomposition (SVD) provides optimization problem: adding an ℓ1 penalty (or both the best rank-r approximation of a matrix Z: ℓ1 and ℓ2 penalties) sparsifies them. It may be nec′ essary to reformulate the problem to get a convex, or U Dr V = Argmax ∥Z − M ∥F M : rank M =r biconvex, or less nasty problem. Different reformulations give different results. but missing values complicate the situation. The iterated SVD (fill in the missing values, compute the rank r approximation from the SVD, iterate) tends to overfit. The soft-thresholding iterated SVD (fill in the missing values, compute U Sλ (D)V ′ , iterate; you can also let λ decrease at each step) solves the problem 2

Argmin 21 ∥Z − M ∥F + λ ∥M ∥∗ M

where the Frobenius norm is taken over the nonmissing entries of Z. The biconvex problem Argmin ∥Z − A,B

2 AB ′ ∥F

+

2 λ(∥A∥F

+

2 ∥B∥F )

Article and book summaries by Vincent Zoonekynd

8a. The first principal component is the direction in which the variance is maximal Argmax Var Xv = Argmax v ′ X ′ Xv; v : ∥v∥2 =1

v : ∥v∥2 =1

it can be sparsified as Find To maximize Such that

u, v u′ Xv ∥u∥2 = ∥v∥2 = 1,

∥v∥1 ⩽ t

(solvable with SVD and thresholding) or Find To maximize Such that

M ≽0 tr(X ′ XM ) tr M = 1, tr(|M | E) ⩽ t2 . 118/587

The first principal component can also be defined as the direction minimizing the reconstruction error, and sparsified as 1 ∑ 2 2 ∥xi − uu′ xi ∥2 + λ1 ∥v∥1 + λ2 ∥v∥2 . Argmin v,θ : ∥θ∥2 =1 N Auto-encoders also minimize the reconstruction error and are easy to sparsify.

K-means can be sparsified by maximizing the betweencluster sum of squares C1 ⊔ · · · ⊔ CK = J1, N K, w ∈ Rp ) ( ∑ p N ∑ ∑ 1 ∑ 1 To maximize wj dii′ j − dii′ j N ′ Nk ′ j=1 Find

i,i =1

Such that

∥w∥2 ⩾ 1,

∥w∥1 ⩽ s,

k

i,i ∈Ck

w⩾0

For the next principal component, since orthogonality and solved with an alternating algorithm (softmay conflict with sparsity, one may consider thresholding and weighted k-means). Find To miximize such that

uk , vk u′k Xvk ∥vk ∥2 ⩽ 1, ∥uk ∥2 ⩽ 1 ∥vk ∥1 ⩽ c ∀j ∈ J1, k − 1K u′k uk = 0.

9. Undirected graphs can encode dependence properties between random variables (one per vertex) in the following equivalent ways (Hammersley-Clifford theorem): – The joint probability distribution function factorizes over the cliques

Classical PCA is consistent when N → ∞, p/N → 0, ∏ but not (at all) if N, p → ∞, p/N → c > 0. ThreshP (x1 , . . . , xp ) ∝ ψC (xC ); olding the diagonal of the sample variance matrix and C∈Cliques computing the PCA in this lower-dimensional space is actually consistent. – The random variable X has the Markov property over the graph: for all cut set S separating the graph 8b. Canonical correlation analysis (CCA) solves the into connected components A and B, XA ⊥ ⊥ XB | XS . problem Find To maximize Such that

β, θ Cov(Xβ, Y θ) Var Xβ = Var Y θ = 1

e.g., via SVD or ALS; it can be sparsified by adding ℓ1 and ℓ2 constraints and solved by alternating softthresholding. 8c. Linear discriminant analysis (LDA) can be presented in three different ways, which can be sparsified as usual: – Model the classes with Gaussian distributions, with different means but the same variance (the naive Bayes classifier is the special case when the variance matrix is diagonal); – Fisher: find a low-dimensional projection in which the between-class variance is large wrt the withinclass variance, i.e., Argmax β

β ′ Σb β ; β ′ Σw β

For instance, one can model how politicians vote as follows Xi : vote of politician i sign θi : whether i is likely to vote yes sign θij : whether i and j tend to agree (∑ ) ∑ θ i xi + θij xi xj P ∝ exp (the graph is complete, but only cliques up to size 2 are used). A multivariate Gaussian model is a graphical model, with only vertex and edge factors; the adjacency matrix of the graph is the sparsity pattern of the concentration (or precision) matrix Θ = Σ−1 . The Gaussian likelihood can be penalized (graphical lasso) ˆ − λρ1 (Θ) Argmax log det Θ − tr ΣΘ Θ≽0

– Optimal scoring: LDA with two classes is a linear regression of the binary response; with more classes, it is still a linear regression, provided we assign a number to each class, in an optimal way.

1 where ρ1 is √the ℓ norm of the off-diagonal elements log p and λ = 2 . The problem is convex and can be N solved efficiently by blockwise coordinate descent (fix 8d. Hierarchical clustering does not work well in the everything except a row (and the corresponding colpresence of many uninformative variables: sparse hierumn); each step is a lasso regression). If the solution archical clustering assigns (sparse) weights to the varihas a block structure, it can be identified before the op ables and applies the usual clustering algorithms on the ˆ ij ⩽ λ whenever i and j are in different timization ( Θ resulting weighted dissimilarity matrix. blocks) and the blocks can be solved independently. Convex clustering looks for prototypes, close to the Alternatively, one can estimate the neighbourhood of points, and close to one another (with an ℓ1 loss, so a random variable as a small set of variables that gives that many are actually equal), a good prediction, using the lasso (and the and or or ∑ ∑ 2 Argmin ∥x1 − ui ∥2 + λ ∥ui − uj ∥q , q ∈ {1, 2}. rule: keep an edge (s, t) if both/either s ∈ N (t) and/or u1 ,...,un t ∈ N (s)). For discrete variables, use the logistic lasso. i2; .// applies the rules until nothing changes, e.g., x+y ing point numbers (not uniformly over some inverval .// {x->y,y->z}; of R); – Estimate the precision of each intermediate result, – It is possible to specify the type and properties of the function arguments or substitution, by comparing it with an arbitrary precision compue.g., f[x_Integer?EvenQ]:=x/2; __ matches a nontation (progressively increase the precision until it is empty sequence, ___ a sequence; sufficient); – If can be written /;, e.g., f[n_]:=Sqrt[n] /; n>0. – Apply (more than a hundred) rewrite rules (e.g., as√ √ √ x + 1 − x → 7 1/( 1 + x + sociativity, but also √ x)); simplify; Article and book summaries by Vincent Zoonekynd

122/587

A thread-safe arbitrary precision computation package D.H. Bailey (2015) Implementation details for a multiprecision library – usefil for ODEs (e.g., the Lorentz attractor loses 16 digits in each period) and experimental mathematics. Alternatives include: gmp, mpfr (and languages that use it, e.g., Julia), Pari/GP, mpmath (Python), Sage.

rational functions, Sister Celine’s algorithm looks for a recurrence relation I ∑ J ∑

aij (n)F (n − j, k − i) = 0

i=0 j=0

as follows: divide the recurrence relation by F (n, k); simplify; only rational functions remain: put them on the same denominator; the numerator is a poly1 Use Newton iteration for square roots x ← x + 2 (1 − nomial in k and its coefficients should be zero: solve x2 a)x, nth roots, inverses x ← x + (1 + xb)x, loga- the corresponding system; if it does not have any rithm x ← x − (ex − a)/ex , arcsine; Taylor for expo- non-trivial solution, increase I and J and try again. over k, we get a recurrence relation for nential (for x small), sine (for x small – use the double By summing ∑ f (n) = F (n, k); solve it with the Hyper algorithm. angle formula to bring it back to its initial value in k (−π/4, π/4]). For multiplication, use the elementary (The algorithm does not scale well: prefer the equivaschool algorithm (in base 248 ) or the FFT (the digits lent Zeilberger algorithm.) of ab, before accounting for carries, are the convolution 2. Given a hypergeometric term (tk )k , i.e., rk = of those of a and b). tk+1 /tk is a rational function of k, Gosper’s algoFor much higher precision (thousands to millions of rithm looks for another hypergeometric term (zk )k digits), algorithms based on the arithmetic-geometric such that ∆z = t, i.e., zn+1 − zn = tn ; this gives a ∑ mean (AGM), being quadratically convergent (the simple formula for the partial sums sn = n−1 tk = k=0 number of correct digits doubles at each iteration), be- zn − z0 . come competitive. Since it computes an antiderivative of a hypergeometric series, it can be seen as an analogue of the Risch Ten problems in experimental mathematics algorithm (differential Galois theory) for hypergeometD.H. Bailey et al. (2006) ric series. Applications of the PSLQ algorithms to situations The details are technical: write where the numeric estimation of the number of interest an cn + 1 r+n= is non-trivial. bn cn Using integer relation algorithms for finding relationships among functions M. Chamberland (2007)

with a, b, c polynomials and ∀k gcd(an , bn+k ) = 1; find a polynomial xn , if it exists, such that an xn+1 − bn−1 xn = cn ; let

bn−1 xn zn − tn . The PSLQ algorithm can also find integer relations becn tween functions: evaluate them at some random point; apply PSLQ; try another point to make sure. 3. Zeilberger’s algorithm (aka creative telescoping) is a (faster) variant of Sister Celine’s algorithm, to find a telescoping recurrence relation A=B M. Petkovšek et al. (1997) F (n + 1, k) − F (n, k) = G(n, k + 1) − G(n, k) ∑ A geometric series is a series tn in which the ratio of or, more generally, two consecutive terms tn+1 /tn∑ is constant. A hyperJ ∑ geometric series is a series tn in which the ratio aj (n)F (n + j, k) = G(n, k + 1) − G(n, k). of two consecutive terms tn+1 /tn is a rational function j=0 of n. If It is guaranteed to exist if F is a proper hypergeometric tn+1 (n + a1 ) · · · (n + ap ) x term: = , tn (n + b1 ) · · · (n + bq ) n + 1 U ∏ (ai n + bi k + ci )! it is usually written i=1 F (n, k) = P (n, k) xk , ∑ (a1 ...ap ) V ∏ tn . p Fq b1 ...bq ; x = p Fq (a, b; x) = (u n + v k + w )! n⩾0

There are lists of hypergeometric identities, but expecting to be able to evaluate arbitrary hypergeometric series from those lists is hopeless.

i

i

i

j=1

where P is a polynomial, U , V , ai , bi , uj , vj are integers (ci and wj are arbitrary).

Summing over k gives a recurrence relation for sn = ∑ 1. Given a double hypergeometric term F (n, k), i.e., F (n, k) (the rhs cancels out if G has finite support k both F (n + 1, k)/F (n, k) and F (n, k + 1)/F (n, k) are in k for all n). Article and book summaries by Vincent Zoonekynd

123/587

4. ∑

To prove hypergeometric identities, e.g., auto-encoder can be used for the same purpose, F (n, k) = 1, the Wilf-Zeilerber (WZ) algok rithm finds G (with compact support in k for all n) X Y such that F (n − 1, k) − F (n, k) = G(n, k + 1) − G(n, k), e.g., by applying Gosper’s algorithm. h(X, Y ) 5. The Hyper algorithm finds the closed form solution, if it exists, of any recurrence with polynomial coefficients (holonomic sequence) by looking for a hypergeometric solution and reducing the problem to finding polynomial solutions of another recurrence (here, “closed form” means sum of hypergeometric series). Note that the zeroes of the coefficients of the recurrence can make the dimension of the space of solutions larger that expected: checking that two solutions of a recurrence of degree k agree on just k points may not be sufficient to prove they are equal. There are implementations of those algorithms in Maple (EKHAD) and Mathematica (Gosper, Hyper, WZ).

X

Y

minimizing the sum of – – – –

Reconstruction error of (X, Y ) given (X, Y ); Reconstruction error of (X, Y ) given (X, 0); Reconstruction error of (X,)Y ) given (0, Y ); ( Minus Cor h(X, 0), h(0, Y ) .

[One could also use a noisy auto-encoder: during training, replace either X or Y with noise.] Semi-supervised learning with ladder networks A. Rasmus et al.

A ladder network is an auto-encoder with skip connecDeep reinforcement learning tions, to preserve details. with double Q-learning H. van Hasselt et al. (2016) Scheduled sampling for sequence prediction Imprecise action values lead to overestimated values. with recurrence neural networks Double Q learning, which learns two sets of weights, θ S. Bengio et al. and θ′ , randomly updating one of them at each step, Neural networks outputting a sequence usually predict and replaces the Q-learning target the next token from the current (hidden) state and the previous token – there may be a discrepancy between YtQ = Rt+1 + γ Max Q(St+1 , a; θt ) training and inference, the former using the true token a ( ) = Rt+1 + γQ St+1 , Argmax Q(St+1 , a, θt , θt ) while the latter only has the precious predicted token. Instead, during training, one can randomly choose to a ( ) use the true or predicted token, with the probability of Q ′ with Yt = Rt+1 + γQ St+1 , Argmax Q(St+1 , a, θt ), θt , a the true token decreasing as training progresses. reduces the problem.

Tensorizing neural networks A. Novikov et al. To reduce the memory footprint of large neural nets, one can use lower numeric precision, hashing, or low rank tensor approximations such as Tucker, CP or tensor train (TT). A TT decomposition of a tensor A is a decomposition of each element as a product of matrices (the first and last factors are row and column matrices) A (j1 , . . . , jd ) = G1j1 G2j2 · · · Gdjd .

Learning both weights and connections for efficient neural networks S. Han et al. For sparse neural nets: train; prune unimportant connections; iterate. Minimum HGR correlation principle: from marginals to joint distribution F. Farnia et al. The HGR correlation is { ρ(X, Y ) = sup E[f (X)g(Y )] : f, g measurable such that Ef X = EgX = 0 and } E(f X)2 = E(gX)2 = 1 .

Correlation neural networks S. Chandar et al.

It is between 0 and 1, and is 0 (resp. 1) iif X and Y are independent (strictly dependent). The distribution Canonical correlation analysis (CCA – for a C++ im- with the minimum HGR correlation for given first and plementation, check dlib) solves the following prob- second moments is the Gaussian – this is an analogue lem: given two (vector) random variables X, Y , find of the maximum entropy property of the Gaussian disx and y to maximize Cor(x′ X, y ′ Y ); x′ X and y ′ Y can tribution. be seen as a common representation of X and Y . An Article and book summaries by Vincent Zoonekynd

124/587

Scalable Bayesian optimization using deep neural networks J. Snoek et al.(2015)

– Consider the potential V defined by Hϕ = 0, where the Hamiltonian operator is

1 H = − 2 ∇2 + V (x); Bayesian neural networks, i.e., distributions over 2σ the weights of a neural net (instead of point estimates) are computationally expensive. They can be approxi- – Move the densities towards this potential ϕ1 7→ mated by adaptive basis regression, i.e., neural nets in e−iδH ϕi ; which only the last, linear layer is Bayesian. Bayesian – Compute the new position of the points; optimization with Gaussian processes scales cubically – Iterate. in the number of observations; adaptive basis regression is a linear alternative (random forests are another). Input warping for Bayesian optimization of non-stationary functions Gradient-based hyperparameter optimization J. Snoek et al. through reversible learning D. Mclaurin et al. Gaussian processes (GP) usually model stationary processes: the covariance function k(x, y) only depends on Backpropagation through the entire learning proce- x − y. For non-stationary processes, one can use a hidure, back to the hyperparameters, allows the use of erarchical Bayesian model, including a transformation gradient-based (instead of gradient-free Bayesian opti- of the input space to make the process stationary. For mization) hyperparameter tuning. Storing all the in- instance, the cummulative distribution function of the termediate results needed to compute the gradients is beta distribution can model simple monotonic transforunreasonable (it would mean remembering all the it- mations: use a logarithmic prior on α and β to encode erations of the inner loop, on the mini-batches). It the prior belief in an exponential, logarithmic, linear, is however possible to run the optimization procedure sigmoid or logit transformation. backwards, provided we remember all the bits that fall off in floating point computations. A sufficient and necessary condition for global optimization Topological data analysis of contagion maps D.H. Wu et al. (2010) for examining spreading processes on networks D. Taylor et al. If f is Lebesgue-integrable, then The Watts threshold model (WTM) map is obtained by simulating the WTM (a node becomes infected if the proportion of infected neighbours exceeds some threshold) and using the infection times (for various (singleton) infection seeds) as coordinates; dimension reduction may be needed. Compressing neural networks with the hashing trick W. Chen et al. To reduce the size of a neural net: – Train a 1-layer net on the log-output of a deep net; – Do not learn the full weight matrices but low rank approximations; – Use feature hashing to reduce the weight matrices to a reasonable number or parameters. Analyzing big data with dynamic quantum clustering M. Weinstein et al. The mean-shift clustering algorithm starts with a cloud of points, computes a density estimator, moves each point uphill, and iterates. Dynamic quantum clustering (DQC) is a quantum-flavoured variant:

essinf f = sup{α : Vm,f (α) = 0}, ∫ ( )m Vm,f (α) = α − f (x) + dµ (m)

Vm,f (α) = m!µ[f (x) ⩽ α]. Use Monte Carlo simulations to estimate those integrals and turn the formula into a glocal optimization algorithm. Chemical equation balancing: an integer programming approach S.K. Sen et al. (2006) Balancing a chemical equation is a linear optimization problem: Find x To minimize x′ 1 Such that Ax = 0, x ⩾ 1, x ∈ Zn . The quadratic problem Find To minimize Such that

x x′ x Ax = 0, x ⩾ 1, x ∈ Zn

– Start with a cloud of points x1 , . . . , xn ; ∑ may be easier to solve and often has the same solution. – Consider the density estimator ϕ(x) = ϕ(xi ), ϕ(xi ) = exp − 12 (x − xi )′ (x − xi )/σ 2 ; Article and book summaries by Vincent Zoonekynd

125/587

A new approach appear in mathematical morphology, a branch of imto balancing chemical equations age analysis.] The local feature size of x ∈ Σ is I.C. Risteski f (x) = d(x, MΣ ). A sample P ⊂ Σ is an ε-sample if Balancing a chemical equation means finding positive ∀x ∈ Σ ∃p ∈ P ∥x − p∥ ⩽ εf (x); if ε < 1, the sample is dense. vectors, x and y, if they exist, so that Ax = By, where A and B have nonnegative entries. Try: Given an ε-sample P of a closed plane curve Σ whose local feature size is always positive, one can find a piecey = (I − G+ G)u wise approximation of Σ, as a subset of the Delaunay + + triangulation, as follows: x = A By + (I − A A)v – Compute the Voronoi tessellation of P and the G = (I − AA+ )B voronoi vertices V ; they approximate the median for arbitrary vectors u and v. (You need to compute axis; a pseudo inverse in Q, exactly, or in Z/nZ, with n – Compute the Delaunay triangulation of P ∪ V ; keep sufficiently large.) the edges connecting two points of P . Mastering the game of Go with deep neural networks and tree search D. Silver et al. (2016) Consider a Go board as a 19 × 19 image and feed it to two deep convolutional neural nets, for forecast the next move (supervised learning from human games), and to estimate the value of a position; combine with Monte Carlo tree search (MCTS).

●●●●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●●●







● ●

●●●●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ●● ● ● ● ●● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ●●● ● ● ● ● ●

●●●●●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●●● ● ● ● ● ● ● ● ●●●

●●●●●● ●● ● ● ● ● ● ●●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ●●●●●



Alternatively, link each point p ∈ P to it nearest neighbour q and its half-neighbour s – the closest point s such that ∠spq ⩾ π/2. Both algorithms are O(n log n) and work if ε < 1/5, resp. ε < 1/3.

Ball arithmetic J. van der Hoeven (2011)

One can reconstruct a compact surface (without boundary) Σ ⊂ R3 from a sample P of points and For reliable computing, consider replacing interval its Voronoi tessellation: arithmetic with ball arithmetic (sometimes called – The normal at a point p ∈ P can be approximated midpoint-radius arithmetic): by the direction in which the Voronoi cell Vp is elongated; more formally, if Vp is unbouded, use the av– While interval arithmetic require full precision for erage direction vp of the unbounded edges; if it is both ends of the intervals, ball arithmetic only needs bounded, take the point p+ ∈ P farthest from p, full precision for the centers; then the point p− farther from p among the points – On C or Rn×m , intervals tend to grow faster than such that ∠p− pp+ > π/2 and set vp = p− − p; balls (e.g., z 7→ z 2 around z = 1 + i). – The surface in the neighbourhgood of a point p ∈ P Mathemagix if a (C++) software environment for cercan be approximated by the cocone: the complement tified numeric computations. of the double cone, centered on p, in the direction vp , of angle 3π/8, restricted to Vp ; Arb: a C library for ball arithmetic – Take the Voronoi edges that intersect the cocones; their duals are triangles in the Voronoi tessellation F. Johansson (but there are still too many); Accurate special functions. – Remove triangles adjacent to “sharp” edges (angle < 3π/2); Bitcoin and cryptocurrency technologies – “Pockets” (blisters) remain on the approximate surface: remove them by walking on the outside. A. Narayanan et al. (Princeton & Coursera, 2015) Non-technical but clear and comprehensive exposition of cryptocurrencies and their applications. Curve and surface reconstruction algorithms with mathematical analysis T.K. Dey (2007)

Topological anomaly detection performance with multispectral polarimetric imagery M.G. Gartley and W. Basener (2009) Outliers form the small connected components of the truncated graph.

The skeleton of a shape (curve, surface) Σ ⊂ Rk Return to the Riemann integral is the set of centers of balls that do not contain any R.G. Bartle (1996) point of Σ and cannot be enlarged while keeping this property. The medial axis MΣ is its closure. It A tagged partition of an interval [a, b] is the datum of is also the closure of the set of points that have at a = x0 < x 1 < · · · < x n = b least two closest points in Σ. [Those notions also Article and book summaries by Vincent Zoonekynd

126/587

and, for each i, ti ∈ (xi−1 , xi ). The Riemann sum of a function f : [a, b] → R on a tagged partition ((xi )i , (ti )i ) is ∑ R(f, (xi )i , (ti )i ) = f (ti )(xi − xi−1 ).

in parameter space, but in function space, i.e., we do not look at ( ) ∑ ∂Loss yi , h(xi , θ) ∑ ∂Loss(y, yˆ) but y=yi yˆ=fˆ(x ). ∂θ ∂ yˆ i i i

i

A function f is Riemann integrable, of integral S, if ∀ε > 0

∃δ > 0

∀(xi )i , (ti )i

(∀i |xi − xi−1 | < δ) =⇒ |R(f, (xi )i , (ti )i ) − S| < ε. The generalized Riemann integral replaces δ with a function δ : [a, b] → R>0 , evaluated in ti . The following functions are integrable: – Riemann integrable functions: R ⊂ R ∗ ; – 1Q ; – Lebesgue-integrable functions: L = { f ∈ R ∗ : |f | ∈ R ∗ }; – The derivative of (continuous) functions differentiable except at a countable number of points.

More precisely, we iterate: ]2 [ ∑ ∂Loss(y, yˆ) (ρt , θt ) = Argmin − y=yi −ρh(xi , θ) yˆ=fˆ (x ). ∂ yˆ ρ,θ i

t−1

i

ft ← ft−1 + ρt h(·, θt ) The weak learners can be: linear regression, ridge regression, mixed model, splines, radial basis functions, GAM, tree stumps, trees with a maximum interaction depth (5 is good enough), wavelets (Viola-Jones face detection algorithm), etc. The loss function can be: L2 , L1 , Huber, quantile loss, binomial loss log(1 − e−yyˆ), Adaboost loss e−yyˆ, etc. Generalizations include:

– If there are many variables, fit the base learners on a random subset of the variables, several times, and keep the best fit – this gives a sparser model; – Subsampling (bagging); The convergence theorems from Lebesgue integration – Shrinkage: ft ← ft−1 + λρt h(·, θt ); (uniform, monotone, dominated convergence) are still – Early stopping – this is needed; the optimal number valid. of iterations depends on the step size λ and can be estimated by cross-validation. Regulated functions: Bourbaki’s alternative to the Riemann integral Laplacian eigenmaps for dimensionality S.K. Berberian (1979) reduction and data representation Another alternative (not a generalization) of the RieM. Belklin and P. Niyogi (2002) mann integral. Regular functions are functions with Given a cloud of points x1 , . . . , xN , build a graph, with one-sided limits at each point; equivalently, they are those points as vertices, and an edge when the distance uniform limits of step functions. For every regulated between two points is below some threshold, or when function f , there is a function F , differentiable except one is among the k nearest neighbours of the other; use perhaps at countably many points, with F ′ = f (the 2 1 or wij = exp − ∥xi − xj ∥ /t as weights (similarities). converse is not true). Regulated functions are Riemann Spectral dimension reduction solves the generintegrable. alized eigenvalue problem Lf = λDf , where D = diag(W 1) and L = D − W , A friendly introduction to RGP O. Flasch (2014) Lfi = λi Dfi 0 = λ0 ⩽ λ1 ⩽ · · · ⩽ λm , There is no need to add improper integrals: such functions are already integrable.

The rgp package provides genetic programming (symbolic regression) and typed genetic programming (idem, with higher-order functions); it can be used with emoa for multiobjective optimization and SPOT for hyperparameter tuning. Alternatives include ECJ (Java), GPTIPS (Matlab) and, on the commercial side, DataModeler, Discipulus and Eureqa. Gradient boosting machines, a tutorial A. Natekin and A.Knoll (2013) Boosting builds an ensemble model by fitting simplistic models (base learners, weak learners) to the data, with more weight on the observations currently misclassified by the ensemble. Gradient boosting gives an interpretation of this algorithm as a gradient descent, not Article and book summaries by Vincent Zoonekynd

and uses f1 , . . . , fm as coordinates. In dimension 1 (Fiedler vector), this can be justified as follows: we want a 1-dimensional embedding (y1 , . . . , yN ) ∈ RN so that points that should be ∑ close (W2ij large) be close, e.g., by minimizing ij (yi − yj ) Wij , with some normalization constraint, 2 e.g., ∥y∥ = 1 or y ′ Dy = 1 (in order to avoid the trivial solution y = 1, we also add y ⊥ 1, i.e., y ′ D1 = 0 – this will give∑the smallest eigenvalue, skipping zero). Noticing that (yi −yj )2 Wij = y ′ Ly, the optimization problem becomes Find To minimize Such that

y ∈ RN y ′ Ly y ′ Dy = 1 and y ′ D1 = 0 127/587

and can be solved with Lagrange multipliers

Constructing Laplace operator from point clouds in Rd M. Belkin et al.

f (y) = y ′ Ly g(y) = y ′ Dy ∇f = λ∇g, i.e., Ly = λDy – a generalized eigenvalue problem. This interpretation remains valid in a continuous setting, i.e., if we replace the cloud of points with a compact manifold M : Find f : M −→ R, C 2 ∫ 2 To minimize ∥∇f ∥ ∫ 2 Such that ∥f ∥ := ∫M f 2 = 1 ⟨f, 1⟩ := M f = 0 where ∥∇f ∥ estimates how far appart f maps nearby ∫ 2 points. The objective can be written ∥∇f ∥ = ∫ ∫ ∗ ∫ ⟨∇f, ∇f ⟩ = ⟨∇ ∇f, f ⟩ = L f · f, where L = ∇∗ ∇ = − div ∇ is the Laplace-Beltrami operator (that is the name of the Laplace operator on a Riemannian manifold); it is positive semidefinite and (if M is compact) its spectrum is discrete 0 = λ0 ⩽ λ1 ⩽ · · · . The first eigenvalues give a low-dimensional embedding. To find two clusters in a weighted graph, one can try to find a partition V = A ⨿ B that minimizes ∑ Cut(A, B) = Wuv ,

One can approximate the Laplace operator L on a submanifold M ⊂ Rd of (known) dimension k from a finite sample of points P ⊂ M as follows: – Approximate the tangent space at p ∈ P , for instance by the k plane Q∗ through p best fitting Pr = P ∩ B(p, r), i.e., minimizing the 1-sided Hausdorff distance dH (Pr , Q∗ ) = supp∈Pr inf q∈Q∗ d(p, q); an approximation Tˆp of Q∗ is sufficient; – Build the Delaunay triangulation Kδ of the projection π(Pδ ) or Pδ on Tˆp ; – For a function f : P → R, set Lf (p) to ∥f (p)−f (π ∑ vol σ ∑ − 4t [f (p)−f (π −1 q)]e k+1

σ:k-dim face of Kδ

−1

q)∥

2

.

q∈V (σ)

Contrary to other implementations (e.g., the adjacency matrix with Gaussian weights), it does not assume that the points are drawn uniformly from M and does not require a global mesh (given a mesh, one can use the same formula, with the whole mesh instead of Kδ, and the tangent space is not needed (π = id) – this is the mesh-Laplace operator).

u∈A v∈B

but this tends to cut off weakly-connected outliers. Instead, one can minimize the normalized cut ) ( 1 1 + , NCut(A, B) = Cut(A, B) vol A vol B ∑ where vol A = Wuv . By setting u∈A v∈V

{ x1 =

1/ vol A 1/ vol B

if i ∈ A if i ∈ B,

one can see that x′ Lx x′ Dx ′ x D1 = 0;

NCut(A, B) =

the relaxed minimum normalized cut problem, aka spectral clustering, is a generalized eigenvalue problem. The local linear embedding (LLE) algorithm builds an unweighted graph from a cloud of points; the barycen 2 ∑ ∑

tric coordinates Wij minimize i x1 − j Wij xj , ∑ and are normalized by ∀i j Wij = 1; the embedding is given by the k lowest eigenvalues of E = (I − W )′ (I − W ). One can see that E ≈ 12 L2 (under some conditions); the eigenvectors are the same as those of L.

Article and book summaries by Vincent Zoonekynd

Convergence, stability and discrete approximation of Laplace spectra T.K. Dey et al. The mesh-Laplace operator gives a good approximation of the Laplace operator (pointwise convergence) and a good (accurate, robust) approximation of its spectrum. The spectrum of the Laplacian gives some information on the manifold, e.g., its volume (Weyl’s law: λn ∼ 4πn/ Vol M ) or its total curvature.

Differential representation for mesh processing O. Sorkine (2006) The graph Laplacian of a mesh in R3 can be interpreted as the linear transformation from coordinates 128/587

to δ-coordinates (useful, e.g., for mesh editing) L : vi 7−→ δi = vi − =

∑ 1 vi |N (i)|

1 |N (i)|



j∈N (i)

vi − vj ,

j∈N (i)

i.e., L = I −D−1 A, symmetrized Lsym = DL = D −A, Lsym x = Dδx . There are many variants, e.g., 1 ∑1 δi = (cot αij + cot βij )(vi − vj ) |Ωi | j 2

Q-learning and SARSA: a comparison between two intelligent stochastic control approaches to financial trading M. Corazza and A. Sangalli Misguided attempt to apply reinforcement learning (RL) to trading, using the past returns as state: they use the past Sharpe ratios as rewards – by construction, this will give a momentum strategy. Universal value function approximators T. Schaul et al. (2015)

The value function approximator V (s; θ) learnt in reinforcement learning can be generalized to account 1 2 /2) + tan(θij /2) 1 ∑ tan(θij for various goals (e.g., a single desired final state), δi = (vi − vj ) |Ωi | j ∥vi − vj ∥ V (s, g; θ), and used for planning (do not try to get to the final goal from the start; try to reach intermediate where |Ωi | is the volume of the Voronoi cell of i, αij , milestones first). 1 2 βij are the angles opposite the edge (ij), θij , θij are Weighted Voronoi stippling the angles at i along (ij). A. Secord (2002) Reconstructing the coordinates from the δ-coordinates (with a few positional “constraints” on some of the ver- Stippling (non-photorealistic rendering) with weighted centroidal Voronoi diagrams, i.e., Voronoi diagrams in tices) can be approached as a least squares problem. which the generators are the centroids of their cell. If the mesh is known (e.g., a character in a video game – the mesh is just deformed as it moves), we can use Modern C the eigenfunctions of the Laplacian to encode the coJ. Gustedt (2015) ordinates (it is not really necessary to compute the eigenfunctions: the reconstruction can also be reduced Long presentation of C11. to a least squares problem). Distilling the knowledge in a neural network G. Hinton et al. To transfer the information in a complex model to a smaller model (model compression, model distillation), one often trains the smaller model to reproduce the output of the complex one on a very large (unlabeled) dataset. In the case of a multiclass classification problem, since there is information in the low probabilities, one can, instead of reproducing the output classes of the complex model, reproduce its probabilities or, rather, their logits. Even better, one can increase the temperature in the learning phase (to help learn smaller probabilities), and set it back to 1 afterwards.

Abstract tensor systems and diagrammatic representations J. Lazovskis (2012) Formalization of Penrose’s diagrammatic tensor calculus, with a focus on Lie algebras. Sampling for inference in probabilistic models with fast Bayesian quadrature T. Gunter et al. Bayesian quadrature is a faster variant of MCMC. For positive integrands (e.g., a likelihood), use a Gaussian process prior on its square root or its logarithm. Probabilistic line searches for stochastic optimization M. Mahsereci and P. Hennig

A SQP method for general nonlinear programs Stochastic gradient descent (SGD) has two problems: using only equality constrained subproblems – The gradient (even if it were not noisy) is not the P. Spellucci best direction: use momentum, AdaGrad, etc.; The donlp2 algorithm solves a nonlinear program with – One must choose the learning rate or resort to an expensive line search: instead of a line search, try equality and inequality constraints by taking a second Bayesian optimization. order approximation of the objective, linearizing the constraints, replacing some inequalities with equalities and discarding the others. The subtleties reside in the Sequential Bayesian prediction choice of the inequalities to keep as equalities (those in the presence of changepoints and faults violated or near-active) or to discard, and how to deal R. Garnett (2009) with inconsistent quadratic programs. List of covariance structures to detect change points with Gaussian processes (GP): Article and book summaries by Vincent Zoonekynd

129/587

– Drastic changepoint: the observations before and after are independent – the result is discontinuous; – Continuous drastic changepoint: the observations before and after are independent conditionally on the value at the changepoint; – Change in input and/or output scales, ( ) |y − x| K(x, y) = λ2out κ ; σin – Faults: bias, stuck value, drift, etc. RRegrs: an R package for computer-aided model selection with multiple regression models G. Tsiliki et al. (2015) To estimate, compare and choose between 10 regression models: linear, lasso, elastic net, SVM, neural net, random forest, recursive feature elimination (fit the full model, rank the features, take the top k, re-fit the model, keep the best of those smaller models), etc., via caret.

For any functor H, d(HF, HG) ⩽ d(F, G). In Vec(R,⩽) , tame diagrams (constant on a neighbourhood of each t ∈ R, except a finite number of them) are of finite type; they can be described as barcodes. The bottleneck distance on barcodes defines an isometric embedding {barcodes} ,→ Vec(R,⩽) . Given f, g : X → R (not necessarily continuous), consider their lower level sets F, G : (R, ⩽) → Top; let H : Top → D be a functor; then, d(HF, HG) ⩽ ∥f − g∥∞ , i.e., H is stable to perturbations; this applies to persistent homology or relative homology. If D is abelian, the category of ε-interleavings of diagrams (R, ⩽) → D is also abelian.

Econometrics: methods and applications P.H. Frances et al. (Coursera, 2015)

Linear models assume that the predictive variables are not stochastic. In the model Y = X1 β1 + X2 β2 + ε, the predictive variable X1 is endogenous if Cor(X1 , ε) ̸= 0: the ordinary least squares estimate βˆ1 is then inconsistent. This can happen if a variable correlated with Bellman’s GAP: X1 has been omitted, if there is a feedback mechanism a 2nd generation language and system between the predictive variables (e.g., when predicting for algebraic dynamic programming sales from price and demand, it may seem that higher G. Sauthoff (2011) prices are good for sales, but higher prices may just come from higher demand), or if there are measureTextbook dynamic programming (DP) is straightfor- ment errors (we do not observe X but X + η). 1 1 ward, but real-life DP can involve more tables, more recurrence relations, more indices (for instance, sequence To address endogeneity, 2-stage least squares alignment with affine gap penalty requires several ta- (2SLS) looks for instrumental variables (IV) Z, i.e., bles). Algebraic dynamic programming (ADP) simpli- variables Z that affect y only through X: Cor(Z, X) ̸= fies DP by describing the search space (e.g., the set 0, Cor(Z, ε) = 0 and uses them to find β: Cov(Z, y) = of all possible alignments) with a grammar – this re- Cov(Z, X)β. The computations can also be done using ˆ moves all the indices. “Bellman’s GAP” is a Java-like two linear regressions (X ∼ Z, then y ∼ X). programming language for ADP. Alternatives include The method is consistent if 1 Z ′ ε → 0 (Z and ε are n Haskell- or Ocaml-based DSLs and/or code generators uncorrelated), lim 1 Z ′ Z invertible (Z is not multicoln and may not scale well. inear), lim n1 Z ′ X has rank k (the number of variables – Z and X are sufficiently correlated). In particular, you Categorification of persistent homology need at least as many instruments as predictive variP. Bubenik and J.A. Scott (2014) ables (use the intercept and the exogenous variables as instruments: there will be fewer to find). Persistent homology can be expressed in categorical terms. An ε-interleaving between F, G : (R, ⩽) → D The Sargan test checks if H0 : Cor(Z, ε) = 0 by apˆ = proximating the residuals using 2SLS: estimate X is a pair of natural transformations ˆ ′ X) ˆ −1 X ˆ ′ y (2S), Z(Z ′ Z)−1 Z ′ X (1S), estimate b = (X ˆ compute the residuals e = y − Xb (this is X, not X), (R, ⩽) (R, ⩽) G F compute the R2 of e ∼ Z: under the null hypothesis, it Tε Tε D D is small, and nR2 ∼ χ2 (m − k) (where m is the number F G (R, ⩽) (R, ⩽), of instruments and k the number of variables in X).

where Tε is the translation, such that F (a)

F (b) G(a + ε)

G(b + ε)

F (a + 2ε)

F (b + 2ε) G(a + 3ε)

G(b + 3ε).

This defines an interleaving distance d(F, G) = inf{ε : F, G ε-interleaved}. Article and book summaries by Vincent Zoonekynd

The Hausmann test for “H0 : X1 is actually exogenous” uses the R2 of res(y ∼ X) ∼ X + res(X1 ∼ Z): it should be small and nR2 ∼ χ2 (k1 ). The course also covered variable selection (forward or backward, using T or F tests, AIC, BIC, L2 (RMSE) or L1 (MAE) norm of the residuals), pseudo-R2 for logistic regression (McFadden, Nagelkerke), time series (ARMA, unit roots, cointegration, Granger causality) and many regression tests: – Regression specification (Reset): add the powers of 130/587

∑p the fitted values y1 = x′i β + j=1 γj yˆi j+1 + εi (in- The first derivative of an analytic function f : R → R f (x + ih) stead of the powers of x) and perform an F test for can be approximated as f ′ (x) = Im + o(h). H0 : ∀j γj = 0 (it is approximate because the yˆi are h For higher derivatives, use multicomplex numbers, not fixed predictors); – Chow break test: split the data in two, at a potenR[i1 , . . . , in ] Cn = 2 tial break point, and use an F test for H0 : βbefore = (i1 + 1, . . . , i2n + 1) βafter ; f (x0 + hi1 + · · · + hin ) – Chow forecast test: linear before the break, perfect f (n) (x0 ) ≈ Im1,2,...,n n1∑ +n2 hn fit after: yi = x′i β + γj δji + εi , H0 : ∀jγj = 0; j=n1 +1 For higher derivatives, some have also suggested to use – Jarque-Bera normality test (for the residuals): some Cauchy’s formula, combination of the sample skewness and kurtosis of a I f (x) Gaussian variable is approximately χ(2)-distributed. dz = 2πif (z0 ), z − z0 C Integer relation detection D.H. Bailey (2000) Given x1 , . . . , xn (high-precision) floating-point numbers, the PSLQ algorithm looks for integers k1 , . . . , kn , ∑ not all zero, so that i ki xi = 0. Given a number α of interest, apply it to:

which generalized to f (n) (z0 ) =

n! 2πi

I C

f (x) dz. (z − z0 )n

Learning to discover efficient mathematical identities W. Zaremba et al.

– 1, α, α2 , . . . , αn to show that α is a root of a simple polynomial; – α, ζ(1), ζ(2), . . . , log 2, log 3, . . . , π, π 2 , . . . , Lik (ℓ), The set of equivalent reformulations of a mathematical expression (e.g., sum((A*B)^6), where A and B are maπ k ζ(ℓ), etc. to find interesting relations. trices) form a graph, with edges corresponding to valid It was also used to find a base-16 expansion of π. transformations. Computer algebra systems (Maple, For n numbers and integer coefficients with at most d Mathematica) use a set of heuristic rules to explore this graph and return a simpler expression (either a shorter digits, you need at least nd significant digits. one, or one with a lower algorithmic complexity). One The LLL algorithm has similar applications. can also use random exploration. It is available in a C++ library, in GAP, perhaps also Instead, machine learning (e.g., n-grams, or RNN) can from SAGE via mpmath and/or GMP. guide the exploration. The decompositional approach to matrix computation G.W. Stewart (2000) The “big six” matrix decompositions are: ′



– Cholesky: A = RR (or A = LDL ), where A is positive definite and R upper-triangular; used to solve A′ x = b or compute x′ A−1 x; – LU: P ′ AQ = LU , where A is square, L lowertriangular, U upper-triangular, P and Q permutations; used to solve Ax = b; – QR: A = QR, where Q is orthogonal and R uppertriangular with non-negative diagonal elements; used in least-squares estimation: QQ′ is the projection on Im A; – Spectral decomposition: A = V ΛV ′ , where A is symmetricV orthogonal and Λ diagonal; – Schur: A = U T U ∗ , where A is square, U unitary, T upper-triangular; sometimes an intermediate step in eigenvalue problems; – SVD: A = U ΣV ′ , where U and V have orthogonal columns and Σ is diagonal. Using multi-complex variables for automatic computation of high-order derivatives G. Lantoine et al (2012) Article and book summaries by Vincent Zoonekynd

A contextual-bandit approach to personalized news article recommendation L. Li et al. There are 3 classes of bandit algorithms: – ε-greedy (with decaying ε); – Upper confidence bounds (UCB); – Bayesian approaches (Gittins index). LinUCB generalizes UCB to contextual bandits: if the reward is linear, i.e., E[ra |xa ] = x′a θa , θa unknown, a = arm, a confidence interval can be computed in closed form. Frequentism and Bayesianism: a Python-driven primer J. VanderPlas (2014) Frequentist confidence intervals can be problematic: for instance, for the truncated exponential P (x|θ) = eθ−x 1x>θ and the observed data {10, 12, 15}, the 95% confidence interval is (10.2, 12.2), even though we know that θ < 10 – the corresponding bayesian credible interval is (9.0, 10.0). The paper also details Bayesian linear regression, with emcee (affine invariant ensemble MCMC), PyMC (Metropolis-Hastings) and PyStan (Hamiltonian 131/587

MCMC), and explains how to choose an uninformative Robust principal component analysis? prior: we want it to be invariant under the transforE.J. Candès (2009) mations (x, y) 7→ (y, x) and (x, y) 7→ (λx, λy) – this If a matrix is the sum of a (dense) low-rank and a sparse second transformation gives the Jeffreys prior for the matrix, M = L + S, the components can be recoverd variance, P (σ) ∝ 1/σ. by solving the convex optimization problem An object-oriented framework for robust multivariate analysis V. Todorov and P. Filzmoser (2009) The rrcov package provides robust estimators of scale and location: – Minimum covariance determinant (MCD): mean and variance of the k observations whose covariance matrix has the smallest determinant (heuristically: start with k points, compute µ ˆ and Vˆ , d2i = (xi − µ ˆ)′ Vˆ −1 (xi − µ ˆ), take the k points with the smallest distance, iterate); – Minimum volume ellipsoid (MVE) that contains at least half the data; – Stahel-Donoho: weighted mean and covariance, { ( ) } c 2 wi = Min 1, ri ri = Max r(xi , a)

Find To minimize Such that

L, S ∥L∥∗ + λ ∥S∥1 L + S = M,

where ∥·∥∗ is the nuclear norm (sum of the singular values). Depending on the application, L or S may be of interest. Introduction to numerical methods in differential equations M.H. Holmes (2007) 1. To solve an initial-value problem (IVP), i.e., an ordinary differential equation (ODE) of the form y ′ = f (y, x), y(0) = α, one can discretize the derivative. There are many ways of doing that (forward Euler, backward Euler, centered, leapfrog, etc.) and the resulting numeric schemes have different stability properties.

outlyingness

A numeric scheme is A-stable if the numeric solution − m(a′ X)| of y ′ = −ry, y(0) = α remains bounded (the exact sori (xi , a) = outlyingness in direction a lution decays exponentially). This equation is chosen s(a′ X) because, near a stable (constant) solution, all ODEs m, s : robust 1-dimensional location and scale; look like that. A scheme is conditionally A-stable if it – Orthogonalized Gnanadesikan-Kettenring (OGK): is stable when the step size is sufficiently small. It is given a robust 1-dimensional scale σ, define robust stricly A-stable if the solution converges to zero. It is monotone A-stable if the solution is monotonic. The covariances forward (Euler) scheme is conditionally (monotone) A[ ( )2 ( )2 ] stable and explicit; the backward scheme is (monotone) 1 Xi Xj Xi Xj sij = σ + −σ − A-stable but implicit; the leapfrog method is unstable. 4 σXi σXj σXi σXj One can also integrate the equation, yi+1 − yi = and tweak the resulting covariance matrix to make ∫ ti+1 f (y, t)dt and use numeric integration, e.g., left ti it positive semidefinite; box, right box, midpoint, trapezoidal rule (implicit, – S-estimator: solution of the optimization problem A-stable, O(h2 ), conditionally monotone), Simpson’s rule, etc. The Adams method approximates f with a Find µ, V function that can be integrated exactly (for instance, To minimize σ(d1 , . . . , dn ) a quadratic approximation gives Simpson’s rule). Such that det V = 1 The trapezoidal method is O(h2 ), but it is implicit. ′ −1 where di = (xi − µ) V (x ∑i − µ), and σ is an Myi+1 = yi + 12 h(fi + fi+1 ) ρ(z/σ) = δ, δ ∈ (0, 1), estimator of scale, i.e., n1 ρ : R+ → [0, 1] increasing from 0 to 1. fi = f (ti , yi ) a |x′i a

It also provides robust PCA: – PCA using a robust variance matrix; – Projection pursuit, i.e., finding the direction in which some quantity (a robust 1-dimensional estimator of scale) is maximal; – Huber’s method: consider all the directions defined by pairs of points, and project the data on those directions; for each point and each of those directions, compute the normalized distance to the center, using 1-dimensional robust estimators of location and scale; use the k points with the lowest distance.

Article and book summaries by Vincent Zoonekynd

fi+1 = f (ti+1 , yi+1 ). It can be turned into an explicit method by replacing fi+1 , which we do not know yet, by its Euler approximation: the second order Runge-Kutta (RK) method (Heun) is O(h2 ), explicit, conditionally A-stable. yi1 = yi + 21 (k1 + k2 ) k1 = hf (ti , yi ) k2 = hf (ti+1 , yi + k1 ) The popular RK4 method uses a similar idea, with Simpson’s rule. 132/587

Newton’s equation, my ′′ = F (y), conserves energy, ference) H(t) = 21 my ′2 + V (y), where V ′ (y) = −F (y), but not tj+1 all numeric schemes do. The trapezoidal method does, tj but it is implicit. The velocity Verlet method simply rexi−1 xi xi+1 places it with its Euler approximation (it is prefered to RK4 in physics). It does not exactly conserve energy, has the instant messaging property and is stable. Nubut almost: it is symplectic. merical simulations suggest that the exact solution of2. A boundary value problem (BVP) is an ODE of the ten lies between the explicit and implicit ones. The ′′ form y = f (x), y(0) = α, y(1) = β. theta method mixes a proportion θ of explicit and 1 − θ If the equation is linear, y ′′ + p(x)y ′ + q(x)y = f (x), of implicit. the centered approximations of y ′ and y ′′ give a triditj+1 agonal system (choose a step size h < 2/ ∥p∥∞ ). If tj the condition number is 10n (it is O(N 2 ), where N is xi−1 xi xi+1 the number of steps), do not expect more than 15 − n significant digits. For θ = 1 (Crank-Nicolson – the most widely used The discretization of non-linear BVP gives a non-linear method for2 parabolic equations), the error is O(h2 ) + system of equations, which can be solved by Newton’s O(k 2 ) (instead of O(h2 ) + O(k)). The trapezoidal rule method (the Jacobian is tridiagonal); the solution need (for time) also gives the Crank-Nicolson method. It is not be unique. not robust to the presence of jumps. Residual methods look for solutions of the form y(x) = ∑ The method of lines discretizes only the space: this ak ϕk (x), for some basis functions ϕk (B-splines, gives a family of coupled ODEs. Fourier) so that the residual be zero on a grid (colResidual ∑ methods, approximating the solution as location) or minimal in L2 norm (least squares). u(x, t) = qk (t)Bk (x), can also be used. Shooting methods solve the IVP y(0) = α, y ′ (0) = s, and adjust s so that y(1) = β, hoping the problem is 4. The advection equation is not ill-conditionned. ∂u ∂u +a = 0, u(x, 0) = g(x), a > 0. 3. The heat equation ∂t ∂x ∂2u ∂u D 2 = ∂x ∂t u(0, t) = uleft (t) u(1, t) = uright (t) u(x, 0) = g(x)

It can be solved eactly with a change of variables: u(x, y) = g(x − at). Contrary to the heat equation, jumps in the initial condition g are preserved (strictly speaking, this is a weak solution). The upwind and downwind explicit schemes (forward difference in time, backward or forward in space) tj+1

tj+1

is a parabolic equation: it is a diffusion (it tends to tj tj smooth the initial condition – g need to be continupwind downwind uous, but u will), it satisfies the maximum principle (the maximum and minimum of u are on the boundary) and has the instant messaging property (even if have different domains of dependence. u(·, 0) is zero on some interval, it immediately becomes non-zero). t

The explicit method (forward difference for time, centered for space), with stencil x exact

tj+1

upwind

downwind

The downwind scheme will not work. For the upwind one, the exact domain of dependence is included in the numeric one (Courant-Friedrichs-Lewy (CFL) condition – this did not happen with the heat equation) if does not satisfy the instant messaging property and ak/h ⩽ 1. More generally, one could look at methods has stability problems as t increases. (PDEs have ad of the form ui,j+1 = Aui+1,j + Bui,j + Cui−1,j , comhoc notions of stability, e.g., by looking at how an os- pute a Taylor expansion of the error and set to zero cillatory initial condition evolves; the explicit method as many terms as possible. Implicit methods tend to requires k = O(h2 ), where k is the time step and h have an infinite domain of dependence, which is not the space step.) The implicit method (backward dif- desirable for advection equations. tj

xi−1 xi xi+1

Article and book summaries by Vincent Zoonekynd

133/587

5. The wave equation is ∂2u ∂2u = 2 2 ∂x ∂t u(0, t) = u(1, t) = 0

c2

u(x, 0) = f (x) ut (x, 0) = g(x). It can be solved by separation of variables (Fourier series) or by a change of variable. Jumps in the initial conditions are preseved. The domain of dependence is

ences: 5-point scheme). yj+1 yj yj−1 xi−1 xi xi+1

The resulting matrix has the following sparsity pattern, where T is tridiagonal and D diagonal.   T D 0    D    D   0 D T

To study the stability of hyperbolic equations (or of One can solve Ax = b, without inverting A, if A is definite, by gradient descent, mintheir discretizations), look for solutions of the form symmetric positive 1 ′ imizing F (x) = x Ax − bx: i(kx−ωt) 2 u(x, t) = e . For the wave equation, this gives 2 ω = ±ct. For other equations (Klein-Gordon, c uxx = xk+1 ← xk + αk dk utt ; modified advection, ut + aux + bu = 0; beam, dk : descent direction uxxxx + utt = 0; advection-diffusion, Duxx = ut + aux ; at i(kx−bt) etc.), it may be complex, u(x, t) = e e . The rk = b − Axk = rk−1 − αk−1 qk−1 equation is stable if Im ω ⩽ 0 (for all k), non-dispersive qk = Adk if Re ω ∝ k (the speed of the wave does not depend on dk · rk the wavelength), non-dissipative if Im ω = 0 (for all k). . αk = Argmin F (xk + αdk ) = dk · qk α The explicit method (centered differences) tj+1 tj tj−1 xi−1 xi xi+1

Steepest descent, dk = −∇F (xk ) = rk , may converge very slowly if the level surfaces of F are eccentric ellipsoids, i.e., cond A ≫ 1 (but cond A ≈ 4N 2 /π 2 ≫ 1). Conjugate gradient addresses this problem by using dk = rk + βk−1 dk−1 , where β is such that ∀i ̸= j, 2 2 ri ⊥ rj , i.e., β = ∥rk+1 ∥ / ∥rk ∥ ; the solution is reached in (at most) N steps.

To address the conditioning problem, find B invertible such that cond BA ≈ 1 and solve BAx = Bb. To apply the conjugate gradient method, we need a symmetric (positive definite) matrix: just rewrite the equation as 6. The Laplace equation ∆u = 0, the prototypal el- (BAB ′ )B ′−1 x = Bb. Popular choices for this precondiliptic PDE, is the steady state of the wave equation: tioner include M = B −1 B ′−1 = diag A, tridiag A and it has similar smoothing properties. It can be solved, (D +L)D−1 (D +U ) where A = D +L+U , D diagonal, with a finite difference approximation (centered differ- L and U strictly upper and lower triangular.

satisfies the CFL condition if λ = ck/h ⩽ 1; it is non dissipative for λ ⩽ 1 but dispersive for λ < 1 (problematic for short wave lengths).

Article and book summaries by Vincent Zoonekynd

134/587

Probabilistic numerics Given a random variable X ∼ f on Rn , projection and uncertainty in computations pursuit looks for a projection X 7→ AX on a lowP. Hennig et al. (2015) dimensional subspace (dimension 1, 2 or 3) that maxMany numeric (non-probabilistic) algorithms can be imizes (or minimizes) some projection index Q(AX). This projection index may be: given a Bayesian flavour: Bayesian optimization is the best known, but not the only one. – Equivariant (mean or some other measure of location); In Bayesian quadrature, to estimate an inte∫1 – Location-invariant and scale-equivariant (standard gral, e.g., 0 f , one models the integrand as a deviation or some other measure of dispersion – this Gaussian process f ∼ GP: since integration is ∫1 gives PCA and robust PCA); linear, ( 0 f, f (t1 ), . . . , f (tn )) is Gaussian, and one – Affine-invariant, e.g., normality test statistics (unin∫1 can estimate the conditional distribution 0 f | teresting projections tend to be more Gaussian than f (t1 ), . . . , f (tn ). This gives an estimate of the intethose with idiosyncraties): |skewness|, excess kurtogral and of its precision – but it depends a lot on (the sis (or other standardized absolute cumulants – but smoothness of) the prior. One can also choose the point they tend to be very outlier-sensitive), standardized where to evaluate the function next to increase the preFisher information cision the most. The prior (the covariance function of )2 ∫ ( ′ ϕ′ f the Gaussian process) specifies the smoothness of the − f, Q(X) = σ 2 (X) f ϕ integrand (usually C ∞ ) and perhaps other properties (e.g., the sign – but the distributions would no longer standardized negative Shannon entropy ( ) be Gaussian). Different priors lead to different quadra∫ ϕ ture rules. f, Q(X) = − log f ∫ ∫ ∫∫ Kolmogorov-Smirnov, Durbin-Watson, etc. (this is f ∼ GP(m, C) =⇒ f ∼ N ( m, C) now called independent component analysis, ICA). Bayesian quadrature can be generalized to Bayesian The k-dimensional projection can be approximated in ODEs or PDEs. a greedy (stepwise) fashion (if you want a more precise Bayesian decision theory uses Bayesian quadrature result, use backfitting: once you have a k-dimensional to estimate and maximize expected utility. (Bayesian projection, remove one vector, find its optimal replaceoptimization uses Bayesian decision theory to choose ment, and iterate until convergence). the next point to evaluate.)

Given two random variables X, Y in Rn , a discriminatBayesian linear algebra tries to solve Ax = b ing hyperplane can be found by using the T -statistic where A to too large to be inverted, using the value as a projection index, or of some products As1 = y1 , . . . , Asn = yn to gain ave[a′ X] − ave[a′ Y ] −1 −1 some knowledge about A and A b. The inverse sd[a′ (X ∪ Y )] A−1 is not know but can be modeled as a Gaussian or variable; (A−1 b, A−1 y1 , . . . , A−1 yn ) is then Gaussian, med[a′ X] − med[a′ Y ] −1 . and so is the conditional distribution A b | s1 = mad[a′ (X ∪ Y )] −1 −1 A y1 , . . . , sn = A yn . The same idea gives a measure of outlyingness ∑ a′ xi − med[a′ X] ri = Sampling for inference in probabilistic models mad[a′ X] a with fast Bayesian quadrature T. Gunter et al. which can be used to define weights and robust (weighted) estimators of location and scale. ∫Bayesian quadrature, to estimate expectations E[ℓ] = ℓ(x)π(x)dx, with a Gaussian process (GP) prior on Regression is the estimation of f (x) = E[Y |X = x]. the integrand ℓ, is problematic when ℓ is a likelihood: Projection pursuit regression (PPR) estimates it the prior does not enforce non-negativity, and struggles as a sum of ridge functions to model the high dynamic range of most likelihoods. ∑ f (x) ≈ gj (a′j x) A Gaussian prior on the log-likelhood is better, but the integral is far from Gaussian. A square root transform, (note that the ridge functions are constant on hyperℓ(x) = αf (x)2 , f ∼ GP, is a good compromise; the inplanes). The ridge functions can be estimated greedily: tegral is not Gaussian but can be approximated by a m−1 ∑ Gaussian, either by linearization or moment matching – Compute the residuals r = y − gj (a′j xi ); i i (warped sequential active Bayesian integration, WS1 ABI). – Smooth the scatterplot ri ∼ a′ xi : Projection pursuit P.J. Huber (1985) Article and book summaries by Vincent Zoonekynd

ri = g(a′ xi ) + noise (this depends on a); 135/587

– Find the projection a that minimizes ∑(

)2 ri − g(a′ xi ) ;

(Only use when there is not enough data or too many dimensions for a kernel estimator.)

Deconvolution tries to recover a signal x from a convolution y = f ∗ x with an unknown filter f . Minimum entropy deconvolution considers segments of length d, – Iterate. (yt , . . . , yt+d−1 ), as points of Rd , and finds the least Not all functions can be represented exactly as a finite Gaussian 1-dimensional projection q ′ · yt:t+d−1 : q is an sum of ridge functions (but they can be approximated estimate of f −1 . – and, even if it is possible, the decomposition need not If a time series Xt is a sum of periodic signals with be unique. independent periods, they can be recovered by finding Projection pursuit density approximation p maximizing (PPDA) uses the same idea to approximate prob2 avek [Xt+kp ]; ability distribution functions, with a multiplicative – Q(p) = avet [Z p,t ], where Zp,t = 2 2 2πit/p – Or |C(p)| = avet [e Xt ] . decomposition (to ensure non-negativity) i

f (x) ≈ f0 (x)

k ∏

Reinforcement Learning D. Silver (2015)

hj (a′j x),

1

for some reference distribution f0 . One can either look for k ∏ fk (x) = f0 (x) hj (a′j x) −→ f (x) 1

k→∞

or, dually, f−k (x) = f (x)

k ∏ 1

hj (a′j x) −→ f0 (x) k→∞

1. Reinforcement learning (RL) is the search for the optimal policy for a Markov decision process (MDL) observed only through its history (past states, actions and rewards). A RL agent can model one or several of: – The MDP itself; – The value of each state; – The policy. Planning refers to the special case where the MDP is known.

2. A Markov reward process is a Markov process (this removes the features of f , one by one, until we are (a Markov chain) with rewards. The value of a state is only left with the reference (featureless) f0 ). The qualv(s) = E[Rt+1 + γv(St+1 ) | St = s], ity of the approximation ∫can be measured with the relative entropy E(f, g) = log(f /g), the Hellinger dis- i.e., v = R + γP v. ∫ √ √ tance H(f, g) = ( f − g)2 , the Prohorov distance A Markov decision process (MDP) adds actions. { We still have v = R + γP v, but v, R and P depend on π(µ, ν) = inf ε > 0 : ∀A µ(A) ⩽ ν(Aε ) and } the policy π. The value of a state, or of a state-action ν(A) ⩽ µ(Aε ) , pair, for the optimal strategy (there exists an optimal deterministic strategy) satisfies the Bellman equation: the bounded Lipschitz metric v∗ (s) = Max q∗ (s, a) { ∫ ∫ a ∑ a ′ d(µ, ν) = sup f dµ − f dν : f ∈ C 0 , q∗ (s, a) = Rsa + γ Pss ′ v∗ (s ). ′ s } f (x) − f (y) ⩽1 , sup |f | ⩽ 1, sup 3. To find the optimal strategy for a known MDP x−y x̸=y (planning), policy iteration starts with an arbitrary etc. One can proceed iteratively: find a minimizing policy, evaluates the values of the states for this policy, E(fa , ga ) (where fa and ga are the marginal distribu- computes the greedy policy for those values, and itertions and g = f0 ); replace f with f ga /fa (resp. g with ates. Each step improves the policy and, in the limit, the Bellman equation is satisfied: it converges to the gfa /ga ); iterate. optimal policy. Projection pursuit density estimation (PPDE) applies the same idea to density estimation (from sam- Value iteration is a variant in which we do not compute the values exactly, but just iterate v ← R + γP v ples): once. There is actually no need to keep track of the – Standardize, using robust location and dispersion es- policy, it suffices to iterate on the values: timates; ∑ a – Choose f0 with fat tails; fit it to the cloud of points; v(s) ← Max Rsa + γ Pss ′ v(s). a – Proceed as above; if the marginal densities are too s′ time-consuming to estimate, use sample-based estimates. There are many variants: Article and book summaries by Vincent Zoonekynd

136/587

– – – –

Update the states one at a time, in a random order; Only update the most promising states; Use a sample of the transition matrix; Focus on states close to the agent.

4. To estimate the value function of a given policy on an unknown MDP (prediction), Monte Carlo RL uses the average return (the sum of the discounted rewards) from complete episodes (i.e., you have to let the MDP run until it reaches a final state to know the return). Temporal difference learning (TD) uses incomplete episodes: ( ) MC: v(st ) ← v(st ) + α Gt − v(st ) ( ) TD: v(st ) ← v(st ) + α Rt+1 + γv(st+1 ) − v(st ) . MC is unbiased but has high variance; TD is biased but has lower variance. Instead of looking one step ahead, we can look n steps ahead and replace Gt with (n)

Gt

= Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n v(st+n ).

1 − ε, choose the greedy (optimal) policy; with probability ε, act at random; update the strategy at the end of each episode). SARSA control uses SARSA (TD for the q-value function) and ε-greedy exploration ( ) q(s, a) ← q(s, a) + α R + γq(s′ , a′ ) − q(s, a) . As before, these can be generalized to forward-view SARSA(λ) (n)

= Rt+1 + γRt+2 + · · · + γ n−1 Rt+n + γ n v(st+n ) ∑ (n) qtλ = (1 − λ) λn−1 qt ( ) q(st , at ) ← q(st , at ) + α qtλ − q(st , at ) qt

and backward-view SARSA(λ) E0 (s, a) = 0 Et (s, a) = γλEt−1 (s, a) + 1st =s,at =a δt = Rt+1 + γq(st+1 , at+1 ) − q(st , at )

Forward-view TD(λ) combines them all: ∑ (n) Gλt = (1 − λ) λn−1 Gt (λ-return) n⩾1

( ) v(st ) ← v(st ) + α Gλt − v(st ) . Backward-view TD(λ) only uses information from the past, by assigning credit for the reward to the most recent and frequent states, using eligibility traces.

q(s, a) ← q(s, a) + αδt Et (s, a) The step sizes usually decrease, but not quickly ∑ ∑ too (Robbins-Monro sequence: αt = ∞, αt2 < ∞). Off-policy algorithms learn about a policy π while following another policy µ. Importance sampling MC π(at |st ) π(at+1 |st+1 ) π(aT |sT ) ··· Gt µ(at |st ) µ(at+1 |st+1 ) µ(aT |sT ) ( π/µ ) v(st ) ← v(st ) + α Gt − v(st ) π/µ

E0 (s) = 0 Et (s) = γλEt−1 (s) + 1st =s δt = Rt+1 + γv(st+1 ) − v(st ) 1-step-ahead error v(s) ← v(s) + αδt Et (s) Backward- and forward-view TD(λ) are equivalent.

Gt

=

has a high variance. For importance sampling TD, ( ) ) π(at |st ) ( v(st ) ← v(st )+α Rt+1 + γv(st+1 ) − v(st ) , µ(at |st )

5. Control looks for the optimal value (or policy) of an unknown MDP. It can be on-policy (learn about the optimal policy while following it) or off-policy (learn about the optimal policy while following another one). the variance remains reasonable. Since the model is unknown, the q-value function is Q-learning does not use importance sampling, but more useful than the value function to compute the uses π (not µ) in the target policy. ( ) q(at , st ) ← q(at , st ) + α Rt+1 + γq(st+1 , a′ ) − q(st , at ) ′ a a π(s) = Argmax Rs + Pss′ v(s ) (P is unknown) a where a′ comes from π and at+1 from µ. In particuπ(s) = Argmax q(s, a) lar, when π = greedy and µ = ε-greedy, this becomes a SARSAMAX (or “Q-learning”): MC-ε-greedy control uses MC to evaluate the q-value ) ( q(s′ , a′ ) − q(s, a) . q(s, a) ← q(s, a) + α R + γ Max function and ε-greedy exploration (with probability ′ a

Article and book summaries by Vincent Zoonekynd

137/587

Algorithm

Model

Policy

Aim

Policy iteration Value iteration

known

unknown

π∗

unknown

known

v or q

unknown

unknown

q∗

MC TD(λ) MC + ε-greedy TD(λ) + ε-greedy IS MC or TD Q-learning

on-/off-policy

on on off

6. If the number of states is large, we cannot use a ta- the variance) ble lookup to model the value function v(s) or q(s, a): one can summarize the state as a set of features and qw (s, a) critic use them in a linear (or nonlinear) model. This model π (s, a) actor θ attempts to minimize ∆w = as above ( )2 J(w) = Eπ vπ (s) − v(s, w) , ∆θ = α∇θ log πθ (s, a)qw (s, a). e.g., with stochastic gradient descent (SGD), ( ) ∆w = α vπ (s) − v(s, w) ∇J(w), where, for a linear model v(s, w) = x(s)′ w with features x(s), the gradient is ∇J(w) = x(s), i.e.,

The advantage function critic replaces qv (s, a) with the advantage function Aπθ (s, a) = q πθ (s, a) − v πθ (s)

∆w = step size × prediction error × feature value. Since vπ (s) is unknown, we use a target instead: the return Gt for MC, Rt+1 + γv(st+1 , w) for TD(0), the forward or backward λ-return Gλt for TD(λ).

(this does not change the expectation but reduces the variance). The TD error is an unbiased estimate of the advantage function

δ = r + γVv (s′ ) − Vv (s) Deep Q-networks (e.g., those that play Atari games) keep all the history and replay it (random minibatches); they compute the Q-learning targets with that only requires an estimate of v instead of q. (As besome old, fixed (slowly updated) estimate of the qfore, you can replace the TD(0) target r + γv(s′ ) with function. the return (MC target), the λ-return (TD(λ) target) 7. Instead of the value function, one can directly learn or use eligibility traces (backward-view TD(λ)).) a parametrized policy πθ (s, a) = probability of taking 8. Model-based reinforcement learning learns action a in state s. This allows stochastic policies. The the model (an MDP: the state and action spaces are policy gradient theorem links the gradient of the loss known, the reward (s, a) 7→ r is a regression problem, function to the score function of the policy, ∇ log π: the transitions (s, a) 7→ s′ a density estimation prob[ ] lem, possibly with a Bayesian prior) and uses it to es∇J(θ) = Eπθ ∇θ log πθ (s, a)q πθ (s, a) . timate the value function or policy. For instance, for a softmax, Dyna learns the model from experience and the value ( ) function from both real experience and simulated exπθ (s, a) ∝ exp ϕ(s, a)′ θ perience. ∇ log π = ϕ(s, a) − Eπθ [ϕ(s, ·)] Forward search focuses on the sub-MDP starting at the current state. In particular, Monte Carlo tree and for a Gaussian policy search (MCTS) applies MC control to simulated exa ∼ N (µ(s), σ 2 ), µ(s) = ϕ(s)′ θ perience (at first, we do not know the Q(s, a), so we ( ) act at random until the end; this gives an estimate of a − µ(s) ϕ(s) ∇θ log π = . some of the Q(s, a); iterate with more and more paths σ2 – this is how computers play go). The Monte Carlo policy gradient (“reinforce”) TD search applies SARSA to the sub-MDP starting algorithm estimates this expectation using one MC at the current state. sample, i.e., q πθ (s, a) ≈ rt . Dyna2 uses two sets of weights, long-term (TD learnThe actor-critic method learns both the action-value ing, from real experience) and short-term (TD search), function and the policy (it may be biased, but reduces and adds them. Article and book summaries by Vincent Zoonekynd

138/587

9. A multi-arm bandit is a 1-state, 1-step MDP. q(a) = E[R | A = a]

action value optimal value

v ∗ = Max q(a)

regret total regret

ℓt = E[v ∗ − q(At )] ] [ ∑ Lt = E v ∗ − q(Aτ )

gap

∆a = v ∗ − q(a)

a

(unknown)

t⩽τ ⩽T

(unknown)

leaf does not update the root but the leaf of the search tree (run a search from st , take a decision, end up in st+1 , run a search from st+1 , update the leaf of the first search towards the value of the second search). Treestrap updates all the values in the search (not just the root or the leaf), i.e., updates a shallow search from a deep one. MCTS (simulate games until the end, from the current state, and apply RL to those self-play games) works well. UCB can be generalized (UCT).

We want to play the arms (actions) with a low gap Phase plots of complex functions: more often. For the greedy (no exploration) and εa journey in illustration greedy strategies, the loss Lt is linear in t. Optimistic E. Wegert and G. Semmler (2011) initialization initializes q(a) to a high value; it is still linear. Decaying ε-greedy is sublinear, but choosing the To plot a function f of a complex variable, one often schedule requires knowledge of the gaps. considers the surface z 7→ |f (z)| (or log |f (z)|), some1 “Optimism in the face of uncertainty” uses an upper times with a colour for the phase f (z)/ |f (z)| ∈ S . bound on q(a): q(a) ⩽ qˆt (a) + u ˆt (a) with high proba- Phase plots only plot the phase, as a colour in the plane bility. For instance, UCB1 uses Hoeffding’s inequality (one can also change the brightness, using a sawtooth function of log |f (z)| or f (z)/ |f (z)| or a product of (which assumes q bounded): them). Most of the information is still visible, some√ times more clearly than with the surface: 2 log t at = Argmax qˆ(a) + . – Zeroes and poles are clearly visibles the colours Nt (a) a changes on a simple loop give the difference between the number of zeroes and poles (counted with mulBayesian bandits put a prior on q and use some quantiplicity) inside the loop; tile of the posterior, e.g., qˆ(a) + cσ(a). – The zeroes of f ′ (that are not zeroes of f ) are Probability matching tries to estimate “colour saddle points”, i.e., intersections of isochromatic lines; ′ ′ P [∀a q(a) > q(a )] – Essential singularities are obvious; from the posterior. Thompson sampling crudely es- – Some results have a striking illustration, e.g., the zeroes of the partial sums of a converging power setimates this probability by sampling from q(a) (once ries with finite convergence radius R cluster at every for each action) and choosing the best action. point on [|z| = R]. Information state search extends the state space to include all the information so far: the bandit becomes an MDP (with an infinite state space). For instance, a Bernoulli bandit gives an infinite tree of Beta distributions, and the values can be computed exactly (Gittins index) or with MCTS. A contextual bandit is a 1-step MDP (with several states). For MDPs, UCB is trickier: try optimistic initialization or the information state approach (with MCTS). 10. Let us consider 2-player zero-sum, perfect information (“Markov”) games. Nash equilibria (joint strategies) exist (provided you allow mixed strategies) but need not be unique. Minimax (depth-first) search π = ⟨π1 , π2 ⟩ vπ (s) = Eπ [Gt | St = s] v∗ (s) = Max Min vπ (s) π1

π2

explores the tree of possible actions – usually truncated, with the leaf values estimated with some evaluation function. Simple TD learns the value function with TD and uses it with minimax search. TD root uses the minimax value (of the root of the tree) as the target. TD Article and book summaries by Vincent Zoonekynd

139/587

Let U = (Uα )α∈A be an open covering of a topological space X. The nerve N U is the abstract simplicial complex (A, Σ) where σ = {α0 , . . . , αk } ∈ Σ iff ∩ α∈σ Uα ̸= ∅. If all intersections of elements of U are empty or contractible, it is homotopy equivalent to X. ˇ The Čech complex C(V, ε) is the nerve of the covering of X with balls of radius ε with centers in a finite subset V ⊂ X. If X is compact, Riemannian, ˇ ε sufficiently small and V well chosen, then C(V, ε) is homotopy equivalent to X. The Vietoris-Rips complex VR(V, ε) is (V, Σ), with σ = {x0 , . . . , xk } ∈ Σ iff ∀i, j d(xi , xj ) ⩽ ε, i.e., all (xi , xj ) span a Čech 1-simplex. (Its 1-skeleton is the “truncated graph” from the distance matrix.) The Delaunay complex is the nerve of the covering by Voronoi cells Vλ = {x ∈ X : x closer to λ than to any other λ′ ∈ V }. The strong witness complex of X, wrt a finite Introducing elliptic, an R package set of points (landmarks) L ⊂ X, is a fattening of for elliptic and modular functions the Delaunay complex, W s (X, L , ε) = (L , Σ) with R.K.S. Hankin {ℓ0 , . . . , ℓk } ∈ Σ iff The elliptic R package provides tools to deal with (and plot) complex functions, in particular elliptic ∃x ∈ X : ∀i d(x, ℓi ) ⩽ d(x, L ) + ε. functions (doubly periodic meromorphic functions) w such as Weierstrass’s ℘ and related (ζ ′ = −℘, σ ′ /σ = ζ) The weak witness complex W (X, L , ε) is (L , Σ) with {ℓ0 , . . . , ℓk } ∈ Σ iff functions. Topology and data G. Carlsson (2009) 1. Topological data analysis estimates the Betti numbers βk (X, F ) = dim Hk (X, F ) of a topological (or metric) space X (often, a subspace of Rn ), over a field F (often F2 ), from a finite number of points sampled from X.

∀Λ ⊂ {ℓ0 , . . . , ℓk }

∃x ∈ X

∀ℓ ∈ L \ Λ

∀ℓi ∈ Λ

d(x, ℓ) + ε ⩾ d(x, ℓi ). s w There are Vietoris-Rips variants WVR and WVR .

All those complexes are functorial in ε: the inequalˇ ˇ ity ε ⩽ ε′ induces an inclusion C(V, ε) ,→ C(X, ε), which in turn induces a morphism of homology groups ˇ ˇ Hk C(V, ε) −→ Hk C(V, ε′ ).

An abstract simplicial complex is a pair (V, Σ), where The R-persistence homology is the functor V is a finite set and Σ ⊂ P(V ) such that σ ∈ Σ, τ ⊂ { σ =⇒ τ ∈ Σ. Its topological realization is (R>0 + , ⩽) −→ Mod Z ˇ ε 7−→ Hk C(V, ε). ∪ chull({ei : i ∈ σ}) σ∈Σ V

where (ei )i∈V is the canonical basis of R . The homology of a simplicial complex X = (V, Σ) is defined as Σk = {σ ∈ Σ : #σ = k + 1}

k-simplices

Ck = ZΣk

k-chains

(V, ⩽)

total order

di (σ) = σ \ ith element of σ

R-persistence Z-modules are complicated objects, but there is a classification theorem for finitely-generated N-persistence F -vector spaces, (Vn )n∈N ≡

N ⊕

U (mi , ni ),

i=0

where

{ U (m, n)t =

F 0

if m ⩽ y ⩽ n otherwise

di : Σk −→ Σk−1 ∑ ∂k = (−1)i di : Ck −→ Ck

and U (m, n)s −→ U (m, n)t is the identity for m ⩽ s ⩽ ̸= even though all the t ⩽ n. Note that dimensions are the same.

∂k ◦ ∂k+1 = 0, therefore Im ∂k+1 ⊂ ker ∂k ker ∂k . Hk (X, Z) = Im ∂k+1

The N-persistence F -homology of a finite set X (choose a subset {εn , n ∈ N} ⊂ R) can be represented as a barcode.

Article and book summaries by Vincent Zoonekynd

140/587

The actual computations require linear algebra (diago- could be a density estimator, data depth nalizing ∂ – Smith normal form) over the principal ideal 1 ∑ ρp (x) = d(x, y)p domain F [t] (N-persistence vector spaces are equiva|X| y∈X lent to graded F [t]-modules). 2. Those ideas have been applied to image data:

or eigenvectors of the graph Lapalcian of the 1-skeleton – Consider 3 × 3 patches (points in R9 ); only keep of the Vietoris-Rips complex, seen as (eigen) functions. To avoid choosing ε, consider the simplicial complex those with high contrast (top 20%); SS(X, ρ, U ) whose vertices are (α, I), with α ∈ A and – Center and normalize (points in S7 ); sub-sample; – Estimate the density with the distance to the nearest I = (εi , εi+1ˇ) a−1(maximal) interval on which the barneighbours (k = 15) and keep the top T % densest code of H0 C(ρ Uα , ε))ε does not change and whose k-simplices are the {(α0 , I0 ), . . . , (αk , Ik )} such that points (T = 20%); · ·∩Ik ̸= ∅. Choose a sec– Build the witness complex on 50 landmark points Aα0 ∩· · ·∩Aαk ̸= ∅ and I0 ∩· ˇ → SS, s(α) = (α, Iα ), tion of the projection, s : CU chosen by archetypal analysis. and choose εα ∈ Iα . [The results are similar to Isomap. The barcodes suggest β0 = 1, β1 = 5, which can be I am not convinced this is significantly different from realized as the union of 3 circles, a minimum spanning tree.]

4. Persistent homology can be generalized to other index posets. For instance, VR(X[T ], ε)ε,T where X[T ] = {x ∈ X : ϕ(x) ⩾ T } and ϕ is a density estimator on the finite set X (or some other function – this is Morse theory), gives R × R- (or N × N-) persiscorresponding to edges (or gradients), horizontal lines tent homology. It can be described with multigraded and vertical lines (consistent with the presence of hor- modules over k[X1 , . . . , Xn ] (here, n = 2). The classiizontal and vertical artefacts: horizon, buildings, peo- fication of those modules is too complex to be useful ple, trees, etc.). Looking at β2 suggests (unconvinc- (but may be amenable to Gröbner bases), but simpler ingly) that those cricles may be embedded in a Klein invariants are more accessible, e.g., dim Mt1 ,...,tn or bottle. ( ) rank Mt1 ,...,tn −→ Mt′1 ,...,t′n (for n = 1, the ranks contain all the information). Zigzag diagrams (quiver representations) appear in the following sotuations:

edges

– Pairwise comparisons of subsamples (bootstrap) S0 ∪ S1

S1 ∪ S2 ···

S0

horizontal lines

vertical lines

S1

S2

– Samples from a moving window; – Upper level sets (T %) of a density estimators with varying bandwidths εi X[T, εi−1 ] ∪ X[T, εi ] ···

X[T, εi ] ∪ X[T, εi+1 ] ···

X[T, εi−1 ] X[T, εi ] For datasets in which one observation is a cloud of points (e.g., a few statistics estimated on a moving – Witness complexes as the set of landmarks change window – say, the number of spikes for the 5 neurons W (X, Li , ε) firing the most, in brain activity data), you can use Betti numbers (or the most frequent Betti signatures) ··· ··· as features. W (X, Li−1 , Li , ε) W (X, Li , Li+1 , ε) 3. Given a map ρ : X → Z and a covering U of Z (with Z = R or Rn or S1 ), ρ∗ U is a covering of X; There is a classification theory for finite zigzag persisconsider the Čech complex of its connected components tence vector spaces: the elementary blocks are of the Cˇ π0 (ρ∗ U ), e.g., for U = U [R, e] = {[kR − e, (k1 )R + form e], k ∈ Z}. For point clouds, replace the connected 0 F F 0 components π0 with single linkage clustering: points ··· ··· less than ε apart are in the same component (these are the Vietoris-Rips connected components). The map ρ F F 0 Article and book summaries by Vincent Zoonekynd

141/587

5. The paper ends with a discussion of functorial clustering algorithms (functors from the category of finite metric spaces with isometries, embeddings, nonincreasing embeddings, non-increasing maps, to the catogory of sets.) Barcodes: the persistent topology of data R. Ghrist (2008) Another (shorter) clear review article.

– Persistent homology to understand the shape of and identify the structures in high-dimensional datasets, e.g., natural images or brain activity; – Conley index theory (a generalization of Morse theory) to distinguish between chaotic and noisy (experimental or simulated) data. How to write a 21st century proof L. Lamport (2012)

Advocacy for the replacement of “informal proofs” Computational topology for point data: (which we have been writing since the 17th century) Betti numbers of α-shapes with “structured proofs”: sequences of statements, V. Robins (2002) each with a proof (itself structured, if needed). It is still too early for mathematicians to write formal To estimate the Betti numbers (intuitively, βk (X) is (computer-verified) proofs, but familiarity with them the number of k-dimensional holes in X) of X ⊂ Rd can be helpful; the appendix shows a proof in TLA+ . from a finite set of points S ⊂ X, attach spheres of increasing radius α at each point: An extensible SAT-solver ∪ Sα = B(x, α) N. Eén and N. Sörensson (2004) x∈X

βkα (S) = dim Hk (Sα ) for α ⩾ dHausdorff (S, X). The persistent Betti number ( ) βkα,β (S) = rank i∗ : Hk (Sα ) → Hk (Sβ ) , for α ⩽ β and i : Sα ,→ Sβ , is, intuitively, the number of kdimensional holes in Sα that are not filled in Sβ .

Walk through the MiniSAT code – should you want to implement your own SAT solver Conflict-driven clause learning SAT solvers J. Marques-Silva et al. (2008)

DPLL is the basic SAT solving algorithm: ∧ ∨ The number of connected components β0α (S) can be obtained from the minimum spanning tree of S. Higher – Write the formula as i j xij ; Betti numbers can be computed from the Delaunay – If one of the clauses has only one term, set it to true; complex of Sα in S (the dual of the Voronoi complex – If one of the variables always appears with the same sign, e.g., always x, or always ¬x, set it to true (or of Sα in S). false); – When you can no longer apply the previous steps, JavaPlex tutorial choose a value for one of the variables; backtrack H. Adams and A. Tausz (2015) when needed. JavaPlex is a Java/Matlab library to compute the persistent homology of filtered simplicial complexes. The tutorial ends with the 3-circle structure of image patches. A roadmap for the computation of persistent homology N. Otter et al. (2015)

But this is suboptimal: when we reach a contradiction, we backtrack and try another value for some variable, but if it is unrelated to the contradiction, nothing will change, and we will explore the same tree again and again. Clause learning identifies the cause of the contradiction and adds it as a new clause, to help prune the search tree.

To find one (or several) clauses to add, represent the Comparison of software to compute persistent ho- current state of the exploration as a graph, whose nodes mology: JavaPlex (well documentated), Dionysus are the possible values of the variables and with the (C++/Python, documented), Dipha (C++, MPI, clauses used for the deductions as arrows (all inbound command line interface, fast), Gudhi (C++, fast), edges have the same label), and find a cut between the latest decision variable and the conflict node. Perseus, Simpers, jHoles, pHat, CHomP. Three examples of applied and computational homology R. Ghrist (2008)

Practical applications of Boolean satisfiability J. Marques-Silva SAT applications include:

Concrete applications of algebraic topology include:

– Hardware verification: checking that two circuits are – Euler characteristic integration (constructible equivalent; sheaves) to aggregate sensor data: the Euler char- – Circuit testing: automatic test generation, for acteristic is a measure since, under some circumboolean circuits, assuming the failure is that of a stances, χ(A ∪ B) = χ(A) + χ(B) − χ(A ∩ B); single value stuck at 0 or 1; Article and book summaries by Vincent Zoonekynd

142/587

– Planning: checking if a state is reachable in k steps, Introduction to RKHS in a deterministic transition system, with several and some simple kernel algorithms transition functions, corresponding to the decisions; A. Gretton (2015) temporal logic; Given a kernel k : X × X → R on a finite set X, one – Bioinformatics, e.g., inferring haplotypes from genocan consider the Hilbert space types: a genotype is a string over the alphabet {0, 1, 2} (wild, mutant, heterozygous); a haplotype H = Span{k(x, ·), x ∈ X} ⊂ F (X, R) ⟨∑ ⟩ ∑ ∑ is a string over {0, 1}; each genotype comes from α k(x, ·) , β k(y, ·) := αa βy k(x, y). x y two haplotypes; the genotypes are known but the x y x,y haplotypes are not; we want the minimum number of haplotypes to explain the genotypes. It satisfies: H ⊂ F (X, R)

Bayesian structural time series models S.L. Scott (2015) Structural models (level, trend and seasonality) can be put in state-space form and used for ad campain effectiveness measurement. The model can be extended to allow non-Gaussian innovations (by expressing them as an (infinite) mixture of Gaussians) and long-term trends (by using a mean-reverting AR(1) slope instead of a random walk). In R, check the bsts and CausalImpact packages.

∀x ∈ X

k(x, ·) ∈ H

∀x ∈ X

∀f ∈ H

∥δx ∥ = k(x, x)

1/2

where

{ δx :

Big data, statistics and the internet S.L. Scott (2014) Gibbs sampling (sample from β|µ, V ; sample from µ, V |β; iterate) cannot be directly, efficiently implemented with MapReduce (there is too much communication and virtually no CPU usage – the complexity of most MapReduce algorithms is the volume of communications, not the computations). Instead, consensus Monte Carlo gives different bits of data to various workers, which work intependently (sample from β|µ, V, dataw ; sample from µ, V |β, dataw ; iterate) and the results are combined. Contrary to distributed optimization (ADMM) or horizontal strategies for stochastic optimization (progressive hedging), it is a simple consensus – there are no increasing penalties to force convergence.

< ∞, −→ R 7−→ f (x)

is the evaluation function. Those results generalize to infinite X, and H is called a reproducible kernel Hilbert space (RKHS) for k.

Experiments in the internet age: a modern look at the multi-armed bandit S.L. Scott (2014) Planning an experiment and running it until the end is suboptimal (in terms of regret): you spend a lot of time estimating precisely options that are suboptimal. In the 2-arm bandit setup, P (θ1 > θ2 ) can be estimated analytically, or estimated by sampling from the posteriors of θ1 and θ2 : Thompson sampling uses a single sample and works well. Full factorial experiments are rarely possible, but one can design fractional factorial experiments to estimate the coefficients as precisely as possible [this is not unlike Bayesian optimization].

H f

⟨f, k(x, ·)⟩ = f (x)

Scalable Bayesian optimization using deep neural networks J. Snoek et al. (2015) Gaussian-process-based optimization scales cubically with the number of observations. Instead, one can train a neural net on the data tanh

tanh

linear

input −−−→ · · · −−−→ ϕ1 · · · ϕD −−−→ output (prefer tanh to ReLU) and ∑ fit a Bayesian linear model using the last layer y = βi ϕi + ε (adaptive basis regression). Fast exact summation using small and large superaccumulators R.M. Neal (2015) Naive computation of long sums can lead to rounding error accumulation. There are approximate algorithms (Kahan-Babuška), but it is actually possible to compute those sums exactly, by using a “superaccumulator”, i.e., 4096 64-bit numbers, one for each possible exponent (all but one bit overlap: carry propagation is rarely needed). For small sums, a smaller number of 64-bit numbers (e.g., with 32-bit overlaps) is sufficient. The overhead is twofold. [A big fixpoint number, with no overlap, looks simpler, but would force us to reimplement addition and could end up slower.] Quantifying creativity in art networks A. Elgammal and B. Saleh (2015)

This can be applied to hierarchical Bayesian Poisson To measure creativity (originality and influence): regression (2 × 107 observations, 104 groups, 10 vari- – Build a graph of paintings, with a directed edge beables). tween two paintings if the first precedes the second, weighted by similarity; Article and book summaries by Vincent Zoonekynd

143/587

– Subtract some reference value from the weights; – Invert the negative edges; – Compute the eigenvector centrality (PageRank). An R package flare for high dimensional linear regression and precision matrix estimation X. Li et al.

The difference in performance is mostly due to better (or learned) hyperparameters in the “neural” networks. Auto-WEKA: combined selection and hyperparameter optimization of classification algorithms C. Thornton et al.

Use Bayesian optimization, e.g., SMAC (randomVariants of the lasso where the square loss is replaced forest-based Bayesian optimization) or TPE (tree2 with the absolute value, the L norm (not its square), structured Parzen estimators), to choose both algop or more generally the L norm, 1 ⩽ p ⩽ 2, and applirithm and hyperparameters in Weka. cations to sparse precision matrix estimation (tiger, clime). Real-time prediction and post-mortem analysis of the Shanghai 2015 stock market bubble Tiger: a tuning-insensitive approach D. Sornette et al. (2015) for optimally estimating – Fit the log-periodic power law (LPPL) model on sevGaussian graphical models eral windows, from 750 to 125 trading days; H. Liu and L. Wang (2012) – Filter the results (they give a long list of empirical Fitting a Gaussian graphical model, i.e., estimating a conditions the model should satisfy); sparse precision matrix, can be done column by col- – The “confidence” is the proportion of models (i.e., umn, with sparse regressions Xi ∼ X\i : the lasso window sizes) that pass those filters; it estimates and many variants (Dantzig selector, scaled lasso, etc.) crash risk; have been suggested, but the optimal tuning parame- – The “trust” is a resampling-based variant of that ters cannot be computed in practice. The sqrt lasso goodness-of-fit test: resample the residuals, add (minimize ∥residuals∥2 + λ ∥β∥1 – this is the L2 norm, them to the fit, re-estimate the model, check if it not its square) is tuning-insensitive. still passes the filters; – Compute the probability distribution function of the In R, check the bigmatrix and flare packages. crash time, or of the bubble start time. Identification of structured dynamical Generative adversarial nets systems in tensor product I.J. Goodfellow et al. reproducing kernel Hilbert spaces M. Signoretto and J.A.K. Suykens Generative adversarial nets simultaneously learn two models Factorization machines (or tensor machines) with a low generator G: noise 7−→ data rank constraint: add a “multilinear spectral penalty”, discriminator D : noise or data 7−→ true or false built from the SVDs of the unfoldings of the model via the minimax game with value parameters. V (D, G) = Ex∼data log D(x) + Ex∼noise log(1 − DG(x)) Improving distributional similarity value(game) = Min Max V (D, G). G D with lessons learned from word embeddings O. Levy et al. Conditional generative adversarial nets Traditional and neural word embeddings use the same M. Mirza and S. Osindero bag-of-contexts representation: GANs can be generalized to conditional models (mod– Positive pointwise mutual information; els of x|y instead of x). ) ( Pˆ (word, context) Deep generative image models using PPMI(word, context) = log Pˆ (word)Pˆ context) + a Laplacian pyramid of adversarial networks E. Denton et al. – Its SVD-based low-rank approximation; – Skipgram with negative sampling (SGNS, wordvec): Stack GANs to generate finer and finer images. maximize s(word · context) and minimize s(word · A neural conversational model corrupted context), where s is the sigmoid function O. Vinyals and Q.V. Le and · the scalar product – this can be seen as a ma′ trix factorization PMI ≈ Words · Contexts + log k, Build a conversational model with a sequence-towith a robust (sigmoid) loss; sequence (seq2seq) RNN-LSTM network, trained on – Global vectors (GloVe), a factorization M ≈ Words · IT helpdesk chats or movie subtitles. Contexts′ + bw · 1′ + 1 · b′c , where Mw,c = log #(w, c). Article and book summaries by Vincent Zoonekynd

144/587

Learning to understand phrases by embedding the dictionary F. Hill et al. Train a RNN with LSTM to map dictionary definitions to word vector representations (word2vec) and use as a crossword solver (or a reverse dictionary, i.e., to find the word on the tip of your tongue from its definition). A simple way to initialize recurrent networks of rectified linear units Q.V. Le et al. To avoid the vanishing/exploding gradient problems when training recurrent neural networks (RNN) with stochastic gradient descent (SGD): – Use Hessian-free optimization instead of SGD; – Or use SGD, with momentum, careful initialization, and clipped gradients; – Or use a LSTM network (the model is complicated, but its gradients well-behaved); – Or use ReLU and careful initialization. Towards large-scale continuous EDA: a random matrix theory perspective A. Kabán et al.

Modeling discrete optimization P.J. Stuckey and C. Coffrin (Coursera, 2015) Another MiniZinc tutorial, with exercises.

The Wiener-Askey polynomial chaos for stochastic differential equations D. Xiu and G.E. Karniadakis (2001) The polynomial chaos expansion of a stochastic process (with finite variance) is the approximation Xt = a0 (t)H0 + ∑ ai1 (t)H1 (Xi1 ) + i1 ⩾0



ai1 i2 (t)H2 (Xi1 , Xi2 ) +

i1 ,i2 ⩾0

··· where X1 , X2 , . . . are N (0, 1) iid, Hn (x1 , . . . , xn ) = e

− 21 x′ x

(−1)n

− 1 x′ x ∂n e 2 ∂x1 · · · ∂xn

In high dimension, one can still use CMA-ES after imposing some structure on the covariance matrix (e.g., diagonal). Instead, one can apply CMA-ES on random (low-dimensional) subspaces and combine the resulting subpopulations. (EDA, “estimation of distribution algorithms”, refers to CMA-ES-like evolutionalry algorithms.)

are Hermite polynomials and ai1 ,...,ik are functions to be determined. It can be generalized to other families of orthogonal polynomials and non-Gaussian random variables. These expansions can be used to solve SDEs, in the same way one would use power series with ODEs.

Centrality of the supply chain L. Wu

Time-dependent polynomial chaos P. Vos

Supplier and consumer centralities (hits algorithm). Discovering hidden factors of variation in deep networks B. Cheung et al. Train a semi-supervised auto-encoder

Worked out examples showing how to numerically solve stochastic ODEs (ODEs with random variables in their coefficients or their initial conditions) with generalized polynomial chaos (gPC) expansions and to circumvent numeric instability (by reinitializing the expansion from time to time).

x y

z

Polynomial chaos: a tutorial and critique from a statistician’s perspective A. O’Hagan (2013)

x

Polynomial chaos (PC) describes a random variable X d (we want a latent representation (z, y) of x, where y is as X = f (Ξ) (equality in distribution), where Ξ is a known, and z should be as independent of y as possi- random variable following a known, simple distribuble) with a penalty on the average squared covariance tion. Such a function f is not unique: one can look ⟨ ⟩ for it in the finite-dimensional space of polynomials (of 2 Cov(zi , yi ) . degree at most n) orthogonal wrt the probability distribution function of Ξ (Legendre for U (−1, 1), Hermite A MiniZinc tutorial for N (0, 1), Laguerre for Exp(1) – choose depending on K. Marriott and P.J. Stuckey the boundedness of X). Typically, Ξ has dimension at MiniZinc is a discrete optimization modeling language: least that of X. The Karhunen-Loève expansion is a it translates the problem to feed it to various optimiz- degree-1 PC expansion. ers (constraint propagation, linear programming, etc.) and provides global constraints (alldifferent, etc.). Article and book summaries by Vincent Zoonekynd

145/587

Random geometry on the sphere J.F. Le Gall (2014) To obtain a random metric on the sphere, draw a random planar graph on the sphere, consider the graph distance (suitably rescaled), refine it by adding more nodes and edges until, in the limit, you have a metric on the whole sphere. More precisely, the “random graphs” are defined as follows: – Consider planar maps, i.e., embeddings of graphs into the sphere S2 , up to homeomorphism; – Consider only p-angulations (i.e., the faces have p edges); – Sample uniformly from Mnp = {p-angulations with n faces},

Es ⩾ 0. It can be built from a Brownian bridge by finding its minimum and shifting it (modulo 1) to start and end at that minimum: Wt

Brownian motion

Bt = Wt − tW1

Brownian bridge

τ = Argmin Bt t

Et = Bτ +t mod 1 − Bτ

Brownian excursion.

Automatic construction and natural language description of nonparametric regression models J.R. Lloyd et al.

The automatic statistician models time series as Gaussian processes (GP) whose kernel is described via a formal language, using elements such as WN (white For the limit, notice that a planar map is an element noise), C (constant), Lin (linear), SE (square expoof nential), Per (periodic), CP (change points) and + and × operators. The models are not unlike symK = {compact metric spaces}/isometries, bolic regression and include linear regression (C+Lin+ which is complete when equiped with the Gromov- WN), GP smoothing (SE + WN), cyclical decompo∑ ∑ Hausdorff distance sition∑ ( SE + Per + WN), Fourier decomposition (C + cos +WN), etc. d(E1 , E2 ) = inf{dH (ψ1 E1 , ψ2 E2 ) : for p = 3 or 4 and n large.

ψi : Ei ,→ E isometric embeddings} { } dH (X, Y ) = Max sup inf d(x, y), sup inf d(x, y) x∈X y∈Y

y∈Y x∈X

Intelligible models for classification and regression Y. Lou et al. (2012)

= inf{ε ⩾ 0 : X ⊂ Yε and Y ⊂ Xε }

Comparison of variants of GAM: the shape functions can be splines (in R: mgcv), regression trees, tree enwhere Xε is the ε-enlargement of X. sembles (bagged, boosted, or both) with a fixed or The Brownian map is the random variable, with val- adaptive depth; the model can be learnt via penal∫ ues in K, obtained in the limit; it is a sphere, but it ized least squares (wiggliness penalty, |f ′′ |2 ), gradihas Hausdorf dimension 4 (a.s.). ent boosting or backfitting (learn fk on the residuals ∑ Quadrangulations can be described as well-labeled trees y − i̸=k fi (xi ), iterate until convergence): boosted bagged trees, preferably with adaptive depth, perform (Schaeffer’s bijection). better. A continuous random tree (CRT) is obtained from a Brownian excursion (a Brownian motion B such that Accurate intelligible models B0 = B1 = 0 and ∀t Bt ⩾ 0), as [0, 1]/∼ where s ∼ t if with pairwise interactions d(s, t) = 0 and Y. Lou et al. (2013) d(s, t) = Bs + Bt − 2 Min Bu ; Interactions can easily be added to GAMs (GA2 M), u∈[s,t]

is a distance on the tree. Schaeffer’s construction can be applied to a CRT with Brownian labels (there are two levels of randomness: in the tree, and in the labels).

but since the number of pairs of variables can be huge, some form of feature selection is needed. For instance, one can greedily add the most promising interaction to the model, using

– Anova (p-value of a test comparing the model with no interactions and the model with the (xi , xj ) inAspects of random maps teraction); G. Miermont (2014) – Independence χ2 test between the sign of the GAM residuals and (xi > ai , xj > aj ), ai = median(xi ); All the details on the Brownian map. – Comparison of the RMSE of a full model (e.g., a random forest) y ∼ rf(x1 , . . . , xn ) with a model without A relation between Brownian bridge the (xi , xj ) interaction, y ∼ rf(x1 , . . . , xbi , . . . , xn ) + and Brownian excursion rf(x1 , . . . , xbj , . . . , xn ); W. Vervaat (1979) – RSS of the best split for (x , x ) – it can be computed i j efficiently. A Brownian excursion is a Brownian motion E on [0, 1] conditioned by E0 = E1 = 0 and ∀s ∈ [0, 1], Article and book summaries by Vincent Zoonekynd

146/587

Convexifying the set of matrices of bounded Robust rotation synchronization rank: applications to the quasiconvexification via low-rank and sparse matrix decomposition and convexification of the rank function F. Arrigoni et al. (2015) J.B. Hiriart-Urruty and H.Y. Le (2011) Robust PCA and matrix completion are both low-rank The convex hull of approximations: they can be combined. {M : rank M ⩽ k and ∥M ∥ ⩽ 1} is

Rotation matrices R1 , . . . , Rn ∈ SO(3) can be recovered from noisy relative alignments Rij = Ri Rj−1 + noise by minimizing ∑

{M : ∥M ∥∗ ⩽ k and ∥M ∥ ⩽ 1} ∑ where ∥M ∥ = x̸=0 ∥M x∥2 / ∥x∥2 = ∑σ1 (M ) is the spectral (operator) norm and ∥M ∥∗ = i σi (M ) is the nuclear (trace) norm (those norms are dual).



Rij − Ri R−1 2 . j F

Rij available



1

R12

···

 R1m ..  .   , the problem   1

 This generalizes a similar result for the ℓ0 pseudo-norm: Letting X =   R21  . the convex hull of {x : ∥x∥0 ⩽ k and ∥x∥∞ ⩽ 1} is  .. {x : ∥x∥1 ⩽ k and ∥x∥∞ ⩽ 1} and ∥·∥1 and ∥·∥∞ are Rn1 · · · dual. becomes A function f is quasi-convex if its sublevel sets [f ⩽   R1 α] are convex. The (quasi-)convex hull of a function is  ..  Find R =  .  ∈ SO(3)n the largest (quasi-)convex function minorizing it. The restricted rank is Rn 2 { To minimize ∥p(X − RR′ )∥F rank M if ∥M ∥ ⩽ r rank : M 7→ +∞ otherwise. r where p is the projection on the available coordinates. Since R ∈ SO(3)n , we can add the constraints Its quasi-convex hull is rank X ⩽ 3 and X ≽ 0. { 1 ⌈ r ∥M ∥∗ ⌉ if ∥M ∥ ⩽ r M 7→ +∞ otherwise Efficient computation of sparse Hessians using coloring and automatic differentiation A.H. Gebremedhin et al. (2009) and its convex hull is { 1 To recover a sparse, symmetric matrix A, from its r ∥M ∥∗ if ∥M ∥ ⩽ r M 7→ sparsity structure and a small number of products +∞ otherwise. Ax1 , . . . , Axn , one can use a colouring of the graph whose adjacency matrix is the sparsity structure to In other words, the spectral norm ∥·∥∗ (sum of the sin- define S = (1node i has colour j )ij ; for a star-colouring gular values) is a convex relaxation of the rank. For (each 4-vertex path has at least 3 colours; conseinstance, the non-convex robust PCA problem quently, the subgraph induced by 2 colours is a collection of stars), one can efficiently recover A from AS. Find L, S, N To minimize rank L + λ ∥S∥0 Compressed sensing recovery Such that ∥N ∥F ⩽ ε via nonconvex shrinkage penalties and X =L+S+N J. Woodworth and R. Chartrand (2015) can be relaxed to Find To minimize Such that and

Compressed sensing, i.e., ℓ0 minimization L, S, N ∥L∥ + ∗ + λ ∥S∥1 ∥N ∥F ⩽ ε X = L + S + N.

Robust models often look for a decomposition data = model + outliers + noise = low rank + sparse + small.

Article and book summaries by Vincent Zoonekynd

Find To minimize Such that

w ∥w∥0 Aw = b

is often relaxed to ℓ1 minimization, and many algorithms rely on the proximal mapping of the ℓ1 norm S(x) =

= Argmin λ ∥w∥1 + w

1 2

2

∥w − x∥2 .

The ℓp quasi-norms, 0 < p < 1, are a better approximation of the ℓ0 penalty, but their proximal mapping has no known closed form expression. Instead, one can 147/587

start with a proximal mapping (in closed form), e.g.,

Data manipulation detection via permutation information theory quantifiers A.F. Bariviera (2015) Application to financial time series.

p-shrinkage

firm shrinkage

hard shrinkage

and derive the corresponding penalty

.

Entropy-based financial asset pricing M. Ormos and S. Zibriczky (2014) In portfolio construction, one can use entropy instead of the standard deviation as a risk measure (it requires a density estimator): one can compute and plot efficient portfolios in the entropy×return space and decompose the entropy into systematic (mutual information) and specific (conditional entropy) components.

Representing numeric data in 32 bits while preserving 64-bit precision R.M. Neal (2015) Storing data as decimals (rather than floating point) is space-efficient, but a time-consuming conversion is usually needed to use the data. Instead, one can use a table lookup (implemented in pqR). A generalized Kahan-Babuška-summation algorithm A. Klein (2005)

When naively computing a (long) sum of floating point numbers, the rounding errors accumulate. It is possible to compensate for them, e.g., by re-ordering the terms (if they are all positive, use the Huffman code order, starting with the smallest numbers – unfortunately, Given a probability distribution p on J1, N K, one can with arbitrary signs, finding the optimal order is an define the normalized entropy HS and the complexity NP-hard problem) or by estimating the error and corCJS as (here, pe denotes the uniform distribution) recting for it: the Kahan algorithm estimates the error ∑ and corrects it immediately; the more accurate KahanS[p] = − pi log pi Babuška-Neumaier (KBN) algorithm corrects it at the S[pe ] = log N end. There are dozens of variants of those algorithms. S[p] HS [p] = S[pe ] Printing floating-point numbers quickly J(p|pe ) = S[ 12 (p + pe )] − 12 S[p] − 12 S[pe ] and accurately with integers Permutation-information-theory approach to unveil delay dynamics from time series analysis L. Zunino et al. (2010)

Max J(p|pe ) = log 2N − (N + 1) log(N + 1) p

QJ (p|pe ) =

J(p|pe ) Maxq J(q|pe )

CJS [p] = QJ (p|pe )HS [p] To transform a time series into a discrete distribution, choose – an embedding dimension D > 1; – an embedding delay τ and consider the permutation defined by

F. Loitsch (2010) Printing floating-point numbers (i.e., converting them to decimal) is not a trivial task. The previous algorithm, Dragon4, required arbitrary-precision arithmetic. Grisu, only requires integer arithmetic (but occasionally needs to fall back to Dragon4). Kylix: a sparse allreduce for commodity clusters H. Zhao and J. Canny The all-reduce pattern usually involves a tree network

(xt , xt+τ , . . . , xt+(D−1)τ ). This defines a discrete distribution on SD (D should not be too large: D! ≪ n.) The corresponding entropy and complexity are the permutation entropy and the In a bufferfly network, the nodes are arranged in a hypermutation statistical complexity. percube; in step k, the nodes communicate with their As an example, one can look at the Mackey-Glass osneighbours along dimension k. cillator ax(t − τ0 ) 1 + xc (t − τ0 ) c = 10, τ0 = 60

x˙ = −x + a = 2,

in the (HS , CJS ) space, as τ0 varies.

Article and book summaries by Vincent Zoonekynd

148/587

Edge compression techniques for visualization of dense directed graphs T. Dwyer et al. (2013)

Google matrix analysis of the multiproduct world trade network L. Ermann and D.L. Shepelyansky (2015)

To display a directed graph with a lot of edges (e.g., the dependencies between software components), one can try to group them into modules, with edges between modules instead of between nodes:

In the graph whose vertices are country×product pairs, also consider the Chei-rank (the PageRank of the opposite graph); use a personalized variant of PageRank to account for imbalance between products (e.g., oil vs furs) or countries; and compute the correlations ∑ κ= pcp p∗cp − 1

– One can find, in linear time (via hashing), nodes that have the same set of neighbours, and group them (the nodes inside a module are either disconnected ot form a clique, depending on whether we include a node in its neighbourhood); – One can also allow for some internal structure in the modules (but the algorithm is trickier); – One can also allow for some module-crossing edges (powergraph), e.g., with a greedy algorithm: compute a hierarchical clustering of the nodes, for some node similarity measure; for each potential cluster, compute the number of edges that would be removed; choose the modules greedily. The results were compared with the optimal minimizer of

c,p

κp1 ,p2 = NC

∑ c

∑ c1

pcp1 p∗cp2 ∑ −1 pc1 p1 pc2 p2 c2

where pcp (resp. p∗cp ) are the PageRank (Chei-rank) scores (eigenvectors for λ = 1, normalized so that p′ 1 = p∗′ 1 = 1). Randomizing bipartite networks: the case of the world trade web F. Saracco et al. (2015)

#modules + w1 × #edges + w2 × #crossings,

The common graph metrics can be generalized to bicomputed using MiniZinc – the greedy solution is far partite graphs: from optimal but, in psychological tests (“how much – Assortativity: Cor(deg X, deg Y | edge X—Y ); of the graph do you remember?” “Can you find the – Complexity and fitness are PageRank analogues – shortest path from A to B?”), it is an improvement they can be used to reorder the rows and columns of on traditional layout algorithms. the adjacency matrix; – As a replacement for the clustering coefficient (there Improved optimal and approximate are no odd cycles), one can count motifs, e.g., , , power graph compression , , , and ; for clearer visualization of dense graphs – Nestedness: count the number of products (counT. Dwyer et al. (2013) tries) two countries (products) have in common; sum; normalize. Finding the optimal powergraph compression of a graph (i.e., a module decomposition that reduce the num- To construct statistical tests on a graph, use some maxber of edges) using general optimization methods (in- imal entropy distribution on the set of graphs (e.g., teger propagation, integer programming) is too time- assuming the degree distribution is known). consuming. The following heuristic is faster, and gives decent results: – Arrange the set of possible solutions into a tree, with the initial graph (1-node modules) as the root, and children obtained by merging two modules in the parent; – Use best-first-search or, better, beam search: at each iteration (i.e., for each depth in the tree), keep the k best solutions (k = 1 is greedy-best-first search). It is actually reasonable to explore the whole tree, with backtracking, after judicious pruning. A generic algorithm for layout of biological networks F, Schreiber et al. (2009) The algorithm beging dunnart (for constrained graph layout): start with a feasible but partial layout, find a locally optimal partial layout, extend it into a full layout.

Article and book summaries by Vincent Zoonekynd

On the modular dynamics of financial market networks F.N. Silva et al. (2015) Let G be a graph (undirected, no self-loops, no multiple edges), A(G) its adjacency matrix, ∆(G) its degree matrix (diagonal), L(G) = ∆(G) − A(G) its Laplacian, ρ(G) = L(G)/ tr ∆(G) its density matrix, λ1 ⩾ λ2 ⩾ · · · ⩾ λn the eigenvalues of ρ(G). The Von Neumann entropy of G is ∑ S(G) = − λi log2 λi . i

One can look at how graph metrics or community metrics (modularity, average shortest path length, average betweenness, degree assortativity, transitivity, von Neumann entropy) vary over time (for a network built from the correlation matrix of asset returns, estimated on a moving window). With a model-based community detection algorithm, one can generate random graphs from the fitted com149/587

munity model, and look at the distribution of graph metrics.

– Max Spec Cor X, with capitalization-weighted and volume-weighted variants (apply a moving average filter if too noisy).

A note on the von Neumann entropy of random graphs W. Du et al. (2010)

Kernel spectral clustering and applications R. Langone et al. (2015)

Dynamic multi-factor clustering of financial networks G.J. Ross (2015)

To cluster data with PCA (or kernel PCA, or spectral methods – kPCA on a graph Laplacian), binarize the transformed data (sign PC1 , sign PC2 , . . . ) and use the most frequent binary codes as clusters.

To measure how much a qualitative variable (sector, country, etc.) influences a hierarchical clustering (from stock return correlations): for each pair of points i, j in the same class, find their closest common ancestor k, count the proportion of points in this class among the descendants of k; average over all pairs. Identifying states in a financial market M.C. Münnix et al. (2012) To identify market states:

Scale up nonlinear component analysis with doubly stochastic gradients B. Xie et al. Use two stochastic approximations simultaneously in kernel PCA: process the data in minibatches (as in stochastic gradient descent) and use random features (different in each batch, as in randomized PCA). Implied correlation and expected returns M. Valenzuela

– Compute the correlation matrix, for daily (or hourly) returns, on a 2-month moving window; The option-implied correlation – Compute the normalized L1 distance (alternatively, ∑ the difference between the largest eigenvalues) beσ 2 − wi2 σi2 tween those correlation matrices, and cluster them ρ= ∑ i w i w j σi σj (hierarchical clustering or k-means) – alternatively, i̸=j fit some regime-switching model. (∫ ) ∫ ∞ FT 2erT 2 putT (K)dK + callT (K)dK σ = FT 0 TF Forecasting financial extremes: a network degree measure of super-exponential growth W. Yan et al. (2015) Use the degree of the visibility graph of the log-price (or its opposite) to identify super-exponential growth (or decay); this is similar to the LPPL pattern recognition indicator.

computed from (30-day) put and call options on the S&P 100 and its constituents, can help predict (3- to 12-month) future returns. Market timing with a robust moving average V. Zakamulin (2015)

A practical approach to financial crisis Among the half-dozen momentum strategies studies, moving average (EWMA) indicator based on random matrices the exponentially-weighted ∑ k “buy if λ × returns A. Kornprobost and R. Douady (2015) [t−k−1,t−k] > 0”, with λ = k⩾0 0.87, is the most robust to the look-back period. The Hellinger distance between two probability distributions is ∫ Market timing with moving averages: (√ √ )2 anatomy and performance of trading rules Hellinger(p, q) = p− q . V. Zakamulin To detect market instability, look at the following indicators: – Hellinger distance between the Marcenko-Pastur distribution and the distribution of the eigenvalues of the correlation matrix (low eigenvalues are not informative: discard everything below the tenth of the maximum theoretical eigenvalue); – Hellinger distance with the spectrum of the sample correlation of a constant correlation Gaussian (or Student, with 2 degrees of freedom); – Maximum eigenvalue of Var X (spectral radius); – tr Var X; Article and book summaries by Vincent Zoonekynd

Technical analysis rules can be formulated as “buy ∑ if k⩾0 wk Pricenow−k ” for different weigting schemes (one can also use price changes, with different weights). Copula-based hierarchical risk aggregation F. Derendinger (2015) A mildly tree-dependent random vector (Xi )i∈I is – a rooted tree structure on the index set I, – a univariate distribution for each leaf, – a copula for each node, describing the dependence of its children, 150/587

where each node is the sum of its children. This does not uniquely determine the distribution of the random vector, unless one also assumes that, for each node, its descendants are conditionally independent from the other nodes.

Relevance vector machines explained T. Fletcher Relevance vector machines (RVM) consider the bayesian model wj ∼ N (0, αj−1 )

Measuring financial asset return εi ∼ N (0, β −1 ) and volatility spillovers, ti = w′ ϕ(xi ) + εi with applications to global equity markets F. Diebold and K. Yilmaz (2009) and estimate its posterior To measure spillover in a VAR(p) model xt = Φxt−1 + εt , – Write the proces as an MA(∞) process, xt = (1 − ΦL)−1 εt ; – Using the Choleski decomposition Var εt = Qt Q−1 t , rewrite the process as an M A(∞) process with orthogonal innovations xt = (I − ΦL)−1 Q−1 t · Qt εt = A(L)ut ; – The 1-step-ahead error xt+1 − xt+1|t = A0 ut+1 has variance A0 A′0 ; the K-step-ahead error can be decomposed   ∑ ∑ ∑ ∑ ∑∑  akij 2 = akii 2 + akij 2  ij 1⩽k⩽K

i

k

j̸=i

k

where the second term is the spillover of j onto i. The spillover index is ∑∑ S=

i̸=j

akij 2

k



akij 2

ijk

It depends on the order of the variables, but not too much. On the computational complexity of high-dimensional Bayesian variable selection Y. Yang et al. (2015) The Bayesian hierarchical sparse model γ = {variables entering the model} ⊂ J1, nK π(γ) ∝ p−κ|γ| 1|γ|⩽s0 π(ϕ) ∝ ϕ−1 (improper prior) βγ ∼ N (0, gϕ−1 I|γ| ) w ∼ N (0, ϕ−1 In ) Y = X + γβγ + w (the slab-and-spike prior, i.e., a mixture of two Gaussians with different variances, is another popular choice), sampled via Metropolis-Hastings (with the neighbourhood of γ defined by removing or adding a variable; or by removing and adding a variable) is an alternative to the lasso for variable selection: it has different, sometimes better, theoretical properties.

Article and book summaries by Vincent Zoonekynd

w | t, α, β ∼ N (m, Σ) m = βΣΦ′ t Σ = (diag α + β −1 Φ′ Φ)−1 iteratively, by computing m and Σ for given values of α and β, then computing the values of the hyperparameters α, β that maximize the evidence P (t | α, β): αi ← β←

1 − αi Σii m2i ∑ N − i (1 − αi Σii ) 2

∥t − Φm∥

.

Quite often, αi → ∞, i.e., wi → 0: the corresponding parameters can be pruned. The remaining parameters are the relevance vectors. Shotgun stochastic search for “large p” regression C. Hans et al. (2007) Shotgun stochastic search (SSS) is a variant of Metropolis-Hastings (MH) to look for a sparse model: – Let Γ be the set of the best k models examined so far and γn the current model; – In parallel (MCMC is usually sequential), examine all the (or a large number of) neighbours of γn ; add them to Γ; discard the worst models if |Γ| k; – Choose γn+1 as with MH (if you allow for three types of neighbours, from deletion moves γ − , additions γ + and replacement γ 0 , they are unbalanced, expecially in high dimensions: γ 0 ≫ |γ + | ≫ |γ − | – you may want to sample from them separately), but with an acceptance probability that depends on the sum of the scores of the neighbours of the current point and the candidate. Statistical model criticism using kernel two sample tests J.R. Lloyd and Z. Ghahramani The notion of p-value can be generalized to a Bayesian setting: – Choose a statistic T ; – Estimate P [T ⩽ Tobs ] using the posterior distribution of T . It measures how surprising the data still is, even after observing it.

151/587

Random bits regression: a strong general predictor for big data Y. Wang et al. Penalized regression on random binary features forms well on big data (same idea as echo state works or extreme learning machines, which are SVMs with a randomized (instead of universal) nel).

rank-1 tensor. The model is fitted by minimizing ( ) 1∑ 2 loss f (xi ), yi + λ ∥w∥2 . n i

pernetjust (Kar-Karnick features are similar, but w is constrained ker- to be in the subspace spanned by a sufficiently large number of random tensors.)

Random feature maps for dot product kernels P. Kar and H. Karnick (2012)

Score function features for discriminative learning: matrix and tensor frameworks M. Janzamin et al. (2015)

SVMs with a high-dimensional embedding space tend to have too many support vectors. One can trans- To build features: form the kernel to reduce the dimension, randomly (as – Estimate the probability distribution p(x) of the (unin the Johnson-Lindenstrauss lemma). For instance, labelled) data x; if the kernel form k(x, y) = f (⟨x, y⟩), with ∑ is of the – Compute the score functions n f (x) = n⩾0 an x , where the an are nonnegative, one can use this expression to build an embedding ∇m p(x) Z : Rd −→ RD , with ⟨Zx, Zy⟩ ≈ k(x, y): Sm (x) = (−1)m ; p(x) n+1 – Select N randomly with P [N = n] ∝ 1/p ; – Select ω1 , . . . √ , ωn ∈ {±1}d randomly; – Given labeled data (x, y), we want some information ∏ – Let Z1 (x) = aN pN +1 j ωj′ x; about G(x) = E[y|x]; we can estimate the expecta– Do the same for the other coordinates Z2 , . . . , ZD . tion of its derivatives: E[∇m G(x)] = E[ySm (x)]; – These are (huge) tensors: compute a low rank ∑ ap⊗m proximation (SVD, CP), E[∇m G(x)] ≈ j uj An introduction to random indexing (each tensor product only involves a single vector M. Sahlgren because the tensors are symmetric); LSA (latent semantic indexing) builds a word- – Use σ(x·uj ) as features where σ is a signoid function. document matrix and computes its (truncated) SVD to produce a lower-dimensional representation of the words. However, the initial word-document matrix can Picture: a probabilistic programming language for scene perception be too large. Instead, one can use a random projecT.D. Kulkarni et al. tion (Johnson-Lindenstrauss): it can be built incrementally, one document at a time. A scene description language can be turned into a probabilistic programming language; do not use a fullyA practical guide rendered (photorealistic) scene to compute the model to applying echo state networks with the data pixel by pixel: use features instead (from M. Lukoševičius computer vision or deep learning). Implementation advice for reservoir computing. Slice sampling for probabilistic programming R. Ranca (2015)

fastFM: a library for factorization machines I. Bayer To sample from a probability distribution only known ∗ C library, with Python bindings (contrary to libfm). up to a multiplicative factor, p(x) ∝ p (x), slice samFactorization machines are not limited to recommen- pling proceeds iteratively: dation systems, but can add interactions to any model.

– Pick ut ∼ U (0, p∗ (xt )); – Pick xt+1 ∼ U ({x : p∗ (x) ⩾ ut }).

Tensor machines StocPy is yet another Python-based probabilistic profor learning target-specific polynomial features gramming language. J. Yang and A. Gittens Tensor machines are a generalization of factorization Count-min-log sketch: approximately counting machines with approximate counters ⟨ r ⟩ q G. Pitel and G. Fouquier (2015) ∑ ∑ pi f (x) = w0 +⟨w1 , x⟩+ w1 • · · · • wpi , x • · · · • x Count-min-sketch variant with more precision for low p=2 i=1 counts, with applications to TD-IDF matrices. where · • · is the outer product and w1pi • · · · • wpi is a Article and book summaries by Vincent Zoonekynd

152/587

Supervised learning from multiple experts: Neural Turing machine whom to trust when everyone lies a bit A. Graves et al. V.C. Raykar et al. (2009) A differentiable Turing machine can be trained with Binary forecasts yij , for many tasks i (e.g., medical di- gradient descent to learn simple algorithms. agnostic) for several experts j can be combined (use majority voting as a baseline model) with an EM alLong short-term memory over tree structures gorithm, alternatively estimating the forecasts and the X. Zhu et al. (2015) reliability of each expert; one can add a Bayesian prior LSTM networks can be generalized from sequences to if some experts are known to be more or less reliable. trees. Invariant backpropagation: how to train a transformation-invariant neural network S. Demyanov et al. One expects neural networks to be invariants to some transformations (rotations, translations, etc.). The following weight updates achieve that (you do not have to specify the group action):

2 ∂ ∂Loss ∂Loss

. −β w ←w−α ∂w ∂w ∂Input

LSTM: a search space odyssey K. Greff et al. In a recurrent net, there is feedback from the output to the input. In an LSTM net, there is feedback from (some of) the inner nodes to (themselves and) the input: they can be trained line RNN (with the same problems). jvmr: integration of R, Java and Scala D.B. Dahl et al. Embed R in Scala or Java, or the reverse.

Learning classifiers from synthetic data using a multichannel autoencoder X. Zhang et al. Use an autoencoder (trained on real data) on surrogate data to make it more realistic.

Bayesian estimation supersedes the t test J.K. Kruschke (2013) For a Bayesian T test, check the BEST R package. Cubist models for regression M. Kuhn et al. (2012)

Deep transform: cocktail party source Cubist models generalize decision trees: separation via probabilistic re-synthesis A.J.R. Simpson – There is a model in each node, and each leaf (or node) is shrunk towards its parent; To separate two speech signals (in a mono-aural sig- – One can add boosting and/or shrink each forecats nal), learn an auto-encoder towards its nearest neighbours (in the training set or the sample to forecast). Speaker 1 Speaker 1 Mixed Speaker 2 Speaker 2 R in finance 2015 1. Several presentations applied social network analy[They used a male and a female speaker: their sis (SNA) methods to financial networks. auto-encoder learned to separate gender, rather than speech.] Explaining and harnessing adversarial examples I.J. Goodfellow et al. (2015) For high-dimensional linear models, you can make infinitesimal changes to the input (e.g., ±1 to each 8-bit pixel of an image) that add up to a large change in the output. To mitigate the problem: – Use “deep” neural networks (at least one hidden layer); – Train on a mixture of adversarial and clean examples. (RBF networks are immune to those adversarial examples, but do not generalize well.)

PageRank on a graph weighted by cross-correlations can identify leaders. [Some use the HITS algorithm instead, to identify leaders and followers, extracting information from leaders and investing in followers.] If the nodes i of a graph represent financial entities, each with a risk Ci , one can compute the following graph metrics: E[degree2 ] E[degree] √ Risk score = C ′ AC Fragility =

where A is the incidence matrix The risk score can be decomposed with Euler’s formula: ∑ ∂Score Risk score = Ci . ∂Ci One can then compute Criticality = C ⊙ x

Article and book summaries by Vincent Zoonekynd

153/587

where x is the eigenvector centrality (not unlike Page- The Hierarchical VAR (HVAR) model also adds lasso Rank) and ⊙ the elementwise product, and penalties to the lags. This is implemented in the BigVAR package. ∂ ∂Score . Spilloverij = By combining bits of models that are not often used ∂Ci ∂Cj together, one can build rather complicated models, e.g., a regime switching cointegration model with timeA graph (e.g., interbank lending market, in some coun- varying transition probabilities try) can be decomposed into core and periphery, and compute various graph metrics (density, betweenness To detect multivariate outliers, use a robust covaricentrality, closeness centrality, transitivity, average ance matrix, e.g., the MCD (minimum covariance determinant). Iteratively-reweighted MCD gives the corpath length, eigenvalues, etc.) over time. rect false positive rate. This is implemented in the One can apply network analysis (plots, centrality) on CerioliOutlierDetection package. raw (SWIFT) financial transactions. 4. The PortfolioAnalytics package provides many One can compute graph metrics (centrality, etc.) on portfolio optimization approaches, including Meucci’s graphs learnt with bnlearn. fully flexible views or Almgren and Chriss’s portfolios 2. Higher moments (up to the fifth...) were mentioned from sorts. a couple of times. Flexible asset allocation with stepwise correlation rank Use differential evolution (DEoptim) to optimize the suggests to build a portfolio one asset at a time, adding expected utility of a portfolio (using the sample distri- the asset with the best bution of past returns). The derivatives of the utility w1 × momentum rank + w2 × volatility rank + function should alternate in sign. They all have names: w3 × correlation rank, U′ > 0 Non-satiation (high returns) where the correlation is the average correlation with U ′′ < 0 Risk aversion (low risk) the assets already in the portfolio. U ′′′ > 0 Prudence (positive skew) It is safer to build a portfolio from a few (smart beta) U ′′′′ < 0 Temperance (low kurtosis) ETFs than from a large number of stocks: lower estiU ′′′′′ > 0 Edginess. mation error and fewer parameters lead to better outof-sample performance. Use, for instance, U (wealth) = wealth/(c + wealth). Taxes have an impact on the optimal multi-period 2Investors have a preference for high odd moments and asset portfolio (Kelly principle). low even moments. The HighFreq package estimates them from OHLC data (not unlike all the volatility 5. mVaR and mES (Cornish-Fisher value at risk and expected shortfall) are estimators: we can compute estimators in TTR). their variance and bias (a computer algebra system 3. A partially auto-regressive process is the sum of a (CAS) is useful – they used the Matlab symbolic toolstationary AR process and a random walk; it can be box, but you may want to check SAGE, Maxima or estimated with a Kalman filter. Caveats: the tests Yacas) and compare them with a maximum likelihood have low power; the AR component may just capture estimator (asymptotically no bias and lowest variance). the market microstructure (e.g., the bid-ask bounce). Also check qrmtools::ARA and the SharpeR package. This is implemented in the partialAR package (use mVaR and mES are bad for: low α, fat tails, large the cointegration residuals, from egcm). samples. To identify change points in a time series, find the par- One can still decompose the VaR into a sum of contition of the indices that maximizes the sample energy tributions (risk factors, managers) in the presence of distance non-linear assets (e.g., with the delta-gamma approxiα

α

α

E (X, Y ) = 2E |X − Y | − E |X1 − X2 | − E |Y1 − Y2 | 2 ∑ α Eˆ(X, Y ) = |xi − xj | + #B B 1 ∑ 1 ∑ α α |xi − xj | − |yi − yj | − #Wx #Wy

mation). One can compute an upper bound on the expected Sharpe ratio; it is unclear if/how it depends on the number of assets.

6. The highfrequency package provides functions to align non-synchronous time series (wait until there has been a new observation for all time series), compute The e-cp3o algorithm speeds up the computations by volatility or jump estimators (realized bipower variremoving points that are unlikely to be change points ation, MedRV, ROWVar), tests for the presence of and reducing the size of B, Wx , Wy . This is imple- jumps, etc. mented in ecp::ecp3o. Here is a stylized fact for high frequency data: For large but sparse VARX models (VAR with exogenous variables), add a (nested group) lasso penalty. Number of transactions ∝ USD Volume2/3 ×Volatility. Wx

Wy

Article and book summaries by Vincent Zoonekynd

154/587

The creditr package provides CDS computations, with a Shiny interface, not unlike the Bloomberg or Markit CDS screen/calculator.

Bayesian multiobjective optimization to steer snake robots, to optimize both speed and head camera stability: replace the expected improvement (EI) with 7. The Rborist package is similar to RandomForest, the expected improvement in (dominated) hypervolume (EIHV). Also gives another test problem, built but faster; it can use several cores (or a GPU). from the Branin function. The irlba package computes the fast truncated SVD. 8. data.table still has an edge on dplyr:

A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning E. Brochu et al. (2010)

– The data is indexed; – It provides fast aggregation, and ordered joins (aka rolling joins, locf); – fread is a fast CSV reader (but h2o.importFile is even faster). Good review article.

The development version of DBI (finally) allows for ParEGO: a hybrid algorithm parametric queries (prepared statements). The readr with online landscape approximation package provides faster functions to read CSV files for expensive multiobjective optimization (but fread is even faster). The httr, xml2, rvest, J. Knowles (2004) jsonlite packages were also mentioned; dplyr, reshape2, etc. were not. The following multiobjective optimization algorithm is Rcpp provides better integration with RStudio, annota- more economical (in function evaluations) than NSGAtions (e.g, // [[Rcpp::export]]), C++11 (closures, II: type inference), direct access to Boost even on Win- – Initialize the population with a Latin hypercube; dows (BH). – Take a weight vector λ ∈ ∆n at random (a new one Rblpapi is a newer inferface to Bloomberg, in C++ rather than Java. It does not work well on Windows. Some suggest to describe the data pipeline in a purely declarative way; to facilitate deployment, this should also include the data cleaning part.

for each iteration, possibly with some of the coordinates set to zero); – Consider the objective function ∑ f (x) = Max λi fi (x) + ρ λi fi (x); i

i

The simsalapar package can help you (easily) parallelize computations – when you have to run the same computations/simulations many times, with different parameter values. An example for VaR computations with nested archimedian copulas (copula::onacopula) was given. [I prefer to use doParallel (and foreach) directly.]

– Model it using a Gaussian process (if there are more than 80 points in the population, take 80 at random); – Use some optimization algorithm (say, Nelder-Mead) to maximize the expected improvement; – Add the new point to the population; – Iterate.

The computation of the expected improvement in dominated hypervolume of Pareto front approximations M. Emmerich et al. (2008)

Multiobjective optimization on a budget of 250 evaluations J. Knowles and E.J. Hughes (2005)

For multi-objective Bayesian optimization, replace the expected improvement (EI) with the expected increase in the hypervolume dominated by the Pareto set. While a Monte Carlo approach is possible, it can be computed exactly: decompose the space into boxes, each corresponding to a different shape for the new efficient frontier, and compute the expected improvement conditional on the new point being in a given box. Faster computation of expected hypervolume improvement I. Hupkens et al. (2014) Expensive multiobjective optimization for robotics M. Tesch et al. (2013) Article and book summaries by Vincent Zoonekynd

The paper also provides a few test functions.

ParEGO performs better than a binary-search-based algorithm: – Pick a point at random in the largest hypercube, and split the hypercube at that point in the direction that yields the most “cube-like” sub-spaces; – Idem with a point close to a “good point”, ∑ for some random aggregation of the scores, e.g. i λi fi (x), λ ∼ U (∆n ); – Alternate between those steps with a predefined exploration/exploitation ratio. Multiobjective optimization of urban wastewater systems using ParEGO: a comparison with NSGA-II G. Fu et al. (2008) Another comparison of ParEGO and NSGA-II.

155/587

Efficient global optimization of expensive black-box functions D.R. Jones et al. (1998)

Bayesian optimization with inequality constraints J.R. Gardner et al. (2014)

Not very different from DACE: start with a Latin hy- Constrained Bayesian optimization, percube; check the goodness of fit of the model (and Find x transform the variables if needed); compute the exTo minimize f (x) pected improvement in closed form. Such that g(x) ⩾ 0, Efficient global optimization (EGO) for multi-objective problem and data mining S. Jeong and S. Obayashi (2005)

BOA: the Bayesian optimization algorithm M. Pelikan et al. (1999)

where both the objective f and the constraints g are expensive to compute, is a straightforward generalization of Bayesian optimization: replace the expected improvement E[(f (x+ ) − fˆ(x))+ ] with the expected constrained improvement E[1gˆ(x) · (f (x+ ) − fˆ(x))+ ]. One often assumes conditional independence: gˆ(x) ⊥ ⊥ fˆ(x) | x.

Variant of CMA-ES with Bayesian networks (suitable for discrete variables) instead of a Gaussian distribution.

Pareto front modeling for sensitivity analysis in multi-objective Bayesian optimization R. Calandra et al. (2014)

Bayesian optimization algorithms for multi-objective optimization M. Laumanns and J. Ocenasek (2002)

In multiobjective Bayesian optimization (MOBO), do not return a discrete set of solutions but, since a model was used for the optimization, a model of the Pareto front.

Another EGO example.

The multiobjective Bayesian optimization algorithm works as follows:

Design and analysis of computer experiments J. Sacks et al. (1989)

– Keep an archive of best solutions (all those that εdominate the solutions seen so far, but allow some One of the first papers on Bayesian optimization. dominated solutions to have at least k of them); – Model the archive, not unlike CMA-ES, but with A fast and elitist decision trees (one for each variable – Bayesian netmultiobjective genetic algorithm: NSGA-II works proved too complex) instead of a Gaussian K. Deb et al. (2002) distribution; The non-dominated sorting genetic algorithm II – Sample candidates from this model. (NSGA-II) solves multi-objective optimization problems, i.e., estimates their non-dominated front. Multi-objective Bayesian optimization algorithm 1. The candidate solutions are sorted by first findN. Khan et al. (2002) ing the non-dominated front, removing it, and iterating. The non-dominated rank can be computed efBOA can be generalized to multiobjective optimiza- feciently: tion: use non-dominated sorting and the crowding dis– For each solution, compute the domination count, tance to select the n nest solutions. i.e., the number of solutions that dominate it; – For each solution, compute the set of solutions it Multiobjective evolutionary algorithm dominates; test suites – The first non-domination front solutions have domiD.A. van Veldhuizen and G.B. Lamont (1999) nation count zero: remove it; – Update the domination counts and the domination sets; Scalable test problems for evolutionary multi-objective optimization – Iterate. K. Deb et al. (2001) 2. To preserve diversity, look at the crowding distance of each solution, defined as the average side Combining multiobjective optimization length of the cuboid around it with (half) its vertices and Bayesian model averaging to calibrate among the other solutions. forecast ensembles of soil hydraulic models 3. Define a partial order on the solutions: T. Wöhling and J.A. Vrugt (2008) – Prefer solutions with a lower non-domination rank; Bayesian optimization with Gaussian processes re– If the ranks are equal, prefer solutions with a higher placed by a (BMA) ensemble of domain-specific (PDE) crowding distance. models. 4. One can add constraints: Article and book summaries by Vincent Zoonekynd

156/587

– Prefer feasible solutions; – When two solutions are infeasible, prefer that with the smaller constraint violation. The paper also contains reference problems. GPfit: an R package for Gaussian process model fitting using a new optimization algorithm B. MacDonald et al. (2013) When fitting a Gaussian process on non-noisy data, with a Gaussian correlation function ∏ 2 Rij = Cor(yi , yj ) = exp −θk |xik − xjk | , k

θk > 0, the correlation matrix can be ill-conditioned (some of the points are too close), and the likelihood has local extrema very close to zero. One can replace R with R + δI, with the smallest nugget δ that makes the matrix well-conditioned [this is the idea behind ridge regression; this is also what you naturally do when the eigenvalues are too close to zero], and reparametrize θ as θ = exp ϕ.

How do fiscal and technology shocks affect real exchange rates? New evidence for the United States Z. Enders et al. (2008) More details on SVAR models with sign restrictions: allowance for small deviations to the sign constraints, use of the posterior distribution to test the sign of the non-constrained elements. Scaling log-linear analysis to datasets with thousands of variables F. Petitjean and G.I. Webb (2015) Given M qualitative variables V1 , . . . , VM , log-linear analysis (LLA) greedily builds a model P [V1 = a1 , . . . , VM = aM ] = u + ∑

ui (ai ) +

i



uij (ai , aj ) +

ia

bb ub )+ ; the saddle point problem – Permutation PCA L(u, a)∑= i (1 − uai + uai+1 )+ ; in high-dimensional non-convex optimization – Ranking PCA L(u, a) = i k =⇒ m(A) = 0; 0-additive measures are (additive) measures. The Choquet integral generalizes the weighted average (a way of aggregating several criteria) ∑ µ({c})f (c) c∈C

from (additive) measures to non-additive measures. Intuitively, it allows for interactions between the criteria being averaged. f : C → R+ 0 ⩽ f (c(1) ) ⩽ · · · ⩽ f (c(m) ) m ∑ [ ] f (c(i) ) − f (c(i−1) ) µ({c(1) , . . . , c(m) }) Cµ (f ) = i=1

=



Article and book summaries by Vincent Zoonekynd

c∈T

Mendelian randomization and causal inference in observationsl epidemiology N.A. Sheehan et al. (2008) Medelian randomization is another name for instrumental variables (IV): find a variable IV so that ⊥ confounding factors IV ⊥ IV −→ possible cause IV ⊥ ⊥ consequence | possible cause

SVD imputation works well for low-noise time series: Fill in the missing values with row averages; Compute the eigenvectors; Regress g1 against the (first) eigenvectors; Predict the missing values; Iterate until convergence.

m(T ) × Min f (c).

In logistic regression, we can replace w′ x with γ(Cµ (x) − β), where x is seen as a function i 7→ xi and µ is k-additive, with k small. The log-likelihood is concave.

This is a refinement of nearest neighbour imputation, which uses an equal- (or empirically) weighted average of gi1 , . . . , gik . – – – – –

(−1)|A|−|B| µ(B).

B⊂A

T ⊂C

Missing value estimation for DNA microarray gene expression data: local least squares imputation H. Kim et al.



confounding factors

IV

possible cause

consequence

The IV is often a gene. Gene interactions complicate the picture.

171/587

Simulation of fractional Brownian motion T. Dieker (2004) A stationary stochastic process X is self-similar if 1 X (m) = m (X1 +· · ·+Xm ) and X1 have the ∑ same distribution. It has long-range dependence if γk diverges, where γk = Cov(X0 , Xk ). A standard fractional Brownian motion is characterized by B(t) has stationary increments B(0) = 0, EB(t) = 0 EB(t)2 = t2H B(t) is Gaussian.

matrix into a circulant matrix 

 γ0

γ1

γn−1

0

  γ γn−1  1   γ1   γn−1 γ1 γ0 γ1    0 γn−1 γ1 γ0  γ γ1  n−1   γn−1   γ1 γn−1 0 γn−1

γn−1

γ1

γn−1 γ1

γ1

    γn−1    0    γn−1      γ1    γ0

Its increments Xk = B(k + 1) − B(k) are fractional which can be efficiently inverted (FFT), to find a Gaussian noise. These are Gaussian processes with square root in O(n log n). covariance functions There are approximate simulation methods (i.e., they sample from a distribution which is not exactly a frac2H E[B(s)B(t)] = 21 [t2H + s2H + |t − s| ] tional Brownian motion), using: 2H 2H 2H γk = E[Xn Xn+1 ] = 12 [|k − 1| − 2 |k| + |k + 1| ]. – The expression of a fractional Brownian motion as a stochastic integral; – Queueing theory; – Random midpoint displacement, i.e., replacing Xn+1 |X1:n with Xn |Xi1 ,...,ik in the Hosking or ∑ ikλ Choleski method; f (λ) = e = – Spectral analysis: a stationary Gaussian process can k∈Z ( ) ∑ be expressed as a stochastic integral involving its −2H−1 2 sin(πH)Γ(2H + 1)(1 − cos λ) |λ| + ··· spectral density f , There is no closed form for the spectral density of fractional Gaussian noise

ARFIMA processes, (1 − L)d Xk ∼ ARMA(p, q) also have long memory; the fractional integration is defined by (1 − L)d =

∑ (d ) n⩾0

n

π



0 π





but approximations are available. Xn =



0

(−L)n

( ) d Γ(d + 1) = n Γ(d − n + 1)Γ(n + 1) The Hosking method simulates a general stationary Gaussian process iteratively, in O(n2 ): since 

f (λ) cos(nλ)dB1 (λ)− π f (λ) sin(nλ)dB2 (λ); π

– Wavelets: the wavelet coefficients of a fractional Brownian motion are almost independent and almost Gaussian, with zero mean and known variance; – Fractional integration of white noise (white noise is the “derivative” of Brownian motion, seen as a generalized process, aka tempered distribution); – Series expansions. The Hurst exponent can be estimated in many ways:



– Aggregated variance Var X (m) = m2H−2 Var X (m) ¯ (m) k X  − X ∼ or absolute moments E  k k(H−1)  V1:n+1 = Var X1:n+1 m E |X − EX| but, in presence of autocorreγ1   lation, the finite-sample estimators are biased; γ1 γ0 – Higuchi estimator: absolute moment estimator, with k = 1 and overlapping windows; and E[X1:n+1 ] = 0, we know that Xn+1 |X1:n = x1:n ∼ – The slope of the periodogram; N (µ, v), where µ and v are easy to compute (because – The variance of the regression residuals, on a moving window (i.e., the aggregated variance, after detrendof the shape of V1:n ). ing); The Choleski method can simulate arbitrary Gaussian – R/S analysis; processes; the Choleski matrix can actually be com- – Likelihood estimator: (X , . . . , X ) ∼ N (0, Γ ), 1 N H puted explicitly, and this gives an O(n3 ) incremental where ΓH is known (as a function of H) – unforalgorithm. tunately, det ΓH and Γ−1 H are expensive to compute; The Davies and Harte method embeds the covariance – Whittle estimator: compare the periodogram I(λ) γ0 γ1   γ1 =  

Article and book summaries by Vincent Zoonekynd

172/587

with the spectral density fH (λ), ˆ = Argmin H H

∑ I(λk ) ; fH (λk ) k

– Local Whittle estimator: replace fH (λ) with 1−2H C |λ| ; – Variance of the wavelet coefficients (no detrending is needed if the wavelet has enough vanishing moments). Speeding up convolutional neural networks using fine-tuned CP-decomposition V. Lebedev et al. (2015) In image processing, convolutional neural networks (CNN) perform the convolution of an image with a 4-dimensional tensor: xijk = pixel at coordinates (i, j), in RGB plane k, for one of the patches ∑ yℓ = bℓ + aijkℓ xijk ijk

= ℓth coordinate of the output, for this patch

– Use linear rectifiers. Link analysis, eigenvectors and stability A.Y. Ng et al. Hits and PageRank are eigenvalue problems: if the eigengap (the difference between the two largest eigenvalues) is small, the first eigenvector is not stable. Hits is unstable if the maximum degree is large; PageRank is not robust to perturbations to pages with a high score. Latent semantic indexing (LSI) (compute the left and right singular vectors of the binary term-document matrix) is a special case of Hits and has the same problem. Stable algorithms for link analysis A.Y. Ng et al. (2001) The randomized Hits algorithm is the Hits algorithm, expressed as a random walk with teleportation, à la PageRank (teleportation increases stability). The subspace Hits algorithm uses the first k eigenvectors and defines the scores as k ∑

f (λi )⟨ej , xi ⟩2 aj = The tensor can be replaced by a low-rank approximai=1 tion, e.g., the CP-decomposition (a higher-order ana- where ej are the basis vectors (zeroes except in pologue of PCA). sition j), f (λ) = λ2 (but other weight functions are possible), xi are the eigenvectors and λi the eigenvalFitNets: hints for thin deep nets ues. A. Romero et al. (2015) Intriguing properties of neural networks To have a small network (e.g., a thin but deep) mimC. Szegedy et al. (2014) ick a large one (wide and deep) do not only learn the outputs, also use the intermediate representations.

Deep neural networks are discontinuous: it is possible to manufacture an imperceptible alteration of an image Effective use of word order to misclassify it. for text categorization While it is possible to interpret individual hidden neuwith convolutional neural networks rons, it is equally easy to interpret random linear comR. Johnson and T. Zhang binations of neurons.

Convolutional neural networks (CNN), often used to process images, apply convolutional and pooling layers (one or several pairs) to the data before passing it to a linear classifier. – A convolutional layer takes k × k (overlapping) regions of the image and gives each of them to m neurons; the weights are shared accross regions; each region, a 3k 2 -dimensional vector, becomes an mdimensional vector; – A pooling layer takes the output of a convolutional layer, cuts it into non-overlapping regions, and aggregates them (max-pooling or average pooling), so that each region becomes an m-dimensional vector. This can also be applied to text: – Encode length-k regions as bags of vords (i.e., |V |dimensional vectors) or using indicator variables (i.e., k |V |-dimensional vectors); – To deal with variable sentence or text length, fix the number of pooling units and dynamically select the pooling region size; Article and book summaries by Vincent Zoonekynd

On the effective measure of dimension in the analysis cosparse model R. Giryes et al. (2014) Compressed sensing tries to reconstruct a vector x so that y = Ax + noise, assuming x is sparse. The cosparse model tries to reconstruct a vector x so that y = Ax + noise assuming Bx is sparse – for instance, x could be an image and Bx its gradient. Adaptive compressed sensing for estimation of structured sparse sets R.M. Castro and E. Tánczos (2014) Compressed sensing, i.e., the reconstruction of a signal from noisy and partial observations, under the assumption that it is sparse, can be generalized to more general supports, e.g., subsets of length s of J1, nK, subintervals of length s of J1, nK, disjoint unions of k such intervals, s-element stars in a graph, etc.

173/587

An exact mapping between the variational renormalization group and deep learning P. Mehta and D.J. Schwab (2014)

Teaching deep convolutional neural neworks to play Go C. Clark and A. Storkey

The renormalization group is the hierarchical descrip- More details. tion of a physical phenomenon by progressively integrating out (marginalizing over) small features (“iteraEfficient gradient-based inference through tive coarse-graining scheme”). It is not unlike deep nettransformations between Bayes nets and works, which integrate small features into larger ones, Neural nets from layer to layer. D.P. Kingma and M. Welling (2014) Taming the monster: a fast and simple algorithm for contextual bandits A. Agarwal et al. (2014) The ε-greedy or epoch-greedy algorithm have a suboptimal regret. Randomized UCB has an optimal regret, but it is complex and slow: it can be simplified and sped up.

Bayesian networks and neural networks are the same thing: each can (often) be transformed into the other. Hogwild!: a lock-free approach to parallelizing stochastic gradient descent F. Niu et al.

Stochastic gradient descent (SGD) is often easy to parallelize: reformulate the objective function to make it almost sepearable (a large sum of functions of a small Learning to discover social circles number of variables – that is usually possible if the in ego networks input is sparse) and forget about concurrency probJ. McAuley and J. Leskovec lems (let the processors overwrite each other’s work – A generalization of BigCLAM that allows overlapping it happens too rarely to have a noticeable effect). or nested communities (not unlike topics) and uses both graph structure and node features. Crypto-nets: neural networks over encrypted data Fast, simple and accurate handwritten digit P. Xie et al. (2015) classification using extreme learning machines with shaped input-weights Homomorphic encryption can be used with neural netM.D. McDonnell et al. (2014) works – but since it only provides + and ×, you need to approximate the transfer function with a polynomial Extreme learning machines (ELM, echo nets – neural (the input space should be compact). nets whose hidden layer is random and not learned) perform as well as deep nets in the digit classification libDAI: a free and open source C++ library task: for discrete approximate inference – Augment the training set by distorting it; in graphical models – Each hidden unit only operates on a randomly sized J.M. Mooij (2010) and positioned patch of the image. To estimate Bayesian networks or Markov random fields (MRF). Also check OpenGM2. Selfieboost: a boosting algorithm for deep learning LSH forest: self-tuning indexes S. Shalev-Shwarts for similarity search Adaboost sequentially fits a model, assigning more M. Bawa et al. weight to currently misclassified observations, and returns a linear combination of the models. Selfieboost Given a locality-sensitive hash (LSH) family H = only returns the last one; to avoid losing the perfor- (hi )i⩾1 , one usually assigns a fixed-length label g(p) = mance of the earlier models, it adds a penalty to keep (h1 (p), . . . , hn (p)) to each point p. Instead, one can use variable-length labels, arranged in a prefix tree, suffithe new model forecasts close to the previous ones. ciently long so that all the points have a different label. An LSH forest is a set of such LSH trees. Move evaluation in Go using deep convolutional networks A performance evaluation C. Maddison et al. (2015) of open source graph databases The positions on a Go board are not unlike the pixels R. McColl et al. (2014) of an imge: use convolutional neural networks to evaluate a position or a move, using training data from the Stinger, MTGL, Boost, Giraph and NetworkX perform well. KGS server

Article and book summaries by Vincent Zoonekynd

174/587

Trend filtering methods for momentum strategies B. Bruder et al. (2011)

Computing present values: capital budgeting done correctly R. Jarrow (2014)

Here are a few algorithms to identify trends in a time series:

The value Vt of an asset is related to its expected returns as follows. [ ] VT 1 + µT −1 := ET −1 VT −1 [ ] VT VT −1 = ET −1 1 + µT −1 [ ] VT −1 VT −2 = ET −2 1 + µT −2 [ ]] [ VT 1 ET −1 = ET −2 1 + µT −2 1 + µT −1 [ ] VT = ET −2 (1 + µT −2 )(1 + µT −1 ) [ ] VT Vt = Et (1 + µt ) · · · (1 + µT −1 )

– Moving average, exponential moving average; – Local regression, which can be seen (for a rectangular kernel) as the discretization of the Lanczos derivarive (a noise-robust slope) 3 ε→0 2ε2

f ′ (x) = lim



ε

tf (x + t)dt; −ε

– Local polynomial regression, loess smoothing; – Hodrick-Prescott filter,

2 2 yˆ = Argmin ∥y − x∥2 + λ D2 y 2 ; y

– Structural model (Kalman filter); – Low-pass filter, from a Fourier or Wavelet transform; – Singular spectrum analysis (SSA): let 

y1  .. H= .

···

 ym ..  . 

yn

···

yt

and perform a dimension reduction on H; – Support vector machines, Y ∼ time, with radial basis functions (RBF), and their standard deviation as smoothing parameter; – Empirical mode decomposition (EMD), Hilbert filter; – Gaussian processes (not mentioned).

(This is the real-world measure, P ). Some textbooks incorrectly price stochastic cash flows as E[VT ] . V0wrong = (1 + Eµ0 ) · · · (1 + EµT −1 ) Correlation in the magnitude of financial returns J. Hämäläinen (2014) Look at Cor(|ri | , |rj |). Understanding the relationship of momentum with beta T. Cenesizoglu et al. (2014)

Multivariate filtering methods were listed but not de- Beta explains momentum performance. tailed: Kalman filter (with a small number of stochastic components), error correction models, permanentThe stability and accuracy of credit ratings transitory decomposition. P. Viegas de Carvalho et al. (2014) Chaos in economics and finance D. Guégan (2009) Let X0 be a random variable and Xn = f (Xn−1 ) a deterministic dynamic system. To study time series using this model: – Estimate the embedding dimension; – Estimate f and f ′ (using k-NN, radial basis functions, neural nets, etc.); – Estimate the Lyapunov esxponent

Reratings are not related to changes in the probability of default, but rather to changes in the relative probability of default. Ratings depend on the business or economic cycle. Data science at the command line J. Janssens (2014)

Traditional Unix tools (grep, sort, uniq, head, seq, paste, shuf, parallel, etc. can be complemented with more recent or specialized tools to process data (download, clean, explore, model) such as cvskit (in particular csvlook and cvssql), jq (to transform 1 ∑ λ = lim ln |f ′ (Xt )| , JSON data, à la XPath), json2csv, scrape (to transn→∞ n 1⩽t⩽n form XML/HTML data, using XPath or CSS selectors), Rio (to use R as a filter, to convert a CSV file i.e., the speed at which trajectories diverge: into another one, or into a plot). ∥xn − yn ∥ ≈ eλn ∥x0 − y0 ∥; Some of those tools are not standard and were devel– Test if it is positive. opped by the author for the book. All are available in a (Vagrant) virtual machine. Article and book summaries by Vincent Zoonekynd

175/587

The book suggests Drake to manage data science workflows and gives a few examples of actual computations, with Tapkee (a C++ library, part of Shogun, for dimension reduction: PCA, MDS, t-SNE, LLE, isomap), Weka, R (via Rio), SKLL, BigML – but no mention of Vowpal Wabbit, libsvm, liblinear or libfm. Some traditional and useful Unix commands were also missing: xargs (mostly replaced by parallel), dc (but many prefer bc), file, screen, tmux, csplit (though split was mentioned). Big data analytics summer school (Caltech, JPL, Coursera, 2014) The k-nearest neightbours (kNN) algorithm can be implented as follows:

first divide the matrix into blocks (tiles) and compute the partial sums, then aggregate them. Page-rank can be personalized by teleporting to a random page in a set of reference pages (e.g., the DMOZ pages for the topic of interest) instead of a random page; it can be made more robust to spam farms by teleporting to a random page in a set of trusted pages (e.g., edu domains). Min-hashing is an approximate test for set equality – a locality-sensitive hashing (LSH) function for sets. Let Ω = {1, . . . , n} be a set (in practice, it will be a set of words or n-grams); a random permutation σ ∈ S(Ω) defines a (random) hash function hσ (C) = inf{ i : σ(i) ∈ C },

C ⊂ Ω.

The probability that two subsets C1 and C2 agree on – In dimension 1, with binary search; this hash function is their Jaccard similarity, – In dimensions 2 to 8, with a kd-tree; – In dimensions greater than 8, with locality-sensitive |C1 ∩ C2 | P [hσ (C1 ) = hσ (C2 )] = J(C1 , C2 ) = . hashing (LSH) – this only provides approximate |C1 ∪ C2 | neighbours. In much higher dimensions, kNN no longer works: all A family H of hash functions h : S −→ X, where (S, d) the distances are approximately equal (curse of dimen- is a metric space and P a probability on H, is said to be (d1 , d2 , p1 , p2 )-sensitive if sionality). The Euclidian distance is not a good choice for distance-based algorithms in presence of corrrelations or differences of scale. One could use the Mahalanobis distance but, for supervised learning, this is suboptimal. For instance, for a classification problem, multiclass discriminant analysis (a form of metric learning) looks for the distance that best separates the classes, i.e., maximizes ′

γ(V ) =

|V Σb V | |V ′ Σw V |

where V = linear projection of the data Σb = between scatter matrix Σw = within scatter matrix. The course also covered the following topics: – Dimension reduction: kNN, feature selection, PCA, metric learning, kernel PCS (kPCA); – Bootstrap (classical, N out of M , subsampling); – Visualization; – Random forests (the sample proximity is the number of trees in which samples i and j end up in the same leaf). Mining massive datasets J. Leskovec, A. Rajaraman, J. Ullman (Stanford, Coursera, 2014) The complexity of a map-reduce algorithm should be measured by the communication cost – that is the main bottleneck, and its is reasonable to assume that all the data has to be moved around. For instance, matrix multiplication is more efficient with a 2-step algorithm: Article and book summaries by Vincent Zoonekynd

∀x, y ∈ S

d(x, y) ⩽ d1 =⇒ P [h(x) = h(y)] ⩾ p1

∀x, y ∈ S

d(x, y) ⩾ d2 =⇒ P [h(x) = h(y)] ⩽ p2 .

It can be “amplified” by the and or or construction, or an and-or or or-and combination (the distances d1 , d2 remain the same, probabilities are transformed by p 7→ pr or p 7→ 1 − (1 − p)r ). and : ∀i, j ∈ J1, rK hi (x) = hj (x) or : ∃i, j ∈ J1, rK hi (x) = hj (x) Here are some examples of LSH families: – For the cosine distance, project on a random hyperplane θ ; π – For the Euclidian distance, project the points on a random line, and put them in buckets of length a. hv (x) = sign(v · x),

P [h(x) = h(y)] = 1 −

For approximate text similarity (Jaccard similarity of the bag or words, or the bag of n-grams), one can also look at the following: – The number of distinct words; – The N least common words – that gives a lower bound on the distance; – An index (word, position, number of remaining words), where “position” is the position in the bag of words, sorted with the rarest words first. The apriori algorithm, for frequent itemset mining, proceeds in two steps (for item pairs – n steps for n-item sets): first count the number of occurrences of each item, and pick those that appear at least s times; then, count the pairs of items (but only for those found in the first step), and keep those that appear at least s times. The memory consumption can be reduced: 176/587

– PCY: in the first pass, also count pairs, but after hashing them; – Multihash: idem, with several hash functions; – Multistage: in pass 1.5, count the hashed pairs, but only for the items from pass 1; – Toivonen: run the algorithm on a random subset of the data that fits in memory; enlarge the set of candidate itemsets by adding the immediate subsets of those selected by the algorithm; count them (on the whole dataset); enlarge the set of candidate itemsets if necessary; iterate until convergence. One can model communities using a bipartite graph, with nodes and communities as vertices, and a link probability pc for each community c; a link emerges between nodes u and v with probability ∏ P (u, v) = 1 − (1 − pc ) c : u,v∈c

(or Max{ε, P (u, v)}). The BigCLAM model replaces the binary community membership structure with a community strength matrix, Fuc ∈ [0, 1], and a link emerges between u and v with probability ∏ (1 − Pc (u, v)), P (u, v) = 1 − c

where

Pc (u, v) = 1 − exp(−Fuc · Fvc ),

i.e.,

P (u, v) = 1 − exp(−Fu · Fv′ ).

The strength matrix can be estimated as ∏ ∏ F = Argmax P (u, v) (1 − P (u, v)) F

= Argmax F

(u,v)∈E



(

(u,v)̸∈E

log 1 − e

−Fu ·Fv′

) ∑

Fu ·

Fv′ .

22 , . . . and merging them when needed, but the result can be imprecise if the smaller blocks are empty; instead, one can define “size” as being the number of 1s instead of the number of positions. To estimate the number of different elements in a stream, look at the maximum number R of leading zeroes in a binary hash of those elements (and repeat for many hash functions): 2R is an estimator of the number of distinct elements but, unfortunately, E[2R ] = ∞. This problem can be fixed (HyperLogLog). The singular value decomposition (SVD) gives the best rank-k approximation of a matrix, but it does not preserve sparsity. The CUR decomposition addresses this problem: – C is obtained by taking (say) 4k columns of A, at 2 random, sampled with probability P (j) ∝ ∥A·j ∥L2 (with replacement): C = A·J ; – Similarity, R is obtained from rows of A: R = AI· ; – U = A†IJ (the pseudo-inverse is also computed from the SVD: if W = XSY ′ is the SVD of W , then W † = Y S † X ′ , where S = diag(si ), S † = diag(s−1 i ) and s−1 = 0 if s = 0). i i Content-based recommendation systems look at the features (e.g., TDIDF, for text) of items highly rated by the user. User-user collaborative filtering finds users similar to the target (cosine similarity, or centered cosine similarity aka correlation); item-item collaborative filtering is similar. Latent factor recommendation systems decompose the user×item rating matrix (e.g., SVD with missing values). There are big-data (1-pass) approximations of the kmeans clustering algorithm:

(u,v)̸∈E

– Read the data in batches; build the clusters progressively; for each cluster, only keep the mean, standard If all the nodes of a graph have the same degree d, 1 deviation and count; is an eigenvector of its adjacency matrix with eigen- – CURE: Run k-means on a random sample of the value d; if the graph is not connected, (1 · · · 10 · · · 0) data; pick 4 points per cluster; move them towards and (0 · · · 01 · · · 1) are eigenvectors for eigenvalue d; if the center; read all the data and assign each obserthere are few edges between the clusters, those vectors vation to the nearest point. are approximate eigenvectors for an eigenvalue close to d. In the general case (spectral graph partition- Support vector machines (SVM) are looking for the ing), consider the Laplacian L = (deg G)Id − A; the separating hyperplane with the largest margin vector 1 is an eigenvector with eigenvalue 0; the second Find w, b smallest eigenvalue is To maximize Min(w · xi + b)yi i ∑ x′ Lx Such that ∥w∥ = 1 2 λ2 = Min ′ = Min (x − x ) . i j x x xx x′ x=1 (i,j)∈E where yi ∈ {±1}. This can be written The coordinates of that vector give the clustering: Find w, b, γ if there are two clusters, of the same size, they are To maximize γ {i : xi > 0} and {i : xi < 0}; in general, look for a Such that ∀i (w · x + i + b)yi ⩾ γ jump in sort(x) and/or look at the third eigenvector. ∥w∥ = 1. Trawling finds small communities by looking for complete bipartite subgraphs Ks,t – it is a frequent itemset The support vectors are the observations x such that i problem. the margin (w · x + b)y is γ. If we remove the con(u,v)∈E

i

i

One can approximately count the number of 1s in a straint ∥w∥ = 1 and impose (w · xi + b)yi = 1, the −1 (to prove it, consider a supstream of bits, on a moving window of size k ≫ 1, by margin becomes ∥w∥ 0 1 keeping track of the counts for blocks of sizes 2 , 2 , port vector x+ , its symmetrix x− wrt the hyperplane, Article and book summaries by Vincent Zoonekynd

177/587

and add their margins), and the problem Find To minimize Such that

w, b 2 ∥w∥ ∀i (w · x + i + b)yi ⩾ 1.

has no dead parts, has proper completion (if there is a token in the sink, there are no tokens anywhere else), and has the option to complete.

If x and y are events appearing in the log, x < y means that x is sometimes immediately followed by y. The If the data is not separable, add a penalty cξi for each causal footprint of a log is the set (matrix) of relations between events, where the possible relations are: misclassified point – x → y if x < y and not y < x; Find w, b, ξ ∑ – x ∥ y if x < y and y < x; 2 To minimize ∥w∥ + c i ξi – x # y if neither x < y nor y < x. Such that ∀i (w · x + i + b)yi ⩾ 1 − ξi The alpha algorithm tries to build a Petri net from ∀i ξi ⩾ 0. the causal footprint of a log. Intuitively, it tries to recThis can be written with a hinge loss ognize patterns for xor/and split/join (e.g., a → b, a → c, b # c corresponds to a xor split). More forFind w, b mally: ∑ 2 To minimize ∥w∥ + c Max{0, 1 − yi (w · xi + b)}. – L = the set of traces (the log); a trace is a sequence i of tasks; The constraints have disappeared: we can use (stochas- – T = the set of tasks; tic) gradient descent. – Ti = the set of initial tasks; – To = the set of final (output) tasks; Process mining – X is the set of candidate places, where a candidate place is a pair (A, B), A, B ⊂ T , such that W. van der Aalst (Coursera, 2014) Process mining is the study and modeling of logs (a log ∀a ∈ A ∀b ∈ B a → b is a set of traces, a trace is q sequences of actions or ∀a1 , a2 ∈ A a1 # a2 tasks or activities, with timestamps, and often ancilliary data – usually as CSV files, but there is an XML ∀b1 , b2 ∈ B b1 # b2 standard: XES). The model is often represented using the business process modeling notation (BPMN), – Y ⊂ X only contains the maximal elements of X; with activities and decision nodes (xor split × , – For the set of places P = Y ∪ {i, o}, just add an xor join, and join, + , and split) initial and a final place; – The set of arcs, F , contains an arc from i to each element of Ti ; an arc from each element of To to o; × × for places (A, B), an arc from each element a of A to (A, B), and an arc from (A, B) to each element b + + of B. To more easily keep track of the “current state” (the process can be in several branches at the same time), one can use a Petri net instead: it is made of transitions and places , and the current state is tracked with one or several tokens : a transition is enabled when there is a token in each of its input places; when it fires, it consumes a token in each of its input places and produces one in each output place (the number of tokens is not constant).

The algorithm is related to frequent itemset mining. The model learned by the alpha algorithm is not minimal (one can sometimes remove places with no visible effect), and cannot capture loops or non-local dependencies; it is not robust to noise: you can end up with a flower model, which allows any behaviour.

Region-based process discovery tries to address some of those problems. First, define states from the log; for instance, a state could be the trace before a given point, A Petri net is k-bounded if there are at most k tokens or the trace after a given point, or the whole trace, or in any given place; it is safe if it is 1-bounded; a dead- the last (or next) k actions (as a sequence, or as a set, lock is a reachable dead marking, i.e., a state in which or as a multiset); one could also use the other attributes no transition is enabled; a live transition is a transi- (resources) of the events. The states form a transition tion that can be enabled from any reachable marking; system. A region is a subset of states such that, for a workflow net is a Petri net, with one source place, each activity a, either a always enters the region, or a one sink place, in which all nodes are on a path from always exits the region, or a never enters nor leaves the source to sink; a workflow net is sound if it is safe, region. The places are the non-trivial minimal regions. Article and book summaries by Vincent Zoonekynd

178/587

Language-based regions reduce the problem of finding the places to a linear algebra problem: a place is a solution (x, y, c) ∈ N3 of c1 + Bx − Ay ⩾ 0, where A and B are binary matrices, with one column per activity and one row for each prefix of a trace, an element of A is 1 if the activity is in the prefix an element of B is 1 if the activity is in the prefix without its last element; c is the number of tokens, the nonnegativity condition ensures that the number of tokens in a place is never negative, x and y describe the arcs, x is the number of transitions to the place (x, y, c), y is the number of transitions from the place (x, y, c). Inductive process mining decomposes the event log using seq, xor and and operations. Genetic algorithms are also used in process mining. To measure the conformance of a log and a model: – Compute the causal footprint of the log and the model, and count the discrepancies; – Replay the log, add tokens to enable transitions when needed, and count the number of tokens added and the number of tokens remaining at the end (but once the net is flooded with tokens, anything is possible); – Align the log and the model [no details]. Process models can be enhanced:

of



 w01′ p0  ..   .   04′  w p0   11′  w p1      Zi =  ...    w14′ p1    w01′ xi     .   ..  w14′ xi

– Do not use the variance matrices themselves, but project them on the tangent space (at I, using the logarithm) of the space S+ of positive definite matrices. The actual Kaggle task was a form of transfer learning (the test samples were not from the same distribution as the training samples – MEGs from a different patient): one can use the forecasts from the trained model as a starting point for the k-means algorithm. But this requires barycenters: in S+ , the average of a set of points is the point that maximizes the sum of the squared distances, and the distance is

−1/2 −1/2 d(Σ1 , Σ2 ) = log Σ1 Σ2 Σ1 F √ ∑ log2 λc =

– At decision points, use statistical models to guess which branch will be taken (data-aware Petri nets); c – Identify bottlenecks and their causes; – Social networks (for instance, it is easy to identify −1/2 −1/2 where the λc are the eigenvalues of Σ1 Σ2 Σ1 . the structure of an organization); – Operational support: detect problems, predict the remaining time, predict and recommend the next acA Riemannian geometry with tivity or resource. complete geodesics for the set of positive semidefinite matrices of fixed rank B. Vandereycken et al. (2013) MEG decoding using Riemannian geometry It is easy to describe the geodesics of S+ (n, n), the and unsupervised classification manifold of positive definite matrices of size n, and use A. Barachant them in optimization problems (e.g., line search along a geodesic), but the algorithms do not scale (they are To forecast a binary class k ∈ {0, 1} from multidimen- O(n3 )). Low-rank approximations are attractive, but sional signals xi : J1, N K → Rc : the space of positive semidefinite matrices of size n – Compute the average for each class, pk : J1, N K → and rank p, S+ (n, p), has no canonical metric. It can be described as the orbit of Rc ; ( ) – For each class k ∈ {0, 1}, combine the signals 1p 0p,n−p 0n−p,p 0n−p w1 pk1 + · · · + wc pkc under the action of GL+ = n defined by A · X w1 x1 + · · · + wc xc AXA−1 . The natural right-invariant metric on GLn , ′ gA (ηA , νA ) = tr(A−1′ ηA νA A−1 ), defines a metric on to maximize the signal-to-noise ratio S+ (p, n). w(k) = Argmax w

∥w′ pk ∥2 w′ pk p′k w = Argmax ; ∥w′ x∥2 w′ xx′ w w

– It is an eigenvalue problem: keep the first four eigenvectors wk1 , . . . , wk4 ; – As features, use the covariance matrix Σi = N1 Zi Zi′ Article and book summaries by Vincent Zoonekynd

A differential geometric approach to the geometric mean of symmetric positive-definite matrices M. Moakher The matrix exponential exp : S(n) → S+ (n) establishes a bijection between symmetric matrices S(n) 179/587

and symmetric positive definite matrices S+ (n) ≃ GLn /On .

Learning recommender systems with adaptive regularization S. Rendle (2012)

The geodesic on S+ (n) from I in the direction S is t 7→ etS . Simultaneously nearn the model and the regularization The geodesic on S (n) from P in the direction S ∈ parameter: +

S(n) ≃ TP S+ (n) is t 7→ P 1/2 etP The distance on S+ (n) is

1/2

SP −1/2

P 1/2 .

βn+1 = Argmin Loss(training set, β, λn ) β



d(P, Q) = Log(P −1 Q) F = 



1/2 ln λ 2

λ∈Spec P −1 Q −1

(note that P Q is not symmetric, but Spec P Spec Q1/2 P −1 Q1/2 ⊂ R× + ).

−1

λn+1 = Argmin Loss(test set, βn+1 , λ). λ

(Here, Argmin does not have to be the minimizer: takQ = ing a few steps towards the minimum is sufficient.)

The Riemannian mean of positive ∑ definite matrices P1 , . . .∑ , Pn , defined as P = ArgminP k d(P, Pk )2 , satisfies k Log(Pk−1 P ) = 0; it is a generalization of the geometric mean.

Scaling factorization machines to relational data S. Rendle (2013)

When fitting a model on relational data (often, a star Multiclass brain computer interface schema, e.g., movie ratings, with movie characterisclassification by Riemannian geometry tics (genre, year) and user characteristics (age, gender, A. Barachant et al. (2012) friends)), one typically joins (denormalizes) the data. this needlessly inflates the volume of data to process If your data consists of positive definite matrices, and the running time. Learning on data in normal form rather than arbitrary vectors, algorithms that only use is possible (process the fact table; update the weights the notions of distance and mean (e.g., kNN, k-means, of the rows on the dimension tables; process the dimenetc.) are directly applicable using the Riemannian sion tables). structure of S+ (n). For other algorithms, project the data on the tangent space (at the identity, or any other point, if one makes better sense), using the logarithm. Tensor decompositions and applications T.G. Kolda and B.W. Bader (2009) Classification of covariance matrices using a Riemannian-based kernel for BCI applications There are two generalizations of PCA to tensors (using A. Barachant et al. (2013) Eistein’s summation convention): More details; algorithm to compute the mean; application to support vector machines.

CP: Tucker:

xijk = g r air bjr ckr xijk = g pqr aip bjq ckr

MEG decoding accross subjects E. Olivetti et al. (2014) The CP (canonical decomposition/parallel factors) expresses the tensor x as a sum of rank-1 tensors. The Transductive transfer learning (TTL): rank of x is the minimal number of terms in such a 1 ∑ Ptarget (xi ) ˆ sum. Contrary to the notion of rank for matrices: θ = Argmin loss(θ, xi , yi ). n i Psource (xi ) θ – The ranks over R and C may be different; – Among tensors of a given shape, the typical rank Factorization machines (any rank that occurs on a set of non-zero measure) S. Rendle and the maximum rank can differ; – “The” typical rank is not unique; A factorization machine is a linear model with in– The best rank-1 approximation need not be part of teractions, a best rank-2 approximation; ∑ ∑ y =α+ βi x i + γij xi xj – The best rank-k approximation need not exist: the i ij set of rank-k tensors is not closed, i.e., some tensors can be approximated arbitrarily closely by a tensor where the interaction terms have low ramk γij = k of smaller rank – the border rank of x is the minimum ⟨vi , vj ⟩,, vi ∈ R , k small (you can add higher order number of rank-1 tensors needed to approximate x interactions). It is an alternative to the “hash trick” with an arbitrarily small error. Vowpal Wabbit uses. It works better than support vector machines for sparse data. (Without α and β, it The CP decomposition can be estimated using alterwould just be a matrix factorization.) nating least squares (ALS). For an implementation, check libfm.

Article and book summaries by Vincent Zoonekynd

180/587

Probabilistic principal component analysis M.E. Tipping and C.M Bishop (1999)

Instead of a single sparse representation, one can compute several (as with random forests: at each step, PCA is a limiting case of (isotropic) factor analysis, choose a new atom in a random subset of Φ) and avX ∼ N (µ, W W ′ + σ 2 I), when σ 2 → 0 (σ 2 is the av- erage them. erage variance due to discarded components). PPCA Dimensionality reduction for k-means allows for missing values. clustering and low rank approximation In R, check the pcaMethods package (for probabilitic M. Cohen et al. (2014) and bayesian PCA). To speed up approximate k-means or PCA, use a sketch NICE: non-linear of the data (row and column sampling, approximate independent component estimation SVD (CUR), Johnson-Lindenstrauss projection). L. Dinh et al. (2014) Alternating least squares Given a random variable X : Ω → Rn , a non-linear infor personalized ranking dependent component decomposition is a transformaG. Takács and D. Tikk (2012) n n tion f : R → R such ∏ that the density of H = f (X) factorizes: pH (h) = d pHd (hd ). Alternating least squares (ALS) are often used to find ′ ˆ It can be estimated by maximum likelihood, pX (x) = R = P Q, with P and Q rectangular with a small number of columns, to minimize pH (f (x)) |det f ′ |, i.e., ∑ ∑ ∑∑ cui (ˆ rui − rui )2 . Argmax log pHd fd x + log |det f ′ | . f,pH1 ,...,pHn

x

u∈Users i∈Items

d

We look for a transformation (built from three layers of building blocks) of the form y1 = x1

It can also be used with a ranking objective ∑ ( )2 cui sj (ˆ rui − rˆuj ) − (rui − ruj ) . uij

y2 = x2 + m(x1 ) which ensures that |det f ′ | and f −1 are easy to compute. Sparse matrix factorization B. Neyshabur and R. Panigrahy (2014)

HyperLogLog in practice: algorithmic engineering of a state of the art cardinality estimation algorithm S. Heule et al. (2013)

HyperLogLog is biased for small counts: one can (empirically) estimate the bias and correct it. (HyperThe sparse matrix factorization problem aims to write LogLog++ introduces a few more changes, mostly to a matrix Y as a product of sparse matrices Y = reduce memory usage.) X1 X2 · · · Xn – PCA can be seen as a special case, where X1 has few columns and X2 few rows. In-code computation Assuming the matrices are square, random, centralof geometric centralities with HyperBall: ized, normalized, and the coefficients of Y are 0, 1 or hundred billion nodes and beyond −1, X1 X1′ = round(Y Y ′ ) with high probability; X1 P. Boldi and S. Vigna can be recovered from X1 X1′ , and one can then iterTo count the number |B(x, r)| of nodes in the ball of ate. center x and radius r in a graph, use a HyperLogLog counter cv for each node and estimate B(x, r + 1) from Blind denoising with random greedy pursuits B(x, r). M. Moussallam et al. (2014) This can be used to compute the harmonic centrality To find a sparse reconstruction of a time series y, choose ∑ 1 ∑ |B(x, r)| “atoms” (elementary time series Φ = (ϕ1 | · · · |ϕn ) and centrality(x) = = . fit y = Φα + w, where α is sparse: d(x, y) r y r⩾1

Find α To minimize ∥α∥0 Such that ∥y − Φα∥ ⩽ ε. One can use a greedy algorithm (matching pursuit, MP) to find an approximate solution – for the stopping criterion, look at Max ϕ

|⟨y, ϕ⟩| . ∥y∥2 ∥ϕ∥2

Article and book summaries by Vincent Zoonekynd

The power of comparative reasoning J. Yagnik et al. The winner-takes-all (WTA) locality-sensitive hash (LSH) is { n R −→ { 1, . . . , k } x 7−→ Argmax xσ(j) 1⩽j⩽k

181/587

for a random permutation σ ∈ Sn . By using several permultations σ and choosing k = 2, we get a binary hash. One can apply a kernel before computing the hash.

CRDTs: consistency without concurrency control M. Leţia et al. (2009) CRDTs are data structures whose operations commute: they can be used in a distributed environment. For instance:

OpenGM: a C++ library – A set with a single add-element operation; for discrete graphical models – A set with add and delete operations, if no element B. Andres et al. (2012) is ever added twice: keep two lists, one for added C++ template library implementing algorithms for elements, one for deleted elements, and clean them discrete graphical models (loopy belief propagation, from time to time; sequential tree reweighted belief propagation, graph – An ordered set, with insert-at and delete operations: cut, etc.), i.e., given a factor graph and a semiring use a dense index space (not R, but paths in a binary ⊕ ⊙ (Ω, ⊗, 1, ⊕, 0), computing tree). x F ϕF (xneigh(x) ); examples include Optimization (R, +, 0, min, ∞) Marginalization (R+ , ·, 1, +, 0) Constraint satisfaction ({0, 1}, ∧, 1, ∨, 0).

MPPack: a scalable C++ machine learning library R.R Curtin et al. (2012)

A comprehensive study of convergent and commutative replicated data types M. Shapiro et al. (2011) Eventual consistency, in a distributed/asynchronous environment, can be guaranteed by the data structures themselves, without synchronization. For instance:

– Add-only set; – Distributed maximum (when the maximum changes, broadcast it to the other replicas); An LGPL C++ template machine learning library – Counter: broadcast +1 or −1 – the reception order (also check Weka, Shogun, mlpy, sklearn, MLLib). is irrelevant; – Register: store value-timestamp pairs and only keep Practically accurate floating point math the value with the latest timestamp; N. Toronto and J. McCarthy (2014) – Set with add and remove operations, if removed eleMeasure the floating point error when approximating ments cannot be added back: use two add-only sets, a real number r with a floating point number x as for elements added and elements removed, and occasionally garbage-collect them; x−r – Set with add and remove operations, in which add error(x, r) = ulp(fl(r)) has priority over remove in case of a conflict (e.g., shopping cart): when adding an element, also store where a unique identifier (a new one, if the element was already there); remove removes the elements with ulp(y) = distance from y to the next float the identifiers known to the replica that received the fl(r) = float closest to r order; ord(x) = (signed) number of floats between 0 and x – Graphs: like sets; – Directed acyclic graphs: that is a global property; it ord−1 (n) (needed to compute ulp). cannot be ensured in general without synchronizaIt is the number of floats between x and r. tion. – Partial order: like sets; When implementing a numeric function, check where – Total order (e.g., collaborative editing): either it is ill-conditionned (“the badlands”), e.g., for funcadd an addRight(position,content) operation, which tions f : R → R, check where the condition number stores the content-timestamp pair in a linked list or ′ |xf (x)/f (x)| exceeds 4. an addBetween operation, for which the total order The examples use typed/racket (for the plots as well) is defined by a dense set of identifiers (trees: Loand MPFR. goot or Treedoc – but they have to be rebalanced (garbaged-collected) from time to time). Fast approximate nearest neighbors with automatic algorithm configuration M. Muja and D.G. Lowe

Generation and analysis of constrained random sampling patterns J. Pierzchlewski and T. Arildsen

Comparison of randomized kd-trees and hierarchical k-means for fast approximate nearest neighbours Random sampling of analog (1-dimensional) signals is (FANN). preferable to regular sampling. Simple sampling patterns include: a regular grid with noise, or adding noise Article and book summaries by Vincent Zoonekynd

182/587

to the previous point. The algorithmic foundations of differential privacy C. Dwork and A. Roth (2014) To ensure plausible deniability, use randomized algorithms (e.g., answer truthfully with probaility 12 , answer randomly otherwise – for real-values functions, perturb the result with the Laplace distribution). A randomized algorithm M is differentially private if ∥x − y∥ ⩽ 1 =⇒ P [M (x) ∈ S] ⩽ e P [M (y) ∈ S] + δ. ε

Sloane’s gap: do mathematical and social factors explain the distribution of numbers in the OEIS? N. Gauvrit et al. (2011) Plotting the number of occurrences of an integer in the OEIS shows two clouds, with a similar exponential decay, and a gap inbetween. Computational complexity explains the decay, but not the gap – it could be due to social factors. Profiling R on a contemporary processor S. Sridharan and J.M. Patel (2014) R is slow because of cache misses, caused by:

P [M (x) = ξ] The privacy loss is ln . P [M (y) = ξ] When answering several queries, the system should refuse to answer similar ones (e.g., the number of people with disease D, and the number of people not named X with disease D). Computing arbitrary functions of encrypted data C. Gentry (2008) The following encryption scheme, which adds noise to the cleartext, is somewhat homomorphic – the noise progressively increases and eventually becomes too large. m ∈ {0, 1} p ∈ {0, 1}P , p ≡ 1 (mod 2) c = m′ + pq m′ ∈ {0, 1}N , m′ ≡ m (mod 2) q ∈ {0, 1}Q somewhat preserved: +, −, × cleartext: key: ciphertext: where

– Pointer-chasing the the garbage collector (GC); – Linear algebra operations: R should provide both row- and column-major storage and/or use blocking; – Custom C code in packages (e.g., rpart): it should be cache-conscious. The GC problems are made worse by: – Unnecessary attributes (e.g., row names, which are automatically added); – Unnecessary copies; intermediate object creation (arithmetic operations on vectors or matrices); – Inefficient implementations (e.g., subsetting with a boolean vector recomputes the row indices for each column). Convex optimization in Julia M. Udell et al. (2014) The Convex.jl package brings disciplined convex programming (DCP) to Julia. Besides the positive orthant, the second order cone, and the semi-definite cone, note the exponential cone,

{ (x, y, z) ∈ R3 : y ⩾ 0, yex/y ⩽ z }, To make it homomorphic, try to find functions fi→i+1 that transform Encrypt(pi , m) to Encrypt(pi+1 , m) (i.e., that change the key and reset the noise), using used∑ for the convex functions exp(x), − log(x), the somewhat preserved operations (this is possible af- log i exp xi . ter a small change in the encryption scheme). This is slow, there is no division, and no inequalities. Computing in operations research using Julia M. Lubin and I. Dunning (2013) A comparative study of discretization methods JuMP.jl relies on macros rather than operator overfor naive Bayes classifiers loading (as in CVX, Convex.jl, PuPL, Pyomo, etc.) Y. Yang and G.I. Webb (2002) and can therefore efficiently build larger problems. √ When discretizing data, use n quantile bins, but A review of goal programming make sure each bin has at least 30 observations. for portfolio selection R. Azmi and M. Tamiz (2010) The Winograd schema challenge “Goal programming” is a vague notion that refers to H.J. Levesque et al. (2012) the various ways of selecting a single solution of a multiA replacement for captchas (or the Turing test): re- objective optimization problem, “find x to minimize solve referential ambiguity (e.g., “the trophy does not g1 (x), . . . , gn (x)” somewhere on the efficient frontier. fit in the suitcase because it is too big”, what is too For instance: ∑ big: the trophy or the suitcase?). – Minimize i λi gi (x); – Minimize Maxi gi (x); Article and book summaries by Vincent Zoonekynd

183/587

– Minimize g1 (x); among the solutions, minimize g2 ; Optimal risk budgeting continue until there is only one solution left; under a finite investment horizon – etc. M. López de prado, R. Vince and Q. Zhu (2013) To decide which proportion xi of one’s wealth to inˆ = Argmaxx fQ,U (x), Portfolio selection with higher moments: vest in asset i, one can use x where f (x) is the expected utility for utility funca polynomial goal programming Q,U tion U after Q periods. The Kelly principle is x ˆ = approach to ISE30 index Argmax f (x). The leverage space seems to be G. Kemalbay et al. (2011) x ∞,linear the graph of x 7→ fQ,linear (x). Indicator-based selection in multiobjective search E. Zitzler and S. Künzli (2004)

A factor model for non-linear dependences in stock returns R. Chicheportiche and J.P. Bouchaud (2013)

Multiobjective search tries to find a good approximation of the Pareto set – a good solution is close to the The non-linear correlation between two random Pareto set and diversified. These notions are rarely variables X and Y generalizes Cor(|X| , |Y |) and clearly defined, but could be formalized, e.g., one could Cor(X 2 , Y 2 ): try to maximize the volume dominated by the solution. Here is one algorithm: [ p] E |XY | 1 – Start with a large candidate set; Cp (X, Y ) = 2 log [ p ] [ p ] . p E |X| E |Y | – Compute the “fitness” of each point in it (relative to the others); – Remove the worst point; update the fitness of the others; – Iterate (until some termination criterion is met, e.g., the population size). The indicator-based evolutionary algorithm (IBEA) defines the fitness of a point a ∈ A as ∑ F (a) = − exp −I({a}, {b})/κ

∑ In the standard factor model xi = k βki fk + ei , the non-linear correlations Cp (fk , fℓ ), Cp (fk , ei ) Cp (ei , ej ) are non-zero. The nested factor model, for the volatilities of the factors f and the residuals e, with two factors, Ω0 and Ω1 , can account for them: fk = εk exp[Ak0 Ω0 + Ak1 Ω1 + ωk ] ej = ηj exp[Bj0 Ω0 + Bj1 Ω1 + ϖk ].

b∈A\{a}

where κ is a parameter (the temperature) and I is a binary quality indicator, e.g., { I(A, B) = Min ε : ∀ b ∈ B ∃ a ∈ A ∃ i ∈ J1, nK } fi (a) − ε ⩽ fi (b) or { I(A, B) =

I(B) − I(A) if ∀ b ∈ B ∃ a ∈ A a ≻ b I(A ∪ B) − I(A) otherwise

where I(A) is the volume dominated by A.

Commodity futures and market efficiency L. Kristoufek and M. Vosvrda (2013) To measure the efficiency of a given market (or asset), average the following measures: – Hurst exponent (long-term efficiency): local Whittle estimator, GPH estimator; – Fractal dimension (short-term): Hall-Wood estimator, Genton estimator; – Entropy: Pincus approximate entropy. The article also gives a concise review of those estimators.

Sparse and stable Markowitz portfolios J. Brodie et al. (2008) For sparse portfolios, add an L1 penalty to the portfolio optimization problem. Sparse portfolio selection via quasi-norm regularization C. Chen et al. (2013)

Implied Hurst exponent and fractional implied volatility: a variance term structure model K.Q. Li and R. Chen (2014) One can compute the Hurst exponent from option prices:

– Using the fractional Black-Scholes formula; a single option does not give (σ, H), but just V = σ 2 T 2H : For sparse portfolios, add an L penalty (0 < p < 1) use several maturities and regress log V ∼ log T ; to the portfolio optimization problem; it is no longer – Using a model-free approach, very similar to the VIX convex, but can still be solved in polynomial time via computations, to compute V . interior point methods. p

Article and book summaries by Vincent Zoonekynd

184/587

Forecasting the NOK/USD echange rate with machine learning techniques T. Papadimitriou et al. One can forecast the exchange rate FXt+1 ∼ FXt + (M2 − M2US ) + (GDP − GDPUS ) + (IR − IRUS ) + (Inflation − InflationUS ) +

Do social firms catch the drift? Social media and earnings news V. Bhagwat and T.R. Burch (2013) Twitter firms (more than 0.1 tweet per day, on average, since the creation of the twitter account – the mapping from firm to twitter account was done manually) have stronger post-earnings announcement drift (PEAD).

Oil + t, where the inflation is a forecast from an ARMA model or from the forward rate, using support vector regression ∑ 2 βˆ = Argmin 12 ∥β∥ + C ϕ(yi − β ′ xi )

A general option valuation approach to discount for lack of marketability R. Brooks (2014)

To price a non-marketable asset, one can model the price of the corresponding marketable asset as the sum of the unmarketable price and an option, e.g., an opϕ= tion to sell the asset at the current market price, or at (with the kernel trick). The resulting model can be the best price in the period (if the investor has market made more interpretable using a “dynamic evolution timing abilities) or at the average price in the period neuro-fuzzy inference system” (denfis), i.e., a set of (European put, lookback put, average-strike put). fuzzy rules of the form “Rule ∑ m: if x1 is Rm1 and . . . and xn is Rmn then y = i βmi xi ”, where Rm1 is a Stock market ambiguity Gaussian membership function and the equity premium (x − c)2 P.C. Andreou et al. (2014) . Rm1 ∝ exp − 2σ 2 Ambiguity (uncertainty about the probability distribution of future assets) can be measured as the disRationality of survival, market segmentation persion (standard deviation or mean absolute deviand the equity premium puzzle ation) of the volume-weighted strike prices of S&P Y. Gabovich (2014) 500 index options. It is a better predictor of future A supply-and-demand model, with young people in- returns than other ambiguity proxies (analysts’ forevesting in equities and retirees in bonds, can explain casts, stock turnover, press) or other option-implied the equity premium puzzle (the difference in returns measures (slope of the volatility smirk, risk-neutral between bonds and equities is too large to be explained variance, skewness or kurtosis, hedging pressure, i.e., OTM puts / ATM calls), and almost as good as the by risk aversion alone). variance risk premium, (VRP = VarQ Xt − VarP Xt , where VarQ Xt is computed from option prices and Do patented innovations affect VarP Xt = Varrealized Xt−1 ). cost of equity capital? S.P. Hegde and D.R. Mishra (2014) β

i

Credit risk models II: structural models Cost of equity increases with R&D expenses but deA. Elizalde (2005) creases with patents: they reduce the uncertainty about the marketability of the research. Structural default models model the value of the assets of a company as a geometric Brownian motion; a deSay it again Sam: fault occurs when the value of the assets is below some the idiosyncratic information content threshold of corporate conference calls – at the end of the period; J. Cicon (2014) – at any point during the period; To measure the information content of corporate con- – or on some sufficiently long subinterval of the foreference calls, compare the two parts of the call (mancast period. agement presentation and Q&A session), e.g., with coThese correspond to the Merton model (European opsine similarity, to estimate how much unscripted infortions), first passage models (FPM, barrier options) and mation is provided. liquidation process models. Those models do not work well. Management forecasts and the cost of equity capital: international evidence To account for correlation and contagion, one can Y. Cao et al. (2014) movel the value of the assets (of several companies) as diffusions with correlated innocations and correlated The presence of management forecasts lowers the cost jumps. of equity.

Article and book summaries by Vincent Zoonekynd

185/587

Asymmetric dependence, tail dependence and the time interval over which the variables are measured B.U. Kang and G. Kim (2014) Look at the asymptotic behaviour of the n-period copula: (n) λ (q) − λ(n) (1 − q) −−−−→ 0 n→∞

(n)

(n)

λU , λL −−−−→ 0 n→∞

(n)

(n)

where λ(n) (q) is the n-period quantile and λU , λL the upper and lower tail dependence indices.

Stock and index derivatives and markets J. Spina (2014) The main option strategies are:

Consumption-based asset pricing with rare disaster risk J. Grammig and J. Sönksen (2014) High equity risk premium may be due to fears of rare disasters.

Optimal hedging with the vector autoregressive model L.T. Gatarek and S. Johansen (2014) With several assets (3+) and cointegration relations (2+), use portfolio construction tools (e.g., minimum variance) on the set of stationary portfolios.

Classifying as defensive or cyclical a bivariate wavelet analysis perspective J. Bruzda

– Dividend spread: profit from investors that fail from Given two signals x and y, compute the non-decimated exercising their American options in time; Hilbert coefficients u and v and define – If you own stocks of your own firm, want to keep the dividends and voting rights, but want some proCovj (xt , yt+τ ) = ujt v¯j,t+τ tection: buy OTM puts (you may also want to sell Varj xt = Covj (xt , xt ) OTM calls to finance the puts); – Variable prepaid forward (sell the stock you own in Re Covj (xt , yt ) : wavelet cospectrum the future, but get the money now); Im Covj (xt , yt ) : wavelet quadrature spectrum – If you want to bet on the performance of a stock (e.g., Covj (xt , yt ) your firm), use an option collar, sell ATM puts, buy Corj (xt , yt ) = √ Varj xt Varj yt ATM calls. 2

|Corj (xt , yt )| : wavelet coherence Relative alpha J.C. Jackwerth and A. Slavutskaya The relative alpha of a hedge fund is the excess performance wrt its peers, where the peers are funds for which the return difference has a small variance. More formally, ∑ αi = wij E[ri − rj ]

|Corj (xt , yt )| : wavelet coherency θj (xt , yt ) = arg Covj (xt , yy ) phase spectrum τj (xt , yt ) = −θj (xt , yt ) · βj (xt , yt ) =

2j+2 6π

time delay

Covj (xt , yt ) Varj xt

|βj (xt , yt )| : wavelet gain.

j̸=i

( ) wij ∝ k h−1 Var[ri − rj ]

In R, check waveslim::modwt.hilbert(x,"k3l3",4).

k : kernel h : bandwidth (Silverman’s rule of thumb).

Optimality of momentum and reversal X.Z. He et al. (2014)

One can add momentum and reversal to a geometric Factor high-frequency based volatility models Brownian motion K. Sheppard and W. Xu (2014) [ ] dS Factor model to estimate large realized covariance ma= ϕmt + (1 − ϕ)µt dt + σS dZS S trices. dµt = α(¯ µ − µt )dt + σµ dZµ Ornstein-Uhlenbeck ∫ t 1 dSu A multivariate model mt = Momentum τ t−τ Su of strategic asset allocation with longevity risk E. Bisetti et al. (2014) and compute the optimal strategy, for a 2-asset uniIf you use reinsurance as a cheap source of leverage (like verse (stock and bond) and an investor maximizing the Berkshire), do not forget to account for longevity risk – expected utility of final wealth. The model can be fitit is negligible for short-term investors, but significant ted to historical data (S&P 500, monthly, 1 century). for long-term ones.

Article and book summaries by Vincent Zoonekynd

186/587

Discounting the distant future J.D. Farmer et al. (2014)

– Assume that the stock returns are linearly related to those liquidity shocks; – Estimate the variance and skewness of the returns, If the interest rate (is non-constant and) follows an conditional on the liquidity shocks; Ornstein-Uhlenbeck process dr = −α(r − m)dt + kdW , ∫t – Decompose it into contributions of the variance, the discount factor D(t) = E[exp − 0 r(s)ds] satisfies covariance, skewness, coskewness of the liquidity log D(t) ∼t→∞ (m − k 2 /2α2 )t, in particular, the longshocks. term interest rate r∞ = m − k 2 /2α2 is not the longterm average of the interest rate. The interest rate Non-linear forecasting of energy futures: distribution, r ∼ N (m, k 2 /2α), does not determine r∞ . oil, coal and natural gas G.G. Creamer (2014) Independence of stock markets before and after the global financial crisis of 2007 B.M. Ibrahim and J. Brzeszczyński (2014) To measure the impact of a variable z on a regression y ∼ x, where x, y, z are time series, assume that the coefficients of the regression are AR(1) processes whose coefficients depend on Z, yt = (α + εt ) + (β + ηt )xt + ut

Use Brownian distance correlation ν(X, Y ) = ∥fX fY − fXY ∥ ν(X, Y ) c(X, Y ) = √ ν(X)ν(Y ) as an alternative to Granger causality, and for feature selection (to build leading indicators).

εt+1 = (a + bzt )εt + vt+1 ηt+1 = (c + dzt )ηt + wt+1 For instance, x and y could be the returns in two markets and z the difference in turnover of volatility between those two markets.

Agent-based models of financial markets E. Samanidou et al. (2007) Clear review of a few agent-based models of financial markets:

– Rebalancers (targeting wcash = wstock = 12 ) and An empirical investigation CPPI investors, reviewing their portfolios (and tradof methods to reduce transaction costs ing) at random times, each agent estimating the “fair T. Moorman (2014) price” from the order book to make her decision, with random deposits and withdrawals; To reduce transaction costs, define a no-trade zone – Traders with a noisy log-utility (so that the optimal around the optimal portfolio, using some measure of weight of stock and bond is independent of wealth, distance (Euclidian, cosine, etc.) and an “optimal” up to noise), assuming the forward returns will be threshold. one of the past k returns (the look-back k is investordependent); Causal dependency in extreme returns – Gamblers, wealthi,t+1 = λ×wealthit , with λ random K. Echaust around 1, and a welfare state ensuring wealthi,t ⩾ There is more long memory in return tails: look at the q × average wealth; cross-correlation of block maxima and minima. – Percolation: traders are the occupied sites on a lattice of dimension 2 to 7, clusters are groups of traders making the same decision (herding), P [buy] = Option pricing P [sell] = a, P [sleep] = 1 − 2a, ∆ log price ∝ supply − with the logistic return distribution demand; this gives the correct stylized facts, includM. Levy and H. Levy ing log-periodic oscillations (if you add some ratioBlack Scholes option pricing formulas can be generalnality and hysteria); ized to the logistic distribution (the distribution whose – The supply and demand balance between noise cdf is the logistic function). A possible justification traders and fundamentalists drives price changes, is that while aggregated log-returns are asymptotically and each trader reviews her beliefs based on those Gaussian, normal market conditions are still far away price changes. from this limit. Only the last two models give credible prices. Forecasting crashes: correlated fund flows and the skewness in stock returns X. Gong et al. (2014)

Electricity price forecasting: a review of the state-of-the-art with a look into the future R. Weron (2014)

Correlated demand for liquidity (from mutual funds) explains the negative skewness of stock returns:

Review of many methods to forecast electricity prices:

– Use mutual fund holdings to estimate how much of the trading of a fund is due to liquidity shocks (inflows and outflows);

– Agent models (Nash equilibrium, multi-agent simulations); – Jump diffusion model, Markov-switching model;

Article and book summaries by Vincent Zoonekynd

187/587

– Exponential smoothing, regression, AR, SARMA, ARX, TAR, GARCH, etc.; – Neural nets (MLP, recurrent, fuzzy), SVM.

An introduction to state space models M. Wildi (2013) Detailed presentation of state space models and the Kalman filter in R, with

Financial sector tail risk and real economic – Detailed explanations of the formulas (the Kalman activity: evidence from the option market filter is a sequence of Bayesian updates – in particuM. Neumann lar, a prior is needed); An option-implied measure of tail risk can be computed – The various objective functions for model estimation (look at the out-of-sample residuals yt = yˆt|t−1 , or from the price of deep OTM put options on a financial the in-sample ones, yt = yˆt|t , and minimize their index (cheaper than it should be because of (implicit) sum of squares or their likelihood – prefer the outgovernment guarantee) and the constituents of this inof-sample maximum likelihood estimator); dex. – A few examples: · Regression, where the coefficients are the hidden Q-learning-based financial trading systems state, and the covariates form the (time-changing) with applications observation matrix; M. Corazza and F. Bertoluzzo (2014) · Intervention studies; · Missing values, dealt with by skipping the obserUse reinforcement learning (e.g., Q-learning) to design vation ξˆt|t ← ξˆt|t−1 ; your automated trading systems. · Time-changing AR(1) process; · Decomposition into trend and SARMA compoTrueSkill™: a Bayesian skill rating system nents. R. Herbrich et al. · Smoothing was not detailed – there are many applications in signal processing, but few/none in time Generalization of the (already Bayesian) Elo system. series analysis. Intrinsic ratings compendium K.W. Regan (2012)

In R, check the dlm, KFAS, dlmodeler packages.

To estimate Elo ratings from moves rather than game outcomes (there are more of them):

Parallel Markov chain Monte Carlo D.N. VanDerwerken and S.C. Schmidler (2013)

– Compute the value of the position before and after each move (with a chess program); α(value,∆value) – Assume P [move] ∝ p0 , where α is a function to estimate; – Discard openings (first 8 moves), repeated positions, clearly advantageous positions; – Estimate the Elo rating from α.

Use importance sampling weights to merge parallel MCMC simulations and get a consistent estimator much earlier, even in case of slow mixing (e.g., multimodal or highly correlated distributions).

A different angle on fitting ROC curves to rating data using the first principal component J.R. Vokey (2014) To estimate a ROC curve from a few points, convert the p-values to a Gaussian (qnorm – the points are likely to be aligned) and take the first principal component (rather than a regression line).

Testing the equality of two positive definite matrices with applications to information matrix testing J.S. Cho and H. White (2014) Build a test for equality of positive definite matrices using A = B ⇐⇒ (det D)1/k =

1 tr D = 1, k

where D = BA−1 . From Archimedian to Liouville copulas A.J. McNeil and J. Nešlehová Archimedian copulas are copulas of d-dimensional ℓ1 norm symmetric distributions, i.e., copulas of random variables X = RSd , where S∑ d is uniform on the unit simplex ∆d = { x ∈ Rd+ : xi = 1 } and R is an independent nonnegative random variable. Replacing the uniform distribution on ∏ ∆d with a Dirichlet disi −1 gives the (nontribution, P (x1 , . . . , xn ) ∝ xα i exchangeable) Liouville copulas.

Article and book summaries by Vincent Zoonekynd

Birds of a feather flock together Local learning of mid-level representations for fine-grained recognition A. Freytag et al. Local learning is similar to local regression, but for arbitrary machine learning algorithms: given a new observation to classify, find the k-nearest ones (k ≫ 1), train a model on them, and use it.

188/587

Multivariate Weibull distributions for asset returns I Y. Malevergne and D. Sornette (2004)

One can use nested Laplace approximations (asymp∫b totic expansion of a eM f (x) dx, when M → ∞, if f has a unique maximum x0 , using a Taylor expansion Asset returns X can be described by a modified Weibull of f around x0 ) to estimate posterior probabilities in latent Gaussian models. distribution, ( )c In R, check the INLA package (not on CRAN). |x| c/2−1 p(x) ∝ |x| exp − σ Non-linear causal inference using Gaussianity measures (perhaps with a different exponent c for x > 0 and α D. Hernández-Lobato et al. (2014) x < 0), i.e., some power of the returns, sign(X) |X| , is Gaussian. For the correlation structure, use a Gaus- If x causes y (through a linear model with nonsian copula. Gaussian noise), the residuals of y ∼ x are less Gaussian than those of x ∼ y. Tweedie family densities: Ranking the economic importance methods of evaluation of countries and industries P.K. Dunn and G.K. Smyth W. Li et al. (2014) The Tweedie distribution is an exponential dispersion model in which the variance is some power of the ex- Country- and sector-level imports and exports (world input-output data, www.wiod.org) defines a (weighted, pectation directed) graph, amenable to cascading failure toleryθ − κ(θ) ance analysis: check what happens when a node fails f (y) = a(y, ϕ) exp ϕ – other nodes fail if their revenue drops by a fraction greater than p ∈ [0, 1]. The critical value pc beyond Var y ∝ E[Y ]p . which most nodes fail measures the centrality of the Special cases include Gaussian (p = 0), Poisson (p = first failing node. 1), Gamma (p = 2) and inverse Gaussian (p = 3). The pdf can be approximated by inverting the cummulant Interdependencies and causalities generating function (it has a simple form) or by an in coupled financial networks infinite series. I. Vodenska et al. (2014) In R, check the cplm and tweedie packages.

To identify lead-lag relations from log-return time series, complexify the signals with a Hilbert transform; Series evaluation of Tweedie exponential compute the “complex correlation matrix”; build a didispersion model densities rected network using the sign of the phase of the corP.K. Dunn and G.K. Smyth (2005) relation to infer the direction of the edges. One can also look at the principal components (complex PCA) If Y ∼ EDp (µ, ϕ) follows a Tweedie distribution with and use random matrix theory (or its resampling-based 0 < p < 1, it can be written Y = X1 + · · · + XN , with equivalent, rotational random shuffling, RRS). Xi ∼ Γ, Xi iid, N ∼ Poisson. When p is close to 1, the distribution is multimodal (p = 1 gives the Poisson SIMoNe: statistical inderence distribution, which is discrete). for modular networks J. Chiquet et al. (2008) Model identification using stochastic differential equation If you suspect your network has a modular structure, grey-box models in diabetes i.e., the adjacency matrix has a block structure, with A.K. Duun-Henriksen et al. (2013) dense diagonal blocks and sparse off-diagonal blocks. A white-box model is a system of ODEs (here, phamaco-kinetics) based on domain knowledge. A grey-box model is a noisy system of SDEs; it mixes domain knowledge and data. A black-box model is entirely data-driven. For grey-box models in R, check the ctsm package (SDEs with noisy measurements at discrete times, i.e., non-linear Kalman filter). Approximate bayesian inference for latent Gaussian models by using integrated nested Laplace approximations H. Rue et al. (2009) Article and book summaries by Vincent Zoonekynd

Maps of random walks on complex networks reveal community structure M. Roswall and C.T. Bergstrom (2008) One can describe paths on a graph by using a Huffman code for the outgoing edges of each node – but there is a different code for each node. Instead, one can use a unique code, and encode the target nodes instead of the outgoing edges (Huffman code for the limiting distribution of a random walk on the graph). To identify clusters, use a 2-level code: each cluster has a unique name, each node has a code unique in its cluster, but reused on other clusters. The path is encoded by using the node codes, as long as it remains in the same 189/587

cluster and, when it changes cluster, an exit code, the name of the cluster, and the code of the new node. The average description length is

Selecting influential examples: active learning with expected model output changes A. Freytag et al.

P [change cluster]H[cluster codes] + ∑ P [same cluster]H[node codes in that cluster]

Active learning refers to the algorithms used to decide which unlabeled samples to label next (when labeling is costly and online), i.e., which sample is the most informative. Strategies include:

cluster

where H is the entropy. To find the optimal partition into clusters, use a greedy agglomerative algorithm, and refine with simulated annealing. Contrary to other clustering algorithms, the edge weights are used. By adding teleportation (à la PageRank), the algorithm can also deal with directed graphs. Online community detection for large complex networks G. Pan et al. (2014) To estimate communities online, do not optimize the modularity, but the expected modularity, for some generative model, e.g.:

– Rapid exploration: prefer samples far away from already-labeled ones – but this gives too much emphasis on outliers; – Maximum uncertainty: label the samples the model is the most uncertain about; – Maximize the expected model change; – Reduce the estimated classification error (expected entropy minimization); – Maximize the expected model output change (EMOC).

Logarithmic-time online multiclass prediction A. Choromanska and J. Langford (2014)

– Link a new node to a community C with probability For multiclass classification problems with a large numproportional to deg C; ber of classes, arrange the classes in a logarithmic– Long two nodes in communities C and C ′ with prob- depth tree, either: ability proportional to deg C × deg C ′ . – Known in advance; Nowcasting economic and social data: – Constructed using class frequencies (Huffman coding); when and why search engine data fails, – Constructed online, recycling orphan nodes when an illustration using Google flu trends needed. P. Ormerod et al. (2014) Search engine data (e.g., Google flu) is less reliable when social influence is high: the Bass diffusion model (fitted between the two low points on both sides of a peak) can help distinguish between independentlymotivated and socially-motivated searches. Do we need hundreds of classifiers to solve real-world classification problems? M. Fernández-Delgato et al. (2014) Comparison of 179 classifiers (the output is binary), in R, Weka, C, Matlab (no Python) on 121 datasets: prefer random forests and SVM, but do not reject neural networks, boosting, C5.0, avNNet (caret) or ELM (extreme learning machines). Generating abbreviations using Google Books library V.D. Solovyev and V.V Bochkarev Abbreviations are prefixes (or substrings) with the same context. Improved part-of-speech tagging for online conversational text with word clusters O. Owoputi et al. (2013) TweetNLP is a POS tagger for twitterese; it uses a hidden Markov model (HMM) with Brown clustering to recognize new words and alternate spellings. Article and book summaries by Vincent Zoonekynd

Model compression C. Bucilă et al. (2006) To compress an ensemble of thousands of models into a simpler, smaller model, fit the smaller model on the forecasts of the complicated model on a large set of unlabeled examples. If there is no such set, use a synthetic one: add noise to each feature of existing observations, where the amplitude of the noise is the distance to the closest observation (for qualitative variables, replace the value with that of the nearest neighbour with probability p).

Do deep nets really need to be deep? L.J. Ba and R. Caruana (2014) Shallow neural networks are as expressive as deep ones: – Learn a deep model; – Use it to create a huge synthetic training set from unlabeled data; – Train a shallow model on the synthetic dataset (to speed things up, use a low-rank approximation of the weight matrix).

190/587

Towards automated discovery of artistic influence B. Saleh et al. (2014)

– And outputs the predicted class (here, liberal vs conservative) using a softmax.

Efficient programmable learning to search Linguistic models can be used with non-text data: for H. Daumé III et al. (2014) instance, one can use topic models (latent Dirichlet analysis, LDA) on images, by replacing the words with In structured prediction, the search space can be deimage features (first identified by the Laplace Har- scribed by an arbitrary imperative program (or a gramris detector, then reduced to a set of 600 features mar). (“words”) using k-means). Software to model sequence data (segmentation, tagging in NLP): CRF++, crfsgd, VW. Memory networks J. Weston et al. (2014) Dropout: a simple way to prevent One can combine neural networks with a memory comneural networks from overfitting ponent: N. Srivastava et al. (2014) – Convert the input to features; When iteratively training a neural net (e.g., with – Store them in memory (more generally, update the stochastic gradient descent, SGD), forget (drop) a ranmemory from the input features and the current dom subset of the nodes at each iteration (and update memory); this smaller set). The resulting neural net can be seen – Predict the output features from the memory and as an ensemble of (all the) 2n sparser subnets. the input text; This can be interpreted as the addition of noise to the – Generate the output from the output features. hidden nodes; this noise can be integrated out to comThe various components can use your favourite mapute the gradient (for regression, this is similar to ridge chine learning algorithm (SVN, decision tree, neural regression). net, etc.). The idea can be used on text data, using a bag-of-word approach, e.g., to answer questions Monte Carlo non-local means: random about a story (store the raw input, verbatim, sequensampling for large-scale image filtering tially, time-stamped, in memory; compute the memory S.H. Chan et al. (2013) m1 “closest” to the input question; compute the memory m2 closest to both the input and m1 and with To denoise a pixel i in an image, take the weighted ∑ m1 ≺ m2 ; output a single word). average of the pixels j, wij xj , where wij measures j

the similarity of the regions around i and j (non-local Modeling interestingness means, NLM); the pixels j can be from the same image with deep neural networks or from an image database. To speed up the algorithm, J. Gao et al. (2014) consider: Use a deep neural net, with a convolutional and a max-pooling layer (often used in image processing) to model text; train it to forecast followed links, i.e., f (text1 , text2 ) = 1 if text2 was read after text1 .

– – – – –

Understanding locally competitive networks R.K. Srivastava et al. Visualizing the nodes of a neural network with rectified linear units (RLU, ), maxout (output the maximum of the inputs) or local winner takes all (in a group of nodes, the maximum passes its input unchanged, but the others output 0) (these are often trained with droupout) suggest that the nodes cluster – only part of the network is activated for any given input pattern. Political ideology detection using recursive neural networks M. Iyyer et al. A recursive neural net (RNN) is a model that – Uses a vector representation of the words in the sentence (initialize it with word2vec); – Combines them, along the parse tree of the sentence (all nodes share the same parameters); Article and book summaries by Vincent Zoonekynd

Only looking at pixels j close to i; Dimension reduction (SVD); Effective data structures (kd-trees); A low rank approximation of the weight matrix W ; Random sampling. Modeling the shape of the scene: a holistic representation of the spatial envelope A. Oliva and A. Torralba (2000)

It is possible to recognize scenes by looking at their texture and structure, without any segmentation (GIST descriptors): – – – – – – –

Presence of textured zones; Undulating contours; Uncluttered horizon line; Size of the major components; How straight the horizon line is; Parallels and perpendiculars; Converging lines. Evaluation of GIST descriptors for web-scale image search M. Douze et al. (2009) 191/587

Useful for duplicated image search.

Approximate bayesian computation and particle filters D. prangle (2014)

Random search for hyper-parameter optimization Maximum likelihood estimation can still be used when J. Bergstra and Y. Bengio (2012) the likelihood is not computable: Avoid grid search for hyper-parameter optimization: if some of the directions are irrelevant, random search (or a low discrepancy sequence) performs better. ●

● ●

● ●





● ●





● ●

Irrelevant parameter

Irrelevant parameter





● ● ●

● ● ●



Important parameter

– Simulate data ysim for many values of the parameter θ; – Select the θ for which ysim and yobs are the closest (instead of using all of y, you can use some summary statistic). In a bayesian framework, this likelihood-free method becomes “approximate bayesian computation” (ABC): Repeat N times: Draw θ from the prior Draw ysim |θ from the model Accept θ if d(ysim , yobs ) < ε.

Important parameter

To avoid degeneracy (no samples accepted), repeat until there are M samples (alive ABC). Freeze-thaw bayesian optimization K. Swersky et al. (2014)

Adding noise makes the estimate consistent (noisy ABC).

Bayesian optimization searches the (global) minimum For state space models, one can start with a particle of an expensive, noisy function (over [0, 1]D ). The func- filter tion is often modeled as a Gaussian process (GP). To t=1 choose the next point on which to evaluate the funcSample x(i) from the prior tion, one can use: w(i) ∝ likelihood(ytobs |x(i) ) – The point that gives the best expected improvement Increment t, resample, propagate the particles aEI (x) = (f (xbest ) − µ(x))+

and replace the likelihood

(where the positive part is smoothed – they do not Simulate y (i) |x(i) , θ u use u+ ≈ log(1 + e ), but u+ ≈ uΦ(u) + ϕ(u) – and w(i) = 1d(ytobs ,y(i) ) 2σ∆(t) . Instead, one can estimate the copula of the returns, and the conditional probabilities (“mispricing indices”) X|Y

MIt

= P [Xt < xt |Yt = yt ]

Y |X MIt =

P [Yt < yt |Xt = xt ]

(where Xt are the returns and xt the realized returns) ∑ ∑ and trade when s⩽t MIX|Y or s⩽t MIYs |X exceeds s some threshold.

Homo economicus and his exact subjective probabiliImproving pairs trading ties should be replaced with homo econometricianus, T.R. Almeida (2011) who uses time series models fitted on real data (“adapPairs opening after a 1-sided shock are less profitable. tive learning”) and macroeconomic data.

Article and book summaries by Vincent Zoonekynd

219/587

Liquidity-adjusted price-dividend ratios and expected returns B.G. Jang et al.

Changes in cash: persistence and pricing implications J.Z. Chen and P.B. Shane (2010)

The price-dividend ratio is often assumed to be sta- The changes in earnings can be decomposed into tionary: if can therefore be defined as a cointegration relation for (log(price), log(dividends)). But evidence ∆Earnings = Accruals + ∆FCF suggests that it may not always be stationary: one can ∆FCF = ∆Cash + net distribution to debt holders try to replace it with a cointegration relation for + net distribution to shareholders. (

) log(price), log(dividends), log(liquidity) ,

The change in cash can be decomposed into a normal and an abnormal part,

where the USD trading volume can be used as a proxy for liquidity.

\ + residual ∆Cash = ∆Cash \ comes from the model where the forecast ∆Cash

Random walks in dividend yields and bubbles F. Bidian (2014) A non-stationary dividend yield is not necessarily evidence of a bubble. Estimating private equity market beta using cash flows: a cross-sectional regression of fund-market paired internal rates of return Y. Jiang and J. Sáenz (2014) To estimate the beta of private equity (PE):

∆Cash ∼ σindustry (FCF) + assets − BV(equity) + MV(equity) + assets ∆ log(assets) + ∆FCF + working capital − cash + assets R&D debt +∆ + (∆dividends > 0) + ∆ assets assets cash outflow on acquisitions ∆ + assets ∆Casht−1 + Casht−1 . ∆

– Compute the internal rate of return (IRR) of the PE investments (you need several of them, and you get an IRR for each); – Compute the corresponding market IRRs (using the same cash flows, but invested in the market, with a non-zero final value); – Regress.

Negative abnormal cash changes are persistent, and do not bode well for the company. Positive abnormal cash changes are not persistent, but such hubris is ignored by the market.

[I am skeptical about the stability of this procedure: the market IRRs are weighted averages of market returns – they will be very, very similar, so we are almost regressing against a constant...]

When the use of positive language backfires: the joint effect of language sentiment, readability and investor sophistication on earnings judgements H.T. Tan et al. (2013)

When readability is low, sentiment (i.e., word choice) Peering inside the analyst “black box”: matters, but in opposite directions for naive and sohow do equity analysts model companies? phisticated investors. A. Markou and S. Taylor (2014) Analysts value firms using the discounted cashflow R&D spillover and predictable returns model (DCF), with explicit cashflow forecasts (subjecY. Jiang et al. (2012) tive, or from subjective forecasts of balance sheet items, or from subjective/historical growth rates thereof). R&D spending is an externality: The discount factor (weighted average cost of capital, – Low R&D firms lag high R&D firms in the same inWACC) is computed from the cost of debt, the cost dustry; of equity (equity risk premium, ERP), the tax advan- – The surprises are higher for low R&D firms (in the tage of debt, and the capital structure – some include same industry) that are not often found next to the preference shares and pension liabilities in the capileaders (in news, portfolios or analysts’ reports). tal structure. The ERP is difficult to estimate – most use the CAPM. The terminal value and the time afWhat do we learn from two new ter which it is reached influence the result. The DCF accounting-based stock market anomalies? is complemented by other approaches: EV/EBITDA, S. Basu (2004) P/E, etc. Accounting anomalies (buy low NOA, sell short firms For more details, check Damodaran’s books. receiving a going-concern audit opinion, etc.) are consistent with minimally rational markets: the prices are Article and book summaries by Vincent Zoonekynd

220/587

inconsistent with all investors being rational, but trans- Portfolio construction. The PortfolioAnalytics action costs are too high for rational investors to make package performs portfolio optimization (with ROI a profit. (Rglpk, Rsymphony, quadprog), random portfolios, DEoptim, pso, GenSA), for various types of objectives (mean, value at risk (VaR), expected shortfall (ES), The roles of receivables and deferred revenues standard deviation, expected utility) and constraints; in revenue management the types of objective and constraints are apparently J. Zha (2014) limited. To detect manipulations, look at: The cccp package will solve cone-constrained convex – ∆receivables ≫ 0 (fictitious sales, bill-and-hold); – ∆deferred revenue ≪ 0 (get the cash early, ship the goods late). Disentangling the accruals mispricing in Europe: is it an industry effect? E. Basilico and T. Johnsen (2013) Also look at: ∆accounts receivables – ; net operating assets ∆inventory ; – net operating assets

programs. One should include taxes (in particular, “tax harvesting” effects in case of losses) in portfolio optimization: it can be very profitable to replace a stock with an equivalent one. Let x be the (random variable of) asset returns and ˜ = (1 x′ )′ . The inverse of x ) ( 1 µ′ ˜′] = Θ = E[˜ xx µ Σ + µµ′ is

receivables sales – Days of sales outstanding = ; receivables sales inventory ∆ COGS . – Days of inventory outstanding = inventory COGS R in Finance 2014 Packages. Among the packages presented, beyond the usual data.table and Rcpp (with C++11 features), and the omnipresence of shiny: ∆

– The next edition of Modeling Financial Time Series with SPlus ditches SPlus (and will contain a list of must-know R packages); – eventstudies for the Patell test; – FlexBayes fits hierarchical models (linear, logistic, Poisson, via MCMC; – ilmts will provide Hurst exponent estimation and simulation of long (and intermediate) memory processes; – cds implements the ISDA standard model for CDSs, i.e., the standardized characteristics of the contract and the quoting conventions; – The pbo package estimates the probability of backtest overfitting. R engine. Recent or forthcoming performance improvements include:

Θ−1 =

(

1 + µ′ Σµ −Σ−1 µ

−µ′ Σ−1 Σ−1

) =

( 1 + sharpe2 −w

−w′ Σ−1

)

where w is the markowitz portfolio. The distribution of vech Θ−1 is asymptotically gaussian, and Var vech Θ−1 is easy to compute (from Var vech Θ): it can be used to perform tests or compute confidence intervals on the Shape ratio or the portfolio weights. This can be generalized to constrained portfolios (e.g., the optimal portfolio with zero covariance wrt some reference portfolio). To check if a signal has some predictive power on future returns, build the top and bottom quintiles, and match them (on industry, size, trading volume): if the effect disappears, it was caused by the imbalance between the portfolios. Coarsened exact matching is implemented in the cem package. Cointegrated pairs do not remain so (use the egcm package for the tests). Higher moments. DEoptim can account for higher moments in portfolio optimization. ∑ The factor model ri = k bik fk + ei (with E[ri ] = E[fk ] = E[ei ] = 0, fk ⊥ ⊥ ei , ei ⊥ ⊥ ej if i ̸= j) can be used to decompose the higher moments of the portfolio returns. ∑ ∑ E[(w′ r)2 ] = wi wj bik bjl E[fk fl ] + wi2 E[e2i ] ijkl



i

E[(w′ r)3 ] = wi1 wi2 wi3 bi1 k1 bi2 k2 bi3 k3 E[fk1 fk2 fk3 ] Bytecode compilation; i1 ,i2 ,i3 k1 ,k2 ,k3 Shallow duplication (since R 3.1.0); ∑ Reference counting (perhaps in R 3.2.0); + wi3 E[e3i ] Large vectors (> 16 GB); i Parallelization: explicit parallelization works; E[(w′ r)4 ] = · · · . C/C++ parallelization via OpenMP only works on Linux; implicit parallelization only works for a few (I leave the fourth moment as an exercise to the reader: functions; contrary to the second and third moments, the cross– The proftools package will soon be available. products do not all disappear – you should have three – – – – –

Article and book summaries by Vincent Zoonekynd

221/587

more terms.) Those higher-order tensors can be written as matrices using the Kronecker product E[(w′ r)2 ] = w′ (BSB ′ + ∆)w E[(w′ r)3 ] = w′ (BG(B ′ ⊗ B ′ ) + Ω)(w ⊗ w) E[(w′ r)4 ] = w′ (BP (B ′ ⊗ B ′ ⊗ B ′ ) + Y )(w ⊗ w ⊗ w).

Volatility models. The 2-state STAR model   1 yt−1    yt = [λt ϕ1 + (1 − λt )]′  .  + εt  ..  yt−p

One can add the volume synchronized probability of informed trading (VPIN) to the Fama-French, momentum and liquidity factors; look at the beta of stock VPIN vs Market VPIN. Performance measurement. To measure the performance of private equity (PE), use the cash flows to and from the fund, apply them to the benchmark, and look at log1p(IRRPE ) − log1p(IRRMarket ) or final valuePE . final valueMarket Time series. To predict economic recessions, one can:

– Take high-frequency time series (e.g., daily index returns); smoothly switches between two AR models, ϕ1 and ϕ2 ; – Compute their spectrograms on a moving window the state transition function λt = λ(zt ) depends (via (short-term Fourier transform (STFT), sometimes a logistic link) on exogenous variables zt . It can be also called Gabor transform); one can notice more extended to a STARMAX model by allowing autorelow-frequencies in expansion periods, and more midgressive dynamics in the state zt . frequencies in recessions; One can use the bootstrap to estimate the precision library(quantmod) of volatility indices (VIX, vega-weighted VIX (VVIX), getSymbols( "^GSPC", from="1980-01-01" ) liquidity- ot elasticity-weighted VIX); check the ifrogs x 1, stream is an R package for machine learning on data to increase weight diversity; – Output a random class (sampled according to the streams, focusing on clustering (cluster the data into predicted probabilities), instead of the majority one. a large number of micro-clusters, online, then cluster the micro-clusters into macro-clusters, offline). The Online boosting is similar, but with Pois(λi,m ) instead, interface allows for other mining tasks (classification, for observation i and model m, with λi,1 = 1, and itemset mining), but there is no implementation yet. λi,m+1 computed from λi,m depending on whether obThe clustering algorithms from moa are available. servation i is classified properly by model m. Online boosting does not perform as well as online bagging, especially in the presence of changes. Instead of separate models, option trees, i.e., trees with option (non-deterministic) nodes (both branches are taken) are more memory-efficient.

A general framework for observation-driven time-varying parameter models D. Creal et al. (2008)

Ensembles of classifiers of different sizes (say, 2n , for various values of n – when the size exceeds the threshold, either delete the root (the oldest node) and all its children except one – or delete everything and start

GAS (generalized autoregressive score) models are a unified framework to describe GARCH-like models (observation-driven models, i.e., the time-dependence of the paramters comes from lagged, observed values –

Article and book summaries by Vincent Zoonekynd

238/587

as opposed to, say, stochastic volatility)

Dancing links D.E. Knuth (2000)

yt ∼ p(y|y1 , . . . , yt−1 ; f1 , . . . , ft−1 ; x1 , . . . , xt ; θ) ∑ ∑ ft = ω + Ai st−i + Bj ft−j i

j

∂ log p st = St−1 ∂ft−1

[

St−1 = I or St−1 = Et−1

∂ log p ∂ log p′ ∂ft ∂ft

]−1 .

(This can be seen as a kind of smoothed or shrunk Kalman filter.)

Dancing links, i.e., the remark that it is easy to remove and add back an element in a doubly-linkes list, can be used to efficiently keep track of the current state in depth-first (backtracking) search. For instance, in the exact set cover problem (given a boolean matrix, find a set of rows with exactly one 1 in each column), one can encode the matrix as two doubly-linked lists (rows and columns), and descending in the tree just removes a column and one or several rows. This can be used in many satisfaction problems:

– Packing pentominoes on a chessboard: one row for each possible position of each individual pentomino, one column for each chessborad square, 1 if it is ocStructural and toppological phase transitions cupied; on the German stock exchange – The n-queen problem: one row for each possible A. Sienkiewicz et al. (2013) queen position, one column for each row, column or The minimum spanning tree (MST) computed from the diagonal, 1 if it is controlled; correlation of stock returns changes with time, some- – Sudoku: one row for each possible decision of the times a scale-free network (the degree distribution is a form “digit i in (k, l)”, columns for constraints: power law), sometimes a star-like network (idem, with “there is a number in (k, l)”, “there is an i in row one outlier). k”, “there is an i in column l”, “there is an i in square m”. Coupling between time series: a network view S. Mehraban et al. (2013) Some of the properties of a time series can be read from its visibility graph: the vertices are the timestamps, and there is an edge between i and j if ∀k ∈ Ki, jJ, yk < yi +

k−i (yj − yi ). j−i

It can be generalized to pairs of time series (normalize them both first): edge between i and j if ∀k ∈ Ki, jJ yk ⩽ yi +

k−i (xj − xi ) j−i

or ∀k ∈ Ki, jJ

yk ⩾ yi +

k−i (xj − xi ). j−i

Will central counterparties become the new rating agencies? C. Kenyon and A. Green (2012) Overreliance on central counterparties could be as bad as overreliance on rting agencies: through collateralized trades, they transform credit risk into liquidity risk, but sharp price changes (that exceed the buffer in the margin account) and invalid prices (from previous transactions (we want future transactions) or from a model) still pose problem – seek more diverse price sources. Network analysis of correlation strength between the most developed countries J. Miśkiewicz (2012)

Instead of looking at the minimum spanning tree of a correlation matrix, i.e., start with an empty graph and progressively add the most important edges, if they Dynamics of episodic transient correlations do not create cycles, until the graph is connected, do in currency exchange rate returns the opposite: start with a complete graph and delete and their predictability the least important edges as long as the graph reM. Žukovič (2013) mains connected (the result is not a tree). [The first part of the article is a very confusing estimator of One can look for predicability in financial time series Cor(log x, log y).] by testing if the correlation Cxx (s) or the bicorrelation Cxxx (r, s) of the log-returns Early prediction of movie box office success based on Wikipedia activity big data Cxx (r) = E[xt xt−r ] M. Mestyán et al. (2013) Cxxx (r, s) = E[xt xt−r xt−s ], estimated on a moving window, is zero.

Article and book summaries by Vincent Zoonekynd

Editor activity of a Wikipedia entry is yet another (crowdsource) data source, after Twitter and Google.

239/587

Deep learning via hessian-free approximation J. Martens (2010)

0. The first attempt to solve an optimization problem is often a greedy algorithm. For instance, for the Newton’s method to minimize f approximates the ob- traveling salesman problem, you can start at a node, jective as qθ (p) = f (θ) + ∇f (θ)′ p + 12 p′ Bp and min- take the nearest neighbour, and iterate; you could also (that is better) start with a 2-element cycle, add a verimizes it by solving ∇f (θ) + Bp = 0. Instead of extex in the best position, and iterate. For the set cover actly minimizing this second order Taylor expansion, minimize it approximately by a few conjugate gradient problem (often represented as a binary matrix, one el(CG) steps – the whole hessian B is not needed, only ement per column, one set per row), you can take the sets in order, until all the elements are covered; better, a few products Bp are (truncated Newton). you can take the largest sets first; even better, you can take the sets with the largest number of uncovered elements first – you can also simplify the problem, e.g., by Gradient descent Newton removing dominated sets or checking if there are sets that have to be taken (because there is an element that is only in this set). The knapsack problem can be solved exactly by dynamic programming (let Oj,k be the value of the optimal solution with capacity k and items J1, jK) or The article also lists a few modifications, useful for (depth-first, best-first, etc.) branch and bound (to find neural networks: damping (do not move too far: the a bound, relax the problem by removing the capacquadratic approximation should remain good), back- ity constraint or, better, by replacing x ∈ {0, 1} with propagation to compute products with the (Gauss- x ∈ [0, 1]). Newton approximation of the) hessian, mini-batches 1. Constraint programming keeps track of the set of (large, and constant during the CG runs), termination possible values (“domain”) of each variable, and tries condition for the CG, no backtracking (in the CG runs, to reduce those sets by “propagating the constraints”. keeping the parameter corresponding to the best value For instance, in the 8-queen problem, “propagating of f , as opposed to the last value, i.e., the best value the constraints” means “marking the cells that are of qθ ), preconditioning (change of variables), start each no longer available”. It could be called “branch-andCG iteration in the direction of the previous one. prune”. A practical guide to training restricted Boltzmann machines G. Hinton (2010)

From an implementation point of view, a constraint is often an object with two methods, satisfied? and propagate; it only interacts with the domain store – constraints do not interact with one another.

Advice on setting the many, many parameters in con- There are dedicated constraint propagation algotrastive divergence (with a concise, but clear explana- rithms for each type of constraint. For instance, tion). to ∑ propagate an arithmetic constraint over integers ∑ ai xi −1= 0, one can isolate one variable xi = Deep neural networks for acoustic modeling j̸=i ai aj xj and find a lower and upper bound on in speech recognition xi from the lower and upper bounds on the other xj (a G. Hinton et al. (2012) few iterations may be needed). Review article on deep neural networks, Gaussian- Most constraint programming languages or libraries alBernoulli restricted Boltzmann machines (GRBM), low reification (transforming a condition into a binary contrastive divergence. variable) and the use of decision variables as array index – these make constraint propagation a bit trickier. High frequency trading and mini flash crashes Global constraints, e.g., allDifferent, could be exA. Golub et al. (2012) pressed with elementary constraints, but if the solver is Most mini flash crashes have a regulatory origin (in- made aware of them, it can exploit their special structermarket sweep orders, ISO: the order is executed on ture (e.g., detect infeasibility earlier, or prune more) – the exchange with th ebest price first, which can de- sudoku can be solved efficiently in this way. plete the order book (if the exchange is less liquid) and Scheduling problems often use more complicated global expose stub quotes). constraints. Discrete optimization P. Van Hentenryck (Coursera, 2013)

The domains are usually intervals (bounds consistency), but one could consider arbitrary sets instead (domain consistency – more computationnally expensive).

Lively overview of several approaches to optimization problems; as in the real world, for the programming assignments you are not told which algorithms will work If the problem has a symmetry, symmetry-breaking for a given problem – your first ideas are likely to fail. constraints can reduce the search space a single coset. Article and book summaries by Vincent Zoonekynd

240/587

For instance, a balanced incomplete block design (BIBD) is a v×b binary matrix with r ones in each row, k ones in each column, such that the scalar product of any 2 rows is ℓ. (This is used in experiment design and software testing: b new features, v tests, each with k features, each feature appearing r times, each pair of features is tested ℓ times.) The symmetries (row and column permutations) can be removed by requiring that the rows (and columns) be in lexicographic order.

the value of a single variable, e.g., for a satisfaction problem (you are trying to minimize the number of breached constraints), choose a variable that appears in the largest number of violated constraints and a new value to minimize the number of violations. The next simplest neighbourhood swaps the values of two variables – in particular, if there is an allDifferent constraint. For instance, in the TSP problem, the 2-opt neighbourhood corresponds to swapping two edges; if can be generalized to 3-opt (remove 3 edges and rewire the nodes to have another tour – contrary to 2-opt, there are many possible rewirings) or k-opt (Lin-Kerningham).

Since constraints do not communicate with each other, Redundant constraints can speed up the computations: you could add the sum of the constraints, or (better), a linear combination of the constraints, with coefficients If the the number of breached constraints is too coarse 1, α, α2 , etc.; you could also find some less obvious a measure of how bad a solution is, you can use the property of the solution. “degree” of violation instead, e.g., in an arithmetic conFor instance, in the car sequencing problem (the cars straint, the difference between the lhs and the rhs. to build have different option, but at most a out of b When using local search to solve a constrained opticonsecutive cars can have option o), the cars with a mization problem, you can solve a sequence of satisgiven option cannot be all at the end: there is a minfiability problems (e.g., for graph colouring, look for imum number of them in each interval [0, k] – these a solution with k colours, and progressively decrease redundant constraints help the solver realize early that k – you can use the solution with k + 1 colours, with a partioal solution is infeasible; it links the capacity one colour removed, as a starting point for the search and demand constraints. for a solution with k colours), or build a sequence of ∑ If you have two very different models, you can give feasible solutions (e.g., maximize i |Ci |2 , where Ci is them both to the solver, and link them with constraints the set of vertices with colour i, using Kemp chains (dual modeling); for instance, for the 8-queen problem, to move from one feasible solution to another), or exthere is one queen in each column, and we want to plore both feasible and infeasible solutions, by adding a find the corresponding row numbers, but conversely, penalty for breached constraints to the objective (e.g., there is one queen in each row, and we want to find change the colours of one node at a time and minimize ∑ ∑ 2 the corresponding column numbers. 2bi |Ci | − |Ci | , where bi is the number of bad The implementation of global constraints often re- edges in colour i – it turns out that, in this example, lies on graph algorithms. For instance, for the local minima are feasible). allDifferent constraint, feasibility can be checked by the existence of a maximum matching in the bipartite graph of variables and values. (Feasibility is a special case of constraint propagation: it either leaves the domains unchanged, or sets them all to ∅.)

Heuristics, e.g., simulated annealing (with restarts, reheats, and perhaps even a tabu list) can help escape local minima.

Tabu search keeps a list of the last k states to avoid visiting them again; instead of the whole state, you can When exploring the tree of possible assignments, there make some property of the state (especially if the satte are several strategies to choose which variable to in- is complex), or the (inverse of the) move (e.g., swapping stantiate and which value to set it to: often, the most- the values of variables i and j) tabu. There are many constrained variable (smallest domain and/or largest variants: aspiration (accept a tabu move if it is better number of constraints) leads to earlier failures and than everything seen so far), intensification (store highmore pruning. For instance, for the Euler knight prob- quality solutions and return to them periodically), dilem, you can start in a corner, and then do the other versification (when there has been no progress for some corners. Instead of choosing the variable and then its time, diversify the current state, e.g., by seting some value, you can do the opposite, i.e., choose the value of the variables to random values), strategic oscillation and then the variable (e.g., in the perfect square pack- (change the proportion of time spent in the feasible and ing problem). Instead of choosing an actual value, you infeasible regions), late-acceptance hill-climbing (comcan just reduce the domain, e.g., by halving it – that is pare the candidate solution with the solution n steps a weaker commitment (magic square, car sequencing). earlier – only one comparison, and keeping the value of 2. For large problems, local search, which only pro- the solutions suffices) etc. vides a local optimum, is a scalable alternative to exact 3. Mixed integer programs (MIP) can be solved methods – but you need to pay close attention to the by branch-and-bound, using a linear relaxation, and definition of the neighbourhoods, to how you explore branching on the most fractional variables. Reformuthem (randomly, exhaustively, heuristically), and im- lating the problem can make the relaxation closer to plement them as efficiently as possible, the actual problem. For instance, the constraint x ̸= y The simplest neighbourhood corresponds to changing Article and book summaries by Vincent Zoonekynd

241/587

can be made linear with the big-M transformation, x ⩽ y − 1 + bM x ⩾ y + 1 + (1 − b)M b ∈ {0, 1},

Linear and discrete optimization F. Eisenbrand (Coursera, 2013) Clear, detailed (but slow) introduction to linear programming, with a few examples in SimPy.

where M is very large, but the relaxation often chooses A beginner’s guide to irrational behaviour b = .5, i.e., discards the constraint – in the graph D. Ariely (Coursera, 2012) colouring problem, it tells you that you need at least 1 Presentation of the ideas (behavioural economics) alcolour. If you use binary colours instead, the constraint ready explained in his books: our relation with money becomes ∀c bi,c +bjc ⩽ 1, and the linear relaxation tells (huge gap between free and $0.01, relations shifting you that you need at least 2 colours. from social to contractual, effect of bonuses on perforIt is often possible to add constraints to a mixed integer mance); dishonnesty (there is a small area in which we program without changing the set of integral solutions, are dishonnest but not enough to feel so; it increases e.g., with Gomory cuts (they can be computed from the with the distance to money – worrying in a cash-less tableau, in the simplex algorithm). More generally, we society); the moral pendulum (a good action is often can add facets of the convex hull of the feasible integral followed by one less so); motivation (need for meanpoints (polyhedral cuts) as new constraints – only when ing, acknowledgement, involvement (Ikea effect), notneeded, as the algorithm progresses, because there is a invented-here bias); self-control; herding; etc. potentially exponential number of them. Artificial Intelligence (CS188) For instance, in the graph colouring problem, n colours D. Klein (EDX, 2013) are needed to colour an n-clique C: we can add a constraint for each clique. Artificial intelligence (AI) is the set of techniques used For the traveling salesman problem with a binary vari- to build machines that act rationally i.e., that learn from the past, predict the future, and use this inforable for each ∑ possible edge, one can add constraints of mation to maximize their expected utility; it has apthe form s ⩽ |C| − 1 to eliminate a cycle C. i∈C i There is an exponential number of such constraints, plications in natural language processing (NLP), video but they can be added one by one, when they are games (how to teach computers how to play PacMan), breached (finding which constraints are breached is a etc. knapsack problem); when all the constraints are satisfied, the solution is usually still fractional, so branching is needed. Dually, column generation deals with exponentially many variables. In the cutting stock problem, the task is to cut shelves of lengths ℓ1 , . . . , ℓk , from boards of length L, minimizing the number of boards needed. The naive formulation (binary variables xi,n indicating that shelf i is taken from board n) has too many symmetries. Instead, one can look at all possible ways of cutting a board: configurations are of the form c = [n1 , . . . , nk ], where ni is the number of copies of shelf i; the decision variables are the numbers of configurations of type c, for each type. There is an exponential number of configurations, but finding the one to add is a (simple) optimization problem (knapsack). Branch-and-price uses column generation to find a bound in the branch-and-bound algorithm.

1. Many planning problems can be formulated as search problems, either in a search graph (the nodes are the states of the world) or a search tree (the nodes are partial plans). Breadth-first search (BFS) explores the tree by putting the nodes to expand (the fringe) in a FIFO queue; depth-first search (DFS) uses a LIFO Limited discrepancy search (LDS) uses a heuristic to stack. There are many variants, for instance, iteradecide the order in which to explore the tree: assum- tive deepening uses DFS with a limited depth, and ining that the tree is binary with the left branches cor- creases the depth if there is no solution (since most responding to the heuristic, it follows the heuristic less of the time is spent on the deepest layer, it keeps the and less, in waves: first follow the heuristic, then follow time-complexity of BFS and returns a low-depth soluit except once, then except twice, etc. tion, but retains the low-memory usage of DFS). A* Large neighbourhood search (LNS) is a hybrid of local uses a heuristic (a lower bound on the length of the search and CP/MIP: start with a feasible solution, se- path to the goal) to decide in which order to expand lect a neighbourhgood (e.g., fix the value of some of the the nodes, i.e., it uses a priority queue (the most comvariables), find the best solution in this neighbourhood mon implementation errors are: you only know that a path is the best when you remove it from the fringe, (CP is often better for smaller problems), iterate. not when you put it there; for graph search, checking if

Article and book summaries by Vincent Zoonekynd

242/587

the node you want to put in the fringe is already there is not sufficient, you also need to update its cost). The heuristic often comes from a relaxation of the problem. 2. Constraint satifaction problems are similar (single agent, deterministic actions, fully observed state, discrete state space), but the state is not arbitrary: it is defined by variable assignments. A binary constraint satisfaction problem (each constraint involves two variables) can be represented by a graph, with variables as vertices and constraints as edges (if the constraints are not binary, use a bipartite graph, with two types of nodes, variables and constraints). BFS does not perform well because it only considers actual solutions in the last layer; DFS fares slightly better, but it only realizes it made a mistake when the solution is complete. Backtracking search checks the constraints as early as possible to prune the search tree; it can be improved by changing the variable order (most constrained first), or the value order. It can be improved further by propagating the constraints via arc consistency: (for binary constraints) for each arc in the constraint graph, check if for each value x in the tail there is a valid value y in the head; reprocess the arcs if you remove something from their head. This can be generalized to k-consistency (each assignment of k − 1 vatiables can be extended to an assignment of all k) – 3-consistency (path consistency) is sometimes used. The structure of the problem can help: independent sub-problems can be solved independently; if the constraint graph is a tree, process the variables in topological oder to make them consistent, and assign them in the reverse order; if the graph is a tree after removing a small number of nodes, do an exhaustive search on those nodes and use the tree algorithm for the others (cutset conditionning); consider “mega-variables” (subproblems), overlapping to ensure consistency, and forming a tree (they should satisfy the running intersection property).

plan A and go left, then decide on plan B and go right, etc.). 4. Expectimax replaces the worst-case in Minimax with the average case, i.e., the uncertainty does not come from a perfect adversary, but from chance (equivalently, the adversary acts randomly). The values in the leaves are utilities, and the players maximize their expected utility. 5. In a Markov decision process (MDP), the effect of the agent’s actions are not deterministic. Formally, an MDP is the datum of a set S of states, a set A of actions, transition probabilities Ts,a,s′ , rewards Rs,a,s′ and a start state. The aim is to maximize the expected (discounted) reward. By introducing chance nodes for (s, a) pairs (q-states), one can use expectimax: V (s) = Max Q(s, a) Max node a ∑ Q(s, a) = Ts,a,s′ (Rs,a,s′ + γVs′ ) Chance node. s′

The corresponding tree is huge, has a lot of redundancies, but we only want an optimal policy S −→ A. The value Vk (s) of a state if the game ends after k steps can be computed iteratively (value iteration); the values Vk converge to V , but the policy usually converges faster. Policy iteration starts with an arbitrary policy, computes the value of the states under that policy, finds the optimal policy for those values (expectimax, 1 step), and iterates until convergence. 6. Reinforcement learning tries to solve an MDP without knowing it:

– Passive reinforcement learning starts with a policy π and learns the corresponding state values (policy evaluation); – Temporal difference learning directly learns the state values (at each step, update the estimated value of the state you are leaving), but not allow the the corIterative improvement algorithms can also be used: loresponding optimal policy (the rewards and the trancal search (e.g., the min-conflict heuristic changes a sition probabilities are needed); variable to the value that violates the fewer constraints – it works very well for the n-queen problem, n = 107 ); – Q-learning does the same thing, but with the Qvalues; it converges to the optimal policy, even if simulated annealing; genetic algorithms. you do not follow it (off-policy learning). 3. Advesarial search, i.e., two-player, perfectinformation, zero-sum, deterministic games (zero-sum To explore the state space, one could follow the current means that there is only one utility function, one player policy with probability 1 − ε and act randomly othertries to maximize it, the other tries to minimize it) uses wise (ε-greedy) or, better, explore areas whose badness the same ideas: the value of a state is the best achiev- has not been established yet (boost the Q-values by able outcome from that state, against an optimal ad- adding k/number of visits). versary (minimax); it is sometimes possible to prune If there are too many states, approximate them by a the search tree (alpha-beta pruning). It is needlessly set of features, and estimate the Q-values with linear pessimistic against a non-optimal adversary (e.g., the combinations of features. PacMan ghosts move randomly). If there are ressource limits, go down the tree as far as possible in the allotSemantic hashing ted time, and use an evaluation function, usually some R. Salakhutdinov and G. Hinton (2007) linear combination of features, (if it is sufficiently deep, it should work well). Since the agent replans at each Deep neural networks (stacked restricted Boltzman step, infinite loops (thrashing) are possible (decide on machines (RBM)) whose deepest layer has a small Article and book summaries by Vincent Zoonekynd

243/587

number of binary variables can model word count problem to a curve fitting problem (for parameter esvectors more accurately than latent semantic analy- timation or goodness-of-fit tests). sis (LSA, i.e., SVD-based low-dimensional approxima1 ∑ tions of the word-document co-occurrence matrix); the Fi = 1xj 0) To study one time series x, one can look at its continto the lasso: it makes variable selection consistent. uous wavelet transform In R, it is implemented in lqa::adaptive.lasso. ( ) ∫ 1 ¯ t−u √ Wx (u, s) = x(t) ψ dt. s s A kernel statistical test of independence A. Gretton et al. To study two time series x, y, one can look at the Covariance is not sufficient to test for independence, cross-wavelet transform but in a universal reproducing kernel Hilbert space (RKHS), it is. “Universal” means dense in Cb (X, R) Wxy (u, s) = Wx (u, s)Wy (u, s) – in practice, you will choose a kernel and hope that it is not too far from universal. The norm of the covarior, separately at the wavelet coherence coefficient ance operator is (“time and frequency correlation”) HSIC = E[k(x1 , x2 )ℓ(y1 , y2 )] + E[k(x1 , x2 )]E[ℓ(y1 , y2 )] −1 S[s Wxy (u, s)] 2 − 2Ex1 ,y1 [Ex2 [k(x1 , x2 )]Ey2 [ℓ(y1 , y2 )]] 2 R (u, s) = 2 2 S[s−1 |Wx (u, s)| ]S[s−1 |Wy (u, s)| ] and it can be estimated as 1 tr(KHLH) where S is some (2-dimensional) smoothing operator, m2 and the wavelet phase where m = number of observations

ϕxy (u, s) = arg Wxy (u, s).

kij = k(xi , xj ) ℓij = ℓ(xi , xj ) 1 H = I = 11′ m k, ℓ : kernels on X and Y .

GillespieSSA: Implementing the stochastic simulation algorithm in R M. Pineda-Krch (2008)

One can compute the asymptotic distribution of this Population dynamics (predator-prey, SIR, etc.) are estimator under the null hypothesis Pxy = Px Py and often modeled as coupled ordinary differential equaestimate the critical value for a statistical test (or to tions (ODE), but this is only valid for large populaconvert the test statistic to a p-value). tions. For small populations, a stochastic model is needed – stochastic simulation algorithms (SSA) are New approaches in visualization a generalization of the simulation of a Poisson proof categorical data: R package extracat cess, modeling the state changes (alive/dead, suceptiA. Pilhöfer and A. Unwin (2013) ble/infected/resistant) of the individuals; they can be Use barcharts overlaid with mosaic plots (rmb) and exact or (faster) approximate. interactive (iWidgets, iPlots) parallel graphs (cpcp), with the variables and values reordered to reduce overGrassmannOptim: an R package plotting. for Grassmann manifold optimization K.P. Adragano et al. (2012) Visualizing association rules: introduction to the R-extension package arulesViz Gradient-based optimization can be used on manifolds M. Hahsler and S. Chelluboina other than Rn – for instance, many constrained optiHere are a few ways to visualize association rules: – Plot confidence, support, lift, number of items using two variables as coordinates and a third as colour; – Plot the rules as a matrix, with the antecedant (lhs) on the x axis and the consequence on the y axis, reordered with the seriation package and/or clustered with the Jaccard distance |Xi ∩ Xj | d(Xi , Xj ) = 1 − ; |Xi ∪ Xj | – Graphs (with itemsets, or items, as nodes); – Parallel plots, etc.

Article and book summaries by Vincent Zoonekynd

mization problems looking for a set of orthogonal vectors are unconstrained optimization problems on the Grassmannian (e.g., the first eigenvectors, independent components, or other dimension reduction problems).

Graphical models with R: Tutorial S. Højsgaard (2012) The gRbase, gRain, gRim packages provide functions to build, use, fit graphical models (or discrete variables, i.e., log-linear models).

252/587

An improved algorithm but is, in many special cases (e.g., if f1 is not infor matching large graphs vertible). It is not directly useable beyond two variL.P. Cordella et al. ables: first use conditional independence tests to find To find an isomorphism between two graphs, grow an the Markov equivalence class of directed acyclic graphs (DAG), then use the PNL model two variables at a isomorphism between subgraphs (adding edges outgotime, to identify the unresolved causal relations. ing from the subgraph, or incoming if there are no outgoing edges, or other vertices of there are no edges to/from the subgraph), backtracking when needed. Dynamic decentralized any-time The VF2 algorithm is implemented, for instance, in hierarchical clustering Python’s NetworkX module. H.V.D. Parunak et al. McKay’s canonical graph labeling algorithm S.G. Hartke and A.J. Radcliffe To test if two graphs are isomorphic, one could try all possible permutations, or the permutations (of the vertices) that preserve some vertex invariant, such as the degree (or the centrality, etc.). The degree is too coarse an invariant, but it can be propagated: first, partition the edges by degree; then, for each node, look at the number of edges to each part of the partition – this gives a finer partition; iterate until the partition stabilizes (equitable partition). When you reach an equitable partition, split by removing an element (nondeterministically) from one of the parts (deterministically). The smallest leaf of the resulting search tree (the partitions are ordered) is a canonical isomorph of the graph. The algorithm is implemented in nauty. It is not known if the graph isomorphism is NPcomplete. Clustering with qualitative information M. Charikar et al. (2004) Here are some algorithms to cluster the vertices of a graph whose edges are labeled “agree” or “disaggree”: – Take a binary decision variable xij for each edge, with xij = 0 if i and j are in the same cluster; the condition xij = xjk = 0 =⇒ xik = 0 can be written xik ⩽ xij + xjk ; minimize ∑ ∑ xij + (1 − xij ); i,j aggree

Parable: a parallel random partition based hierarchical clustering algorithm for the MapReduce framework S. Wang and H. Dutta To cluster a large dataset on Hadoop: – Split the data at random and compute a hierarchical clustering on each node; – Choose one of the trees as a template; – Align all the other trees to it: swap two children of the root if doing so makes them more similar to the two children of the root of the template, continue recursibely with the other nodes; – Output the resulting tree: the shape is that of the template, but the leaves contain sets of observations rather than individual observations. Probabilistic latent variable models for distinguishing between cause and effect J.M. Mooij The direction of a causal relation between two random variables X and Y can be inferred by comparing the two models

i,j disaggree

truncate the solution of the linear relaxation at 2/3 and 1/3, one cluster at a time; – Associate a vector vi to each vertex: maximizing ∑ ∑ (1 − vi vj ) vi vj + i,j aggree

Ant clustering (each ant picks up items and drops them when it is close to similar items) can be seen as a distributed variant of k-means. It can be generalized to produce a hierarchical clustering: start with a random tree (with the data in the leaves) and move nodes (if they are too dissimilar to their siblings) or merge them (if their children are very dissimilar). (This is just a local search algorithm, but an easy-to-parallelize one.)

E X

E

and Y

X

Y

(E is not observed) using Bayesian model selection.

i,j disaggree

subject to vi vi = 1 and vi vj ⩾ 0 is a semi-definite relaxation of the previous problem.

A quick and gentle guide to constraint logic programming via ECLiPSe A. Niederliński (2011)

The book starts with a clear description of the differences between constraint logic programming (CLP) and its ancestor, Prolog: they are syntactically similar, they explore the same tree, but CLP prunes it more The PNL (post-nonlinear) model x2 = f2 (f1 (x1 ) + e1 ) efficiently. Prolog only gives up a branch (a partial is not identifiable in general (e.g., in the Gaussian case) instantiation, or grounding, of the decision variables) On the identifiability of the post-nonlinear causal model K. Zhang and A. Hyvärinen (2009)

Article and book summaries by Vincent Zoonekynd

253/587

when a constraint (involving ground variables) is violated. CLP keeps track of the domain of each variable (usually, as an interval of integers – but it could also be an arbitrary finite set of integers) and, after each instantiation, propagates the constraints – for instance, if you know that x + y = z, x, y ∈ J0, 3K, z ∈ J5, 9K, you can deduce that x, y ∈ J2, 3K, z ∈ J5, 6K – this prunes a lot of branches. In addition, CLP offers more choice for the search strategy: branch-and-bound (Prolog is designed for satisfaction problems and requires dirty, non-declarative tricks), variable choice heuristic (e.g., most constrained), value choice heuristic (e.g., minimum, maximum, middle, random), random restarts, timeout, etc. CLP also differs from operations research (OR), which is limited to continuous and binary variables. The rest of the book is a long list of examples for the ECLiPSe open source CLP system, but most of the language features are used without any explanation. = < > =< >= is ; , not => #=< #>= :: .. &= &:: ~ ! -> =\= $= $< $> $=< $>= fail retractall assert labeling findall minimize bb_min search indomain ic_symbolic:indomain ground suspend delete

Similarity evaluation on tree-structured data R. Yang et al. (2005)

distribution functions (ecdf), kernel smoothing, correlation, rank correlation, common parametric distributions, copulas, mixtures (via the Fourier transform), classical or Bayesian estimators, tests and plots for goodness of fit, etc. Then, one propagates uncertainty: either as [min,max] intervals, analytically, or by solving optimization problems, or by simulations with quasi-random numbers (low discrepancy sequences, or Latin squares (design of experiments) – Latin squares are fixed-length multivariate low discrepancy sequences); or as locationdispersion pairs, or quantiles, estimated by Monte carlo simulations or Taylor expansions. To approximate the probability of some event [g(X) ⩽ 0], transform X to make it rotationally invariant (isoprobabilistic (Nataf, Rosenblatt) transformation – if X ∼ N (µ, C ′ C), just use X 7→ C −1 (X − µ)) and approximate the event [g(X) ⩽ 0] with a halfplane (FORM: first order reliability method) or a ball (SORM). The probability can also be approximated with Monte Carlo simulations.

g(X) ⩽ 0

reliability index

FORM

SORM

The edit distance between trees (number of relabeling, Finally, the importance of each input variable on the delete and insert operations needed to transform one uncertainly of the output can be estimated, for ininto the other) can be computed with dynamic pro- stance, if Z = h(X), with a Taylor expansion gramming. Alternatively, one can use the string edit Var Z ≈ ∇h(µX ) Var X ∇h(µX )′ distance between the preorder (or post-order) traversal ∑ sequences, or various histogram distances (height, de= ∇i h(µX ) Var X ∇h(µX )′ gree, label). Here is another histogram: transform the i tree into a binary tree or, simply, with Cor(Z, Xi ), lm(Z~Xi ), pCor(Z, Xi ), rank correlation, ∂RI/∂θ, etc. ⇝ Instead of the Taylor expansion, given , make it complete, and consider the histogram (empirical distribution) of the binary branches

Rn

h

X

Rm Y =h(X)



a b

c

(they can be seen as 3-grams for binary trees). OpenTURNS reference guide OpenTURNS is a C++ reliability library, often used from Python, i.e., a list of methods used by engineers to propagate uncertainty in their computations and identify the main sources of those uncertainties.

one can use the functional chaos expansion: the expansion of h over a basis of orthogonal polynomials (“polynomial chaos”) wrt fX (only keep the “first” coefficients – since there are many indices, you can tweak the notion of “first”, e.g., using the Lp pseudonorm, for p < 1, of the indices). LOF: Identifying density-based local outliers M.M. Breunig et al. (2000)

First, the input variables can be described, as a joint One can use the ideas behind density-based clusterprobability distribution, using empirical cumulative ing (dbscan) to define the degree of outlier-ness of a Article and book summaries by Vincent Zoonekynd

254/587

point p, lofk (p) =

is expensive (double loops to compute the candidates Ck and the frequent k-item sets Lk ).

1 ∑ ℓk (q) k ℓk (p) q∈Nk (p)

ℓk (p) =

1 ∑ dk (p, q) local reachability density k q∈Nk (p)

dk (p, q) = Max{dk (q), d(p, q)}

reachability distance

dk (p) = distance to the kth nearest neighbour. Being local, the method still works if the data has clusters of different densities. Fraud detection is the most prominent application of outlier detection. ● ●

The FP-growth algorithm processes the data into an FP-tree: – Compute the support of each item; – For each transaction, sort the items by decreasing support; – Build the corresponding prefix tree, storing the support of each prefix in the nodes, with links between identical items. 10

● ●

8

a



● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ●● ●● ● ● ● ● ● ●● ● ●● ● ● ● ● ●●●● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

●● ● ●

5

b c

3

1 1 1 1

d 1 e

● ● ●● ●●●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ●● ●●● ● ● ● ●●● ● ●● ● ● ● ● ●● ●

2 2 1 1 1

1

Using the FP-tree to extract frequent itemsets is trickA scalable and efficient outlier detection ier: strategy for categorical data A. Koufakou et al. (2007) – Start with the least frequent item; – If it is sufficiently frequent, consider the subtree it Here are a few ideas to detect outliers in categorical defines, update the counts for the conditional FPdata: tree and discard the leaves (the item itself); – Greedy algorithm to identify observations whose re- – Process this conditional FP-tree recursively; moval reduces the entropy the most; – Discard the item and iterate with the next least fre– Observations containing few frequent item sets (FIS) quent item, until the tree is empty. ∑ fpof(x) ∝ Support(x) F ∈fis F ⊂x

fis = {I : |Support I| > n} – Observations containing many infrequent itemsets ∑ 1 Otley(x) = |I|

Efficient implementations of Apriori and Eclat C. Borgelt (2003) The frequent itemset (FIS) mining algorithms Apriori and Eclat both explore the same prefix tree

I⊂x |Support I|⩽n

– Observations containing rare values for most variables (i.e., consider the variables separately) 1 ∑ avf(x) = f (xi ) m i where f (xi ) is the proportion of points in which the ith variable has value xi . Frequent pattern growth (FP-growth) algorithm. An introduction F. Verhein (2008) The apriori algorithm for frequent item set (FIS) mining L1 = {frequent items} = { {x} : Support {x} ⩾ n }

a

ab

ac

ad

abc abd abe

b

c

d

bc

bd

be

acd ace

ade

bcd bce

ae

abcd abce

abce

abde

acde

cd

ce

de

bde

cde

bcde

abcde

breadth-first and depth-first. Eclat can be implemented by storing the transactions as rows in a sparse bitmatrix and by using the columns to compute the intersections.

Ck+1 = { a ∪ {x}, a ∈ Lk , {x} ∈ L1 , x ̸∈ a } Lk+1 = { c ∈ Ck+1 : Support c ⩾ n } Article and book summaries by Vincent Zoonekynd

255/587

Simple algorithms for frequent itemset mining C. Borgelt (2010) The split-and-merge algorithm for frequent itemset mining preprocesses the data: – Compute the item frequencies; – Remove the items below the desired threshold; – Sort the items in each transaction by increasing frequency; – Aggregate identical transactions (keep track of their count). The dataset is then processed recursively (divide and conquer): – Take the least frequent item; – Recursively process the itemsets that do not contain it, by removing it from the transactions and reaggregating them (some may become equal); – Recursively process the transactions that contain it: discard the other transactions, and remove the item from the remaining transactions. A comprehensive assessment of methods for de novo reverse engineering of genome-scale regulatory networks V. Nerendra et al. (2010) Algorithms to identify (undirected) causal relations include:

– Limited partial correlation (qp-graphs); – Hierarchical clustering; – Local methods. Fast calculation of pairwise mutual information for gene regulatory network reconstruction P. Qiu et al. To compute the mutual information matrix ∫∫ pkℓ (x, y) MIkℓ = pkℓ (x, y) log dxdy pk (x)pℓ (y) one often uses a kernel estimator of the probability which gives something like MIkℓ = ∑ densities, ∑ log f . This is often implemented with ijkℓ i j nested loops (for k ... for l ... for i ... for j ...), but one can change the order of the loops (to i, j, k, ℓ) and factor out some of the computations (bits of the kernel estimators of pk (x), pℓ (x), which were recomputed each time). A robust procedure for Gaussian graphical model search from microarray data with p larger than n R. Castelo and A. Roverato (2006)

Relevance networks infer the structure of a graphical model by testing if Cor(Xi , Xj ) = 0, but this ignores – Relevance networks: add an edge between variables confounders, e.g., if B ← A → C, then Cor(B, C) ̸= 0. with a significant mutual information Instead, one can look at the conditional independences, ∫∫ which can be tested with partial correlations p(x, y) dxdy MI(x, y) = p(x, y) log p(x)p(y) Cor(Xi , Xj | Xk , k ̸= i, j) (use a kernel estimator for the densities, and change the order of the nested loops to acoid recomputing the 1-dimensional densities p(x), p(y) each time); – Add an edge between variables with a mutual information above some threshold; – Prune an MI-derived graph by removing the edge with the lowest mutual information in each triangle (Arcane); – Methods based on higher order mutual information ∫ H(X) = −p(x) log p(x)dx ∫∫ H(X, Y ) = −p(x, y) log p(x, y)dxdy ∫∫∫ H(X, Y, Z) = −p(x, y, z) log p(x, y, z)dxdydz H(X; Y, Z) = H(X) + H(Y, Z) − H(X, Y, Z) to test if Y → X ← Z; – Add an edge between variables with a significant correlation; – Add an edge between variables with a significant partial correlation; the partial correlations can be computed from the concentration matrix (the inverse of the variance); – The graphical lasso, i.e., and L1 -penalized estimator of the concentration matrix; Article and book summaries by Vincent Zoonekynd

or, equivalently, the zero pattern of the concentration matrix (the inverse of the variance). When it cannot be estimated, e.g., if p ≫ n, one can use the limited partial order correlations instead, Cor(Xi , Xj | Xk1 , . . . , Xkq ). The non-rejection rate is the proportion of subsets {k1 , . . . , kq } such that H0 : Cor(Xi , Xj | Xk1 , . . . , Xkq ) = 0 is not rejected (using the T test for zero regression(coef-) p ficients) – if q is large, use a random subset of the q+2 subsets. The qp-procedure produces the graph whose edges have a non-rejection rate below some threshold. To choose q and assess its adequateness, look at the histogram of the non-rejection rate and the qp-cliqueplot: plot( max_clique_size ~ threshold | q, cex = number_of_edges ) abline( h = min(p,n) ) (you want enough edges to capture all the information, but a small maximum clique size).

256/587

Local causal and Markov blanket induction A linear non-Gaussian acyclic model for causal discovery for causal discovery and feature selection for classification S. Shimizu et al. (2006) C.F. Aliferis et al. (2010) Variables x , . . . , x , related by cause/consequence re1 n Review of a few algorithms to infer local causality in a lations described by a directed acyclic graph (DAG), graphical model, viz computing the parents and chil- and observed with a non-Gaussian, additive noise ε, dren (PC) of a node X ∈ V and its Markov blanket can be modeled as x = Bx + ε, where B is strict lower (MB) – a minimal element of triangular. But since x = (I − B)−1 ε, independent component analysis (ICA) can recover B = I −ICA(x), up to a permutation of the rows and columns. The re{A ⊂ V \ {X} : ∀ Y ∈ A \ {x} X ⊥ ⊥ Y | A}. sulting DAG can be pruned with statistical tests for For a faithful graph (all dependence and independence the strength of the causal relations. relations come from the graph), PC(X) = {Y ∈ V : ∀ Z ⊂ V \ {X, Y }

Distinguishing causes from effects using non-linear acyclic causal models K. Zhang and A. Hyvärinen (2007)

X ̸⊥ ⊥ Y | Z}

and MB(X) is uniquely defined and contains parents, children and childrens’ parents (spouses) of X. The LiNGAM algorithm (ICA-based causality discovBy stitching the PC(X) for all X, one can estimate the ery) can be extended to allow for non-linearities whole (global) graphical model. xk+1 = fk+1 (gk+1 (x1 , . . . , xk ) + εk+1 ), Local causality learning can also be used for feature selection. which can be reformulated as Hash kernels for structured data Q. Shi et al. (2009) The idea behing the count-min sketch (use a hash table, with several hash functions, and ignore collisions) can be used to approximate scalar products in highdimensional spaces (e.g., after the “kernel trick”) or to compress sparse feature vectors (in text processing) ⟨ ⟩ ¯ ¯ k(x, y) = ⟨ϕ(x), ϕ(y)⟩ ≈ ϕ(x), ϕ(y) ∑ ϕ¯j (x) = ϕi (x).

x = f (Bs) with B strict lower triangular. In the case of two variables, one can find g1 and g2 so that x1 be as independent as possible from g2 (x2) − g1 (x1 ), e.g., by minimizing their mutual information with a multilayer perceptron (MLP) and using a statistical test to check the independence.

Nonlinear causal discovery with additive noise models P.O. Hoyer et al.

i : h(i)=j

If the features are duplicated, the drop in performance is not that bad. Hashing can be combined with random sampling – for instance, if the features are the number of subgraphs of size k in each isomorphism class. This idea is implemented in Vowpal Wabbit.

Given a non-linear noise with additive non-Gaussian noise y = f (x) + ε, plot the density of y|x for several values of x: it should be the same curve, up to translations. This property is unlikely to hold for x|y and strengthens the belief that x → y.

New methods for separating causes from effects in genomics data A. Statnikov et al. (2012)

3 0.75

Finding the direction of a causal link from observational data alone (no intervention analysis) looks hopeless: in terms of graphical models, A → B and A ← B are in the same Markov equivalence class. However, comparing the complexities of both models, i.e., of the decompositions of P (A, B) into P (A)P (B|A) or x|y y|x P (B)P (A|B), or checking if the conditional distribution is constant, i.e., B|A ⊥ ⊥ A, can sometimes give the answer – many algorithms have been proposed. The actual test checks that x ̸⊥ ⊥ y and res(y ∼ f (x)) ⊥ ⊥ y, where y ∼ f (x) is a non-linear regression and res(·) Also check the “Cause-effects pairs” Kaggle competiits residuals. tion. density

density

2

1

0.25

0

0.00

0.00

Article and book summaries by Vincent Zoonekynd

0.50

0.25

0.50

0.75

1.00

0.0

0.5

1.0

1.5

2.0

257/587

Inferring deterministic causal relations P. Daniušis et al. (2012) Causal inferrence often relies on the assumption that the noise is additive and non-Gaussian. This breaks down in the case of a deterministic relation Y = f (X), especially if f is invertible. However, one can expect X ⊥ ⊥ f and Y ̸⊥ ⊥ f , where ⊥ ⊥ denotes algorithmic independence – the shortest description of p(x, y) is a separate description of p(x) and f .

Latent Dirichlet allocation: towards a deeper understanding C. Reed (2012) A topic is a probability distribution on a collection of words. Latent Dirichlet allocation (LDA) is similar to k-means, but with bags of words instead of vectors in Rn : it models a “mixture” of topics, one step of the algorithm assigning each text to the nearest topic, the next refining the topics by “averaging” the texts associated with it. The R package metaLik for likelihood inference in meta-analysis A. Guolo and C. Varin (2012) Meta-analysis (combining the result of many studies, without access to the raw data of each), boils down to a seemingly simple random-effects model: estimate β in

pY pX

Yi = βi + ei Given a family of distributions E well-suited to model a probability density p, one can measure the complexity of p as D(p∥E ) = Min D(p∥q), q∈E

where

∫ D(p∥q) =

p(x) log

p(x) dx q(x)

is the Kullback-Leibler divergence. The informationgeometric causal inference (IGCI) method computes CX→Y = D(pX ∥EX ) − D(pY ∥EY ) and concludes that X → Y if CX→Y < 0. If EX = EY only contains the uniform distribution U (0, 1), ∫

1

CX→Y =

log |f ′ (x)| p(x)dx

0



m−1 yi+1 − yi 1 ∑ , log m − 1 i=1 xi+1 − xi

βi = β + ε i ei ∼ N (0, σi2 )

precision of study i

εi ∼ N (0, τ )

heterogeneity of the studies.

2

Unfortunately, most statistical procedures (tests, confidencen intervals, etc.) rely on asymptotic results: the small size of the sample (the number of studies aggregated) makes them invalid. In this context, higherorder expansions (of the log-likelohood) are preferable. Non-parametric kernel distribution function estimation with kerdiest: an R package for bandwidth choice and applications A.Quintela-del-Río and G.Estévez-Pérez (2012) If you are more interested in the cumulative distribution function (cdf) than the probability distribution function (pdf), e.g., probability of exceedance, mean return period, quantiles, etc., the optimal bandwidth for kernel density estimation is different. Spherical k-means clustering K. Hornik et al. (2012)

where x1 ⩽ · · · ⩽ xm . The architecture of SciDB M. Stonebraker (2011) SciDB is a database for scientific data, not unlike a column store, but with arrays instead of columns. It can be queried with a SQL-like language (AQL) or a more precedural one (AFL). In addition to relational operators (join, etc.), it provides scientific operations (matrix multiplication, singular value decomposition (SVD), regression, machine learning). Data is stored in overlapping chunks, because many operations (on time series, images, geographical data, etc.) require neighbouring values. It also supports versioning (the data is never deleted, but stored to optimize queries with current data).

Article and book summaries by Vincent Zoonekynd

Spherical k-means, often used in text clustering, refers to k-means for the cosime dissimilarity, i.e., ∑ after projecting the data on the sphere, Minp,c i 1 − cos(xi , pc(i) ), i.e., ∑ Min µij (1 − cos(xi , pj )), M,p

ij

with µij ∈ {0, 1} and M 1 = 1. In the fixed point algorithm, local improvements are possible (change the membership of a single observation; this also changes the prototypes p). Extended spherical k-means minimize ∑ wij µm ij (1 − cos(xi , pj )), ij

with µij ∈ [0, 1] and M 1 = 1 (m and w are given).

258/587

ClustOfVar: There is no mention of the forecast and caret packan R package for clustering of variables ages. M. Chavent et al. (2012) An efficient algorithm PCAMIX is a generalization of principal component for automatic peak detection analysis (PCA) that allows both quantitative and qualin noisy periodic and quasi-periodic signals itative variables. The first component is F. Scholkmann et al. (2012) ∑ ∑ 2 2 Argmax Cor (u, xj )+ Corr (u, yj ) The local maxima scalogram (LMS) n u∈R

j quantitative variable

j qualitative variable

mk,i = 1xi xi+k

where Corr(u, y) is the correlation ratio, i.e., the pro- may help detect peaks in a signal. portion of the variance of u explained by y. It can Local maxima scalogram be computed from the singular value decomposition (SVD). It can also be used to cluster the variables, by considering k ∑

Argmax



Cor(u, xj ) +

u1 ,...,uk ∈R ℓ=1 j∈Cℓ ∩Quant C1 ,...,Ck partition of the variables n

∑ Cor(u, yj ).

j∈Cℓ ∩Qual 1:length(y)

This can be approximated by hierarchical clustering of the variables, or a k-means-like algorithm.



● ● ●

● ●















● 1:length(y)



● ●







The devil is in the tails: actuarial mathematics and the subprime mortgage crisis C. Donnelly and P. Embrechts (2010)

Graph databases The default times, needed to price CDOs, are often I. Robinson et al. (2013) modeled with a 1-parameter Gaussian copula and (say) Relational databases are cumbersome and inefficient exponential margins. This is inadequate: it implies when dealing with graphs (self-joins, recursive joins), asymptotic independence – one should use another copand nullable columns complicate queries even furula or, at least, stress the model with other copulas. ther. NoSQL databases can fake foreign keys, but all the work has to be done by the application. R and data mining: examples and case studies Graph databases use index-free adjacency (the relaY. Zhao (2013) tions are not stored in a global, separate index, but locally, bidirectionally, at each node), often separate Examples (working code with little or no explanations of what is computed) of data mining algorithms in R. graph structure from property data (not unlike colThe paper version also contains case studies. The top- umn stores), and provide efficient graph-theoretic operations (depth-first search, breadth-first search, shortest ics include: path (Dijkstra), A*, etc.). They can be used for pre– Decision trees: party::ctree, rpart, random- dictive modeling, e.g., identify missing links (triadic Forest; closure: if A–B and A–C, then B–C is likely) or im– Regression: lm, glm, nls; portant links (a local bridge is a relation that leads to – Clustering: kmeans; cluster::pam, cluster:: a different part of the network). clara, fpc::pamk; hclust; fpc::dbscan; Applications include social data (find colleagues with – Outlier detection: boxplot.stats; DMwR::losimilar interests, colleagues of colleagues interested in factor, Rlof::lof; dbscan, kmeans, time series a given topic, etc.), email forensics, recommendations, models, extremevalues, mvoutlier, outliers; – Time series: decompose, stl, timsac::decomp, geographical data (parcel delivery and shortest path ast::tsr; arima; dwt; clustering with dwt:: queries, R-trees), master data management, network dwtDist; classification after transfroming the time and data center management (machines, applications, etc.), access control (dealing with complex organizaseries into features, e.g., with dwt, or with k-nearesttion structures and product hierarchies requires recurneighbours (RANN::nn); – Association rules: arules::apriori, arules:: sive joins in SQL), bioinformatics (protein networks). eclat, arulesViz; Data modeling is very similar to entity-relation (ER) – Test mining: twitterR::userTimeline; tm: modeling: use a node for each entity (and for most revectorSource, Corpus, tm_map, getTransform- lations: only encode simple relations as relationships), ations, TermDocumentMatrix, find*; wordcloud; avoid attributes on relationships (prefer fine-grained textcat (n-grams), lda, topicmodels; relationships). Time can be stored in linked lists (pre– Social network analysis: igraph, sna vious/next) or date nodes (or, even, a timeline tree: Article and book summaries by Vincent Zoonekynd

259/587

each day points to its month, each month points to its year). Optimizing a graph database just means adding direct links for relations that would otherwise require several links.

– The framework shuffles and sorts the data (transparently); – The reducer receives (key,list) pairs and emits more key-value pairs. – The jobtracker oversees the MapReduce jobs (on the Neo4j provides several APIs: a core API (in Java, job submission node), the tasktrackers, on each data with Node and Relationship objects), a traversal API node, run the code. (more declarative), and Cypher (a SQL-like language – the book has a lot of examples); SPARQL (for RDF) 3. Here are a few MapReduce design patterms: and Gremlin are other query languages. It can be em– In-mapper recombiner: while it is possible to prebedded (either persisted on disk, or in-memory), used aggregate the data in a combiner, it is often more as a server (REST, JSON – the server can be extended: efficient to do it in the mapper – not in its map use JAX-RS annotations to indicate to which URIs a method (called for each key-value pair), but in its class/method responds), supports transactions, recovclose method (called when the node has finished ery, master-slave replication. processing all the data). You may need to flush the Data-intensive text processing with MapReduce J. Lin and C. Dyer (2010) 1. With the help of infrastructures such as MapReduce, the “unreasonable effectiveness of data” has made data-intensive scientific discovery the “fourth paradigm of science” (after theory, experiment and simulations). Contrary to most books on MapReduce or its implementation Hadoop, which hide actual contents (if any) under complicated installation instructions (this is Java) and boilerplate code (this is Java), this one focuses on MapReduce algorithm design – no actual code, just clear Python-like pseudo-code.

– – –

2. MapReduce assumes that failures are common, moves processing closer to the data and processes data sequentially (no random access). It can be seen as a cloud analogue of the map and fold primitives of functional programming. map: (k1 , v1 ) −→ [(k2 , v2 )] reduce: (k2 , [v2 ]) −→ [(k3 , v3 )] The Hadoop infrastructure works as follows:



– The input data is stored in a distributed file system (HDFS), often serialized (ProtocalBuffers, Avro, Thrift); the files are stored on data nodes (each file is – replicated three times); the metadata is on the name node. – The mapper receives the data, as key-value pairs, and emits more key-value pairs. – The combinator (optional) pre-aggregates the data: for instance, when counting words, emitting pairs – (w, 1) is inefficient def mapper(id, terms): for term in terms: emit term, 1 def reducer(term,counts): – emit term, sum(counts) and the counts can be pre-aggregated. – The partitionner (optional) groups the data, usually by key but, for some applications, you may want to use only part of the key. Article and book summaries by Vincent Zoonekynd

accumulator if it gets too large. For some operations, e.g., the mean, the combiner and the reducer do different things. def mapper(t,x): emit t, (x,1) def combiner(t, xcs): xs, cs = zip(*xcs) # unzip emit t, (sum(xs), sum(cs)) def reducer(t, xcs): xs, cs = zip(*xcs) # unzip emit t, sum(xs)/sum(cs) Pairs: to compute co-occurrences, use pairs of words as keys. Stripes: to compute co-occurrences, use one word as key, and a hash table as value. Order inversion: if you need the result of the aggregation before it is ready, e.g., to compute relative frequencies, duplicate the data (e.g., one copy for the word count, another for the word pair count), make sure the intermedite result (word count) is computed first (set the sort order), and partition the data accordingly. emit (w1,*):1 emit (w1,w2):1 emit (w1,w3):1 partition on w1 Value-to-key conversion: when the reducer must produce sorted data, ask the framework to sort it before, by putting the value to sort in the key. Reduce-side join: for a 1-to-1 join of T and S, emit k : (t, T ) and k : (s, S) (where s and t are the primary keys of S and T , k the column to join on, and T , S the other columns), for a 1-to-many join, emit (k, s) : S and (k, t) : T , define a custom partitioner, ensure that the values of the small table come first and cache them. Map-side join: if the data has already been partitionned on disk, e.g., as a result of a previous MapReduce process, map over the larger set after reading the corresponding chunk of the smaller one in the mapper – no reducer is needed. Memory-backed join: put the smaller dataset into memory (or some distributed key-value store: Memcached, etc.) in every napper – if it does not fit into memory, partition it into n pieces and perform n in-memory joins. 260/587

4. An inverted index maps words to sorted lists of document ids (there may be some data (“payload”) with the id: term frequency, position, style (title or not), html link, linguistic function (pos, type of entity: place, person), etc.). For efficient set operations (intersection, union), the list should be sorted: do not emit term:(id,freq), but (term,id):freq to have the framework sort the ids (you need a custom partitionner to ensure that all the messages for a given term are sent to the same reducer). The volume of data is huge: the process can benefit from compression. – Only store the differences (d-gaps) between document ids – since they are sorted, the numbers are smaller. – Variable length integer coding: set the 8th bit to zero as long as the number is not finished. – Group varInt: group the numbers by 4 and prefix each group with a byte indicating (on 2 bits) the length of each number. – Simple-9: in each 32-bit word, use 4 bits to indicate how the remaining 28 bits are split into equal-sized parts. – Unary code: encode n as n − 1 1s and one 0. – γ-code: encode n as ⌊log2 n⌋ 1s, a 0, and the binary encoding of the number, without the leading 1. – δ-code: similar to the γ code, but the first part is not unary-encoded but γ-encoded. – Golomb code: choose b, encode q = ⌊(n − 1)/q⌋ in unary, encode the remainder r = n − 1 − qb in truncated binary (the smaller numbers are encoded on k bits, the others on k + 1 – unassigned k-bit codes followed by 0 or 1). MapReduce is a poor solution for document retrieval, but you could partition the documents, and cache the most-requested results. 5. Many sequential graph algorithms rely on some global data structure (e.g., a priority queue): they are not straightforward to implement in MapReduce. For instance, breadth-first search (Dijkstra’s algorithm) can be implemented as a sequence of MapReduce iterations: store the estimated distance from the source in each node, process each node in each iteration, stop when the distances no longer change – there are as many iterations as the longest shortest path (you also need to keep track of the adjacency list, e.g., by emitting it as well). def mapper(id,d): for a in neigh(id): emit a, d+w def reducer(id,ds): emit id, min(ds) PageRank (a random walk on the graph, with teleportation, computing a score (equilibrium probability) for each node), hits (random walk on a bipartite graph of hubs and authorities, computing two scores for each node on the initial graph – the graph is usually Article and book summaries by Vincent Zoonekynd

query-dependent: pages containing the search terms and their neighbours), salsa (idem) can be computed in the same way, each iteration spreading the probability mass (but pay attention to dangling nodes). Combiners can improve performance, if you partition the data with some heuristic (sort by zipcode, school, language, domain name, etc.). To avoid underflow, use log-probabilities: they can be added as { b + log1p(ea−b ) if a < b a⊕b= a + log1p(eb−a ) if a ⩾ b. 6. If a statistical model contains parameters θ (unobserved, to be estimated), hidden variables y (unobserved, but we do not care much about their values), and data x (observed), maximum likelihood estimators ˆy ˆ ) = Argmax P (X = x, Y = y; θ) (θ, (θ,y)

give a very noisy estimator of θ: prefer the marginal likelihood estimator (i.e., integrate y out): θˆ = Argmax P (X = x; θ) θ ∏∑ = Argmax P (X = xi , Y = yi ; θ). θ

i

y

The expectation-maximization (EM) algorithm is a hillclimbing algorithm that attempts to maximize the marginal log-likelihood: ∑ θn+1 ← Argmax P (x, y; θn ) log P (x, y; θ). θ

x,y

This single formula is often, confusingly, presented as two steps: – Estimation of the probability distribution X, Y | θ = θn in the E step (it is a probability distribution: in the discrete case, it is a set of probabilities); – Computation of the expected log-probability ∑ P (x, y; θn ) log P (x, y; θ) x,y

(it is a function of θ: if it is explicitly computed, it is also part of the E step); – Maximization, in the M step. Given a hidden Markov Model (HMM), the problems of computing the probability that the model generated the data (used for clustering or outlier detection) and computing the most probable sequence of hidden states that could generate the observed data can be solved with dynamic programming (the forward algorithm computes the probability that the process is in state q at time t; the Viterbi algorithm computes the probability of the most probable sequence of states leading to state q at time t): they can be implemented with MapReduce, by processing one column of the dynamic programming table (one time t) in each iteration; each mapper receives a small part of the data; each reducer computes a cell in the dynamic programming table. Similarly, estimating the parameters of a HMM (forward-backward algorithm) can be implemented with EM. 261/587

HMM can be used, for instance, to align texts in different languages: P (target, alignment | source) ∏ ∏ = P (ti+1 |ti ) × P (ti |sai ) = language model × translation model. MapReduce can also be used for parallel but non-dataintensive tasks. The book also mentionned, with few details, other technologies in the MapReduce/Hadoop ecosystem: Mahoot (machine learning), HBase (tables in HDFS), Cassandra (key-value store), Giraph (bulksynchronous parallel (BSP) framework), GPU programming (e.g., for streaming data), Pig, Hive, Hadapt (data processing). Financial risk modelling and portfolio optimization with R B. Pfaff (2013)

– Time series models: ARMA, VAR, VECM, SVAR, SVEC (urca, vars); – Tactical asset allocation: Black-Litterman, copula opinion pollomg, entropy polling (TTR, fTrading, BLCOP). Lectures on modern convex optimization A. Ben-Tal and A. Nemirovski (2012) Convex programming is too general to be amenable to practical, efficient methods, but many special cases are – linear, conic quadratic, semi-definite programming are all examples of conic programming. 1. To find a lower bound of x∗ = Min{f (x) : ∀i gi (x) ⩾ bi }, x

consider a linear combination of the constraints ∑ ∑ yi gi (x) ⩾ y i bi i



i

∑ (with y ⩾ 0). If f (x) ⩾ i yi gi (x), then yi bi is a lower bound. The best such lower bound is {∑ } ∑ Max yi bi : ∀x f (x) ⩾ yi gi (x) .

Each chapter is structured in the same way: terse presentation of a few mathematical notions (consider it y⩾0 i i as a mere check-list), list of relevant R packages (not This is the dual problem; its optimal value is less than unlike the CRAN task views), and (more interesting) that of the primal (weak duality). If f and g are linear, examples. set x to +1 and −1: the inequalities become equalities, The topics covered are and the dual of – Risk measures: VaR, ES, mVaR, mES, coherent meaMin{c′ x : Ax ⩾ b} sures; x – Efficient frontier; is – Distributions to model returns: generalized lambda Max{b′ y : y ⩾ 0, A′ y = c}. y distribution (GLD), GHD (ghyp, lmomco::pargld, For linear problems, there is a strong duality: the prifBasics::gldFit); – Extreme value theory: block maxima, peaks over mal has an optimal solution iff the dual has, and the thresholds, exceedance declustering (evir::gev, optimal values are the same – but there are two ways in esmev::gev.fit, fExtremes::gevFit, ismev:: which a problem can fail to have a solution: it can be rlarg.fit, fExtremes::mrlPlot, gpdFit, de- infeasible, or unbounded; if the primal is unbounded, then the dual is infeasible, but the converse if false: Cluster); – Volatility: GARCH models (fGarch::garchFit, both can be infeasible. rugarch); Applications of linear programming include com– Dependence: correlation, rank correlation, Kendall’s pressed sensing (ℓ1 penalty as an appoximation of a τ , lower tail dependence, copula GARCH model (non-convex) ℓ0 one), support vector machines (SVM), (fit a GARCH model to individual time series, and discrete-time linear dynamic systems. then a copula to the (joint) residuals), mixture of copulas (QRM::fit.tcopula, QRM::rcopula.t, Replacing the linear objective function by an arbitrary (convex) function is not the only way of generalizaing copula::dcopula); – Robust estimation: M-estimator, MM-estimator, linear programming into non-linear programming – one MVE, MCD, S-estimator, SDE, OGK estimator can keep the objective function linear but replace the constraints Ax − b ⩾ 0 with Ax − b ≽K 0, where ≽K (rrcov::Cov*); – Robust optimization: scenario-based, box or elliptic denotes the partial order induced by a closed, pointed convex cone with non-empty interior K. In the dual uncertainty set (FRAPO::Socp, Rsocp); ∗ – Diversification: risk contribution, diversification ra- problem, y ⩾ 0 gets replaced by y ≽K ∗ 0, i.e., y ∈ K , tio, defined with the volatility or some down- the dual cone. This is conic programming. side risk measure such as the tail dependence To formulate the dual of coefficient (FRAPO::P*, FRAPO::dr, FRAPO::cr, Min{c′ x : Ax ≽K b}, Portfolio::Analytics::optimize.portfolio); x – Portfolio optimization with VaR, ES or drawwe are looking for a lower bound on the objective down in the objecive or the constraints (fPortfolio::minRiskPortfolio, FRAPO::P*); c′ x ⩾ ⟨λ, Ax⟩ ⩾ ⟨λ, b⟩ Article and book summaries by Vincent Zoonekynd

262/587

for some λ. But an arbitrary λ does not preserve in- – The Euclidian norm: ∥x∥2 ⩽ t; equalities: the dual of K is the set of λ that always – Its square: do: 2 ∥x∥2 ⩽ t ⇐⇒ x′ x + 41 (t + 1)2 ⩽ 14 (t − 1)2 K ∗ = {λ : ∀x ≽K 0 ⟨λ, x⟩ ⩾ 0}.

( )

x

⩽ t+1 For the dual, consider ⇐⇒ 1

(t − 1) 2 2 2 Max{⟨λ, b⟩ : λ ≽K ∗ 0, ∀x⟨c, x⟩ ⩾ ⟨λ, Ax⟩}. λ – Convex quadratic forms (i.e., quadratic programThe last condition can be written as ming): ∀x ⟨c, x⟩ ⩾ ⟨A ∗ λ, x⟩. x′ C ′ Cx + q ′ x + r ⩽ t

( By setting x to a basis vector and its opposite, we see )

Cx

⩽ 1 (t − q ′ x − r); that this is equivalent to c = A∗ λ. The dual is there⇐⇒ 1 ′ 2

(t + q x − r) 2 fore: 2 Max{⟨λ, b⟩ : λ ≽K ∗ 0, A∗ λ = c}. λ – Many (rational) powers; The primal can be formulated in the same way (inter- – The Lp norm; section of a cone and an affine subspace) by adding a – etc. variable y = Ax − b. More compactly: Robustifying a linear program with elliptic (or, more generally, conic-quadratic representable) constraints primal : Min{⟨d, y⟩ : y ∈ L − b, y ≽K 0} y gives a SOCP. For instance, the robustification of the dual : Max{⟨b, λ⟩ : λ ∈ L ⊥ + d, λ ≽K ∗ 0}. constraint ax − b ⩾ 0 is λ

( ∗ } { )

a −a Strong duality holds if the primal is strictly feasi∗ ∗

⩽ε ⩾0 Min a x − b : ∗ b − b 2 a∗ ,b∗ ble, i.e., ∃x Ax − b ≻K 0 (strict inequality), i.e., ◦

K ∩ (L − b) ̸= ∅. (Without strict feasibility, anything can happen: primal solvable but dual infeasible, i.e., etc.) 2. The Lorentz cone { } √ Lm = x ∈ Rm : xm ⩾ x21 + · · · + x2m−1 ; is self-dual. A second order cone program (SOCP) is Min{c′ x : Ax − b ∈ Lm1 × · · · × Lmk }. x

The condition Ax − b ∈ Lm can be written, ( ) ( ) Dx d − ∈ Lm , p′ x q where

( D p′

d q

)

( ) = A b ,

i.e., p′ x − q ⩾ ∥Dx − d∥2 . The dual of Min{c′ x : ∀i ∥Di x − di ∥2 ⩽ p′i q − qi } x

is

Find To maximize

µ, ∑ν

µ′i di + νi qi

{(

ax − b + Min u,v

i.e.,

x −1

} )′ ( ) ( )

u u

⩽ε ⩾0 : v v 2

( )

x

ax − b − ε

−1 ⩾ 1. 2

(To robustify the objective c′ x, replace it with a new variable t, add the constraint c′ x ⩽ t, and robustify it.) The robustified problem can be used to assess the stability of the objective, the solution, the feasibility status of a linear problem. 3. Semi-definite programming (SDP) is another special case of cone programming, where the (self-dual) cone is that of positive semi-definite symmetric matrices. One can assume there is only one inequality – otherwise, put them in a single block-diagonal matrix. It is a generalization of conic quadratic programming: ( ) ( x tIk−1 ∈ Lk ⇐⇒ t x′

x t

) ≽ 0.

In addition, SDP can use the largest eigenvalue (or singular value), the sum of the k largest eigenvalues i (or singular values), the spectral norm, (some rational ∀i ∥µi ∥2 ⩽ νi . powers of) the determinant, non-negative polynomials, Optimization problems rarely arise in this form: they trigonometric polynomials, etc. Applications include have to be somehow transformed. For the objective, the stability of dynamical systems (eigenvalues) and replace “minimize f (x)” with “minimize t such that robust cone programming. t ⩾ f (x)”. For the constraints, there is a very, very Semi-definite programming provides relaxations of long list of functions or constructions that can be used combinatorial problems, finer than LP relaxations, as in SOCPs; here are a few of them. follows. Formulate the problem as a quadratic program Such that

i ∑

Di′ µi + νi pi = c

Article and book summaries by Vincent Zoonekynd

263/587

with quadratic constraints (e.g., x ∈ {0, 1} is equivalent to x2 − x = 0 or, since we prefer inequalities, x2 − x ⩽ 0, x − x2 ⩽ 0). x∗ = Argmin{f (x) : g(x) ⩽ 0} x

Consider the dual problem, i.e., transform the constraints to penalties, with coefficients to be determined, fλ (x) = f (x) + λg(x), λ ⩾ 0; this provides a lower bound on f (x∗ ) ζ ∗ = inf fλ (x) ⩽ f (x∗ ). x

But, since fλ (x) is a quadratic form, ∀x ζ ⩽ fλ (x) means that fλ (·) − ζ ≽ 0. Finding the best bound is therefore a semi-definite program: (ζ ∗ , λ∗ ) = Argmax{ζ : fλ (·) − ζ ≽ 0, λ ⩾ 0}.

Interior point methods are interior penalty methods, i.e., to solve Min{c′ x : x ∈ X }, x

they solve a sequence of unconstrained problems x∗ (t) = Argmin tc′ x + K(x) x

where K is a barrier function (if xn → x and x ∈ ∂X , then K(xn ) → +∞) and t increases, (x∗ (t))t⩾0 traces the central path (each is solved with the Newton method, which converges quadratically if sufficiently close to the solution). With conic programming, a clever choice of the barrier function x ⩾ 0 ⇝ − log x X ≽ 0 ⇝ − log det X x ∈ Lk ⇝ − log(x2k − x21 − · · · − x2k−1 )

and tn+1 = 1.1tn , a single Newton step suffices for each value of t and the algorithm converges in O(log ε−1 ) The SDP relaxation of a quadratically-constrained steps. It is easier to monitor convergence and stay close quadratic program can also be formulated as follows: to the central path if one traces both the primal and dual central paths; they are related by the augmented replace inhomogeneous quadratic forms complementary slackness relation g(x) = x′ Ax + 2b′ x + c x∗ (t) = −t−1 ∇K(s∗ (t)) ζ,λ

s∗ (t) = −t−1 ∇K(x∗ (t))

with homogeneous ones G(x, t) = x′ Ax + 2tb′ x + ct2 =

( )′ ( )( ) t c b′ t , x b A x

then, replace the quadratic forms x′ Gx with tr(GX). The relaxation corresponds to the embedding  n n+1  R −→ S (+ ) ( )′ 1 1  x 7−→ . x x

(since the cone is self-dual, the barrier is the same for both problems). 5. Interior point methods are polynomial, but for large-scale problems O(n3 ) or O(n2 ) will not do: if we want the algorithm to terminate in a reasonable amount of time, we can afford to evaluate the objective, its gradient, but nothing fancier.

There are information-theoretic bounds on the number of steps needed to achieve a given precision ε, e.g., O(log n/ε2 ) as n → ∞ for ball constraints (and much Applications of SDP relaxation include graph prob- worse for box constraints) – it is almost independent lems (Shannon capacity, maxcut), other combinatorial of the dimension. problems, chance constraints. Mirror descent is a generalization of the projected Finding the best inner ellipsoidal approximation of a gradient polytope defined by inequalities (or, more generally, of xn+1 ← Proxxn (γt f ′ (xt )), an intersection of ellipsoids) is a SDP. where one replaces the norm, distance and gradient to Finding the best outer ellipsoidal approximation of a better suit the problem. For instance, polytope defined as the convex hull of its vertices (or, – Over an L2 ball, ∥·∥2 , ω(x) = 21 x′ x and the proxy more generally, of a union of ellipsoids) is a SDP. mapping 4. Convex programming is a polynomial problem: the proxx ξ = Argmin ω(y) + ⟨ξ − ω ′ (x), y⟩ ellipsoid method needs O(n2 log ε−1 ) steps (and each y∈X 2 steps needs O(n ) operations) and goes as follows. = Argmin ∥x − ξ − y∥2 Start with an ellipsoid containing the optimal set. If y∈X its center is not feasible, there exists a separating hyis the projection of x − ξ on X; perplane between the center and the (convex) feasible set: use it to cut the ellipsoid. If it is feasible, the (sub) – Over the simplex ∆n , use ∥·∥1 and ω(x) = ∑ gradient of the objective defines a hyperplane, on one xi log x1 ; side of which the optimal set lies: use it to cut the ellipsoid. In both cases, take an outer approximation of – Over the spectahedron {x∑∈ S n : x ≽ 0, tr x ⩽ 1}, use ∥λ(x)∥1 and ω(x) = i λi (x) log λi (x). the cut elllipsoid and iterate. Article and book summaries by Vincent Zoonekynd

264/587

A convex-concave saddle point problem (e.g., equilibrium, in a zero-sum game)

density (probabilities, moments, expectations, entropy, etc.), mixing detection (compare the log-concave estimator with a mixture of log-concave densities, or with 2 Min Max ϕ(x, y), f (x) ∝ exp(ϕ(x)c ∥x∥ ), with ϕ concave and fixes valx∈X y∈Y ues c > 0 (this also accounts for fat tails), or with where ϕ is convex in x, concave in y, can be solved with a permutation test comparing the data with samples mirror descent by considering the vector field (ϕx , ϕy ) from the estimated distribution). instead of the gradient. Contrary to kernel-based estimators, this does not reThere are a few variants: stochastic mirror descent if quire the choice of a bandwidth matrix (problematic in the gradient is noisy (the expected error converges in high dimensions). the same way), bundle mirror descent (momentum, to The density is found by minimizing address plateau problems), etc. ∫ 1∑ Most non-smooth optimization problems are of the σ(y1 , . . . , yn ) = − yi + exp hy (x)dx n form C Min Max ϕ(x, y) x∈X y∈Y where yi = log fˆ(xi ) are the log-densities evaluated at the data points xi , hy is the smallest concave function and can be smoothed as (Nesterov smoothing) with ∀i hy (xi ) ⩾ yi , the first term is the (negated) Min Max ϕ(x, y) + d(y). log-likelihood, the second term can be thought of as a x∈X y∈Y Lagrangian term (we want the density to integrate to The mirror prox algorithm (a variant of saddle point 1), C is the convex hull of the points (the support of mirror descent) can be used to solve those prob- the density). Since the concave function hy is affine lems. Convergence, for a non-smooth optimization on each triangle of a triangulation of the points (obproblem, using only first-order information, is at best tained as a side result of the QuickHull algorithm), the O(1/sqrtt), but it can be brought to O(1/t) after such integral is easy to compute. The objective fuction is a transformation. Examples of non-smooth functions convex, but non-differentiable: one could use Newton’s that can be used in this wayinclude: algorithm with a subgradient, but convergence would be slow (linear or worse, versus quadratic for differenp – The L norm, ∥x∥p = Max∥y∥q ⩽1 ⟨y, x⟩. tiable functions), or Shor’s r algorithm (space dilation p – The L norm of the singular vectors of a matrix; in the direction of the difference of two consecutive p – The L -norm of the positive part, gradients – empirically faster)

+

x = Max ⟨y, x⟩ There is an R implementation in the LogConcDEAD p ∥y∥q ⩽1 package (in dimension 1, also check logconden). y⩾0 – The maximum entry of a vector, or the sum of the k largest entries sk (x) = Max ⟨y, x⟩

Efficient rank reduction of correlation matrices I. Grubišić and R. Pietersz (2005)

y∈∆n,k

where ∆n,k = [1′ y = k, y ⩾ 0]; – The sum of the k largest eigenvalues Sk (x) = Max tr(yx). y≽0 tr y=k

Maximum likelihood estimation of a multi-dimensional log-concave density M. Cule et al. (2010)

A constrained optimization problem Max {f (x) : g(x) = 0}

x∈Rn

can be formulated and solved as an unconstrained one if the feasible set M = [g(x) = 0] ⊂ Rn is a manifold M on which you can explicitly compute the gradient and (for the Newton and conjugate gradient methods) the hessian: the gradient comes from the isomorphism between Ty M and Ty∗ M induced by the Riemannian structure,

The log-concave maximum likelihood density estimator is well-defined, and asymptotically finds the logconcave density that minimizes the Kullback-Leibler Fy (u) = ⟨(grad F )y , u⟩Ty M , divergence with the true density. Its support is the convex hull of the sample data (log-concave distribu- the hessian from the Levi-Civita connection and the updates to the solution (“moving in some direction”) tions have thin tails). The estimator can be used for data visualization (e.g., are given by parallel transport alomg geodesics. checking for the presence of elliptic contours), classification (estimate the density of each class), clustering (fit a mixture of log-concave distributions with the EM algorithm, as you would a mixture of Gaussians), monte Carlo estimation of functionals of the Article and book summaries by Vincent Zoonekynd

In the case of rank-constrained correlation matrices, the feasible set is not a manifold (it is a stratified manifold, each stratum corresponding to a value for the rank), but it can be described as the quotient of a manifold (the “Cholesky manifold”) by the orthogonal 265/587

group. The resulting algorithm is equivalent to optimization using a parametrization with spherical coordinates.

Covariance selection and estimation via penalized normal likelihood N. Liu et al.

Sparse inverse covariance selection via alternating linearization methods K. Scheinberg et al. (2010)

The coefficients of the Cholesky matrix can be interpreted as regression coefficients – penalized (lasso, ridge) regression gives a penalized Cholesky matrix, and a penalized variance matrix.

There are many algorithms for sparse inverse covariance selection (SICS), i.e., sparse estimators of the inverse covariance matrix corresponding to sparse graphical models (Σ−1 ⊥ Xj |Xk,k̸=i,j ): ij = 0 iff Xi ⊥ Max log det X − ⟨X, Σ⟩ + λ ∥X∥1 . X≻0

The alternating linearization method solves the optimization problem

Covariance selection for non-chordal graphs via chordal embedding J. Dahl et al. The sparsity pattern of a variance matrix forms a chordal graph. One can easily estimate the inverse covariance matrix under a chordal soarsity constraint and, with more work, an arbitrary sparsity constraint.

Min f (x) + g(x) x

Sparse permutation invariant covariance estimation A.J. Rothman et al. (2008)

with f and g convex, by rewriting the problem as Min{f (x) + g(y) : x = y} x,y

and separately updating x and y. It can be adapted to solve the SICS problem.

The L1 -penalized inverse-variance (concentration) estimator

ˆ − log |Ω| + λ Ω− Argmin tr(ΩΣ) 1 Ω≻0

Model selection − through sparse maximum likelihood estimation (where Ω are the off-diagonal elements of Ω) can ′ for multivariate gaussian or binary data be estimated by parametrizing Ω as Ω = T T , with O. Banerjee et al. (2006) T lower-triangular, to ensure positivity, and using a quadratic approximation of the absolute value: The sparse inverse covariance matrix Argmax log det X − tr SX − ρ ∥X∥1 , X≻0

where S is the sample variance matrix, can be estimated by solving the dual problem, which is of the form Max{log det W : ∥W − S∥∞ ⩽ λ}. W

|un+1 | ≈

1 u2n+1 1 1 u2n+1 1 + |un | or + |un | . 2 |un | 2 2 |un | + ε 2

Efficient estimation of covariance selection models F. Wong et al. (2003)

It can be estimated one coordinate at a time (blockcoordinate descent) Bayesian approach for covariance selection (sparse estimation of the inverse covariance), with a Γ prior on −1 w12 ← Argmin{y ′ W11 y : ∥y − s12 ∥∞ ⩽ ρ}. the diagonal elements, and a zero-inflated prior on the y correlations, estimated with Gibbs sampling, updating one partial correlation at a time (to keep the matrix positive definite). Sparse inverse covariance estimation with the graphical lasso J. Friedman et al. (2007) This problem can in turn be solved though its dual

2

1/2 Min 12 W11 β − b + ρ ∥β∥1 ,

Shrinkage algorithms for MMSE covariance estimation Y. Chen et al. (2009)

How to choose the shrinkage coefficient when estimating a covariance matrix (the optimal coefficient depends on the unknown variance matrix and cannot be which is a lasso problem. The glasso package provides used, the Ledoit-Wolf coefficient can be improved on). an implementation. β

Article and book summaries by Vincent Zoonekynd

266/587

Identifying small mean-reverting portfolios conbinations of the turning points (2-fund theorem). A. d’Aspremont (2008) The efficient frontier can be computed by starting with When looking for cointegrated assets to form mean- the maximum return portfolio and moving down, from reverting portfolios, sparse portfolio are more in- turning point to turning point. vestable. For instance, one could look for a portfolio whose prices Pt = St x maximize the OrnsteinUhlenbeck mean reversion parameter λ dPt = λ(P¯ − Pt )dt + σdZt . As a proxy for the mean-reversion, one can use the Box-Tiao predictability ν=

Vart−1 [Pt−1 ] Var[Pt−1 ] = Vart−1 [Pt ] Var[Pt |Pt−1 ]

The authors provide an implementation in Python. Loglog counting of large cardinalities M. Durand and P. Flajolet (2003) To approximately count the number of different elements in a stream of words (words in a text, IP addresses in an intrusion detection system, values in a column in a database for query optimization, etc.) one can:

– Distribute the hashed (randomized) values in N buckets, in a bitmap, and count the occupied buckets (maximizing mean-reversion corresponds to minimiz(there are adaptive and hierarchical variants); ing predictability, i.e., momentum). If the asset prices – Keep a proportion p ≪ 1 of the data (by looking follow a VAR(1) process St = St−1 A + Zt , then Pt = at the first bits of the hashed value) and count it, St x = St−1 Ax + Zt x and exactly; x′ A′ ΓAx – Look at the maximum position ρ(x) of the first nonν(x) = x′ Γx zero bit in the hashed data: log2 N ≈ Max ρ(x). where Γ = Var St – it is a generalized eigenvalue problem. To estimate the matrices A and Γ, one can use penalized methods (covariance selection for Γ, lasso for A). The sparse generalized eigenvalue problem } { ′ x Ax , ∥x∥0 , ∥x∥ = 1 Max x x′ Bx can be approximately solved greedily (progressively increase the number of non-zero coefficients, assuming that those sets of indices are increasing) or by semidefinite relaxation.

The log-log algorithm uses this idea, after splitting the data into m buckets, and averages the estimates. HyperLogLog: the analysis of a near-optimal cardinality estimation algorithm P. Flajolet et al. (2007) In the loglog counting algorithm, one can replace the ˆi with the harmonic mean of arithmetic mean of the N ˆ the 2Ni . Firefly algorithm, stochastic test functions and design optimisation X.S. Yang (2010)

The firefly algorithm is similar to particle swarm opAn open-source implementation of the critical timization (PSO), but each particle moves randomly line algorithm for portfolio optimization (no momentum) and is more attracted by nearby fitD.H. Bailey and M. López de Prado (2013) ter particles (it is not attracted by less fit particles at all). If the local minima are hardly distinguishable, the When people compute the efficient frontier particles can end up in several of them. Find w To minimize w′ Σw Empirical mode decomposition and maximize w′ µ of financial data ′ Such that w1=1 K. Drakakis (2008) l⩽w⩽u An intrinsic mode function (IMF) is a continuous functhey often solve a series of optimization problems, each tion with positive maxima and negative minima, i.e., a with a different target return, and interpolate. This is function that oscillates (e.g., , or ). A imprecise and computationally wasteful. function f can be decomposed into an IMF (empirical The turning points on the efficient frontier are efficient mode decomposition, EMD) as follows (sifting): portfolios such that nearby efficient portfolios contain different assets – points, on the efficient frontier, at which an asset enters or leaves the portfolio. Between two turning points, the efficient portfolios are the solution of an unconstrained problem (only equality constraints: w′ µ = µtarget and w′ 1 = 1) on a subset of the assets, for which an analytic expression can be derived (via Lagrange multipliers); they are also convex Article and book summaries by Vincent Zoonekynd

– Interpolate the local maxima (upper envelope) and local minima (lower envelope) of f ; – Average them; – Subtract this average from the function; – Iterate until this difference is an IMF; – Subtract the IMF from f and start again to decompose the residuals.

267/587

Recent mathematical development on empirical mode decomposition Y. Xu and H. Zhang The first intrinsic model function (IMF) of the empirical mode decomposition (EMD) is intuitively similar to the Hilbert transform (HHT, Hilbert-Huang transform): ∫ f (s) ds Hf (t) = pv t R −s Af = f + iHf Af (t) = ρ(t)eiθ(t) where ρ(t) is the instantaneous amplitude, θ(t) the phase, θ′ (t) the frequency. This technical article tries to describe which functions can he obtained as IMF. In R, check the EMD and hht packages. MCMC Using Hamiltonian dynamics R.M. Neal (2011) To sample from a probability distribution P (q) ∝ exp −U (q), the Hamiltonian Monte Carlo (HMC) method adds another variable p (momentum) and considers the Hamiltonian system H(q, p) = U (q) + K(p), with K(p) = 21 p′ M p; its dynamics are ∂H dqi = dt ∂pi ∂H dpi =− . dt ∂qi To be umerically stable, the discretization of this system should remain reversible, volume-preserving, symplectic. The Euler method does not, but the leapfrog method (start with a half-step for momentum, and then use full steps for both position and momentum so that the updates are staggered) does. The HMC algorithm goes as follows: start with a state (q, p), sample p from P (p) ∝ exp −K(p) (Gaussian), make L leapfrog steps of size ε, and use the new state as a Metropolis proposal. HMC avoids the random-walk-like behaviour of the Metropolis algorithm, especially when the dimension is high or the variables correlated; fine-tuning the mass matrix M can improve things even further. Choosing ε and L is difficult, though. The No-U-Turn Sampler: Adaptively setting path lengths in Hamiltonian Monte Carlo M.D. Hoffman and A. Gelman The leapfrog simulation in HMC sampling is usually done a fixed number of times L. To avoid having to specify L, one can try to move in one direction, as long as ∥qnew − qold ∥ increases, i.e., as long as 1 d 2 ∥qnew − qold ∥ = (qnew − qold ) · pnew ⩾ 0. 2 dt Article and book summaries by Vincent Zoonekynd

But this is not time-reversible: we would not be sampling from the right distribution. The nuts algorithm fixes that problem by moving sn 2n steps ahead, for increasing values of n, with sn a random sign (the technical details are more complicated than that). Stan modeling language M.D. Hoffman and A. Gelman (2012) Stan is a Bayesian sampler, not unlike Bugs or Jags, but it uses Hamiltonian Monte Carlo (HMC) sampling: when a Gibbs sampler converges slowly and generates correlated samples, HMC should perform better. The model is converted into heavily-templated C++ (formal differentiation of the log-likelihood) and compiled (this can take time). Stan can automatically select the tuning parameters: the mass matrix is set to the identity; the number of steps can be chosen with the nuts algorithm; the step size can be estimated during warmup. Instead of a sampling statement y[i] ~ normal(mu, sigma) one can explicitly increment the log-probability log__ 1%) hypotheses.

One could also use a loss function, e.g., that defining the expected shortfall – but it will just tell you that VaR estimators are bad CVaR estimators. (But it makes sense for the regulators to do that: this is how they use the VaR – to compute the risk capital.) Extended analysis of backtesting frameworks for value at risk G.J. van Roekel (2008)

More details, on the choice of the hyperparameters (e.g., stop using a subsample when the corresponding estimate ceases to change), with applications to time series (e.g., the stationary bootstrap: take a block of – Test the number of exceedances, with an exact test length m ≪ n, take a point at random, with probabilor a likelihood ratio (LR) test, test the time to the ity 1 − p take the next point, with probability p take next one. another point at random, iterate with more points, re- – Test their independence, with an LR test, using inpeat with more blocks). dicator functions “exceedance in [t − k, t + k]” rather than “exceedance at t”; test the distribution of the Article and book summaries by Vincent Zoonekynd

278/587

time between two exceedances: it should be geometric, i.e., easy to approximate with an exponential distribution (for instance, you could test the null that the distribution is exponential vs the alternative that it is Weibull but not exponential). Tests for discrete distributions have more power. – Compare the regulatory capital (derived from the VaR) with the average expected shortfall (CVaR).

– Independent set (a set of vertices, no two of which are adjacent) of maximum weight in a path-graph (a graph that is also a path, i.e., a tree of arity 1) [hint: either the last element is in the set or not]; – Knapsack with integral sizes [same hint]; – Sequence alignment; – Optimal binary search tree (minimum average search time, when the frequency of each term is known); – For the single-source shortest path problem, use Dijkstra’s algorithm, if there are no negative edges, or A guide to modelling counterparty credit risk Bellman-Ford’s algorithm (slower): look at shortest M. Pykhtin and S. Zhu (2007) paths from s with at most k edges (you can save on The counterparty exposure of an asset of value v is space by keeping a pointer to the previous vertex in vT (its value when the counterparty defaults, at some the shortest path) – a distributed version is used for time T in the future). Using simplified pricing models, internet routing; you can simulate possible evolutions of vt and com- – For the all-pairs shortest paths problem (APSP), the pute its expectation and quantiles (for instance, for Floyd-Warshall algorithm computes the length of the bonds or swaps, it progressively increases, because of shortest path from i to j passing through (at most) uncertainty, with a drop at each cash flow, because the 1, 2, . . . , k; Johnson’s algorithm modifies the weights, remaining cash flows are decreasing). The credit value by adding them pi − pj , where pi is the length of the adjustment (CVA) is the risk-neutral expectation of the minimum path from a new vertex to i: this makes loss. them nonnegative, so we can use Dijktra’s algorithm. A probem A reduces to a problem B, denoted A ≼ B, if a polynomial algorithm for B gives a polynomial algorithm for A. If C is a class of problems, a problem The second part of the course covered greedy algo- A ∈ C is C -complete if all problems in C reduce to rithms, dynamic programming and NP-completeness. A, i.e., A is the most intractable problem in C (but there can be many problems as difficult as A). Let Greedy algorithms never come back on a decision they P denote the class of problems solvable in polynomial made, and therefore run in linear time: e.g., optimal time. Let NP be the class of problems for which it is caching (remove from the cache elements you will not possible to verify a solution in polynomial time, and need any time soon), scheduling (each job has a weight whose solutions have polynomial lengths; intuitively, and a duration, sort them by weight/duration). If they these are the problems solvable by brute-force search. happen to be correct (rarely), this can often be proved It is conjectured, but not proved, that P ̸= NP. by induction or with an exchange argument (start with an optimal solution and transform it into the output There are many NP-complete problems: TSP, 3SAT, etc. of the algorithm, without ever making it worse). Algorithms, design and analysis II T. Roughgarden (Coursera, 2012)

For the minimum spanning tree (MST) problem, To solve an NP-complete problem, you can: Prim’s algorithm grows a connected tree (and can be – Focus on solvable special cases, e.g., the 2-SAT probefficiently implemented with a heap), while Kruskal’s lem: when solving a 3-SAT problem, you may be algorithm grows a forest by adding the cheapest edges able to isolate 10 or 20 variables so that, when you as long as they do not introduce a cycle (and can be fix them, the problems becomes a 2-SAT problem: implemented with a union-find data structure). They you can then combine brute force search on those are correct and fast, but there are faster algorithms. variables and a polynomial 2-SAT algorithm; Kruskal’s MST algorithm can be used to find the k- – Use heuristics: for the knapsack problem, there are clustering of a set of points that maximizes spacing greedy (sort the idems by decreasing value/weight (the spacing is Min d(a, b) where a and b are in different – to prove that the the solution is at least 50% of clusters) i.e., the distance between the most alarmingly the value of the optimal solution, consider a “fracclose points that are not in the same cluster. tional solution”, in which you can cut the objects) and dynamic-programming-based (rounding) heurisHuffman codes (binary prefix-free codes; they can be tics; represented as binary trees) are also built greedily, bottom-up: start with each symbol in a 1-element tree, – Find an exact algorithm, exponential-time but faster than brute-force search: for instamce, dynamic promerge the two least frequent elements, iterate. gramming for the knapsack problem, dynamic proDynamic programming solves a problem by considering gramming for the traveling salesman problem (2n insub-problems, with an order relation (often inclusion), stead of n!: consider L(S, j), the minimum length of from whose solutions it is possible to solve the original a path from 1 to j that passes exactly once through problem. Contrary to divide-and-conqueer, the subeach vertex in S, for all S containing 1 and j), etc. problems are overlapping, and it is necessary to cache (memoize) the intermediate results. Here are a few In the vertex cover problem, you are asked to find a minimum set of vertices so that each edge has at least examples. Article and book summaries by Vincent Zoonekynd

279/587

one extremity in it. It can be solved in special cases, e.g., trees (dynamic programming) or bipartite graphs (maximum flow problem); for small graphs, use the fact that G has a vertex cover of size k iff Gu or Gv has a vertex cover of size k −1, where Gu is G without vertex u, and u−v is an edge.

the data in the arrays pointed to by their pointer arguments; they can retrieve the thread block and thread number through implicitly-defined variables threadIdx.x, blockDim.x, blockIdx.x.

plication implied by the clauses (each is equivalent to two implications), the problem reduces to finding the strongly connected components; – Backtracking (each clause forbids one assignment of the variables); – Randomized local search (find an arbitrary unsatisfied clause and flip one of the variables, at most 2n2 times, and use log2 n random restarts: the probability of finding an assignment if there is one is at least 1 − 1/n – this estimate is very conservative, but the algorithm is quadratic).

There are other “parallel design patterns” such as compaction (the use of sparse matrices), binning (the Barnes-Hut algorithm, for the N -body problem), scatter-to-gather conversion (map/reduce), etc.

The following topics were not covered: maximum flow, linear programming, computational geometry, parallel or distributed algorithms, algorithms that run forever, streaming algorithms (when the data is too large to be kept in memory), etc.

OpenCL is similar to CUDA, but not limited to a single vendor, and a bit clunkier (the device code is compiled at runtime and has to be put in a string (sic); it cannot be called as a normal function: its arguments have to be pushed on a stack one by one; etc.).

Threads and blocks are arranged in 1-, 2- or 3dimensional arrays. Threads in the same block can The maximum cut problem (finding the cut with the share memory: declare it as __shared__ in the kerlargest number of edges) is NP-complete. It is tractable nel function. Threads in a block can wait for other in special cases: for instance, in bipartite graphs, there threads in the same block to reach the same point, by is a cut with all edges, and it can be found by breadth- calling __syncthreads(). Shared memory and thread first search in linear time. In general, you can use a blocks can reduce the amount of data transfered belocal search heuristic: let c(v) (resp. d(v)) be the num- tween the host and the device, and greatly increase the ber of edges from vertex v that cross (resp, do not program speed – the typical example is tiled matrix cross) the cut, start with an arbitrary cut, move v if multiplication, i.e., multiplying block matrices. Cond > c. Use multiple random restarts to help improve volution (or solving PDEs, e.g., wave propagation) rethe solution. quires halo data: data from neighbouring nodes, close to the boundary – you may want to process it in priThe 2-SAT problem (find an assignment of ∧ boolean ority to avoid delaying the neighbouring nodes. variables so that some statement of the form k εk aik ∨ ηk ajk , where εk , ηk are the identity or the negation) can Many linear algorithms, such as the sum or the cumube solved in polynomial time (only the 3-SAT problem lated sum of a vector, can be executed in log-linear time is NP-complete): (parallel prefix sum, or scan), by arranging the compu– By considering the graph whose vertices are the vari- tations in a tree (dividing the number of threads by 2 ables and their negations, with an edge for each im- at each iteration).

Heterogeneous computing W.-M. W. Hwu (Coursera, 2012) GPU programming is getting more popular, and the main API is currently CUDA, targeted at NVidia chips. A CUDA program is just a C program, with a few more keywords. The memory can be allocated either on the host (CPU, malloc, free) or the device (GPU, cudaMalloc, cudaFree), can be copied between them (cudaMemcpy). Functions can be declared as callable and/or runnable from/on the host and/or the device: kernel functions, prefixed with __global__, are called from the host as vecAddKernel(x,y,z, n); and run on the device, in parallel (blocks×threads threads); they do not return any value, but can change Article and book summaries by Vincent Zoonekynd

Streams allow you to transfer data and compute at the same time. Manually managing them can be tricky. MPI is a framework for distributed programming, where processes (possibly running on different machines) send and receive messages. The nodes can use CUDA.

OpenACCjust adds a few #pragmas to a sequential C program: #pragma acc parallel loop copyin(M[0:m*n]) copyin(N[0:n*p]) copyout(P[0:m*p]) for(int i=0; i a; or some subjective view; or a mixture distribution (the 1-day VaR, and allow the square root rule...). normal regime, from the risk model, and a “crash You can compute confidence intervals for the VaR esti- regime”, with some crash probability); or a bayesian mators (analytically, or via asymptotic results – boot- blend of the two regimes (product of the probability strap is not mentionned). Here are a few tests for the distribution functions). VaR, based on the number of exceedances n1 and the number of consecutive exceedances n11 (the tests are You can use data from a past crisis (sample covariance matrix, higher moments) to define the “crash regime” only asymptotic, unlikely to be useful unless you really or correct the covariance matrix of (more recent) hishave a huge amount of data, and the independence test will overlook clustering unless the exceedances are ex- torical returns. If you manually tweak the correlation matrix, it can cease to be a valid correlation matrix, actly consecutive). and you may need to correct it. αn1 (1 − α)n0 Exogenous liquidity adjustement of the VaR (by mod−2 ln ( )n ( )n ∼ χ2 df=1 n1 1 n0 0 eling the distribution of the bid-ask spread) has a negn ligible effect, but endogenous adjustment (linear or ( n )n1 ( n )nn0 1 0 quadratic market impact, with some price drift if the 2 n n liquidation is spread over several days, accounding for ∼ χ −2 ln ( )n ( )n ( )n ( )n df=1 n00 00 n01 01 n10 10 n11 11 volatility clustering) may be necessary. n0 n0 n1 n1 8. Regulatory capital is the capital that regulaαn1 (1 − α)n0 tors ask banks to hold to ensure they remain solvent. 2 −2 ln ( )n00( )n01( )n10( )n11 ∼ χdf=2 It can be computed as 3VaR 10-day (either the avern00 n01 n10 n11 age of the 10-day VaR over the past 60 days, or the n0 n0 n1 n1 latest 10-day VaR, whichever is higher) if the VaR model has been approved (i.e., if it does not have too There are similar tests for the expected shortfall. many exceedances). Alternatively, the regulators proTo check of the risk model uses all the information in vide “standardized rules”: for equities, 8% of the value the market, regress the exceedance indicator variable of the positions, plus a 4% to 8% charge for specific against the lagged risk factors (with several lags) and risk (insuddicient diversification), and another one for compare with the intercept-only model (the intercept credit risk – fixed income, commodities, currencies have is the significance level α). similar rules. If you are willing to assume Gaussian iid returns (test Economic capital (EC) is similar, but used internally for autocorrelation, heteroskedasticity, normality), you and presented to shareholders and rating agencies. It could test the bias: is often based on ES rather than VaR. Risk budgeting ) ( ) ( 1 Yt+1 or economic capital allocation is the allocation of the ∼ N 1, . sd economic capital to the various desks of the firm. Conσ ˆt+1 2T trary to real capital (needed by funded activities, and Instead of looking at a single quantile, you could com- whose financing cost should be included in the P&L), it pare the whole forecasted distribution with the actual is not additive, because of diversification. Aggregation returns, e.g., with a likelihood ratio test (I would use risk (use of incorrect correlation when aggreting the VaR or ES of the various desks to compute the firma Kolmogorov-Smirnov test). wide economic capital, or when allocating the economic 7. While VaR and ES look at extreme events in oor- capital to the various desks – or incorrect use of correlamal market conditions, stress tests (scenario analysis) tions when they do not capture the dependences) is the look at abnormal market conditions. Article and book summaries by Vincent Zoonekynd

295/587

most important source of model risk. Risk budgeting is often done by looking at the efficient frontier in the P&L×EC space, or by optimizing some risk-adjusted performance measure, such as

Identifying communities is trickier (most of those algorithms do not work on real-world networks, probably because communities overlap):

of nodes on whose shortest path the node is; – Its closeness centrality, i.e., the average of the inverse of the distances from this node; – Its eigenvalue centrality, ∑ centralityi = wij centralityj ,

Growth models, such as the Barabasi-Albert preferential attachment model, have hubs, and a more inequal degree distribution (power law).

– Connected components, strongly-connected components; E[P&L] RORAC = – Cliques: they betray community structures, but they EC overlap, can be incomplete, are not robust; they only or describe a “core” rather than the larger community; E[P&L] − kEC RAROC = – k-core: a subgraph in which every node is connected EC to at least k other nodes; (they give the same point on the efficient frontier: the – Connected components of the graph with nodes the tangent allocation). k-cliques and an edge if two k-cliques differ by only one node; – Hierarchical clustering; Social network analysis – Betweeness clustering: remove the edge with the L. Adamic (Coursera, 2012) highest betweeness, recompute, iterate until the be1. Given a network, you can look at the following detweeness of the remaining edges is sufficiently low; scriptive statistics. the connected components are the communities; – The degree of each node, the distribution of the de- – Modularity: compare the presence of edges within grees (distinguish in- and out-degrees for oriented and between communities with what you would exgraphs), the average degree, its standard deviation pect if they were selected at random; start with ranor the Gini coefficient (to measure how diverse the dom community assignments and change them to imdegrees are); prove the modularity. – The average shortest path; 2. The simplest network model is the Erdös-Rényi ran– The size of the largest connected component (or dom network: the edges are simply selected at random, strongly-connected, for oriented graphs). independently. There is a giant component, there are To assess how central a node is, look at: no hubs, the degree distribution is binomial (asymptotically Poisson), the average shortest parth is pro– Its degree; – Its betweeness centrality, i.e., the proportion of pairs portional to log n.

j neighbour of i

which can be generalized to C = (α + βC)A, where A is the adjacency matrix, β a parameter, and α a normalization constant. The analogue for directed networks, PageRank, is more complicated: it can be computed with a random walk on the directed graph, with some teleportation to avoid getting stuck in loops (it cannot happen with undirected networks: if you enter a loop, you can always leave it).

Small world networks combine the short distances of ER random networks with a community structure. The Watts-Strogatz model starts with a lattice, starts to randomly rewire it, but stops before getting a randon network. There are hierarchical (Watts-DoddsNewman) and geographical (Kleinberg, the probability of an adge is proportional to some power of the distance). When rewiring, you can use simulated annealing to minimize λ × average shortest path, in number of edges+ (1 − λ) × average shortest path, in kilometers.

The strength of an edge A–B can be assessed with the neighbourhood overlap:

Changing the value of λ gives networks with hubs or with short edges

number of neighbours of both A and B . nunber of neighbours of either A or B

3. A power law (or Pareto distribution) is a probability distribution such that

The presence of communities can be assessed by the clustering coefficient:

P (X = x) ∝ x−α ;

number of closed triangles . number of connected triplets For a directed network, you can count all the 3-node patterns (“motifs”) and compare with the numbers for a random graph with the same average degree, or with the same degree distribution. Article and book summaries by Vincent Zoonekynd

in the case of node degree, the variable is discrete. The Zipf law (the size of the rth largest event is proportional to r−β ) is the q-function of the Pareto distribution. Fitting a (discrete) power law distribution is tricky: – When regressing the logarithm of the number of nodes with a given degree against the logarithm of 296/587

the degree, large numbers of observations are binned in a single point for low degrees, and there are many empty bins for high degrees – forgetting them gives a very biased estimator; – Logarithmic binning is better, but discards information; – Cumulative binning binning is preferable: the number of events in J1, nK still follows a power law, with exponent α − 1; – The maximum likelihood estimator is (you need to choose where the power law starts, xmin ) α=1+n

(∑

log(xi /xmin )

)−1

– The power law may not extend to the very end of the tail: you can try to add an exponential cut-off p(x) ∝ x−α e−x/k . 4. One can study the influence of the network structure on various processes such as contagion (or information diffusion, opinion formation, coordination) or resilience. Assortative networks (hubs connect to hubs, as opposed to disassortative nets, such as the web, in which the hubs are on the periphery) are more resilient. You can also look at the correlation profile: compare the number of edges between nodes of degree k and l with that in a random network; look at the average degree of the neighbours, as a function of a node’s degree; look at the correlation between the degree of two adjacent nodes. 5. From a software point of view, networks can be studies with Gephi (to look at the network), NetLogo (for experiments, and to explain things to others), iGraph, NetworkX (Python), sna (R). Cryptography I D. Boneh (Coursera 2012) Many historical ciphers are generalizations of the Caesar cipher (rot-3): substitution cipher, Vigener cipher (k rot-n ciphers), Hebern machine (k substitution ciphers, forming a “rotor”), Enigma machine (idem, with 3 to 5 rotors, rotating at different speeds). Stream ciphers use a similar idea: in the one-time pad, the message is xored with the key, which has to be as long as the message (the safety of this cipher relies on the fact that if X and Y are binary, independent, and X is uniformly distributed, then X xor Y is uniformly distributed); most stream ciphers use a pseudo-random generator instead, e.g., Salsa20 (or the other eStream ciphers – do not use a linear congruentual generator or random in glibc).

the message starts with From:␣ Bob, you can xor it with From:␣ Bob xor From:␣ Eve to change the sender name). There are many more, each corresponding to a type of attack or information leakage we want to prevent. Not all attacks necessitate flaws in the algorithm: it is easier to attack the implementation, e.g., with side channel attacks (measure the time, the power consumption, the cache misses, etc.) or fault attacks (overclock the device so that it outputs a wrong result – this can give some information about the key). A block cipher encrypts the message by blocks, somewhat hiding its exact length. Many are built from Feistel networks: cut the block in two (L, R) and consider F1 ◦ · · · ◦ Fn , where Fi (L, R) = (R, L xor fi (R)) – the Fi are invertible (pseudo-random permutation, PRP) for arbitrary fi (pseudo-random function, PRF). DES and AES are block ciphers. (DES and double DES are broken; there is an attack on the triple DES, but it is not practical). AES is not a Feistel network: ci = σ(fi (ci−1 xor ki )) (11 times, with a permutation σ and PRP fi , starting with c0 = m). A pseudo-random generator (PRG) G : K → K × K defines a block cipher: it is a 1-bit PRF, and by iterating it, k 7→ (G(k)1 , G(k)2 ) 7→ (G(G(k)1 )1 , G(G(k)1 )2 , (G(G(k)2 )1 , G(G(k)2 )2 ) you get a 2-bit PRF (those 4 values correspond to the encoding of 00, 01, 10, 11), and eventually an n-bit PRF (you do not need to compute all 2n values: since you only want one branch of the tree, there are only n values to compute). The PRF can then be turned into a PRP with the Ruby-Rackoff theorem (3-round Feistel). But it is much slower than heuristic block ciphers (or the stream cipher used). The naive use of a block cipher, cutting the message into blocks and encrypting each block separately, with the same key (ECB, Electronic Code Book), is unsafe: two identical blocks give identical cipher texts, it is not safe against replay attacks – even worse, if you encrypt an image like that, its silouette remains visible. Cipher Block Chain (CBC) remedies this by xoring the cleartext with the ciphertext of the previous block before encryption: c0 = IV (Initialization Vector) ci = E(k, ci−1 ) xor mi To ensure that the initialization vector is not predictable, you can use a nonce (“number used once”), and encrypt it (with another key):

c0 = nonce There are many notions of a “secure” cipher. A ciIV = E(k1 , nonce) pher is semantically secure if E(m1 , k) and E(m2 , k) c1 = E(k, IV) xor m1 are computationnaly undistinguishable (for cleartexts ci = E(k, ci−1 ) xor mi m1 , m2 chosen by the attacker and a random key k). A cipher is malleable if you can change the ciphertext to generate predictable effects on the clear text Randomized countermode (CTR) uses a similar idea, (e.g., the one-time pad is malleable: if you know that but is parallelizable: xor the message with with Article and book summaries by Vincent Zoonekynd

297/587

F (k, IV), F (k, IV + 1), etc., where F is a pseudorandom function.

From a collision-resistant compression function operating on small messages, you can build a collisionThe integrity of a message can be ensured with a mes- resistant function that operates on larger messages, by chaining the hash function: e.g., SHA-1 (almost brosage authentication code (MAC): ken), SHA-256, SHA-512, Whirlpool (slower) . tag = S(message, key) V (message, tag, key) = ”yes”

Contrary to checksums, MACs have to be robust to malicious errors, not just random ones. A PRF can be used as a MAC (it should be secure and sufficiently long – more than 80 bits), for small messages. For larger messages, you can chain the MAC computations, as in, encrypted CBC-MAC:

A collision-resistant function is not a secure MAC: use a Hash MAC (HMAC) instead (it is very similar to NMAC, which used two keys, and a PRF instead of a hash): S(key, message) = Hash( key∥outer pad, Hash(key∥inner pad, message) ).

c0 = F (key, m0 ) c1 = F (key, m1 xor c0 ) ... MAC = F (other key, cn ) or NMAC (nexted MAC): c0 = F (key, m0 ) c1 = F (c0 , m1 ) ... MAC = F (other key, cn ∥pad) (without the last encryption, it is not secure: an attacker can append something to the message). Padding the last block with zeroes is unsafe. Instead, you can add 100 · · · 00, but you may need to add a dummy block. CMAC provides padding with no dummy block: replace the final encryption with cn = F (key, (mn ∥100 · · · 00) xor key1 ) cn = F (key, mn xor key2 )

When checking if the MAC is correct, you need to compare several bytes: the compiler (lazy language, optimizing compiler, ||/&& shortcut, etc.) may want to stop as soon as it knows the result, telling a potential attacker whether the first byte was incorrect. Before the comparison, you can encrypt the correct MAC and the MAC to test, and compare the encrypted values. Encryption alone is not secure against tampering (it is safe against passive attacks only, not active attacks): always use authenticated encryption instead. We want a new notion of safety, ciphertext integrity: the recipient can reject messages if they are not valid; the attacker cannot create a valid ciphertext; the attacker chooses the ciphertext. This does not provide any safety against replay attacks Encrypt-and-MAC (add the MAC of the plaintext and send it in clear) is not safe (it leaks information); MACthen-encrypt (add the MAC of the plaintext and encrypt everything) is only safe in some special cases; encrypt-then MAC (add the MAC of the ciphertext, in clear) is safe.

There are many standards that combine a cipher and a MAC: TLS (to prevent replay attacks, the sender The MAC can be parallelized (PMAC): and the receipient keep track of a counter, incremented each time, used in the MAC computation, but never c1 = F (key1 , m1 xor P (key0 , 1)) exchanged; suffers from padding attacks, timing atc2 = F (key1 , m2 xor P (key0 , 2)) tacks, information leakage (the reason why a mesreturn F (key1 , c1 xor · · · xor cn ) sage is rejected); renegotiates the key in case of a problem to avoid those attacks); 802.11b (MAC-thenIf F is invertible (a PRP rather than just a PRF), you encrypt (the reverse is safer), linear CRC (not a crypcan invert the last step, add some more data, or modify tographic MAC), repeated IV, related keys, etc.); SSH one of the blocks, and recompute the tag. (the length of the message is not authenticated). A one-time MAC can be computed from a large prime In many situations, you have one key, but the algonumber q (e.g., 2128 + 51), using two random numbers rithm needs several: you can use a key derivation funck, a ∈ J1, qK as a key: if the message ∑ (m1 , m2 , . . . ) is tion (KDF). You could use a PRF, if the source key is made of 128-bit integers, the tag is i mi k i +a mod q. uniform. Use a salt (non-secret, fixed, chosen at ranA one-time MAC can be transformed into a many-time dom) to extract a pseudo-random key from the source key, K = HMAC(salt, source key) and use HMAC as MAC (Carter-Wegman MAC): a PRF with key K. Do not use password-based KDFs: there is not enough entropy, and dictionary attacks tag = (r, F (key1 , r) xor S(key2 , message)) work well – if you insist on using them, use a salt and where r is random, S is a 1-time MAC, F is a PRF. It a slow hash function, e.g., iterate x 7→ hash(x∥salt) is a random MAC: it is not deterministic. many (106 ) times. depending on whether there is padding or not.

Article and book summaries by Vincent Zoonekynd

298/587

Deterministic encryption (no nonce, the ciphertext is always the same) allows you to search in an encrypted database but is unsafe, if the messages are small or repeated (e.g., foreign keys). Tweakable encryption, E(key, tweak, message), where each E(k, t, ·) is a PRP, is often used in disk encryption (e.g., xts), with t the sector number and E(k, t, x) = E(E(k, t), x).

resistant hash: H(x, y) = g x hy in G = F× p , with g, h ∈ G (key) – finding a collision is equivalent to computing a discrete log; – Factor a number into a product of two large primes (soon possible for 1024-bit numbers, hard for 2048bit numbers – Find a root of a polynomial of degree > 1 mod n (the best know algorithms factor n).

In the Diffie-Hellman scheme, choose a large prime p, and g a generator of the cyclic group (Z/p)× . A chooses a in {1, . . . , p − 1}, sends g a mod p to B; B chooses b in {1, . . . , p−1}, sends g b mod p to A They both compute g ab and use it as a key. For a 256-bit key, you need a 16,000-bit prime. The same idea works with other cyclic groups, e.g., elliptic curves: a 512-bit curve suffices. It is insecure against man-in-the-midde attacks. In the Merkle puzzle scheme, A sends n puzzles to B, It can work passively: everyone publishes g a , and can each taking O(n) time to solve. B chooses one, solves read the g b of everyone else. It only works to commuit, and the solution is of the form (id, key) (A had gen- nicate between two parties; it can be generalized to 3 erated n different keys). B sends the id to A; they both but, beyond that, the problem is still open. know the key. There is a gap between the work done Many public-key encryption schemes are based on trapby B and the attacker: O(n2 ); it is very inefficient. door encryption: a pair of functions that can be generPublic-key encryption relies heavily on number the- ated randomly, one inverse of the other, but otherwise ory: very difficult to invert. One is non-deterministic and – Euclid’s theorem (there exists u and v so that au + used to encrypt messages (public key), the other is debv = gcd(a, b)) can be used to compute inverses in terministic (private key, trapdoor). Key exchange can be performed through a trusted third party (TTP), who shares a key with everyone: the TTP chooses a random key KAB , encrypts it for A as E(KA , KAB ), sends it to A, encrypts it for B as E(KB , KAB ), and sends it to A (not B, A will forward to B). This is safe against eavesdropping, but not against active attacks or TTP corruption.

Z/nZ). – Fermat’s theorem (xp−1 = 1 (mod p)) can also be used to compute inverses (x−1 = xp−2 ), but is less efficient ((log p)3 instead of (log p)2 ). It can also be used to select random prime numbers: pick p at random, until 2p−1 = 1 (mod p); You can be fairly confident (P [not prime] < 2−60 ), but there are better, non-probabilistic, tests of primality. – Euler: F× p is a cyclic group. – Lagrange: ordp (g)|p − 1 (the order of an element of an abelian group divides the order of the group) can be used to prove Fermat’s theorem. – Fermat: xp−1 = 1 (mod p). – Fermat: xϕ(n) = 1 (mod n) if x ∈ (Z/nZ)× (a special case of Lagrange), where ϕ(n) = |(Z/nZ)× | is Euler’s totient function.

The RSA functions are x 7→ xd and xe , where N = pq is a product of two large prime numbers, ϕ(N ) = (p − 1)(q − 1), x invertible mod N (most are), d is random and e is such that de = 1 (mod ϕ)(N ); the functions are inverses because xϕ(N ) = 1 (mod N ). It can be used to exchange the key for a symmetric cipher (rather than encrypt messages, as some textbooks suggest: the functions are deterministic).

We do not know if RSA is safe: we do not know if computing e-th roots is as hard as factoring. Encryption is much faster than decryption (other public key encryption algorithms are more balanced: ElGamal). There are timing or power attacks: computing cd mod N can expose d; there are fault attacks: a single error in the computation of cd mod N (check, by elevating to the e-the power: it should be 1) can leak the deMany algorithms rely on the difficulty of the following composition of N . There are key generation problem: problems: p is generated before q, when there is not enough en– Computing a modular e-th root: the best known al- tropy: many routers/webservers have the same p (and gorithms require a factorization of the modulus. In different q): compute gcd(ni , nj ) for all the pairs of Fp , x > 0 is a quadratic residue (i.e., a square) iff routers/webservers you can find – if it is not 1, you x(p−1)/2 = 1 (mod p) (Legendre symbol); if p√ = 3 have factored ni (0.5% of webservers are affected). (mod 4) and c is a quadratic residue, then c = c(p+1)/4 ; if p = 1 (mod 4), the square root can be PySP: Modeling and solving found in O((log g)3 ), with a randomized algorithm; stochastic programs in Python you can also solve quadratic equations with the highJ.P. Watson et al. (2011) school formula. – Inverse in Z/nZ; Stochastic programs can be converted into determin– Roots of a polynomial of arbitrary degree in Fp ; istic programs (the objective function, an expectation, – Discrete logarithm mod p: find x so that g x = a becomes a weighted sum; each decision variable is du(mod p) (believed to be hard for F× p (p large) and plicated for each scenario; the constraints are likewise elliptic curves; can be computed with the GNFS al- duplicated; you just have to add non-anticipatory congorithm for F× p ); can be used to create a collision- straints, that encode the structure of the scenario tree), Article and book summaries by Vincent Zoonekynd

299/587

but the size of the resulting problem (most optimizers have a pre-solver, that identifies and removes redundant variables and constraints, but this only halves the size of the problem) makes it difficult to solve, unless you exploit the structure of the problem.

The Cornish-Fisher expansion in the context of delta-gamma-normal approximations S.R. Jaschke (2001)

Proof of the Cornish-Fisher expansion: a series expansion of F −1 ◦ Φ, with F the cumulative distribution Vertical strategies, such as Bender’s decomposition (L- function (cdf) of interest and Φ a known cdf, typishaped method, for 2-stage problems) or nested de- cally Gaussian), often used to estimate the value at composition schemes decompose the problem by stages. risk (VaR) of non-gaussian distributions. The error inHorizontal strategies, such as progressive hedging or creases as you progress into the tail, and the approxithe dual decomposition, decompose it by scenarios. mated VaR need not even be monotonic. However, the errors from the Cornish-Fisher expansion are smaller Progressive hedging suggests to: than those coming from a quadratic approximation (∆– Solve the deterministic problems for all the scenar- Γ-normal): to be safe, compare with a Monte Carlo ios; estimator. Using a reference distribution Φ with fatter – Average the solutions (the result does not satisfy the tails may reduce those problems. non-anticipatory constraints); – To each sub-problem, add a penalty for the breached Asset liability management non-anticipatory constraints, re-solve the problems, for individual households and iterate. M.A.H. Dempster and E.A. Medova (2010) The algorithm can be improved: There are three ways of solving ALM problems: dy– The first iterations do not require a precise solution; namic programming; stochastic programming (trans– Detect if a variable converges early; if so, fix its value; form the stochastic program into a huge, sparse, de– Detect cycles (tabu search, with a Bloom filter), terministic, linear program); various heuristics. Dywhich are common with integer variables; namic programming suffers from the curse of dimen– The sub-problems can be solved in parallel. sion; so does stochastic programming, but to a lesser extent. The authors model 10 asset classes (monthly, Pyomo: modeling and solving but the investment and consumption decisions are mathematical programs in Python annual) with geometric brownian motions (equities, W.E. Hart et al. (2011) bonds, commodities, alternative investments, real esThe article stresses the need for more modularity in tate) or geometric Ornstein-Uhlenbeck processes (cash, mathematical programming software: one should sep- inflation, treasury bill rate), with correlated noises, arate the structure of the problems (abstract model) from market data (for some asset classes, you have to from the data (concrete model) and the solver; use a compute the total returns from the prices, by adding high-level programming language (Python, as opposed dividends or coupon payments), and use the correto AMPL, AIMMS, GAMS), which makes it easy to sponding scenarios in an ALM problem, looking for enumerate sets of constaints and allows you to imple- the optimal consumption and investment decisions, to ment algorithms that require many optimizations (e.g., maximize the expected utility of life-time consumpbranch-and-cut; Bender’s decomposition for stochastic tion. The utility is piecewise linear and specified sepprograms; problems with an exponentially large num- arately for each goal; it is penalized for bankruptcy ber of constraints in which you can quickly identify the (excess borrowing). The simulation also includes the breached constraints, add them to the problem and it- death of the investor (variable investment horizon), and fixed, inflation-indexed or growing liabilities. Conerate). sumption is divided between equity-preserving goals Coopr contains: Pyomo, to specify the problem; and (e.g., housing), often requiring a downpayment and interface to different solvers; a parallelization frame- subsequent mortgage payments (making the problem work, build from pickle (serialization) and pyro (RPC), path-dependent), and non-capital goals. should you want to solve several subproblems in parallel. Individual asset liability management The (only?) open-source solvers supported are cbc, E.A. Medova et al. (2008) glpk, ipopt, from the coin-or project. Presentation of the corresponding GUI, to specify the Alternatives include: ALM problem and analyze the results. – FlopC++: a problem is a C++ class, derived from MP_Model; thanks to operator overloading, the synDiscretionary Wealth Hypothesis and ALM tax looks readable – but I suspect the error messages D. diBartolomeo (2010) are not; – OptimJ, a Java extension – not a library, but an ex- Tackle the dynamic asset allocation problem with tension to the language, with its compiler, integrated transaction costs as follows: always invest in a meanvariance efficient portfolio, but change the risk aversion with Eclipse, no longer commercially developed. parameter according to the following rule of thumb, Article and book summaries by Vincent Zoonekynd

300/587

which is not unlike portfolio insurance (PI), λ=

1 2

Assets Assets − PV(Liabilities)

and add a penalty to the mean-variance optimization to account for the transaction costs transaction costs × turnover . probability that the new portfolio is better The present value of the liabilities is computed using a risk-free recombining binary tree of short interest rates, so that Assets − PV(Liabilities) is non-negative iff the probability of bankruptcy is zero. The 401(k) retirement income risk F. Sortino and D. hand (2011) Since the goal of retirement investment is not to beat the market, but to “retire with dignity”, returns and volatility are not the right benchmark. Instead, to decide if you should make a catch-up contribution, you can look at: – The probability of achieving the goal; – The present value of the assets and liabilities; – The desired target return (DTR), i.e., the internal rate of return required for this present value to become positive. Optimal investment strategies in defined contribution pension plans D. Blake et al. (2011) Most pension finds separate the accumulation and decumulation phases, asking the wrong questions (how much do you want to save?), using incorrect targets (performance, rather than retirement), failing at their market timing attempts (when volatility is high, returns are low: one should reduce the weight of equities) and leading to under-funded pension plans. Only in rare cases (power utility and iid asset returns) is the 1-period optimal strategy also a multi-period optimal strategy. The Kelly principle can be generalized to account for (risky) labour income; the resulting strategy, “stochastic lifestyling”, is reminiscent of the discretionary wealth hypothesis. Deciding if/when to annualize is an option exercise problem (it depends on age, wealth and risk aversion). Since most pensions are underfunded, most people will buy an annuity; to avoid low interest rates, it is preferable to spread its purchase over time (phased annuity purchases).

Defined-benefit pension plans have fixed liabilities: they are bonds, with a very high duration (10 to 15 years). Since long-duration assets (long-term bonds) have lower returns than short-duration assets (stocks, etc.), you may want to invest in a core portfolio of short-duration assets (100%), and a durationenhancing overlay (50% to 100%, long long-duration bonds, short short-duration bonds) to hedge the interest rate risk. A robust optimization approach to pension fund management G. Iyengar and A. Ka Chun Ma (2011) The defined-benefit pension problem can be formulated as follows: find the future contributions wt and the portfolio composition xt (equity index and bonds of each maturity) to minimize the present value of the future contributions, subject to the requirement to meet the prescribed (deterministic) liabilities at all times t, and regulatory requirements NAV ⩾ β PV(future liabilities). The time horizon T is sufficiently long so that liabilities beyond T are negligible. (The problem is slightly more complicated: since the liabilities have to be met, the firm has to find the money somewhere – it is a corporate finance decision, involving the use of debt vs equity vs retained earnings, taxes (interest payments are tax-deductible), debt rating, etc.). The constraints are often replaced by chance constraints, asking that the probability that the constraint is breached be below some threshold ε. Equity prices follow a geometric brownian motion; bond yields are described by the Nelson-Siegel model, whose factors follow a mean-reverting process; the innovations for those four factors are correlated. The chance constraints P [constraint breached] ⩽ ε are of the form P [Xt ⩾ a] ⩽ ε, where Xt is a stochastic process defined from the risk factors. Using Ito’s formula, you can compute its stochastic differential equation, and then linearize it at t = 0, i.e., assume it is just a brownian motion. The constraint becomes P [something gaussian] ⩽ ε; it can be written as a second-order cone (SOC) constraint ∥Bx − a∥ ⩽ d′ x + c.

To ensure that all K chance constraints are satisfied with probability 1 − ε, use Bonferroni’s inequality: assume they are independent and set their thresholds to It is not advisable to seek an absolute guarantee to ε/K. The resulting (SOC) optimization problem is no deliver the desired pension: such guarantees are very longer stochastic: the contributions wt and numbers of shares xt do not depend on the state of the world at expensive to secure. time t. Duration-enhancing overlay strategies for defined-benefit pension plans J.M. Mulvey et al. (2011) Article and book summaries by Vincent Zoonekynd

The resulting strategy is conservative, perhaps because of the linearization of the geometric brownian motion.

301/587

Alternative decision models – The yield curve is described by the Diebold-Li for liability-driven investment model, i.e., the Nelson-Siegel model in which the K. Schwaiger et al. (2011) parameters (level, slope, curvature) form a VAR process; There are many ways of formulating the pension prob– Use a 2-step model to deal with time series with lem, each simplifying a different aspect of it. Here are a shorter history (hedge funds, etc.): first model four of them. (VAR) the long-history variables, then model the – A deterministic program, in which you minimize short-history ones, adding the long-history ones as both the sum of all the contributions and the sum exogenous variables; of the absolute values of the PV01 of the assets and – Modify the intercept of the VAR model to avoid unliabilities (since there are two objectives, the result realistic reversion to the (old) long-term averages. is not a single strategy, but an efficient frontier of The term structures of return volatility, for different strategies). – A stochastic model (you have to generate scenarios), asset classes, are very different. with two stages (to simplify), in which you match the present value (PV, not PV01) of the assets and liaDynamic risk management: bilities; the two objective functions are the expected optimal investment with risk constraints contributions and the expected absolute value of the S. Jarvis (2011) difference in present values. The dynamic asset allocation problem can be solved – A chance constraint can be added to this model with partial differential equations (PDEs). P [Assets − γLiabilities > 0] ⩾ 1 − ε. Asymmetric risk measures are even more important for dynamic strategies than for static ones: they can creThe equivalent deterministic formulation (the con- ate very non-gaussian return distributions – prefer the straint is breached in at most εN of the N scenarios) conditional value at risk (CVaR, expected shortfall) to involves binary variables. the standard deviation. – The chance constraints can be replaced by integrated chance constraints (this is similar to the The optimal dynamic strategy that maximizes the exCRRA (constant relative risk aversion) utility difference between value at risk and expected short- pected 1−γ ((x − 1)/(1 − γ) or log x if γ = 1) of final wealth, fall) when asset prices follow a geometric brownian motion. is the growth-optimal portfolio (GOP, a generalization E[Assetst − γLiabilitiest ] ⩽ PVt (Liabilities(ω)); of the Kelly principle); with a CVaR constraint, the payoff is transformed and looks like a collar . (This binary variables are no longer required. is like an asset and liability management (ALM) problem, but with no liabilities.) A liability-relative drawdown approach to pension asset liability management The optimal strategy can be computed by dynamic A. Berkelaar and R. Kouwenberg (2011) programming on a scenario tree or, better, a recombining tree. Alternatively, since the probability distriFind the portfolio weights w (constant over time) to bution of the payoff is solution of the Fokker-Planck maximize the (logarithmic) utility of the funding ratio equation (Kolmogorov forward equation – not to be Assets/Liabilities, whose final value can be approxi- confused with the Kolmogorov backwards equation, mated as which gives the option price), one can solve it numeri∏( ∑( cally, for many investment strategies, and keep the one ) ) 1 + w′ rt − rtliabilities ≈ exp w′ rt − rtliabilities , with the highest utility. (Contrary to option pricing, t t the payoff is unknown: we are looking for the strategy that maximizes the utility of the payoff.) The resultwith a constraint on the variance of the surplus (constant (Assets − Liabilities) of on the maximum funding ra- ing distribution looks like a soft CPPI proportion portfolio insurance). tio drawdown (linear program) or on the conditional funding ratio drawdown (average of the worst 10% worst drawdowns, also a linear program)

While the CVaR is an acceptable risk measure for static (1-period) strategies, it is less so for dynamic ones. A Allow for some short-selling: many pension-funds use measure of risk ρ can often be seen as a capital requirea return-generating portfolio and a liability-hedging ment: ρ(X) is the amount of capital needed to make X acceptable, i.e., ρ(X) = Min{m : ρ(X + m) ⩽ 0}; overlay. therefore ρ(X + m) = ρ(X) − m. In a multi-period The scenarios (returns for equities, commodities, real setting, one would expect ρt (XT ) = ρt (−ρt+1 (XT )) – estate, hedge funds) are generated from risk factors but this does not hold for the CVaR. The only time(“state variables”: level, slope and curvature of the consistent risk measures are the entropic risk meayield curve, default spread (Baa − Aaa), consumption- sures wealth ratio) through a vector autoregressive (VAR) 1 ρt (X) = log Et [exp −γX] model: γ Article and book summaries by Vincent Zoonekynd

302/587

or, more generally (if you remove the translation in- – Since the problem is non-convex, you may want to variance ρ(X + m) = ρ(X) − m), certainty-equivalent use a grid search (on the non-linear parameters: τ1 , risk measures τ2 ) to find good starting values; – The problem is heteroskedastic: use weights (e.g., ρT (X) = u−1 Et [u(X)]. the inverse of the duration, or the bid-ask spread). The implementation, in the termstrc package, is un[The consequence of time-inconsistency, the fact that usable: prefer fBonds (it uses a grid search, but no conone is risk-seeking when the wealth is very high or very straints, no weights, no time-dependence, and models low, looks harmless to me.] forward rates rather than yields). Bank asset-liability and liquidity risk management M. Choudhry (2010) Asset liability management (ALM), in a bank, focuses on: – Liquidity risk: the dates of the assets and liabilities do not match (they are even sometimes unknown: options, current accounts, credit lines, etc.), so the bank will need to refinance some of them, and there is no guarantee that this will be possible at a reasonable price; – Interest rate risk: even if the dates match, the rates need not match.

Foundations of the Statpro simulation model M. Marchioro and D. Cintioli (2007) To estimate risk (value at risk, expected shortfall), resample from the multiplicative (stock prices, implied volatility, and other positive quantities) or additive (interest rates, swaps, and other quantities that could become negative) historical changes. For fixed income, resample the instantaneous forward rates rather than the prices. For complex instruments, price them from simple instruments. For large portfolios, pay attention to missing data (coming, e.g., from different holidays). Pricing simple interest-rate derivatives M. Marchioro (2008)

A two-factor HJM interest rate model for use in asset liability management Clear presentation of the annoying accounting convenS. Kaya et al. (2010) tions (day counting convention, business day convenThe G2 ++ short rate model (a 2-factor Gaussian tion, compounding convention) and overview of the model) can be generalized to a term-structure (HJM) simplest common interest rates derivatives (forward rate aggreements (FRA) and swaps). The interest rate model, and used to price structured products. term structure (“yield curve”) can be estimated from deposit rates (the -ibor rates: overnight, tomorrowAsset liability management modelling next, t+2, . . . , until 1 year) and the swap rates (beyond with risk control by stochastic dominance 1 year: since the principal is not exchanged, the risk is X. Yang et al. (2010) lower). To interpolate the term structure, linear interStochastic dominance (first order, second order, and polation of the logarithm of the discount factor, i.e., their variants: interval second order, etc.) constraints piecewise constant interpolation of the forward rates, can be included in optimization or stochastic optimiza- gives decent results. tion problems (they become chance constraints). Zero-coupon yield curve estimation with the package termstrc R. Ferstl and J. Hayden (JSS 2010) The yield curve is often modelled empirically, from market data, either in a non-parametric way (linear interpolation of discount factors (bad) or of their logarithms (better), piecewise interpolation of the instantaneous forward rate (equivalent), splines) or by assuming that the curve has a specific shape (Nelson-Siegel, Svensson – beware, the formula is different if you model the yield or the instantaneous forward rate), sometimes with a (VAR, VARMA) time-dependence on the parameters (Diebold-Li). Here are some implementations caveats: – To avoid identifiability problems, you may want to impose a few constraints, e.g., τ2 − τ1 > ε in the Svensson model; Article and book summaries by Vincent Zoonekynd

Doing bayesian analysis, a tutorial with R and BUGS J.K. Kruschke (2011) Leisurely introduction to Markov Chain Monte Carlo (MCMC) computations in bayesian statistics, in which the graphical representation of bayesian models as a combination of building blocks is more pictorial than the traditional BUGS one.

µ

σ

Gaussian

xi

i = 1..n

303/587

Model Thinking S.E. Page (Coursera, 2012) Introduction to complex systems (Markov models, Lyapunov functions, random walks, Polya’s urn model, chaos, network models), agent-based models (Schelling’s segregation model, epidemics models (SIR, SIS), coordination game, replicator dynamics, wisdom/madness of crowds), game theory (prisonner’s dilemma, auctions, colonel Blotto) with applications in economy (Solow growth model and the role of innovation), sociology, politics (voting and aggregation of preferences), etc., with examples in NetLogo – useful if you need to explain an agent-based model to a non quantitative audience. Algorithms, design and analysis I T. Roughgarden (Coursera 2012) Introductory algorithms course, covering: – Sorting: insertion, selection, bubble-sort, mergesort, quick-sort; bucket-sort (if the data is U (0, 1)), counting sort (discrete, repeated values), radix sort; – Divide-and-conqueer: multiplication of large integers or matrices (with the Karatsuba and Strassen improvements); number of inversions in an array; closest pair; median (adapt quick-sort; or split the data into groups of 5; recursively compute the median of medians; use it as a pivot) – Heuristics for NP-complete graph problems: graph colouring to assign temporary variables to registers in a compiler (pick a node with at most k neighbours, remove it, iterate, backtrack if needed, remove a node if stuck), minimum cut (select an edge at random, contract it, iterate, repeat and keep the best solution); – Graph search: breadth-first search (BFS: use a queue, FIFO), to compute distances or connected components; depth-first search (DFS, backtracking: use a stack or recursion), to compute a tolological ordering (all the edges go forward) of a directed acyclic graph (you could also look for a sink vertex, remove it, and iterate) or its strongly connected components (DFS on the reverse graph, remember the finishing time of each vertex, DFS in the reverse order) – Heaps (or priority queues): minimum paths from a vertex, running median (use two heaps); balanced binary search trees; union-find (aka disjoint-set) to store partitions, with two operations, union and find (the comonent an element is in, e.g., via a representative element), as a forest of trees (components), each child pointing to its parent, with short paths to the root (representative element) – for more efficiency, when running find(), have all the elements you see point to the root. – Hash tables (if you implement your own, use a random hash function); Bloom filter (use an array of n bits and k hash functions; to insert an element, set the k corresponding bits to 1);

Article and book summaries by Vincent Zoonekynd

Co-movement of energy commodities revisited: evidence from wavelet coherence analysis L. Vacha and J. Barunik (2012) Wavelets can be used to analyze the coevolution of two time series: let ( ) ∫ ∞ 1 t−u Wx (u, s) = √ dt x(t)ψ¯ s s −∞ be the continuous wavelet transform of a time series x (at time t and scale s); the squared wavelet coherence coefficient is −1 ¯ y (u, s)] 2 S[s Wx (u, s)W 2 Rx,y (u, s) = 2 2 S[s−1 |Wx (u, s)| ]S[s−1 |Wy (u, s)| ] where S is some smoothing operator. It measures the linear correlation at a given location and scale. The wavelet coefficient phase difference ¯ y (u, s)] ϕx,y (u, s) = arg S[s−1 Wx (u, s)W can highlight delays in the oscillations of the two time series (add it as arrows (as a vector field) on top of the wavelet coherence heatmap plot). On Hurst exponent estimation under heavy-tailed distributions J. Barunik and L. Kristoufek (2012) Expirical comparison of various Hurst exponent estimators: rescaled range (R/S), multifractal detrended fluctuation analysis (MF-DFA), detrending moving average (DMA), generalized Hurst exponent (GHE, i.e., scaling of the q-moments, often with q = 2) – GHE has lower variance and bias. Monte-Carlo-based tail exponent estimator J. Barunik and L. Vacha (2012) The Hill estimator of the tail exponent is biased on small samples – use a simulation-based estimator instead: – Generate random samples, with the same size, from various stable distributions; – Compute their Hill estimators, for various threshold choices; – Compare with the Hill estimators (for all those thresholds) of the actual data (they use the L1 norm and give the same weight to all the thresholds). Properties of the most diversified portfolio Y. Choueifaty et al. (2011) A few remarks on the diversification ratio (ratio of the portfolio’s weighted volatility to its overall volatility), a generalization of the Herfindahl index (entropy) when assets are correlated and/or do not have the same risk. that can be interpreted as the (square of the) effective number of (uncorrelated) assets (or risk factors) in the portfolio. (This approach is limited to unleveraged, long-only portoflios.)

304/587

Orthogonalized equity risk premia to large changes in the output), the conditioning num

and systematic risk decomposition ber (|xf ′ (x)/f (x)|, or κ(A) = ∥A∥ A−1 for linR.F. Klein and K.V. Chow (2010) ear functions), matrix decompositions (LU, QR, SVD, In a factor risk model, you may want the factors to be Cholesky), structural rank (the maximum rank of a sparse matrix with the same presence/absence patorthogonal, e.g., to have a simple risk decomposition. tern), finite differences (implicit, explicit and θ methGram-Schmidt orthogonalization depends on the order of the factors, but there is a symmetric variant: let ods – Crank-Nicolson is obtained for θ = 1/2), flaws in F = (f1 | · · · |fn ) be the matrix of (centered) factor re- pseudo-random number generators and the way they turns; we want S so that Var F S = I, i.e., S ′ F ′ F S = I, are used (resetting the seed before each number may i.e., SS ′ = (F ′ F )−1 , i.e., S = (F ′ F )−1/2 C for an arbi- not be a good idea), low-discrepancy sequences, bootstrap (the Taylor-Thompson algorithm is a smooth trary C ∈ O(n), e.g., C = I. non-parametric bootstrap: take a point at random, look at its k nearest neighbours, move it towards them Derivatives and credit contagion by a random amount, iterate). Examples include Euroin interconnected networks pean and American option pricing, time series models S. Heise and R. Kühn (2012) (with a clear interpretation of the model parameters), Most models of contagion in financial networks (bi- CPPI, VaR estimation. partite, firms vs banks and insurance companies) only The second part is devoted to (unconstrained) optitake into accout who lends to whom, and neglect syn- mization. The fixed point method (rewrite the probthetic credit risk exposure through CDSes – it is not lem as x = f (x) and iterate x n+1 = f (xn ) until convernegligible, and taking it into account requires adding gence) is more useful than one may think: it is the base hyperedges to the network (between three nodes: firm- of the Jacobi, Gauss-Seidel and SOR algorithms, used bank-insurance or firm bank-bank). to solve large linear systems (e.g., those coming from Second-order price dynamics: approach to equilibrium with perpetual arbitrage E. Kempt-Benedict (2012) The presence of arbitrageurs and rationnally-bounded economic actors explains the permanent fluctuations around the equilibrium. (This is the usual informed traders vs noise traders game, expressed in terms of supply and demand.) A multifractal approach towards inference in finance O. Løvsletten and M. Rypdal (2012)

PDEs) or large non-linear systems; it can also be used to compute the S-estimator (a robust estimator whose cost function is defined implicitely). Gradient-based algorithms (gradient descent, quasi-Newton, LevenbergMacquardt), heuristics (local search, tabu search, simulated annealing, threshold acceptance, genetic algorithms, differential evolution, particle swarm optimization, ant colony optimization) and hybrid methods are also presented, with examples from portfolio optimization, term structure models, option pricing model calibration (via the characteristic function), robust regression (least median of squares, least quantile of squares, least trimmed squares). The book contains many plots, but the lack of titles and labels makes them difficult to read.

Multifractal processes have stationary increments and q satisfy E[|Xt | ] ∝ tζ(q) for some concave function ζ (linear for self-similar (i.e., stable) processes such as A tutorial on geometric programming Brownian motion or Lévy flight). There are many ways S. Boyd et al. (2007) of building such processes or their discrete equivalents. A monomial is a term of the form cxa1 1 · · · xann , with c > For instance (multi-fractal random walk): 0 – but the ai are arbitrary real numbers. A posynoxn = Xn − Xn−1 mial is a sum of monomials. A geometric program (GP) is an optimization problem of the form “minimize xn ∼ N (0, σn ) P0 , so that P1 , . . . , Pn ⩽ 1 and M1 = · · · = Mm = 1”, log σn ∼ Gaussian Process where the Pi are posynomials and the Mj monomials. T /∆t + The exponents can be arbitrary, but the coefficients Cov(log σt , log σs ) ∝ log |t − s| + 1 have to be positive and the equality constraints only have one term. Numerical methods and optimization in finance M. Gilli et al. (2011)

Geometric programs can be solved efficiently via interior point methods: if f is a (generalized) posynomial, then log f (ey ) is convex (in terms of f , the convexThe first part of the book covers many numerical ity condition involves goemetric averages, hence the topics, such as the inherent∑imprecision of floating- name). point arithmetic (for which 1/k converges), numer- Surprisingly many problems are geometric programs, ical instability (the algorithm amplifies rounding er- or can be reformulated as such, or can be approximated rors), ill-conditioning (small changes in the input lead by GPs. Here are some of the transformations (defining Article and book summaries by Vincent Zoonekynd

305/587

generalized posynomials). f (x) − m(x) ⩽ 0 → f (x)/m(x) ⩽ 1 f (x)2.2 + g(x)3.1 ⩽ 1 → f (x) ⩽ t1 , g(x) ⩽ t1 , 3.1 t2.2 1 + t2 ⩽ 1

f0 (f1 (x), . . . , fn (x)) ⩽ 1 → f0 (t1 , . . . , tn ) ⩽ 1, f1 (x) ⩽ t1 , . . . , fn (x) ⩽ tn Max{f1 (x), f2 (x)} → t, f1 (x) ⩽ t, f2 (x) ⩽ t

relative fitting error (after the logarithm transform, the problem becomes linear). A max monomial fit f (x) = Maxk fk (x) can be obtained by adapting the kmeans algorithm (as a starting point, use k perturbed copies of a monomial fit). A posynomial fit can be obtained by non-linear least squares (Gauss-Newton, sequential quadratic programming). Some generalizations are no longer convex:

– Signomial programming removes the constraint that the coefficients be positive: start with an initial Trade-off analysis measures the impact of the conguess, use a monomial approximation of the offendstraints: replace the constraints fi (x) ⩽ 1 and gj (x) = ing terms (perhaps with a constraint to force the 1 with fi (x) ⩽ ui and gj (x) = vj , consider the value solution to stay close to the current one); iterate; p(u, v) of the corresponding problem, and compute the – Mixed integer GP can be solved with heuristics partial derivatives (the logarithm disappears because (round the solutions, perhaps after tighening the we are evaluating them in 1) constraints; or round one variable at a time, starting ∂ log p ∂ log p with those closest to an integer; or use branch-andSi = = bound). ∂ log ui u=v=1 ∂ui u=v=1 ∂ log p ∂ log p Tj = = . Portofolio selection problems in practice: ∂ log vj u=v=1 ∂vj u=v=1 a comparison between linear You do not have to explicitly compute those quantiand quadratic optimization models ties: they come from the dual problem. This is also F. Cesarone et al. (2010) useful for infeasible problems: the sensitivities (using some measure of infeasibility as objective, or a penal- The quadratic program with cardinality constraints ized objective) tell you which constraints bear the most Minimize x′ Qx resposabilities. ∑ such that xi = 1 Here are more transformations p(x) + f (x) ⩽ 1 → t + f (x) ⩽ 1, r(x) − q(x) q(x) + p(x)/t ⩽ r(x) and approximate transformations (functions whose Taylor expansions have positive coefficients – for the exponential and the logarithm, the problem can be solved exactly: after the logarithmic transformation, the function is convex). ( )a f (x) − b b exp f (x) ≈ e 1 + a f (x)≈b log q(x) ≈ a q(x)1/a − a a≫1

There are many generalizations: – in a mixed linear GP, some variables appear as GP terms, others as linear terms: only apply the logarithm transform to the GP ones; – GP problems with posynomial (not monomial) equality constraints can sometimes be solved by relaxation: replace h(x) = 1 with h(x) ⩽ 1 and try to tweak the solution (especially if it is not unique).

x⩾0 |supp x| ⩽ k (polynomial if Q is positive definite, but NP hard in general) can be solved efficiently by noticing that the minimum ∑ is obtained in the interior of a face ∆I of ∆ = [ xi = 1] where the restriction QI of Q is strictly −1 −1 convex; the solution is then x∗I = (1′ Q−1 QI 1/ I 1) ˚I We “just” have to find I such that |I| ⩽ k, x∗I ⊂ ∆ ′ −1 −1 that minimizes (1 QI 1) . It turns out that, as we increase k, the corresponding I ′ s are included into one another: this gives an incremental algorithm to solve the problem. However, since there are many possible I’s at each step (as we increase k, they form a tree, and we want one of the longest branches), the complexity is still exponential in the worst case; one can try heuristics, e.g., keeping the solutions with the best values. Learning from data Y.S. Abu-Mostafa (Caltech, 2012) Hoeffding’s inequality measures how well in-sample results generalize out-of-sample: for proportions, on a sample of size N ,

Many problems can be approximated with GPs: a function f can be approximated by a monomial (resp. posynomial) if F (y) = log f (ey ) can be approximated P [|µout − µin | > ε] ⩽ 2 exp(−2ε2 N ). by an affine (resp. convex) function. A monomial fit can be obtained by Taylor expansion, least squares, It also tells us if learning is possible at all: when you penalized least squares (L1 regularization gives fewer choose between M (independent) models, terms), non-linear least squares for the relative erP [|Errorout − Errorin | > ε] ⩽ 2M exp(−2ε2 N ). ror |g(xi ) − fi | /fi , or by minimizing the maximum Article and book summaries by Vincent Zoonekynd

306/587

But M is usually infinite. In the case of classification, what matters is not the number of models, but the number m(k) of different possible predictions they can make on the data: if there are k points, there are at most 2k dichotomies but, in general, since the models have a specific form (e.g., linear), there will be fewer dichotomies. If the VC (Vapnik-Chervonenkis) dimension (intuitively, the effective number of parameters, or the effective degrees of freedom) dVC = sup{k : m(k) = 2k } is finite, m(k) is polynomial. We then have P [|Errorout − Errorin | > ε] = O(N dVC e−αε

2

N

).

The mutiplicative constant is too large for this inequality to be useful in itself, but it is believed that the order of magnitude is close to optimal. As a rule of thumb, you want N ⩾ 10 dVC . VC analysis can be contrasted with bias-variance analysis: the bias (sometimes called “deterministic noise”) is the error between the best approximation, in the set of models considered, and of the (unknown) datagenerating mechanism; the variance is the error between the best approximation and that derived from the sample data.

Error

Error

Errorout generalization error

variance

Errorin

VC analysis

Errorout

bias N

N

Bias-variance analysis 2

To avoid overfitting, use regularization (add a λ ∥w∥ penalty or, equivalently (by duality), a constraint ∥w∥ ⩽ C) and (10-fold) cross-validation (to select the regularization parameters and/or the model complexity). The examples covered included singular value decomposition (SVD, for recommendation systems); the perceptron, logistic regression, neural networks; softmargin support vector machines (SVM), with details of the corresponding quadratic optimization problem and its dual, which gives the support vectors (the number of support vectors measures the complexity of the model, not unlike the VC dimension); the kernel trick (k(x, y) = (1 + x′ y)k , which gives polynomials of degree up to k, k(x, y) = exp −γ ∥x − y∥); radial basis functions (apply k-means on the inputs, ∑ regress y ∼ exp −γ ∥x − µi ∥, choose γ and k by cross-validation – this is very similar to SVM with a gaussian kernel, where the µi are support vectors instead of k-means centers.

Article and book summaries by Vincent Zoonekynd

Propositional logic is the logic of truth tables: given propositional constants p, q, r, etc. and a truth assignment, you can check if complex sentences (p ∧ q, (p ∨ q) ⇒ r, etc.) are true or false. A sentence is valid if it is always true, for all truth assignments; contingent if it is sometimes true, sometimes false; unsatisfiable if it is never true; satisfiable if it is sometimes true (valid or contingent); falsifiable if it is sometimes false (unsatisfiable or contingent). A set of sentences ∆ entails a sentence ϕ, written ∆ ⊨ ϕ, if, for all truth assignments for which ∆ is true, so is ϕ. Entailment is a satisfiability problem: ∆ ⊨ ϕ iff ∆ ∪ {¬ϕ} is unsatisfiable. Satisfiability (SAT) is an NP-complete problem, but there are some improvements on the naive algorithm (checking all the truth tables): arrange the possible truth assignments in a tree (each level corresponds to a literal, each node corresponds to a partial truth assignment), simplify the problem at each node by partially evaluating the formula, and prune the tree whenever possible. Besides deterministic algorithms (DPLL), there are a few heuristics (WalkSAT). The Fitch system is a set of rules of inference, to draw conclusions from premisses (notation: ∆ ⊢ ϕ). Since it allows subproofs ϕ⊢ψ ϕ⇒ψ (if you are not familiar with this notation: the premises are above the bar, the conclusions below), its proofs are shorter that those of other systems (Mendelson, etc.). A proof system is sound when: if ∆ ⊢ ϕ, then ∆ ⊨ ϕ (i.e., if something is provable, it is true); it is complete when the converse holds. Fitch for propositional logic is sound and complete.

Errorin

in-sample error

Introduction to logic M. Genesereth and E. Kao (Coursera 2012)

Propositional resolution provides a more algorithmic way of checking entailment: distribute (ϕ ∨ (ψ ∧ χ) = (ϕ ∨ ψ) ∧ (ϕ ∧ ∨∨χ)) all the premisses to put them in clausal form i j pij , often written as {p11 , . . . , p1k1 } ··· {pn1 , . . . , pnkn } and apply the propositional resolution principle {p, q1 , . . . , qm } {¬p, r1 , . . . , rn } {q1 , . . . , qm , r1 , . . . , rn }. The empty clause is a contradiction. The propositional resolution principle is not complete (e.g., {p}, {q} ⊬ {p, q}), but unsatisfiable clauses can be proven to be so: to prove ∆ ⊨ ϕ, show ∆ ∪ {ϕ} ⊢ {}. Relational logic adds quantifiers, variables (needed if you want to use quantifiers), relations and functions. (You will often need an equality relation, syntactic sugar for an equivalence relation with the substitution property: ∀x∀y p(x) ∧ (x = y) ⇒ p(y).) We can still 307/587

define the notion of truth assignment and logical entail- and complete for first order logic: ment but, in presence of functions, the set of sentences for which you have to provide a truth value (the Her∆ ⊢Fitch\{dc,ind} ϕ ∆ ⊨fol ϕ brand base) is infinite. Here are a few examples: Z/4 can be described with 4 literals and relations “same”, ∆ ⊢Fitch ϕ ∆ ⊨rl ϕ “next”, “plus”; N (Peano arithmetic) can be described with one literal 0, a “successor” function and “same”, “next”, “plus” relations; lists can be represented with a “nil” literal, a “cons” function and an “append” relaR in finance 2012 tion; BNF grammars; propositional logic itself (metaTo design a real-time (i.e., you cannot look into the fulevel logic). ture) finite-sample filter approximating some idealized Finite relational logic (i.e., the Herbrand base is finite, filter, in particular, there are no functions) is equivalent to ∑ propositional logic: to see it, write everything in prenex yt = γk xt−k form, i.e., with the quantifiers on the outside. It is k∈Z n compact: every unsatisfiable set of sentences contains ∑ yˆt = γˆk xt−k , a finite unsatisfiable subset. Omega relational logic (ink=0 finitely many literals but no functions) is not compact, but is semi-decidable: if a finite set of sentences is unexpress it in frequency space satisfiable, it can be shown in finite time (using the ∑ prenex transformation). γk e−ikω Γ(ω) = The Fitch proof system is still applicable to relational k logic (with a few more rules: universal and existential introduction and elimination, for which you have to and find the γˆk that minimize ∫ 2 pay attention to free variables; domain closure; induc ˆ Γ(ω) − Γ(ω) tion): if ∆ ⊨ ϕ, then there exists a finite proof ∆ ⊢ ϕ |ω|=1 (but if ∆ ⊭ ϕ, the search for a proof will take forever). Relational resolution is useful, but incomplete: it is based on satisfiability, but relational logic is not decidable. To put an expression in clausal form, existential elimination, ∃a p(x) −→ p(a(y)), adds a new function (Skolem function) of the enclosing universal variables y. Once you are only left with universal quantifiers, on the outside, you can drop them. Unification is also trickier: you may need to substitute some of the variables and drop part of the clauses. Relational resolution can be used for answer extraction: to find x such that p(x), add the clause {¬p(x), goal(x)}, i.e., p(x) ⇒ goal(x), and continue until you get {} (no solution), or {goal(a)} (a is a solution), or {goal(a1 ), . . . , goal(an )} (one of the ai is a solution). It may also take forever (relational logic is not decidable). The algorithm can be sped up by removing tautologies ({p, ¬p, . . . }), duplications (Ψ if Φσ ⊂ Ψ) or clauses containing a literal that is never negated. In Herbrand semantics, the universe of discourse contains what the logic describes and nothing else; in Tarskian semantics, the universe of discourse can be larger (e.g., the reals and the hyperreals (non-standard numbers) are both models of the reals). Propositional logic and relational logic are examples of Herbrand semantics. First order logic is the tarskian equivalent of relational logic (you can emulate first order logic with relational logic: just add a new unary function). Fitch without domain closure and induction is sound

You can decompose this quantity into a penalty for the delay between the real and idealized filter and a penalty for the amount of remaining noise. The singular value decomposition (SVD) can be used to quickly test for cointegration among a large number of assets. The dlm package can fit, filter (Kalman), smooth state space models: ARMA, stochastic volatility, regression with time-varying parameters (dlmModReg), etc. and help you cross-validate the result. The Rcpp package allows you to transparently manipulate R objects in C++, eschewing the gory macros you have to put up with when using .Call directly – and the inside package simplifies things even further. But you are left with those template-related error messages... It is even more useful with the advent of reference classes (R5). In the other direction, you can use Rinside to, say, include R in a GUI or a web application (e.g., with the Wt framework). To mine textual data, use Python and NLTK, of the tm package and its many plugings (tm.plugin. webmining, tm.plugin.sentiment, tm.plugin. tags). CppBugs is faster than MCMCpack, and (almost?) as flexible as JAGS. Evaluating the design of the R language F. Morandat et al. Unflattering but objective evaluation of R: the language is designed to be used interactively (named and

Article and book summaries by Vincent Zoonekynd

308/587

optional arguments); the main data type is the vector (with support for missing values); it encourages vectorized operations. It mixes seemingly incompatible programming language paradigms: functional and lazy, but dynamic, imperative and reflexive (parse, eval, quote, bquote, substitute, deparse). The current implementation is “massively inefficient”, mainly because of the laziness of the language (which also fails to deliver the performance improvements it should bring), immoderate memory usage (garbage collection is more expensive, frequent and slow than it should). The two class systems (S3, S4 – there is also R5 and a few more in separate packages) look “like an afterthought”. It is “hopelessly non-thread-safe” and lacks standard data structures (growable arrays, hash maps). “It is not the ideal language to develop robust packages.” “As a language, R is like French; it has an elegant core, but every rule comes with a set of ad hoc exceptions that directly contradict it”.

The periodicity can be modeled with a change of time Yt = XTt , with dTt /dt periodic (and positive). Robust pricing and hedging of double no-touch options A.M.G. Cox and J. Obłój (2009) Given a set of calls and digital call prices, with no arbitrage oportunities (there are a few conditions to be satisfied: C(0) = S0 , C(∞) = 0, C ′ (0) ⩾ −1, D(k) = −C ′ (k)), how much latitude do we have to extend the market model to include other options? In the case of double-no-touch options, one can devise a few super- and sub-hedges and show that the corresponding bounds are as tight as possible. Conquering the Greeks in Monte Carlo: efficient calculation of the market sensitivities and hedge ratios of financial assets by direct numeric simulation M. Avellaneda and R. Gamba (2000)

The article also provides a formal semantics of the language, and analyzes large amounts of code, to see how When pricing options via Monte Carlo simulations, the language is actually used. you do not get exactly the market prices; however, The article does not mention the ability to easily de- you can put weights on the sample paths to exactly scribe statistical models (formula), the data manip- recover them (use the maximum entropy principle to ulation (reshape2, plyr) and plotting capabilities – have uniquely defined weights). Those weighted paths but these are add-ons, rather than parts of the core can then be used to compute sensitivities (“Greeks”). language. The nature of alpha A.M. Berd (2011) Even when they do not explicitly trade options, there is often a negative correlation between the returns of a hedge fund and market volatility, at least for “convergence” (mean-reverting) strategies: indeed, those strategies are essentially a short strangle . It is the opposite for momentum strategies. Measure this risk, diversify your strategies and, if it is not enough, add some explicit volatility hedging.

Vibrato Monte Carlo and the computation of Greeks S. Keegan (2008) Option sensitivities (“greeks”) can be computed via

– Finite differences Price(θ + ∆θ) − Price(θ − ∆θ) ∂Price = ∂θ 2∆θ – Likelihood ratio ∫ Price = E[Payoff] = Payoff(S)p(S) dS ∫ ∂Price ∂p = Payoff × dS ∂θ ∂θ ∫ The feedback effect of hedging ∂ log p = Payoff × pdS in portfolio optimization ∂θ [ ] P. Henry-Labordère (2004) ∂ log p = E Payoff × Delta hedging can have a destabilizaing market impact: ∂θ to hedge a call, you buy when the market rises. This is (but the variance is too high); the opposite of portfolio optimization, which suggests – Pathwise sensitivity to buy when the price falls. [ ] ∂Price ∂Payoff ∂SpotT =E ∂θ ∂Spot ∂θ Circadian patterns and burstiness in human communication activity – Adjoint method: discretize the PDE H.H. Jo et al. (2011) Sn+1 = Sn + a(Sn , tn )h + b(Sn , tn )Zn+1 Human activities (phone calls, tweets, etc.) show both periodicities (circadian, weekly) and bursts. The bursts can be modeled with a cascading Poisson process, i.e., a Poisson process (Xt )t with intensity λt = λ0 + λ1



e−(t−s)/τ Xs .

s The corresponding “non-parametric empirical Bayes” frontier is higher than the shrinkage or bootstrap ones. supk a) ⩾ P (Y > a)

t

t

(the first term ensures that S ∗ is close to the signal S, the second ensures smoothness) and just look at the slope of the smoothed signal. Collaborative filtering with temporal dynamics Y. Koren (KDD 2009) Application of factor models (used to measure risk in finance) to recommender systems (Netflix, etc.).

or, equivalently, ∀a

FX (a) ⩽ FY (a).

Article and book summaries by Vincent Zoonekynd

317/587

Can we learn to beat the best stock A. Borodin et al. (2004)

Models for the impact of all order book events Z. Eisler et al. (2011)

Cover’s universal portfolios use frequent rebalancing to achieve large returns, especially if stock prices are mean-reverting, i.e., if they have are negatively autocorrelated. They can be improved by exploiting crosscorrelation.

The mid-point price is a linear superposition of the impact of the previous prices, with decreasing weights (compute those weights from the autocorrelation function). This simplistic model can be improved by allowing for volume-dependent impact, different order types (market or limit), and weights that vary slowly with time.

Fiducial inference and generalizations J. Hennig et al. (2009) Fiducial inference is bayesian statistics with no prior: the model

Identification of clusters of investors from their real trading activity in the market M. Tumminello et al. (2011)

data = f (parameters, innovations)

Investing styles (corresponding to the informed vs noninformed or fundamental vs technical oppositions in can be rewritten as most agent-based market models) can be identified from transaction data as follows. Build a bipartite netparameters = f −1 (data, innovations). work, whose nodes are investors and days, with three types of edges, corresponding to “buy”, “sell” and ”buy If you observe the data and know the distribution of and sell later in the same day”; project it to a network the innovations, you have a posterior distribution for of investors, with 9 types of edges, and weights correthe parameters. (In general, f is not invertible, which sponding to the number of days; to account for differcomplicates a lot of things.) ent trading activities, truncate the graph to a statistically validated network: compute the statistical p-value precision and reproducibility significance of each edge, with a p-value, testing if the D.D. Boos and A. Stefanski (2011) two investors act independently (given the number of We tend to give a confidence interval to everything: orders of each type they send in the sample), adjusted why not for p-values, as well? A look at their standard for multiple tests (Bonferroni or FDR); finally, apply deviations, bootstrap confidence intervals, and proba- some community-detection algorithm such as infomap. bility of reproducibility P (pnew ⩽ 0.05 | pold ⩽ 0.05) suggests that they are very imprecise: the order of Life time of correlation between stock prices magnitude (approximately, the stars displayed by most on established and emerging markets statistical software) is the best we can hope for. A. Buda (2010) To find the “best” window size to compute a correlation Exploring complex networks matrix (between stock returns), check how the followvia topological embedding on surfaces ing quantities change: average number of consecutive T. Aste et al. (2011) days with a correlation above 0.5 (for a pair of stocks, Planar graphs are simpler than arbitrary graphs. In- or average for all pairs in a market); time in which half stead of trying to embed graphs in the plane, one can the connections in the minimum spanning tree (MST) try to embed them in surfaces of genus g (for the lowest have disappeared; average correlation (Epps effect). possible g). The maximal n-vertex graphs embeddable in a genus g surface can be described from two elementary moves: −→

and

A contextual risk model for the Ellsberg paradox D. Aerts and S. Sozzo

−→

Quantum interpretation of knightian uncertainty (the future is random, but you do not know from which distribution it is drawn): quantum superposition corRationality, irrationality and escalating responds to the use of mixture distributions as priors. behavior in online auctions F. Radicchi et al. (2011) Asymmetric random matrices: what do we need them for? Lowest unique bid auctions (people bid, much less than S. Drożdż et al. (2011) the price of the object; pay each time, much (100 times) more that the bid; are told if their bid is the highest or Random matrix theory (RMT) can be generalized to not; and the lowest unique bid wins) are all-pay auc- study the distribution of the diagonal and off-diagonal tions, lotteries decided by the outcome of a minority elements, and the (complex-conjugate) eigenvalues of game: they are only profitable to the auctioneers. cross-correlation matrices Cor(X, Y ). The null hypothesis is still that (X, Y ) is gaussian iid [I iniArticle and book summaries by Vincent Zoonekynd

318/587

tially thought, from the title, that they were studying the eigenvalues of Cor(X) where the distribution of (X1 , . . . , Xn ) under H0 is no longer invariant under the action of Sn ]. Applications include the study of time series, X = (X1 , . . . , Xn ), Y = (Y1 , . . . , Yn ) – but I am sceptical about the effect of autocorrelation (or serial dependence in general): with real-world data, we already know that H0 is false. Endogenous bubbles in derivative markets: the risk neutral valuation paradox A.F. Maccioni (2011)

for n ⩾ 1 (there are generalizations for n ⩾ 2), it can be computed in O(n2 ) (Panjer recursion). Discretization and underflow can pose problem. Without discretizing the losses, the Panjer condition gives an integral equation satisfied by h. With the fast Fourier transform (FFT), one can compute the pdf h of the compound loss from its characteristic function χ. As with Panjer recursion, this still requires discrete losses. Aliasing error (artefacts at the boundary of the loss distribution) can be reduced by tilting the distribution, i.e., by replacing f (x) with e−θx f (x). For the cdf H, one can use numerical integration methods (H can be expressed as an integral involving the characteristic function χ: there is no need to compute the pdf h), exploiting the oscillatory behaviour of the integrand.

In an incomplete market, the presence of risk-neutral investors (technical traders) lets the prices stray far away from the fundamental value of the assets. creating endogenous depressions or bubbles; when those traders disappear (or stop trading, for an instant), Since the moments of the compound distribution are prices violently bounce back to the fundamental value. known, you can approximate the distribution with a simpler one (Gaussian, translated gamma) with the Calculation of aggregate loss distributions same moments. P.V. Shevchenko (2010)

The asymptotic expansion

(Good, simple introduction to Fourier methods for numeric computations in statistics.) A loss distribution is the distribution of

1 − H(z) ∼ E[N ](1 − F (z)) z→∞

provides a VaR estimator.

Z = X1 + · · · + XN

Index cohesive force analysis reveals that the US market became prone to systemic collapses in 2002 D.Y. Kenett et al. (2011)

where the Xi are iid and N is a random variable; we are interested in their value-at-risk (VaR) and expected shortfall (ES).

k⩾0

χ(t) = ψ(ϕ(t)) ∫ 2 ∞ h(z) = Re[χ(t)] cos(tz)dt π 0 ∫ 2 ∞ sin(tz) dt H(z) = Re[χ(t)] π 0 t where H is the cdf of Z, F that of X, pk the probability mass function of N , χ and ϕ the characteristic functions of Z and X, ψ the probability generating function of N . The moments of the compound distribution Z can be explicitly computed from those of X and N (to derive the formulas, use the characteristic functions). Monte Carlo simulation is the easiest (and slowest) way of computing the VaR or ES of the compound loss (do not forget to estimate the error of those computations). If the individual losses Xi have a discrete distribution, the distribution of the compound loss Z can be obtained recursively, in O(n3 ) time. If the probability mass function pn = P (N = n) satisfies ) ( b pn−1 pn = a + n Article and book summaries by Vincent Zoonekynd

The sample variance or correlation matrix of stock returns changes with time. To study those changes, you can look at the eigenvalues, their entropy, the average correlation, the average partial correlation (correlation without the market) – the ICF is the ratio correlation over partial correlation. A rasterplot of the average (partial) correlation by stock can help identify regime changes: the partial correlation became negligible in 2002. 1.0 Average correlation

Analytically, it can be computed via convolution or characteristic functions ∑ H(z) = pk F (k)∗ (z)

0.8 0.6 0.4 0.2 0.0 2007

2009

2011

2013

2015

Detecting novel associations in large data sets D.N. Reshef et al. (2011) The MIC (maximal information coefficient) is a measure of dependence defined (computationally) as follows. Divide the data into an n × m grid, discretize the data along this grid, compute the mutual information, 319/587

normalize it (divide by log Min(x, y)); find the division Counter-point to risk parity critiques that maximizes the mutual information; repeat for all E. Peters (2010) values of n, m with nm ⩽ N 0.6 (where N is the num- A few caveats on the implementation of risk parity ber of observations). The MIC is the maximum value strategies: avoid leveraging assets that are already inin this matrix. ternally leveraged (e.g., stocks, whose debt/equity raIt can be interpreted as a generalized correlation coef- tio is often 2/1); hedge inflation; do no leverage asset ficient and can identify non-functional relations (a cir- classes with no expected returns, such as commodities cle, etc.), but is misled by some patterns – for instance, – only use them to hedge inflation. both a line and a 2 × 2 checkerboard pattern give the maximum value. (Also, its power is apparently much The hidden risks of risk parity portfolios lower than that of the correlation distance.) B. Inker (2010) From the matrix M of mutual information, one can derive other measures: Max |Mij − Mji | measures monotonicity, MIC − Cor2 measures non-linearity.

Remember that the standard deviation does not measure all the risk, especially for assets with skewed returns – and leverage increases the danger.

Brownian distance covariance G.J. Székely and M.L. Rizzo (2009)

Financial Theory J. Geanakoplos Yale (2009)

The distance covariance between two random variables Some universities have started to put some of their unX and Y with characteristic functions ϕX and ϕY is ∫ dergraduate (i.e., elementary) courses online: this one is an introduction to finance, describing models, examw(x, y) |ϕX,Y − ϕX ϕY | ples or experiments to illustrate general economic and −p−1 −q−1 with w(x, y) ∝ |x| |y| (p and q are the dimen- financial facts. sions of X and Y ).

Economy

If U is a stochastic process on R, let XU = U (X) − E[U (X)|U ] be the U -centered version of X; in particular Xid = X − E[X]. The brownian covariance is defined by

Many price-setting mechanisms (seller-derived (supermarket), haggling, price regulation, tatonnement, pit (like haggling, but with several buyers and sellers), bid/ask (as in computer-trading), specialist, etc.) lead to the same price (this is the law of one price), the price 2 ′ ′ CovW,W ′ (X, Y ) = E[XW XW YW ′ YW ′ ] predicted by the law of supply and demand. This can where X ′ and Y ′ are iid copies of X and Y , and W , be seen in the following experiment: take 10 students, W ′ are independent brownian motions. It coincides 5 buyers and 5 sellers, assign a preferred price to each, with the distance covariance. The result can be gen- ask them to trade using the “pit” method (shout the eralized to Lévy fractional brownian motions of Hurst desired price, above their preferred price for seller, be−p−2H −q−2H low for buyers, and modify your offer until you find index H ∈ (0, 1) and w(x, y) ∝ |x| |y| . a counterparty). Even if it were theoretically possible for everyone to find a counterparty with a profit on Leverage aversion and risk parity both sides, this is not what happens: the final price p A. Asness et al. (2011) is such that the number of sellers sellers below p equals There are many reference portfolios: capitalization- the number of buyers above p – the others could not weighted (the common approximation of the market find a counterparty. portfolio), equal-weights, fundamental weights (since In this experiment, the average price (for buyers, or the market portfolio overweights overvalued stocks, one for sellers) did not play any role: only the price at the can try to replace the stock price with some “fair value” margin mattered. The price is decided at the margin: derived from the balance sheet – it is an accountingthe market price is the reservation price of the marginal motivated way of shrinking the capitalization-weighted buyer or seller, i.e., the price for the next person to buy portfolio towards the equal-weighted one), minimum or sell. This also explains Adam Smith’s water and divariance, tangent (the CAPM theorem claims it is the amonds paradox: water is useful but cheap, while diamarket portfolio). The risk parity portfolio is yet anmonds are useless but expensive. Price and value are other (robust) approximation of the elusive market not the same thing. portfolio, using equal contributions to risk. This is theoretically justified if some investors are averse to Economic models are usually defined with a set of leverage: they would overweigh risky assets (stocks) to agents, each with an initial endowment and a utility achieve higher returns, increasing demand for riskier function, trading with one another to change their alloassets and lowering their price. Historically, a 170% cation and maximize their utility function. It was once bond, 30% stock portfolio performed better than the thought that the resulting equilibrium was maximizmarket portfolio and the classical 60% bond, 40% stock ing the sum of the utilities – this only happens in the unrealistic case where the marginal utility is constant. one. The equilibrium is just Pareto efficient: changing the

Article and book summaries by Vincent Zoonekynd

320/587

allocations would make someone worse off (but there are many Pareto-efficient allocations).

With the introduction of time, the notion of price becomes trickier. The prices are expressed in some curTo find the equilibrium, just write the following equa- rency. In the 1-period set-up, we could “normalize” the prices by assuming that the price of good x was tions: 1 currency unit. In the 2-period setup, we can do ex– Markets clear, i.e., supply equals demand, i.e., for actly the same: assume that the price of good x is one each good, the sum of the initial endowments equals currency unit, and look for the price of good y (the the sum of the allocations; same asset next year) in the same currency – the cur– All the money is spend, i.e., for each actor, the value rency this year. In the real world, the market price of the initial endowment is the value of the alloca- we would see would be expressed in a different curtion; rency: the currency next year. The difference (strictly – Everyone maximizes their utility: the Lagrange mul- speaking, the ratio, expressed as a rate) between the tipliers conditions can be written as currency this year and the currency next year is the inflation. To model it, we would need to add a “theory 1 ∂UA 1 ∂UA of money” (how much money there is in the economy, = px ∂x py ∂y how quickly it circulates, etc.) to our model. The difference between the price this year (in today’s curfor all goods x, y, and all actors A (U ′ is called rency) and the price next year (in today’s currency) is marginal utility). the real interest rate. This is what we are interested The prices are only determined up to a multiplicative in. The price, in today’s currency, of next year’s good, constant. There is no “just” price: it depends on utili- is called its present value. The difference between the price this year in today’s currency and the price next ties and endowments. year in next year’s currency is the nominal interest rate: it combines (muddles) both elements. We will 2−asset, 2−good equilibrium only use today’s currency. 1 Allocation of good x for agent B 0 1

0

Allocation of good y for agent B

Allocation of good y for agent A

1 + real interest rate =

Indifference curves for A Indifference curves for B Pareto frontier Budget lines (prices)

0 0

Allocation of good x for agent A

1

1

1 + nominal interest rate 1 + inflation

The real interest rate is usually positive. It can be negative when the government tries to boost demand: it is then preferable to spend now, and even borrow to spend now, rather than wait until next year. Interest is “crystallized impatience”: we prefer goods today rather than tomorrow. It may also include differences in supply between today and tomorrow. Bonds and rates or return

Bonds (coupon, zero, annuity, perpetuity) can be used to spread expenses over time. For instance, it a university needs 100,000,000 per year, for the next 10 years, Assets and time to repair badly maintained buildings, it does not have To jump from economic models to financial models, we to cut its budget by that much: if the cost can be need to add time, assets and (later) risk. spread over the life of the university, i.e., if they are Adding time is easy: let x and y be the same good (say, willing to pay forever, they can replace the costs with apples), but x is now and y next year. For instance, if a perpetuity with the same present value (PV) and rethe utility is log x + 12 log y, we are impatient, we value duce the annual cost: apples more today than next year, i.e., we discount PV(10-year 100 million annuity) = PV(perpetuity). apples in the future. We can also add assets: for instance, a piece of paper giving you the right to receive one apple today x and one apple tomorrow y is an asset – an apple tree is also an asset, with the same dividends (1, 1).

(You have to choose an interest rate for this, e.g., 5%.)

The notion of present value can also be used to measure the performance of a hedge fund: since the cash flows and the assets under management are irregular, the noThe situation may look complicated: we have endow- tion of average return can be contentious. The geometments of x, y, of the asset (1, 1) (and of other assets, ric average of the annual total returns is a biased meadefined by their payoffs or dividends); we have the util- sure: it gives the same importance to years with little ity for each actor; we are looking for the price of each capital (and amplifies little profits) and less importance good and asset, and the allocation of each good and to years with more capital (and reduces losses, e.g., asset. But it can be simplified by just removing re- caused by capacity constraints). The internal rate dundant assets or goods: assets such as (1,1) are just of return (IRR, sometimes called yield-to-maturity), i.e., the constant rate for which the present value of all linear combinations of goods. Article and book summaries by Vincent Zoonekynd

321/587

the cash flows (both positive and negative) is zero, is a better alternative, though plagued by its own problems. It is also biased, because clients tend to leave after a bad year (so that there is more weight on those bad years) and invest after good years (so that there is less weight on those good years). If there is a gap, a long series of zeroes, in the stream of cash flow, everything that comes after will be discounted a lot and will have a negligible effect. The IRR is not always welldefined: for instance, (1, −4, 3) could suggest either 0% or 200%. The IRR assumes that the interest rate is deterministic and constant (we are throwing away relevant information: we know the risk-free discount factors). There is also a (uniquely defined) modified internal rate of return (MIRR). The IRR assumes that you can reinvest the money at the IRR rate, while the MIRR uses a reinvestment rate: compute the future value of the inflows, after n periods, with the reinvestment rate; compute the present value of the outflows (negative cashflows), now, with the financing rate; convert into a rate. In R, check financial::cf. (The classical alternative to the IRR is the present value of the stream of cash flows for the risk-free interest rate.)

positive cash flow at the beginning and a negative cash flow at the end, so that the portfolio value be zero (easy to do if the yield curve is not flat). If you only report the cash flow and not the present value of the outstanding bonds, it looks positive. That is why you should mark-to-market. Social Security Social security (an insurance against living too long) works as in the following parable: a son gives $100 to his father (say, for a life-prolonging operation), later asks his own son to pay back the $100 that helped his grand-father live a few more years, this son then asks his own son, and so on – each generation contributes the same nominal amount, which contributes to the PV of the first father, e.g., PV = 50+25+12.5+· · · = 100. (In case of a baby boom, the amount to repay (per person) decreases.) In other words, social security is a (beneficial) Ponzi scheme – money is another widespread Ponzi scheme.

The overlapping generations model can explain and quantify what happens. Generation n lives in periods n and n + 1, with endowments (3, 1), i.e., 3 units when young, 1 when old. Generation 0 only has a (0, 1) endowment, but also owns land, an asset with Some people use the “current yield” (the bond rate dipayoff (1, 1, 1, . . . ): in period 1, they will consume the vided by price) instead, especially if they want to sell good produced, sell the asset to the next generation, you something: this is “the wrong way of measuring and buy more good (from the next generation) with things, which can get you confused”. (Be wary of any- the proceeds. The utility is log x + log x . One can t t+1 thing called “yield”: there are many, many definitions, compute the equilibrium. and they are almost all misleading.) Social security worsens this situation: the (3, 1) endowUsing risk-free bonds of several maturities one can ment is replaced by (1, 1). This is a huge benefit for the compute the price today of $1 in n years (that is a first generation, but every subsequent generation has discount factor), and the forward rates (i.e., the to pay the price; the cost does not fall, and remains rate between years n − 1 and n: it would be the future the same generation after generation, because of the interest rate, it it was deterministic and known – it is interest rate. not). Some also compute the yield (and plot the yield curve), but that is misleading: it is the annual rate The model can be made more realistic by considering a over those n years computed assuming that the inter- growing population, growing land dividends, etc.: the est rate is constant – but we look at the yield curve situation improves, but only slightly – the interest rate because it is not constant. (The yield is not unlike the would just be higher. implied volatility and the smile: it is computed under Some economists claim that the baby boom should not the assumption that it is constant, and we look at how have any effect on the stock market, because we know it varies...) in advance how many people will retire, and this inforProfit is sometimes incorrectly measured as the cash mation is already in the prices. However, the interest flow you receive after one period, forgetting the change rate fluctuates with the population: this is what imin present value. In times of crisis, when the value of pacts the stock markets – demography has an effect the firm drops, many banks and businesses report the on stock prices. If we compare the utility of the gencash flow instead of the profit, to hide the losses. This erations, we find that the utility of baby boomers is kind of deception is very common with bonds: this is lower. how the “current yield” is defined. (The overlapping generations model for social security Profit =

CF1 CF1 + PV1 − PV0 ̸= = Current yield PV0 PV0

can be augmented with random stock dividends.) Social security can be fixed as follows:

Carry trades use this idea to hide losses: for instance, – Separate the legacy debt from social security; if the 2-year forward interest rate is 2% and the 5-year – Create a new security, an annuity indexed on the average wage: its performance will track that of the 5%, you can buy a 5-year bond (you pay 100 and restock market, but with fewer, slower fluctuations; ceive 5, 5, 5, 5, 105) and sell a 2-year bond (you receive the market will price it; 100 and pay 2, 102). More generally, a carry trade is – If you want some redistribution, by paying 10% to a portfolio of two bonds, one long, one short, with a Article and book summaries by Vincent Zoonekynd

322/587

14% of your income (depending on how much you Callable bonds are another optimal stopping probearn), 12% would go in your social security account. lem (use, e.g., a geometric brownian motion for the interest rate and compare the price of a bond and a Uncertainty callable bond; use a binomial or trinomial (recombinTo fully leave Economy for Finance, we need to add un- ing) tree): the borrower has the option to prepay the certainty to our models: the future state of the world is bond at any time. Mortgages are a special case. a random variable, but with a known distribution (we exclude knightian uncertainty, i.e., uncertainty about Mortgages: history the probability distribution). Mortgages have been around since Babylon (1000 BC). Uncertainty can have a surprising effect on the dis- They used to be coupon bonds (say, 7 7 7 7 7 · · · count factor. We see a big difference between today 107), with a “balloon payment” at the end – but peoand tomorrow, but very little between one year and ple would often default just before. After the 1929 one year and one day. But the discount factor is sup- crisis, amortizing mortgages, with identical dividends (e.g, 8 8 8 · · · 8) became more popular: the present posed to decrease exponentially. This can be explained value (or “remaining balance”), and the risk for the if the interest rate is unknown, stochastic, and (unrealistically) follows a geometric Brownian motion: the bank, decreases with time. expected discount factor decay (asymptotically) in t−α (hyperbolic discount).

Evidence shows that prices already include a lot of information. For instance, prices of orange juice (concentrated, mainly from Florida) include weather information; weather forecasts cannot help forecast prices, but prices can help forecast weather. The rational expectations assumption goes further and claims that the current price is the expectation of the present value of the future price (this is not entirely true: we will see later how risk changes prices – the difference is the risk premium).

Then, Freddie and Fannie pooled mortgages and sold them: the risk was limited (the loans were standardized, with stringent conditions on the borrowers), distributed (loans from different regions were pooled), and securitization made mortgages liquid (you could easily resell your Freddie or Fannie shares).

Then, CMO (Collateralized mortgage obligations) cut the pool into pieces: for instance, floaters (constant cash flow plus interest rate), reverse floaters (constant cash flow minus interest rate) and a residual piece for the defaults and the prepayments. There were arbitrage opportunities when the price of the pieces did not match the price of the pool. Hedge funds were selling Current price = E[PV(future price)]. the riskiest parts and hedging the rest; other investors From bond prices, one can estimate the probability were selling the easy parts and worrying about the risky of default of a country or company (just compare the parts remaining on their balance sheets. However, the price with a risk-free bond, i.e., compute the default chain from home owner to mortgage pool to CMO to investor was getting longer, and the same collateral was spread). re-used at each step. 1 − P (default) Bond price = Then, new mortgage markets appeared, besides agency 1+r (Freddie, Fannie) mortgages: jumbo prime (large mort1 = gages, that were previously excluded); alt-A (almost (1 − default spread)(1 + r) prime); subprime; uncollateralized. With several maturities, one can check how this probability changes with time. The CDS contains the same Then, mortgage pools were cut into pieces (bonds): AAA, AA, A, BBB (the defaults and prepays are put information. into the lowest piece first, until it is full). CDS (credit Shakespeare’s play the Merchant of Venice is about default swaps), insurance on each of those bonds, also collateral to control default risk (people not keeping appeared. their promises): a defaulted loan makes you lose “a pound of flesh”, choosing the wrong casket forces you Then, CDOs appeared (take many BBB pieces, pool them together, and split them, into AAA, etc.), and to remain single, losing the ring entails your death. CDO squared (do it again, take many AA CDOs, pool In the black-and-red game, you draw cards from a 52- them together, and split them). card deck, receive $1 for each red card, pay $1 for each black card and stop when you want (in particular, you Mortgages cannot lose: if you wait until the end of the deck, you Mortgages are just callable bonds, and computing their are flat). What is the value of this game and what prices should be an optimal stopping problem. Howis the “optimal” (maximum expected payoff) strategy? ever, borrowers do not prepay their mortgages optiThis optimal stopping time problem can be solved mally: banks estimate the proportion of people who via dynamic programming (“backward induction”): let will prepay in order to lower the cost of the mortgages V (a, b) be the expected value of the game if there are a and be more competitive – for the borrower, there can red and b black cards left, and compute it recursively. be an actual arbitrage opportunity. Modelling the pre) ( pay rate as a function of the interest rate poses a surb a (1 + Va−1,b ) + (−1 + Va,b−1 ) Va,b = vival problem: people who have prepaid are more alert a+b a+b + Article and book summaries by Vincent Zoonekynd

323/587

or attentive than those who remain: the more people prepay, the less likely the remaining ones are to prepay. Agent-based models work better: Each agent has a “prepay cost” (that depends on the credit rating of the borrower) that measures how attentive he is to the interest rate, and acts rationally wrt this cost; one can calibrate the distribution of this prepay cost with the data. Other parameters can be added: attentiveness, value of the mortgage, location, pay of the borrower, etc.

Earlier, we have claimed that the price was the expected payoff: it is actually less, to compensate for the risk. The difference is the risk premium. [What we really did was use the present value of the future cash flows as utility, maximize expected utility, and compute the corresponding price: that kind of utility is unrealistic, but the rest of the reasoning is correct.]

Utility should be concave, to account for diminishing utility, i.e., risk aversion – this is Jensen’s inequality: we prefer $1 for sure to $1 on average. To avoid infiThe value of the mortgage depends on the interest rate, nite expected utilities, you may also want utility to be as an option does, but the shape of the payoff is tricky, bounded (for instance, how much would you pay for dangerous, when the interest rates are low – that part the game: “flip a coin until you have tails and receive n should be carefully hedged. 2n (or 22 ) dollars, where n is the number of tosses needed”?).

Mortgage value

With quadratic utility, U (x) = ax − x2 , the expected utility only depends on the mean and variance of x. An Arrow-Debreu security is an asset that pays $1 in one state of the world, and $0 elsewhere. CDS and binary options are almost Arrow-Debreu securities: they are event-dependent, not state-dependent. In a complete market, the price of an Arrow-Debreu security is proportional to its probability.

1

0 0

1

2

3

4

5

Interest rate

The mutual fund theorem says that in an incomplete market, everyone diversifies by holding the aggregate economy and cash: the only thing that differentiates economic actors’s allocations is the proportion of cash.

This assumes that all investors are rational, have access the the same information and investment opporA hedge fund is an asset manager that uses leverage, tunities, know the correct expected returns and varihedging, is lightly regulated (i.e., the clients are so- ances, and have quadratic utility (but not necessarily phisticated investors) and has high fees. Hedge funds the same utility). Of course, we do not know what the typically go bankrupt when they do not hedge properly “market” portfolio – but it is larger than we think. or encounter liquidity problems (e.g., margin calls). The advantage of diversification can be seen with two Here is a hedging example. Someone is willing to bet assets: if Cor(X, Y ) < 1, the variance of Z = λX +(1− that team A will win at 60/40 odds, while the book- λ)Y can be lower than that of X and Y (try to plot makers offer 50/50: how do you profit from this dis- those random variables, as λ varies, in the σ(Z) × EZ crepancy of views? You can accept the bet, but place plane). Hedging

an opposite bet with the bookmakers, for a carefully chosen value, so that you have a sure profit in all cases – there are no probabilities involved. When time is involved, dynamic hedging may be required. For instance, imagine someone is willing to bet that team A will win over team B, over the next 10 matches, at 60/40 odds, but the bookmakers offering 50/50 odds only allow you to bet on individual matches: how do you profit? The solution is similar, but you have to change your bet after each match. Dynamic hedging also occurs in finance: for instance, if the market price of a mortgage is lower than what your model predicts (e.g., because the market does not account for prepayments), but is extremely sensitive to interest rates, you can hedge interest rate fluctuations away. Everyone wants to hedge, but some risk will remain (markets are not complete). Article and book summaries by Vincent Zoonekynd

324/587

0.010

● ●

0.008



Crashes appear in the 2-period model: after a bad news, the most optimistic investors are bankrupt, the leverage drops, marginal buyers are less optimistic and pay less, and the price drops.

Efficient frontier Efficient frontier with a risk−free asset Indifference curves (for some investor) Tangent portfolio Minimum variance portfolio Assets

Return

0.006



0.004

0.002

● 0.000



0.000

0.005

● ● ● ● ● ● ●● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ●● ● ●

0.010

● ●

0.015

0.020

0.025

0.030

This is what happened with the subprime mortgage crisis. The apparent risk was low because the securitized mortgages were collateralized (by the houses) and investors were also buying insurance, in the form of CDSes, from insurance companies. However, thoses CDSes were not collateralized. If the BBB mortgage bonds went, so would the CDSes, CDOs and CDO squared. High leverage creates high prices: more buyers are willing to buy at a high price, because they can borrow – the marginal buyer can afford more when leverage is high. When people noticed that something was going wrong, margin (collateral) requirements increased, leverage decreased, price fell, worry rose, collateral requirements increased, etc.

Risk

Leverage poses many problems: The Sharpe ratio of an asset or portfolio is the slope of the line going through it and the risk-free asset: (EX − r)/σ(X). You should maximize your Sharpe ratio. A stock is worth more if it increases the Sharpe ratio. This can be quantified by the covariance pricing theorem: Price = (1 + r)−1 EX − α Cov(Market, X),

– A few investors, with high leverage, control the market (they are the marginal buyers) – what if they are not rational? – Leverage spreads inequality. – Leverage increases (leverages) volatility, which impacts economic activity. – Some companies are deemed “too big to fail”. – After a leverage crash, the is a debt overhang: no investments (from home owners and businesses) and no loans (from banks).

i.e., the risk is not measured by the variance, but by the covariance. The price is determined by the marginal Preventing those problems is not that hard (in hindutility, by the utility of a marginal buyer, by how sight, it is never hard): much diversification the asset adds to the market portfolio – not by something intrinsic. (This still assumes – For mortgages, just write down the principal of the loans (the alternative is to evict the home owners, quadratic utility.) This can be somewhat unintuitive. but this is bad for everyone: people lose their home; For instance, if GE and a startup have the same exbanks only get back part of their money, after a few pected returns, which has the highest price? You might years, when the house is sold, at an abnormally low think it is GE, because it has a lower risk, as measured price (auction), even accounting for the drop in value by the variance. It is actually the startup, because it because it was not maintained (or was vandalized) has a lower risk, as measured by the correlation with after the eviction). This is problematic because the the market. bond owner and the home owner are not in contact – In particular, (EX, Cov(Market, X)) should be on a there is an intermediary, with no incentive to write line. down the principal. The CAPM worked well for 20 to 30 years, but no – To prevent the leverage collapse, banks should lend with less collateral, e.g., by forbidding quick changes longer works. The same goes for the Black-Scholes in the margin requirements. model: the Gaussian daily returns assumption was al– To replace the natural buyers, the government most valid in the 1970s, but no longer is. should invest more. Leverage cycle – To prevent or observe the next leverage crisis, one (the Federal Reserve) should not only monitor interThe leverage cycle can be modelled as follows: two est rates, but also the collateral. assets (gold, risk-free, and an oil well, with an uncertain payoff), two states, many actors, with identical endowments (1,1) but heterogeneous beliefs on the probabil- Hyperbolic discounting is rational: valuing the ities of those states. Actors can borrow gold using oil far future with uncertain discount rates wells as collateral from less optimistic investors, with J.D. Farmer and J. Geanakoplos (2009) various leverages. There is not a single supply and deIn a multi-period world, we want the discount factor mand equation, but many, one for each possible value for future utilities to be stationary and time-consistent. of the leverage. (It turns out that only one of those ∑ loans is actually traded: the leverage is uniquely deUtilityt (x) = Dt→s Utility(xs ) fined.) The availability of leverage increases the price. s⩾t Article and book summaries by Vincent Zoonekynd

325/587

With known, deterministic rates, the discount factor decreases exponentially. However, if the the interest rate is stochastic, described by a geometric Brownian motion, then the expected discount factor exhibits a power law decay, at least after some time (a couple of generations). Dt→s =

s ∏

(1 + rτ )−1

when facing someone else, who can also take decisions that will affect your payoff. There are many notions of a “good” (or bad) decision. A Pareto efficient situation is a situation that cannot be changed to make one player better off without making the other worse off: since it usually requires cooperation between the players, it rarely happens.

A strategy is dominated by another strategy if it is always worse, whatever the other player does. (If you res→∞ place “worse” by “worse or as bad as”, that is a weakly There is no such effect for more realistic mean-reverting dominated strategy.) models. You can try to find a “good” decision, iteratively, 1.0 by discarding the dominated strategies for both playGeometric brownian motion Constant rates ers, which gives a new game, and continuing to dis0.5 card the dominated strategies until there are none left. However, this assumes that both players are rational, known that the other is rational, know that the other 0.2 knows they are rational, etc. – i.e., players are rational and this is common knowledge. In practice, it does 0.1 not work. τ =t

Expected discount factor

E[Dt→s ] ∼ (s − t)−α

1

2

5

10

20

50

100

200

Time (years)

Discussion of “the leverage cycle” H.S. Shin (2009) after J. Geanakoplos Real-estate is a leveraged investment: home buyers using a mortgage with a 5% downpayment are leveraged 20 times. The leverage cycle can be modeled as follows. In a 1-period world, consider two assets, cash and stock, with the stock price distributed as U (q −z, q +z) in period 1, and two types of investors, trying to maximize the same expected utility under different constraints: unleveraged investors have a leverage at most 1, leverage investors have a value-at-risk (VaR) constraint (since the price distribution is uniform and known, the constraint can be formulated as “they cannot be bankrupt”). When the asset price increases, the VaR constraint is less binding, non-leveraged investors can invest more, demand increases, and price increases even further – there is a similar downward spiral. Demography and the long-run predictability of the stock market J. Geanakoplos et al. (2004) Markets ups and downs, over the past century, seem to coincide with baby booms and busts: part of the effect can be explained by the equity premium, higher when the population is older. This can be modelled with an overlapping generations model, an incomplete market, and age-dependent utilities; the effects remains even after adding families (children helping parents), social security (idem), bequests (the opposite). Game Theory B. Polak (Yale, 2007) Game theory is the study of decision making, not when facing an unchanging world, as often in economics, but Article and book summaries by Vincent Zoonekynd

A Nash equilibrium (NE) is a situation in which each player’s decision is a best response to the other player’s decision. Since the definition is circular, you can expect a few problems: it need not exist, need not be unique, need not be Pareto efficient, and weakly dominated strategies can be Nash equilibria (but strict ones cannot). Randomized decisions can create (mixed) equilibria in situations with no pure equilibria, such as the rockpaper-scissors game. A pure strategy that is part of a mixed strategy NE is itself a best response (to the other participant’s strategy, which is usually a mixed strategy): mixed equilibria cannot be strict. To find a mixed strategy NE, look for mixes that make the other party indifferent. Mixed strategies can be interpreted as randomized strategies or as a population whose elements have different behaviours. A strategy S is evolutionary stable if, when a new strategy appears, with a small proportion in the population, it disappears. Then, (S, S) is a Nash equilibrium. Conversely, if (S, S) is a strict Nash equilibrium, then S is evolutionary stable – this does not work with weak equilibria. The payoffs of a simultaneous game can be represented as a matrix of (with two values in each cell, one for the row player, one for the column player). The decisions and payoffs in a sequential game can be represented as a tree. Backward induction gives a “good” solution to the game, but it assumes that the other player is rational – and it need not be Pareto efficient: there can be moral hazards. Backward induction assumes that the other player will play it his/her best interest, i.e., that they will not choose something that would be bad for them and for us. It assumes they will not “screw up”. It does not give a worst-case strategy.

326/587

You can tweak a simultaneous game by changing the payoffs (incentive design: give some of your payoff to the other player, in some situations, to encourage him/her to choose that path – you will get a smaller share of a larger pie), or pruning the tree of unwanted branches by restricting your choices (build a new factory, which becomes a sunk cost; burn your ships to show your commitment, etc.). Zermelo’s theorem says that in a sequential game with perfect information and a finite number of moves, either player 1 can force a win, or player 1 can force a tie, or player 2 can force a win. With imperfect information, many Nash Equilibria are clearly bad choices. A Nash Equilibrium is a subgame perfect equilibrium (SPE) if it also induces a NE on all subgames. Repeated games can lead to wars of attrition: past costs are sunk, they do not play any role in new decisions, and can accumulate, forever. This happens in wars (costly wars over very small frontier areas), companies fighting for a market regardless of the cost or bribery auctions (“all pay auction”: two companies bribe a government until one out-bribes the other).

teach you anything, or at least nothing useful for your future job: it is socially wasteful, uses resources but does not improve productivity) Auctions are becoming more important (governments selling oligopoly rights (phone markets, natural resources), ebay, IPOs, microstructure of stock markets, etc.). In a common value auction, the value is the same for everyone, but it is unknown. In a private value auction, the values are different for everyone and unrelated. (Reality is between those two extremes: an oil well has a common value, but different companies may have different exploitation costs; a house has a private value (consumption), but since you eventually resell it, other people’s values matter.) In a common value auction, if everyone bids their (unbiased) estimated value, the winner will overpay: this is the winner’s curse. If you win, it is bad news. (This often happens with IPOs.) You should bet under the assumption that you will win, i.e., you should ask yourself “What is your estimate? You have won, so your estimate is too high: do you want to correct your estimate?”.

There are many types of auctions: first price sealed bid auction; second price sealed bid auction (or victory Repeated games try to enforce cooperation by balancing threat of future punishments and prospects of fu- auction: the highest bidder gets the good, but pays the ture rewards. If there is a time limit, they can exhibit second highest bid); ascending open auction (the price the lame duck effect (politicians or CEOs approach- progressively rises, and people progressively drop out ing retirement): in the last period, you no longer need – very similar to the second price sealed bid auction); descending open auction (or Dutch auction: the price to cooperate, because there is no future – and, by backprogressively falls until someone accepts it – identiward induction, there is no reason to cooperate at the beginning either. In some cases, it can work: if there cal to the first price sealed bid auction). The optimal are several Nash equilibria, you can use one as a reward strategies are different: in a private value second price and the other as a punishment, to encourage coopera- sealed bid, you should bid your value: it is a weakly dominant strategy; in a private value first price sealed tion. bid auction, bidding your value is weakly dominated: This effect can disappear in repeated games with a ran- you should bid slightly less. Surprisingly, the expected dom number of periods (e.g., in the prisoner’s dilemma, revenue generated by these auctions is the same. one can use the grim trigger strategy: cooperate until the other defects – to show that it is an equilibrium, Here are some of the games or real-world situations discount the future using the probability that the game used to illustrate those notions. will continue). Coordination games are variants of the prisoner’s When information is asymmetric, some players may dilemma: competing or colluding firms, overfishing, want to reveal the information while other will try to global warming (contractual enforcement is needed: it hide it – but the information has to be verifiable: claim- is not just a communication problem), etc. ing “our product is the best” is not sufficient, even if it happens to be true. Not revealing information is informative: it means that it is not in your interest to do so (information unravelling: “Silence speaks volume”). Revealing verifiable information can be costly. For instance, in a world with two types of workers, good and bad, and two types of pay, high and low, employers and good workers want the distinction to be known, bad workers do not. If the cost of education is sufficiently different for good and bad workers, good workers will spend time to earn a degree, but bad workers will not: this is a separating equilibrium. This can lead to qualification inflation: if the cost difference is too small, good workers will try to earn even more degrees. (This model assumes that education does not Article and book summaries by Vincent Zoonekynd

In the number game, each of n participants chooses a number between 1 and 100, the winner is the one closest to 2/3 of the average (this shows that, in practice, if you delete dominated strategies until the end, you lose: rationality is not common knowledge). Alice and Bob both wear a pink hat but do not know the colour of their hat: “Someone is wearing a pink hat” is shared knowledge, but not common knowledge. In the election game (linear city model), two politicians decide on their position on the political spectrum J1, 10K; each position has the favour of 10% of the population; people vote for the closest candidate (or one of the closest candidates, at random, in case of a tie): iterative deletion of dominated strategies leads to both

327/587

candidates choosing a medium position (5 or 6). This also applies to product placement or petrol station location (the best location is next to a competitor, not in an area with no competition).

egy is to make the two piles equal.) There is a second mover’s advantage if the two piles are equal, but a first mover’s advantage otherwise. Here is a 2-dimensional variant: the stones are in a n × m array; choose a stone and remove it and all the stones north-east of it; you In the investment game, each of N players can invest lose if you remove the last stone. $0 (no loss, no profit) or $10; if more than 90% choose to invest, they receive $5 more, if not, they lose their Two duellists are walking towards each other; if their $10. There are two Nash equilibria: everyone invests, skills (probability of hitting the adversary as a function or no one does. It is a cooperation game, but con- of the distance) are known, when should they shoot? trary to the prisoner’s dilemma, there is no dominated (Not too early (if pn < 1 − pn+1 , shooting is a domistrategy: contractual enforcement is not needed, com- nated strategy), not too late (use backwards induction munication or leadership often suffices to move to the to show that you should act as soon as pn ⩾ 1−pn+1 ).) better Nash equilibrium. Bank runs, monopolies (the A similar situation appears in economics, when two use of Microsoft products as a de facto standard) or companies are developing a new product, want to refashion trends are similar examples. lease it before the other to occupy the market, but do not want to release it too early because it is not comThe Cournot duopoly (two firms competing on the quantity produced) is a game of strategic substitute, pletely finished. the opposite of a coordination game: if you produce more, the other party produces less (in a coordination game, if you produce more, the other party has an incentive to produce more). The Nash Equilibrium does not maximize total profit; monopoly or collusion does, but collusion is not a Nash equilibrium: the firm would have an incentive to produce more. In Bertrand competition, the two firms compete on prices, not quantities: the outcome is similar to perfect competition. In the candidate-voter model, each voter is a potential candidate, and their political opinion is in [0, 1]; there is a reward for the winner, a cost to run, and a cost proportional to the square of the distance between the voter and the elected candidate. There are many Nash equilibria: a single candidate exactly in the middle, two candidates symmetrically positioned around the centre (but not too far from the centre, otherwise a centrist candidate could run and win). This mimics what happens in real elections: if a new candidates appears on one side, the other side wins. In the location model, two types of people have to choose to live in one of two cities; they prefer a heterogeneous environment, but want to avoid being isolated among too many people of the other type. The heterogeneous situation is better for everyone, but it is not stable.

The chain store paradox shows the advantage of not being rational (or, at least, having the other party believe you are not): several firm consider entering a new market, currently controlled by a monopolist; if it is more costly for the monopolist to fight the new entrants than to leave them a share of the market, the monopolist should not fight – but if they have the reputation of being irrational (systematically or randomly fighting, even if it is not profitable), they can deter new entrants. The dictator game (you are given $1, you can give part of it to the other player, who cannot do anything) and the ultimatum game (you are given $1, you can give part of it to the other player, who can accept or reject the offer, in which case no one receives anything). show that we are not always rational. Bargaining is similar to the ultimatum game, but with several periods, and a discount factor (the value to share decreases with time). The first offer is always accepted; the first mover’s advantage disappears as the number of periods increases; if discounting is almost negligible, and similar (and known) for both players, an even split is optimal. However, if the value for each player is not known, the result of the bargaining is suboptimal.

When designing incentives, do not overlook the strategic effects. Consider two firms competing in a Cournot equilibrium. One has the opportunity to invest in new equipment: it is expensive, but will reduce production Consider n lions, ranked by seniority, and one sheep. costs. It the quantity produced does not change, it Only lion 1 can eat the sheep, but lion 2 will then have may not be profitable, but if the other firm knows that the opportunity to eat him in his post-meal nap, and the investment has been made, there is a strategic efso on for the other lions. Should lion 1 eat the sheep? fect: they will reduce their production, we will increase (Use backward induction and consider the behaviour ours, and the new investment will become profitable. of the last lion.) Similar effects appear in the tax code. In a Cournot duopoly, if firm 2 has a spy that tells them what firm 1 will be doing, and firm 1 knows it, then firm 1 has a first mover’s advantage. For firm 2, more information, more choices can hurt.

Comparison of correlation analysis techniques for irregularly-sampled time series K. Rehfeld et al. (2011)

In the game of Nim, there are two piles of stones. You can remove as many stones as you want from either The auto-correlation and cross-correlation of pile, but if you take the last stone, you lose. (The strat- irregularly-sampled time series can be estimated by considering the time series as regular; linearly inArticle and book summaries by Vincent Zoonekynd

328/587

terpolating to have regular time series; binning the observations pairs by lag, i.e., considering a regression

with a rectangular kernel; or using a similar regression with a gaussian (this is what gives the best results) or a sinc kernel.

Ckmeans.1d.dp: optimal k-means clustering in one dimension by dynamic programming In one dimension, the (usually NP-hard) k-means problem is amenable to dynamic programming (consider Di,m , the minimum within sum of squares when clustering x1 , . . . , xi into m clusters) – for situations where reproducibility is important.

A drunk and her dog: an illustration of cointegration and error correction M.P. Murray (1994)

Tweets and peers: defining industry groups and strategic peers based on investor perceptions of stocks on Twitter T.O. Sprenger and I.M. Welpe (2011)

(xt − x ¯)(xs − x ¯) ∼ |t − s|

Traders tag their company-related messages with the ticker (e.g., $AAPL, which gets indexed as @MV_AAPL): xn − xn−1 = un company co-occurrence can be used to define indusyn − yn−1 = vn try groups; the classification reflects structural changes more quickly than traditional methods (looking at the can be turned into cointegrated random walks by variance matrix of the returns, looking at analysts in adding error correcting terms common, etc.). xn − xn−1 = un

A set of random walks

yn − yn−1 = vn + a(yn−1 − xn−1 ).

Zipf’s law unzipped S.K. Baek et al. (2011)

When looking for cointegration relations in data, make sure they are meaningful (here, xn−1 − yn−1 is the dis- The omnipresence of Zipf’s law (distribution of city sizes, word frequencies, etc.) can be explained using intance between the drunk and her dog). formation theory: when putting M balls into N boxes, A drunk, her dog and a boyfriend: find N (k), the number of boxes of size k to maximize an illustration of multiple cointegration the entropy (this can be reformulated in terms of muand error correction tual entropy). Contrary to the law of large numbers, A. Smith and R. Harrison it is not a process of aggregation, but of division. The model can be generalized by adding a boyfriend xn − xn−1 = un yn − yn−1 = vn + a(yn−1 − xn−1 ) zn − zn−1 = wn + b(zn−1 − xn−1 ), the drunk’s attraction for her dog or boyfriend, the boyfriend’s aversion for the dog, etc. By looking at the data alone, you cannot always answer the question “who is attracted by whom?”: you only have the subspace of cointegration relations. If the cointegration relations are β ′ xn (this means that this series is stationary), the error correction model (ECM) is xn −xn−1 = αβ ′ xn−1 +un ; we can estimate αβ ′ (and its rank, i.e., the number of independent cointegration relations), but not α and β separately. Nonparametric goodness-of-fit tests for discrete null distributions T.B. Arnold and J.W. Emerson R Journal (2011) The Kolmogorov-Smirnov and Cramér von Mises tests use the L∞ or L2 (or generalizations) distances between the cumulative distribution functions as test statistics: for continuous distributions, the distribution of the test statistic does not depend on the hypothesized distribution. For discrete distribution it does: the dgof package can compute the corresponding p-values.

Article and book summaries by Vincent Zoonekynd

Stability of the world trade web over time – an extinction analysis N. Foti et al. (2011) Yet another study of the world trade web (WTW) dataset, modelling what happens to the trade network if one or more nodes or edges are removed or modified. Full characterization of the fractional Poisson process M. Politi et al. (2011) You can build counting processes (random variables N (A) that count the number of points in a Borel set A) by specifying the distribution of the inter-event times: with exponential inter-event times, this is a Poisson process (Markov, i.e., memory-less), but other power-law distributions, e.g., Mittag-Leffler, give a non-Markov process. Price dynamics in a Markovian limit order market R. Cont and A. de Larrard (2011) Most models of the limit order book are only amenable to simulations; by considering a simplified model, with the level-1 order book filled or depleted by Poisson processes for market, limit or cancellation orders, unit order sizes, unit spread, and a prescribed order book distribution after a tick changes, one can compute 329/587

many quantities of interest: distribution of the duration between price changes, autocorrelation of the price changes, volatility, etc. A stochastic model for order book dynamics R. Cont et al. (2010)

Telegraph processes (random motion with constant velocity ±c, and direction changes given by a Poisson process) lend themselves to explicit computations (but beware: they have finite variation and allow for arbitrage opportunities). Their limit, as the Poisson density increases, is a random walk.

Lévy processes can be defined as (limits of) the sum of a Brownian motion with drift and (several) compound compensated Poisson processes. AlternaOption pricing and estimation tively, they are stochastically continuous (intuof financial models with R itively: there can be jumps, but they are random, and S.M. Iacus (Wiley, 2011) if there are infinitely jumps on a compact (infinite acThis book (and the accompanying sde, yuima, tivity), they are not too large) processes with stationopefimor packages) explains how to estimate ary independent increments. They are characterized continuous-time models with R and use them to price by their characteristic function. Ito’s formula can be options. The theoretical chapters also contain exer- generalized to Lévy processes. cises to ensure that the reader understands the notions The numDeriv package can compute derivatives, inintroduced. creasing the precision by evaluating the function at R already provides dpqr (density, probability (cumu- more carefully chosen points (e.g., t + h and t − h inlative density function, cdf), quantile (inverse of the stead of t and t + h) or with more general methods cdf) and random sample) functions for standard dis- (Richardson interpolation). A more complicated (continuous time) model.

tributions; the fBasics package adds a few more: nig (normal inverse gaussian, i.e., hitting time of a random walk), gh (generalized hyperbolic), stable. Some of those distributions are only defined from their characteristic function (this is the case for infinitely divisible distributions in general: these are the distributions whose characteristic function is given by the Lévy-Khintchine formula). The fast Fourier transform (fft) can be used to move between cumulative distribution function F (cdf) or the probability density function f (pdf) and characteristic function ϕ. 1 1 F (x) = − 2 2π





−∞

e−itx ϕ(t) dt it

∫ ∞ −itx e −1 1 ϕ(t)dt = F (0) − 2π −∞ it ∫ ∞ 1 f (x) = e−itx ϕ(t)dt 2π −∞

The mle function can be used to maximize a loglikelihood: the result itself is the same you would get with a general-purpose optimizer (optim), but you can call functions such as summary, vcov, confint on the result to have more information (fidistrplus provides more, moment-based, estimators, and allows censored data).

Non-homogeneous Poisson processes can be simulated with the acceptance-rejection method (“thinning”): sample events with constant intensity λ ⩾ λ(t); for each event, take a random number in [0, λ]; if the number is not in [0, λ(t)], reject the event. The sde::sde.sim function can sample from the solution of a stochastic differential equation, using the Euler or Milstein scheme (based on first or second order Taylor expansion); the yuima package generalizes this to multi-dimensional processes, Markov switching diffusions and Lévy processes (a Lévy process is a timechanged Brownian motion). Stochastic models (the drift and sensitivity of a diffusion) can be estimated via maximum likelihood if the transition probability is known (normal, lognormal, CIR, etc.), or via quasi-maximum likelihood, i.e., by replacing the continuous stochastic differential equation with its first order discretization (implementation in the sde and yuima packages). Interest rates are often modelled as √ dXt = f (Xt )dt + g(Xt )dBt where f and g are (Laurent) polynomials (of given degrees and unknown coefficients – or arbitrary functions). They can be fitted (with bias) via 2-stage regression: Xt+1 − Xt ∼ f (Xt )

∂LogLik ∂θ E[Score] = 0 ∂LogLik Information = Var ∂θ ∂ 2 LogLik = −E ∂θ2 1 Var(Unbiased estimator) ⩾ Information Score =

Article and book summaries by Vincent Zoonekynd

res2t+1 ∼ g(Xt ) Xt+1 − Xt ∼ f (Xt ), with weights w = g(Xt ) The fBasics package provides many functions to fit non-gaussian distributions (nigFit, hypFit, ghFit, stableFit). The Black-Scholes PDE (partial differential equation), for the price of a European option, is just a consequence of Ito’s formula, for a self-financing portfolio. 330/587

More generally, since the Gaussian density is a solution of the heat equation, one can show that (functions of) expectations of functions of a random walk are also solution of some PDE.

or using regressions to move back in time (LongstaffSchwartz least squares method). The Monte Carlo estimator is biased, but one can easily compute an upper and a lower bound. There are also various approximations, e.g. with a sequence of Bermudan opThe fOptions::GBSOption function computes Blacktions (which can only be exercised at specific times) Scholes option prices. and Richardson’s extrapolation, or by finding an ODE The equivalent martingale measure can be computed approximately satisfied by the early exercise premium in as follows. i.e., the difference between the prices of the American and European calls (quadratic approximation). µ−r λ= If the price is an (exponential) Lévy process, the equivσ( ) 1 2 alent martingale no longer unique. Upper and lower Mt = exp −λBt − 2 λ t bounds for the no-arbitrage price are known, but the Q(A) = E[1A MT ] (martingale measure) interval is very large. One can try to transform the Wt = Bt + λt brownian motion under Q density f of Z1 in some way, so that the density remains infinitely divisible, but turns the process into a Monte Carlo option pricing can be sped up by paral- martingale; for instance, an Esscher transform, lelizing the computations, e.g., with the foreach and fθ (x) ∝ eθx f (x), doSnow or doMC packages. library(doMC) # multicore library(foreach) registerDoMC(4) # 4 cores p 0 (x) + α− 1⟨x,v1 ⟩ 0 to minimize F (t, x(t), u(t))dt then the point is a (local) minimum. t0 where x(t) ˙ = f (t, x(t), u(t)) The Kuhn and Tucker condition (aka Lagrange

instead of

lim ϕ(x) = ϕ(x0 ),

multipliers) translate these conditions into useable formulas (by explicitely describing the tangent cone) when the set of admissible solutions is of the form: K = {x ∈ Rn : ∀i ∈ J1, pK bi (x) ⩽ 0}. The Lagrangian is L(x, λ) = ϕ(x) + λ · b(x). If x is a solution, then there exists λ ∈ Rn such that DL(x, λ) = 0

Since u controls x via the ODE, this is sometimes called an optimal control problem. There are equivalent formulations where tyhe cost function only depends on the final value of x (Mayer formulation). One usually imposes some conditions on f to avoid explosions, e.g., Lipschitz and linear growth. The Hamiltonian of the problem is F (t, x, u, p) = F (t, x, u) + p · f (t, x, u). If u is a solution of the Lagrange problem, then there exists a function p, called the adjoint state, such that

λ · b(x) = 0. (You may need to remove a few pathologies, such as linear dependencies between the Dbi (x).) Calculus of variations is still interested in minimization problems, but this time, we are looking for a function that minimizes some integral: Find x : [t0 , t1 ] −→ R C ∫ t1 to minimize F (t, x(t), x(t))dt ˙ n

and x(t0 ) = x0

1

t0

such that x(t0 ) = x0 and x(t1 ) = x1 .

∂H (t, x(t), u(t), p(t)) ∂x p(t1 ) = 0 (transversality condition) p(t) ˙ =−

H(t, x(t), u(t), p(t) = Min, H(f, x(t), v, p(t)). v

These equations (if you want to solve them, also add x˙ = ∂H/∂p, which comes from the definition of H) are called the Pontryagin principle (or the hamiltonian system, if you remove the one with the minimum). By adding a few more conditions, you can have a set of sufficient conditions.

(The final condition can be replaced by a set of equalities or inequalities on the coordinates of x(t1 ).) Pe- One can use dynamic programming to solve the same nalized regression can be formulated in this way. The (Lagrange) problem. Let J(t, ξ, u) be the cost on [t, t1 ] Article and book summaries by Vincent Zoonekynd

344/587

if x(t) = ξ and J(t, ξ) the bext cost on [t, t1 ] if x(t) = ξ. have the Hamilton-Jacobi-Bellman (HJB) partial difThen, the dynamic programming principle states that ferential equation. If V is not smooth, we still have ∫ s inequalities of the form (for s > t) J(t, ξ) = inf( +J(t, x(s))). V (t, x) ⩾ sup E[lim inf V (s)] t u

The dynamic programming equation (a PDE involving J: the Pontryagin principle was a system of ODEs) is a necessary condition, under some continuity assumptions: ∂V inf H(t, ξ, v, π) = − . v ∂t (Those continuity assumptions are not reasonable, but you can get rid of them if you consider viscosity solutions.) Let

X·t,x

V (t, x) ⩽ inf E[lim sup V (s)] u

and would look for viscosity solutions of the PDE. The optimal stopping problem, i.e., finding a stopping time τ to maximize J(t, x, τ ) = E[g(Xτt,x )] where X is a diffusion

be the solution of

dX = µdt + σdW

dX = µdu + σdW Xtt,x

= x.

For f : Rn −→ R, let t,x E[Xt+h ] − f (x) . h→0 f

Af (x) = lim

Xt = x (not a controlled process: that would be a mixed stochastic control and stopping problem) can also be solved via dynamic programming. Once ∑ V (t, x) == J(t, x, τ ) τ

The operator A is called the generator of the diffusion, ( ) ∂ ∂ ′ 1 A=µ + tr σσ . ∂x 2 ∂x∂x′ The expectation v(t, x) = E[g(XTt,x )] satisfies ∂v + Av = 0 ∂t v(T, ·) = g (This result can be generalized to v˙ + Av + kv + f = 0 with initial conditions (Cauchy problem and FeynmanKac representation) or to Au − ku + d with boundary conditions (Dirichlet problem).) This gives a correspondance between expectations and solutions of PDEs: we can use PDE tools (e.g., finite elements) to estimate expectations or use stochastic tools (e.g., Monte Carlo simulations) to solve PDEs. Let X be the stochastic process describing the price of an asset, u be a trading strategy (control process, to be determined), J(t, x, u) the expected utility on [t, T ] of the wealth resulting from the trading strategy u if the initial price is Xt = x, and V (t, x) the expected utility of the optimal strategy. By changing X, we can assume that the cost J only depends on the final value: J(t, x, u) = E[g(XT )|Xt = x]. The control process can be: deterministic (open loop control), Markovian (us depends on the current price Xs , but not on previous prices) or adapted to FX (feedback control: it can depend on all previous prices). If V is smooth (it is, under restrictive but reasonable assumptions: Lipschitz, bounded), u V (t, x) = sup E[V (t + ε, Xt+ε )] u

is known, the optimal strategy follows: just check if you are in the stopping region {(t, x) : V (t, x) = g(x)} or the continuation region {(t, x) : V (t, x) > g(x)}. If V is smooth (this is the case under restrictive but reasonable assumptions), V (t, x) = sup E[1τ 0 : U (t) < 0 P. Chaussé RMetrics Workshop 2009 where The generalized method of moments (GMM) fits a model to your data by considering several quantities U (t) : surplus at time t or moments (e.g., average, variance, median, quantiles, u : initial surplus higher moments, truncated moments, etc.) and fineck : premium collected at time k tuning the parameters so that the theoretical quanN (t) : number of claims in [0, t] tities be as close from the observed ones as possible. More precisely, find several moments g = (g1 , . . . , gn ) Xn : value of the nth claim ∑ ∑ so that E[g(θ, X)] = 0 and compute U (t) = u + ck − Xn

The actuar package provides some functionalities for actuarial science, which is just “risk management” with a different vocabulary: –



– –



k⩽t

n⩽N (t)

from the distributions of N and X; – Simulations of hierarchical models. An overview of random number generation C. Dutang RMetrics Workshop 2009

2 θˆ = Argmin ∥¯ g (θ)∥ θ

1∑ g¯(θ) = g(θ, xt ). T t Since the moments are not “independent”, the Euclidian distance is not the best choice: you will prefer

The default random number generator (RNG) in R θˆ = Argmin g¯(θ)′ W g¯(θ) (the runif function) is the Mersenne twister (MT), θ a linear congruential RNG modified by some bitwise operations. The randomtoolbox package implements for some W . The most efficient estimator is obtained other RNG (mainly generalized MT: WELL, SFMT) with ∑ and quasi-random number generators (aka low disW = Cov(g(θ, xt ), g(θ, xt+s ) cepancy sequences): s∈Z Article and book summaries by Vincent Zoonekynd

392/587

which can be estimated (with an HAC (heteroscedasticity and autocorrelation consistent) estimator) as ∑ ˆ ∗) = ˆ s (θ∗ ) kh (s)Γ Ω(θ |s|⩽T −1

– Assortativity coefficient, to measure mixing patterns between vertex types

tr e − e2 r= 1 − ∥e2 ∥

ˆ s (θ∗ ) = 1 g(θ∗ , xt )g(θ∗ , xt+s )′ Γ T kh : some kernel

where eij is the fraction of edges between vertices of types i and j; those types can be numeric (e.g., the degree, leading to the degree correlation) and even continuous.

For instance, one could try any of the following: ˆ Here are a few models of randon graphs: ˆ ∗ ); estimate θ; – Estimate θ∗ with W = 1; compute Ω(θ – Iterate the above until convergence; – Poisson random graphs (choose the degree of each – Optimize properly: vertex; join the vertices at random), aka Erdős– Rényi graphs; ˆ g (θ). θˆ = Argmin g¯(θ)′ Ω(θ)¯ – Configuration model (replace the Poisson distribuθ tion by another one); they fail to account for tranThe limiting distribution is known. sitivity; In finance or economy, you can use the CAPM, the util- – Exponential random graphs: choose a few numeric properties of graphs, ε1 , . . . , εk , e.g., number of verity function and other meaningful quantities to define tices, number of vertices of a given degree, etc.; moments. choose a few numbers β1 , . . . , βk (inverse temperaIn R, just provide your moments and your data to the tures); sample graphs with probability gmm function. ∑ P (G) ∝ exp − βi εi (G). I did not understand what GEL (generalized empirical i likelihood) was. These are amenable to (Monte Carlo) simulations, but not analytical derivations, except in the case of Markov graphs (the absence/presence of edges are independent unless the edges share a vertex). Unfortunately, they model transitivity in an unnatural way: there are too many complete cliques. Small world model: put all the vertices on a circle (or any other low-dimensional lattice), connect each vertex to its k nearest neighbours, rewire edges at random with probability p (a variant adds edges with probability p: this can be seen as the average of a lattice and a random graph); – The Barabási-Albert model is a model of network growth, i.e., it tries to explain where the graph properties come from: add vertices one by one, link them to previous nodes, with a higher probability for nodes with a high degree (preferential attachment); the earlier Price model was similar but used directed edges.

A tale of two theories: Reconciling random matrix theory and shrinkage estimation as methods for covariance matrix estimation B. Rowe RMetrics Workshop 2009 – Both methods try to remove the noise in the spectrum of the sample covariance matrix: random matrix theory (RMT) violently shrinks the low values to zero; shrinkage smoothly shrinks all values to their mean (or to a predefined value). Combining the two methods turns out to be a bad idea. The structure and function of complex networks M.E.J. Newman arXiv:cond-mat/0303516

Given a network, one can compute the following nuFully mixed epidemiological models, such as SIR meric quantities (invariants): (suceptible-infected-recovered: after catching the dis– Harmonic mean geodesic distance between all pairs ease, people (die or) become immune) (the arithmetic mean would have problems with nonS˙ = −βIS connected vertices); – Clustering coefficient: I˙ = βIS − γI 1∑ R˙ = γI C= Ci n or SIS (reinfections are possible) Number of triangles connected to vertex i Ci = ; Number of triples centered on vertex i S˙ = −βIS + γI – Degree distribution; I˙ = βIS − γI – Network resilience: progressively remove vertices, staring with the highest degree ones, and plot of the can be generalized to a network. Other processes on mean distance between edges versus the fraction of networks include percolation and search. vertices removed; Article and book summaries by Vincent Zoonekynd

393/587

strucchange: an R package for testing for structural change in linear regression models A. Zeilis et al. The strucchange package implements tests based on the following empirical fluctuation processes: – OLS-CUSUM: cumulated sum of the residuals (a brownian bridge); – recursive CUSUM: cumulated sums of the residuals, where the ith residual comes from the model estimated on the first i observations; – Recursive MOSUM: idem, on a moving window; – Fluctuations: parameters of the model estimated on an expanding window, suitably renormalized (contrary to the residuals, this is a multidimensional process); – Moving estimate: idem, on a moving window. In case of a structural change, these processes “fluctuate more”; they can be compared with boundaries of the form √ √ b(t) = λ (MOSUM: it is stationary), b(t) = λ t (brownian motion) or b(t) = λ t(1 − t). However, the crossing probability is easier to compute for linear boundaries: we lose some power at the begining and end of the interval. r 1 instead of a constant. Permutation tests for structural change A. Zeileis and. T. Hothorn The SupF (or SupLM: the test statistic can be transformed to look like an F or a Student T one) tests whether the average of a series of random variables (Xk/n )k∈J1,nK is a constant µ against the alternative hypothesis Hπ that it is µ1 before some breakpoint τ ∈ [0, 1] and µ2 after, by considering the statistic RSSπ , RSS0 suitably renormalized √

( ) RSSπ Instead of those significance tests, which have no clear Zπ = (n − 1) 1 − , RSS0 alternative hypothesis, you may prefer an F test, which tests against the alternative that there are two different where RSS0 and RSSπ are the sums of squared residumodels, one before and one after an (unknown) breakals under H0 and Hπ , and aggregating them as point; they compare the sum of squared residuals for the model with and without a breakpoint at observaD = Max Zπ . π tion τ with an F statistic Fτ and aggregate them as sup Fτ , Mean Fτ or log Mean exp Fτ . This statistic can be compared with the following distributions: f Int, // Function argument Disp (xi − xj )i,j a : Int, – Fourth central moments; b : => Int, // Not evaluated immediately – Tyler’s shape matrix. ): Int = { ... } Outliers are often spotted by looking at the Mahalanobis distance for one measure of dispersion: besides (before) a plot of the first eigencoordinates, a scatterplot of those distances for several measures of dispersion may also be helpful. Tools for exploring multivariate data: the package ICS K. Nordhausen et al. Journal of Statistical Software (2008)

class Foo(a:Int, b:Int) {...} // private members class Bar(...) Extends Foo { ... override def toString ... } val x = new Foo(1,2); A trait is a Java interface, designed to be added to a class, i.e., designed for multiple inheritance; abstract classes are designed for single inheritance.

A more application-oriented presentation of invariant Case classes provide implicit constructors, implicit coordinate selection (ICS). The ICS transformation can getter methods and pattern matching, i.e., implicit tests on the type of objects – this is just syntactic sugar, be described as: but it increases code readability. – Whiten the data wrt the first measure of dispersion: x 7→ (x − 1′ x ¯)Disp−1/2 (stopping after that abstract class A step is not satisfactory: the square root is not case class A1(n:Int) extends A unique and the whitening transformation is not case class A2(n:Int, m:Int) extends A affine-equivariant); ... – Perform a principal component analysis (PCA) on e match { case A1(n) => n the whitened data, wrt the second measure of discase A2(n,m) => n+m } persion. Type parameters (aka templates or generic types) Article and book summaries by Vincent Zoonekynd

397/587

can also specify lower and upper bounds on types. class Set[A] // A implements the Ordered trait: class Set[A