Pierre Michaud∗ [email protected]

Abstract This paper presents OESO, a new model of RL (Reinforcement Learning) aimed at logistics applications. OESO provides three (sub-)models: structural, functional, and dynamical. Two functional rules are derived from these models: a delta signal production rule, and a learning rule. OESO study has been (notably) inspired by a natural logistics process: feeding. Nutriments flow from sources to downstream consumers, notably the brain. The oesophagus is one component of the nutritive system, which transmits nutriments to the stomach. In the OESO framework, such segments are called oesomeres, i.e. segments which carry value items. OESO segments are structured in graphs, from upstream sources to downstream consumers. Segments are fitted with value sensors, whose signals converge to the OESO cognitive device, which outputs the delta signal. Interestingly, OESO production and learning rules (with exponential-decay delays) are equivalent to Temporal-Difference TD(lambda) rules, at least in logistics applications (notably glucose feeding, which involves the dopamine system). Stimulating an OESO cognitive device reproduces the dynamics of the phasic signal recorded on pre-trained vertebrates’ dopamine neurons during basic RL scenarios, notably: unexpected, predicted, and omitted rewards. OESO suggests that learning may spread the OESO structure upstream, by recruiting cognitive segments which apparently transmit value to its foremost entry ports. OESO model quantifies logistics, including value transformation. It should benefit numerous domains, notably biology and thus medicine, neurosciences, neuroeconomics, psychology of reward (and probably of other vital values: risk of damage, care, sex, ...), robotics, and human activities related to logistics (among others: production and economy). In summary, OESO shifts from the phenomenal level (external signals) to the underlying mechanical level.

1

Reading guide

This paper is the “Salon des Refus´es” version of the initial OESO paper. Informations about previous and future versions, and about oesomeric model related web pages (English Wikipedia page, and the oesomeric model home page) are provided at the end of the “References” section. New or uncommon terms are listed and defined in §14.1. Vectors are indicated by bold letters (e.g. Θ, φ). UT is the transpose of a given vector U.

2

Motivation

A major application of TD (Temporal-Difference) methods is the case of Reinforcement Learning (RL). The TD(λ) algorithm (§3.2) is currently the favorite model to explain the phasic dopamine ∗

http://oesomeric.free.fr/

29th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.

signal dynamics. While many TD model applications (e.g. “Gridworld“ [4]) states evolve in the usual 3D space, the conventional TD model derivation considers time, but not space. On the contrary, foundations of the new (yet unpublished) oesomeric model involve space and time. Its production rule (δΠ (t) signal production) and its learning rule are derived from a structural (spatial) model. In the oesomeric framework, segments fitted with value sensors transmit themselves value to downstream consumers. The oesomeric model study has been (notably) inspired by a natural logistics process: feeding. Nutriments flow from sources to downstream consumers, e.g. the brain. The oesophagus (or “esophagus”) is one component of the digestive system. After food passes through the oesophagus, it enters the stomach. Fig. 1 illustrates the glucid path along rat glucid oesomeres, from mouth to brain.

Figure 1: The glucid path along rat glucid oesomeres, from mouth to brain.

Moreover, DA specialists focusing on feeding neurobiology (see. e.g. [14]) depict a “gut–brain dopamine axis” fitted with several caloric sensors scattered all along the path from mouth to brain; dopamine sensing may function as a central caloric sensor. Interestingly, oesomeric model production and learning rules are equivalent to TD rules, at least in logistics applications (notably glucose feeding, which involves the dopamine system). But while TD derivation considers the phenomena (the external signals on the time axis), the oesomeric model derivation starts form the internal underlying mechanisms, in a spatiotemporal framework: it unfolds the spatial dimension of logistics processes, including animal feeding. Progression towards vital or accessory value sources is an every day life activity. However, the oesomeric model connects the RL field (including phasic DA modeling) with its underlying mechanisms: value progression along oesomeres. 2

As an illustration of benefits provided by the oesomeric model, considering the spatiotemporal fate of rewards after their ingestion explains the absence of DIP after ingestion, and it suggests the needlessness of a systematical uniform temporal discount. The oesomeric model also suggests its application to many logistics domains in various fields, notably: biology and physiology, psychology, and many human activities where progress matters, hence many oesomeric (value progression) spaces to explore.

3 3.1

Background The phasic dopamine signal

A classical way to make an animal expect a reward (e.g. delivery of a bolus of sucrose) is to train it with several cue-reward pairings, with a constant delay D between a cue (e.g. a tone) and a signal indicating the actual and immediate reward delivery (classically denoted “US” i.e. unconditioned stimulus). After this training, the announcing cue becomes a “CS” (a conditioned signal).

Figure 2: The dopamine signal during URPROR scenarios. Left: the real DA signal. Right: the abstract phasic DA signal δDA (t). DAP stands for “phasic DA”, PIC is a positive impulse, DIP is a negative impulse; NIL means “no significant PIC or DIP” during a given time interval. Let δDA (t) denote the abstract phasic DA signal, i.e. the second phasic component signal [19] produced by most of the vertebrates’ midbrain DA (dopamine) neurons, devoid of biological details (e.g. baseline, neuronal delays, etc.). Let us call “URPROR” the three following basic DAP scenarios [10]: 3

• “UR” (Unexpected Reward) elicits a PIC at US when the animal receives an unexpected reward. • “PR” (Predicted Reward) elicits a PIC at CS and no significant DAP signal (NIL) at US when the reward is actually provided on time, i.e. after the usual delay between cue at CS and reward at US. • “OR” (Omitted Reward) elicits a PIC at CS and a DIP on omission of an expected reward. Fig. 2 illustrates the URPROR scenarios. 3.2

Temporal-Difference learning

3.2.1

Conventional TD(λ)

The conventional TD(λ) algorithm is defined by the following update equations [18]: T δt = Rt+1 +γ ΘT t φt+1 − Θt φt

(1)

et = γ λ et−1 + φt

(2)

Θt+1 = Θt + α δt et

(3)

where: • the scalar δt , the TD error, is the TD device output signal at t, • the scalar Rt+1 is the actual reward received between the previous time step t−1 and the current time step t, • φt is the feature vector at t, • Θt is the weight vector computed during the previous time step t−1, • et is the eligibility-trace vector computed during the current time step t, • the parameter γ is the temporal discount factor, specifying how future rewards are weighted with respect to the immediate reward, • the parameter λ ∈ [0, 1] is an exponential decay constant; in the TD framework it is called the trace-decay parameter [4], • the parameter α is the learning rate, Let us call δT D (t) := δt the output signal of the conventional TD(λ) algorithm. The temporal discount factor γ specifies how future rewards are weighted with respect to the immediate reward. In this paper (and more generally in the oesomeric framework), we assume no systematic uniform temporal discount (see §11.1), thus γ = 1. Equation (1) is the conventional TD(λ) production rule, by which a TD(λ) device produces the δT D (t) output signal. Equations (2) and (3) are the conventional TD(λ) learning rules. Setting λ := 0, we get the conventional TD(0) learning rule equation: Θt+1 = Θt + α δt φt 3.2.2

(4)

TD error and Prediction Error

Let us recall how the TD algorithm computes a prediction error, denoted δT D (t) in this paper, between expected and experienced reward acquisition. The dot product Vt := Θr T φt (5) is the sum of rewards expected by the TD device after t: Vt is the expectation of the sum Rt+2 + Rt+3 + etc. 4

Assume γ := 1 (§11.1). Consider the temporal-difference TDt defined by: TDt := Vt−1 − Vt

(6)

TDt is the difference between: • the sum Vt−1 of rewards which were expected after t−1, and • the sum Vt of rewards currently expected after t. TDt is the “temporal-difference” at t, i.e. the “error or difference between temporally successive predictions” at t−1 and t [2]. Hence the name of the Temporal-Difference algorithm. If the prediction is correct, a reward of magnitude TDt should be delivered between t−1 and t. The prediction error δT D (t) quantifies the discrepancy between the actual reward (Rt+1 ) and the expected reward (TDt ): δT D (t) := Rt+1 −TDt = Rt+1 −[Vt−1 −Vt ] = Rt+1 +Vt −Vt−1

(7)

δT D (t), the TD error, is a prediction error [2][4]. It is the difference, at t, between: • the reward Rt+1 actually obtained between t−1 and t, and • the reward TDt which the TD device expected to receive between t−1 and t. 3.3

DAP/TD and the Reward Prediction Error hypothesis

When a TD(λ) device is trained and then stimulated with scenarios such as URPROR, δT D (t) shows a striking similarity with the phasic component of the dopamine signal recorded when a trained animal is stimulated by the same scenarios (see e.g. [7]): δT D (t) ≈ δDA (t). Hence a highly active research domain, denoted hereafter “DAP/TD”. This similarity led many DAP and TD authors (e.g. [3, 5, 13]) to postulate the RPE (Reward Prediction Error) hypothesis: δ(t)DA and δ(t)T D indicate a discrepancy between reward prediction and actual experience. Section §11 compares TD and OESO models, notably their δ signal production and learning rules.

4 4.1

Introduction to the oesomeric model Initial insight

The initial insight about the oesomeric cognitive model resulted from the idea of a natural temporal bridge between CS and US. Imagine a sensor C˙ CS that measures the amount RCS (t) of a given substance (e.g. sugar concentration) in its surroundings, which I call S˙ CS (see Fig. 3). Assume that S˙ CS is a segment, fitted with the C˙ CS sensor and three ports: a CS entry port, and two exit ports, namely a transmission port and an omission port. An admission into a segment occurs when an item enters into this segment. The term ”item” refers to everything whose entry into a given segment is detected by its sensor. An emission from a segment occurs when an item exits from this segment. An emission is a transmission and/or an omission (omission is defined below). Let dRCS (t) denote the variation of RCS (t) during the t time step: dRCS (t) := RCS (t)−RCS (t−1). RCS (t) is a level signal, dRCS (t) is a variation signal. Now imagine another signal dRU S (t) which indicates variations of the same valuable substance as dRCS but inside another segment S˙ U S . For example, imagine that S˙ U S is the nutritive duct from the mouth to a brain sensor (Fig. 1). This long segment is made of several shorter abutting segments. DA specialists focusing on feeding neurobiology (see. e.g. [14]) depict a “gut–brain dopamine axis” fitted with several caloric sensors scattered all along the path from mouth to brain; dopamine sensing may function as a central caloric sensor. 5

Figure 3: A two-segments structural architecture, and the “missions” : admission, transmission, omission, emission.

An axiogogue is a structured set of real oesomeric segments (or simply real segments). S˙ k (k=0···K) conveying value items to consumers downstream. Axiogogues have a graph structure, with entry and exit ports as nodes, and segments as edges. Biological axiogogues are usually directed acyclic graphs, with a tree structure, but not necessarily; for example, glucids and lipids follow two different paths (portal system for the former, lymphatic system for the latter) to reach the blood circulatory system and eventually the brain. Let S˙ CU denote the axiogogue S˙ CS +S˙ U S . In the present case (Fig. 3), the transmission port of S˙ CS abuts an admission port of S˙ U S so that some items are directly transmitted from S˙ CS to S˙ U S . Some items may exit from S˙ CS (and from S˙ CU ) via the omission port of S˙ CS without entering into S˙ U S : they are omitted from S˙ CS . Sometimes an item emitted from S˙ CS is partly transmitted to other segments of S˙ CU (to S˙ U S in the present case), and partly omitted from the axiogogue S˙ CU during the same time interval: this is a partial transmission. Assume that items admitted into S˙ U S are steadily emitted later from S˙ U S . For example, the brain steadily consumes glucose which was ingested earlier as a bolus and rapidly absorbed. The intrabody nutritive segment is a RASE segment: rapid admission (usually), slow emission (structurally). Exponential-decay RASE segments emit at a rate λ proportional to their current level. Fig. 4 shows the typical level response EPtλ0 (t) (called hereafter exp-pulse) and the variation response dEPtλ0 (t) of an exponential-decay RASE segment to a rapid admission at t0 EPtλ0 (t) := e−λ(t−t0 ) if t ≥ t0 or else EPtλ0 (t) := 0

(8)

Hence, while admission of a significant item into S˙ U S elicits a dRU S (t) PIC, its later omission from S˙ U S does not elicit a significant dRU S (t) DIP. Pharmacokineticists are familiar with such asymmetric dynamics. The CS-US delay produced by the artificial learning apparatus may be considered as the ratio L/V ; L simulates the spatial length of the segment between the cue (at CS) and the reward receptors (at US), and V is the usual speed of reward items in this segment (Fig. 5). Now consider a ΠCU device (Fig. 6), which produces the signal δΠ (t) = dRCS (t)+dRU S (t). δΠ (t) dynamics are similar to those of δDA (t) when ΠCU is stimulated by each of the three URPROR scenarios (Fig. 6), notably: in the PR scenario, δΠ (t) ' 0 during the S˙ CS to S˙ U S transmission; in the OR scenario, reward omission elicits a significant δΠ (t) DIP. Hereafter I define the structural, functional, and dynamical oesomeric cognitive models sketched above, aiming to derive simple δΠ (t) production and learning rules. 6

Figure 4: An exponential-decay RASE segment, the exp-pulse level signal after a rapid admission, and its variation (see text).

4.2

Transitivity

CS announces a future reward. Announcement is theoretically transitive: if cue K announces a later event J, and if prior cue L announces K, then L announces J. Thus announcement learning is theoretically transitive. Indirect experimental evidence (e.g. fMRI records) suggests a high-order (> 2) DAP learning transitivity. However, direct evidence (DA neuron recording during simple scenarios) is lacking, both for and against high-order conditioning. 4.3

Beyond CS and US

So far, comparison between δT D (t) and δDA (t) has mainly focused on the classical DAP/TD stimulation frame described above: rewards are usually delivered into (or just before) the mouth, and CS is a proximal cue delivered shortly before the oral admission: both CS and US lie in a proximal pre-oral and oral space domain. However the oesomeric structural model and theoretical announcement learning transitivity, strengthened by growing neurobiological evidence, suggest an extension of the classical CS-US span, both downstream and upstream. Downstream, the oral reward signal (classically denoted US) may also act as a reward cue for later reward admissions into downstream segments (stomach, upper intestine, ... and eventually the brain). Production of the δDA (t) signal related to post-ingestive segments by an oesomeric cognitive device seems plausible. Oesomeric learning seems more speculative in this intra-body domain (where neuronal wiring could be innate). Upstream, a signal usually considered as a cue (classically denoted CS) could also be 7

Figure 5: The extended nutritive system: intra-body domain (from external cues to ingestion) + preoral domain (from ingestion to consumption). The CS-US delay produced by the artificial learning apparatus may be considered as the ratio L/V (CS-US segment length / reward velocity). considered (at least theoretically) as a reward announced by more distal upstream cues, which may themselves be rewards announced by even more distal cues.

5 5.1

OESO functional and structural architectures OESO devices

An OESO device is a natural or artificial device which implements the oesomeric model. A driven device usually includes 3 functions: a cognitive function including perception (sensing), a motor function (involving e.g. motor neurons), and a driving function which uses the cognitive function output to act on the motor function. A driven device may embed zero, one, or several oesomeric cognitive devices, each one tracking variations of a specific valuable substance, indicated by a reference signal, e.g. dR0glucose (t) or dR0water (t). In this paper, I consider only one oesomeric cognitive device, based on a reference variation signal dR0 (t). I assume that the cognitive components of the oesomeric device are functionally separated from the driving and motor parts, and I focus on the cognitive component of the oesomeric device, the oesomeric cognitive device. 5.2

Discrete time frame

In this paper I present a discrete time oesomeric model. Time is divided into time steps of constant duration dt. For simplification, let dt = 1. Operator d computes the difference between a signal U (t) at the current time step t, and the same signal U (t−1) at the previous time step: dU (t) := U (t)−U (t−1). Where unambiguous, I omit the currently considered time step t, writing U instead of U (t). 8

Figure 6: The ΠCU device output signal dRCS (t)+dRU S (t) matches the abstract phasic DA signal δDA (t). Left: the real DA signal during URPROR scenarios. Right top: structural architecture (segments S˙ CS and S˙ U S and their sensors C˙ CS and C˙ CS ); functional architecture (the ΠCU device). Right bottom: the level (blue) signal RCS +RU S during URPROR scenarios, and its variation (red) signal dRCS +dRU S .

5.3

The OESO cognitive device

An oesomeric cognitive device is composed of an oesomeric core function, denoted Π♥, and a set of cognitive input operators, notably delay operators (§9). Π♥ has K+1 inputs: one reference signal dR0 (t), and K recruitable inputs dRk (t), k=1···K. Π♥ inputs are produced by the cognitive input operators, that transform N perceived signals dXi (t) into K+1 signals dRj (t). At time step t, Π♥ gets K+1 input signals dRk (t), and computes K weights Wk (t) and the output signal δΠ (t) using W0 and the K weights Wk (t−1) computed at t−1. W0 weights the reference signal dR0 (t), it is assumed to be a strictly positive constant. The output signal δΠ (t) is thus a linear function of several input signals dRk (t). The linear function approximation makes sense in the domain of logistics: eating half an apple provides half the nutritive value of the entire apple. In this paper, I assume that value sensing and cognitive processing are linear, at least in their variation domain. Assuming that dR0 (t) indicates the variations of a set v0 of substances inside a real segment, let S˙ 0 denote this segment. and S˙ A the axiogogue ended downstream by S˙ 0 . 9

5.4

Cognitive segments

Let Sk (k=0···K) denote the sensibility (spatial) domain of dRk (t). Sk (k=1···K) are cognitive OESO segments (or simply “cognitive segments”). S0 and cognitive segments are denoted with undotted ‘S’, they “represent” the axiogogue reality, transformed (sometimes altered) by the cognitive input operators. dV˜k (t) is the product of a dRk (t) signal by its weight Wk computed at time step t−1: dV˜k (t) := Wk (t−1)×dRk (t)

(9)

dV˜k (t) is the estimation at t, by Π♥, of the variation, between t−1 and t instants, of the net value held by Sk (net value is defined below: §7.1). 5.5

The OESO delta production rule

Let SΠ (t) (or simply SΠ ) denote the set of Sk segments with a non null weight. Π♥ output is a sum dV˜Π (t), also denoted δΠ (t), of all dV˜k (t) signals (k=0···K): δΠ (t) := dV˜Π (t) :=

X

dV˜k (t) =

SΠ

K X

dV˜k (t) =

k=0

K X

Wk (t−1) × dRk (t)

(10)

k=0

or, in vector notation (see §1): δΠ (t) := dV˜Π (t) := W(t−1)T dR(t)

(11)

dV˜Π (t) is the estimation at t, by Π♥, of the variation, between t−1 and t instants, of the net value held by SΠ . 5.6

An alternative functional architecture

The Π♥ functional architecture defined above is a Π♥d×Σ one: derivation (d) then weight (×) then sum (Σ). Π♥×Σd is an equivalent functional architecture alternative, since d, ×, and Σ operators are linear. Π♥×Σd produces both V˜Π (t) and dV˜Π (t) := δΠ (t). In this paper I concentrate on the Π♥ architecture necessary to produce dV˜Π (t): Π♥ stands for Π♥d×Σ .

6

Oesomeric cognitive learning: an overview

Before diving into segment-level quantitative analyses provided by the next sections, let us consider a qualitative and global-level overview of the learning process. 6.1

Π♥ Initialization

Initially, all weights Wk are set to 0, except W0 , which is assumed to be a strictly positive constant. The oesomeric cognitive device is then a novice one, SΠ is truncated to S0 , and dV˜Π (t) = dV˜0 (t) = W0 ×dR0 (t), i.e. dV˜Π (t) is initially the measure of gross value variations inside S0 . 6.2

Cognitive segments recruitment

In order to increase the oesomeric cognitive device farsightedness, Π♥ recruits cognitive segments. To recruit a given cognitive segment Sk , Π♥ increases its weight Wk from 0 to a strictly positive weight, thus adding Sk to SΠ . Progressively, oesomeric cognitive learning adds segments to SΠ , thus spreading SΠ upstream. 10

6.3

Segments strata, learning backpropagation

Learning a novice oesomeric cognitive device first recruits stratum 1 segments, i.e. segments which apparently transmit directly to S0 (which belongs to stratum 0). Then when SΠ contains stratum 1 segments, learning may add stratum 2 segments, i.e. segments which apparently transmit directly to stratum 1 segments, and so on. Hence net value estimation learning backpropagates upwards, starting from the reference segment S0 . Qualitatively, oesomeric cognitive learning progressively spreads SΠ upstream, by recruiting cognitive segments at the foremost entry ports of SΠ structure. Quantitatively, oesomeric cognitive learning adjusts Wk weights, providing Π♥ and the driving function with an estimation of the net value afforded by an SΠ acquisition as soon as it occurs. An oesomeric cognitive device is mature when its weights no longer vary significantly in a stationary context. 6.4

The prognodendron

Let us call prognodendron the mental “representation” of all the segments belonging to SΠ and their spatial structure. The term “prognodendron” could be understood as “foreknowledge tree”. The prognodendron is a cognitive object, it is a kind of cognitive map. Prognodendrons representing natural axiogogues have usually a tree structure. Prognodendrons may include graph parts, notably when they represent axiogogues related to human activities. While human brains are able to imagine or even draw prognodendrons, a basic oesomeric cognitive device is space-agnostic (section §10.1), unaware of its prognodendron structure. Keeping the prognodendron construct in mind is useful to understand and to study the oesomeric model and its numerous applications. “Π” (and “PGD” in the NIPS 2016 paper [20]) stand for “prognodendron”.

7

Net value estimation

In this section, I show how the oesomeric cognitive device computes an estimation of the net value (the estimated value) afforded by an admission into a given segment. 7.1

Gross value, net value

Because segments sometimes omit, the net value of an admission into an upstream segment is lower than its gross value, which is the value acquired when this item is directly admitted into the reference segment S˙ 0 . ˙ tX ) denote the net value of an item b˙ located in S˙ k at tX . V net (b, ˙ tX ) is the integral Let V net (b, ˙ ˜ of variations dV0 (tX ) after tX , letting b follow its “natural” fate departing from S˙ k , including transmissions, possible partial omissions, and maybe transformations (section §10.1), ceteris paribus (neither admission into nor emissions from S˙ A of other items): ∞ X

˙ tX ) := V net (b,

dV˜0 (t)

(12)

t=tX +1

Thus the net value of an item b˙ at tX is the total gross value that will eventually be admitted at S˙ 0 after tX following its natural fate, ceteris paribus. 7.2

Estimated value

Net value measure is only available after the duration needed for b˙ to “naturally” flow downstream from S˙ k to S˙ 0 . This duration is usually bounded. However, a driven device needs to know the value afforded by an admission as soon as it occurs, in order to reinforce (if immature, or else adjust) its potential causes shortly afterwards: the weight of an upstream segment that has just emitted, or a just triggered action which seems to favor this admission. 11

The strength of such reinforcements should depend on transmission evidence, which itself depends both on the magnitude of potential causes (emissions or actions) and on the magnitude of the noticed effect (an admission into SΠ ). Π♥ deals with these temporal credit assignment and magnitude issues by backpropagating a net value estimation capability (stored in Wk weights), from stratum 1 segments, up to distal upstream segments, as explained below. 7.3

Target weight

Consider a recruitable signal dRk (t) and its corresponding real (S˙ k ) and cognitive (Sk ) segments. Let ELTS[Sk , tE ] denote a time step tE such that dRk (tE ) < 0: from Π♥ point of view, an item may have been emitted at tE from S˙ k , it is an “emission-like time step” (ELTS). Assume that an item b˙ E is really emitted from S˙ k at tE : dRk (tE ) < 0. Let S k denote the complement of Sk in SΠ : S k := SΠ −Sk

(13)

S˙ k is the real segment corresponding to S k . At tE , a non null part b˙ T of b˙ E is transmitted to S˙ k , thus eliciting a strictly positive variation dRk (tE ) > 0. The remaining part b˙ O = b˙ E −b˙ T is ˙ t) denote the estimated net value of an item b˙ at t. V˜ (b˙O , tE ) = 0, omitted from S˙ Π . Let V˜ (b, because only cognitive segments belonging to SΠ have non null weights. Thus V˜ (b˙E , tE ) = V˜ (b˙T , tE )+V˜ (b˙O , tE ) = V˜ (b˙T , tE ). Assume the following (unrealistic: see below) ceteris paribus condition: nothing other than the emission of b˙ E from S˙ k happens at tE (including a non null transmission to S˙ k ). Thus dV˜k (tE ) = −V˜ (b˙E , tE )

(14)

dV˜ k (tE ) = V˜ (b˙T , tE ) = V˜ (b˙E , tE ) = −dV˜k (tE )

(15)

and

Let Wk∗ (tE ) denote the following ratio: Wk∗ (tE ) =

dV˜ k (tE ) −dRk (tE )

(16)

Wk∗ (tE ) is the target weight (for Wk , according to the current experience at tE ). Noticing that dV˜k (tE ) = −dV˜ k (tE ) =

dV˜ k (tE ) × dRk (tE ) −dRk (tE )

(17)

then dV˜k (tE ) = Wk∗ (tE ) × dRk (tE ) 7.4

(18)

One-shot learning

Assume Wk∗ (tE ) is constant at every ELTS[Sk , tE ]. Denote Wk∗ this constant. Thus, at every ELTS[Sk , tE ]: dV˜k (tE ) = Wk∗ × dRk (tE ) (19) In that case, Π♥ could perform a one-shot learning, storing the constant Wk∗ in Wk at the first experienced ELTS[Sk , tE ]: dV˜ k (tE ) Wk (tE ) ← W ∗ k = (20) −dRk (tE ) 12

7.5

Target weight variability

But the target weight Wk∗ (tE ) is seldom constant. For example, consider a boolean segment S˙ dice , i.e. a segment which either transmits or omits (it never partially transmits). S˙ dice gives you $6 each time you get a “6” (win), and $0 otherwise (void). ∗ ∗ ∗ Thus Wdice (twin ) = $6, Wdice (tvoid ) = $0, and on average Wdice = $1. To deal with Wk∗ variability, Π♥ could compute a moving average Wk∗ (t) of the various Wk∗ (tEi ) experienced during recent ELTS[Sk , tEi ]. It could then store Wk∗ (t) in the Wk weight at each ELTS[Sk , tEi ]: Wk (tEi ) ← Wk∗ (tEi ) (21) Section §8 addresses this issue: deriving a simple cognitive learning rule. 7.6

Ceteris paribus, appearance

The above rationale is correct if both the numerator and the denominator of dV˜ k (tE ) Wk∗ (tE ) = (−dRk (tE ))

(22)

are elicited only by the emission of b˙ E from Sk , ceteris paribus. But Wk∗ (tE ) may be wrong if other items are admitted into or emitted from SΠ at tE . Notably in case of transmission illusion, when a fortuitous coincidence happens at tE between an omission of b˙ E from S˙ k , and the admission of another item b˙ A into S˙ k . Π♥ is not aware of what really happens at tE . Hence it cannot distinguish between a real transmission and a transmission illusion. In both cases it notices an apparent transmission, i.e. dRk (tE ) < 0 and dV˜ k (tE ) > 0. Averaging recently experienced Wk∗ (tEi ) usually cleans inappropriate adjustments due to such illusions. 7.7

Acquisition

Thanks to Wk adjustment at previous time steps, Π♥ obtains an estimation dV˜k (t) = Wk (t−1)×dRk (t)

(23)

of net value variation inside Sk at each time step t: either a net value loss during emission-like time steps tE (dV˜k (tE ) < 0), or a net value acquisition during admission-like time steps tA (dV˜k (tA ) > 0), or no variation. More interesting for a driving device using dV˜Π (t) = δΠ (t) as a training signal: a positive δΠ (t) indicates: • either a net value acquisition into SΠ at t, in which case the driving device should reinforce a recently triggered action, which seems to favor such a net value acquisition, • or a better than average transmission inside SΠ . Better and worse than average transmissions occur frequently (see the Sdice illustration above).

8

The OESO cognitive learning rule

This section proposes an OESO cognitive learning rule. This paper focuses on the cognitive component of the OESO device (cf. §5.1). Proposing an OESO driving learning rule is outside of its scope. In a nutshell, such a learning rule would reinforce actions which apparently favor acquisitions of “positive” value items (such as nutriments). It could also reinforce actions which apparently favor emissions of “negative” value items (such as risks of damage). The following subsections analyze and discuss two classical method of learning rule derivation: 13

• (§8.1) The moving average method, which seems inconvenient for the OESO framework. • (§8.2) The stochastic gradient descent method, which provides the proposed OESO cognitive learning rule. 8.1

Moving average

Recall the target weight formula (8.1): Wk∗ (tE ) =

dV˜ k (tE ) −dRk (tE )

Wk∗ (tE ) is usually variable (§7.5). A classical method of dealing with sample variability in non-stationary contexts is to update an exponential moving average Wk∗ (t) at each ELTS[Sk , tE ]: Wk (tE ) = Wk (tE −1) + a × [Wk∗ (tE )−Wk (tE −1)]

(24)

The learning rate a is set such that 0 < a 1. Let dWk (tE ) := Wk (tE )−Wk (tE −1), and ∆Wk∗ (tE ) := Wk∗ (tE )−Wk (tE −1). Hence dWk (tE ) = a × ∆Wk∗ (tE ). ∆Wk∗ (tE )

=

dV˜k (tE ) dV˜k (tE ) − −dRk (tE ) dRk (tE )

=

Therefore: dWk (tE ) = a ×

dV˜k (tE ) + dV˜k (tE ) −dRk (tE )

=

dV˜Π (tE ) (25) −dRk (tE )

dV˜Π (tE ) −dRk (tE )

(26)

Note that dRk (tE ) is negative at every ELTS[Sk , tE ]. Thus ∆Wk∗ (tE ), dWk (tE ), and dV˜Π (tE ) have the same sign, therefore dV˜Π (tE ) provides the sign of Wk adjustment at tE . If dV˜k (tE ) > dV˜k (tE ) during an ELTS[Sk , tE ], then Wk should be reinforced. If dV˜k (tE ) < dV˜k (tE ), then Wk should be attenuated. Hence, Wk (t) of a mature variable-Wk∗ segment varies around Wk∗ . Recall the S˙ dice ∗ ∗ illustration above; after training, Wdice is mature: Wdice (t) ' Wdice (twin ) ' +$5, = $1, ∆Wdice ∗ and ∆Wdice (tvoid ) ' −$1. A low learning rate (a 1) deadens adjustment noise. As stated above (26): dWk (tE ) = a×

dV˜Π (tE ) −dRk (tE )

(27)

But if dWk (tE ) was inversely proportional to dRk (tE ), Wk adjustment would be highly sensitive to tiny emissions and to dRk noise. Moreover, when two cognitive segments compete to “explain” an admission (the “effect”), this kind of adjustment would favor the smaller cause (that with the smaller apparent emission). Therefore, a learning rule performing such a moving average of Wk∗ (tE ) should be avoided. 8.2

Stochastic gradient descent

A better learning rule which takes into account the above rationale, and magnitude guideline related to transmission evidence (section §7) is derived by an application of the stochastic gradient descent method. Recall (10) that: δΠ (t) :=

K X

Wk (t−1) × dRk (t)

(28)

k=0

hence

∂δΠ (t) = δΠ (t) dRk (t) ∂Wk (t−1) 14

(29)

and:

∂δΠ (t) ∂[δΠ (t)]2 = 2 δΠ (t) = 2 δΠ (t) dRk (t) ∂Wk (t−1) ∂Wk (t−1)

(30)

Hence the following learning rule: Wk (tE ) = Wk (tE −1) + α [−dRk (tE )] δΠ (tE ) 8.3

(at each ELTS tE )

(31)

The OESO cognitive learning rule

To get a more concise learning rule formula, we may use the rectifier bracket notation (suggested by [6]) to implement the “adjust only at ELTS” condition: buc+ := u if u ≥ 0 or else 0

(32)

Finally, we get the following OESO cognitive learning rule: Wk (t) = Wk (t−1) + α b−dRk (t)c+ × δΠ (t)

(33)

Thanks to the rectifier bracket notation, this rule is applicable at each time step, whatever the sign of dRk (t). But it should not apply to W0 , which is supposed to be a strictly positive constant. The OESO cognitive learning rule equation (33) is equivalent to the conventional TD(0) learning rule equation (4), except for the rectifier bracket term (but see section §11.3). 8.4

Discussion: dynamistics

The OESO learning equation (33) is fully dynamistic: both terms dRk (t) and dV˜Π (tE ) are variations. The TD rule is halfway between statistics and dynamistics. Translating the conventional TD(0) learning rule to OESO notation gives (§11.3.1) Wk (t) = Wk (t−1) + α Rk (t−1) × δT D (t)

(34)

While δT D (t) is a variation, Rk (t−1) is usually considered as a level (the feature k of state s at t−1) [4]. See section §11 for the TD(λ) case. [6] proposes a dynamistics learning rule, where changes of the temporal stimulus representation are multiplied with the prediction error, similar to the previous proposal of [1]. As far as I know, differential learning mechanism proposals are founded on neuronal modeling and animal learning, or on philosophical and mathematical directions [1], but not (up to now) on a biologically plausible structural model such as the OESO one.

9

Cognitive input operators

So far I have considered only axiogogues with abutting segments, each providing a dRx (t) signal to Π♥. Now consider an axiogogue with the following chain of segments, upstream to downstream: S˙ k , S˙ m , S˙ j . Assume that S˙ k and S˙ j provide input signals dRk (t) and dRj (t) to Π♥, and that dRj (t) has already been recruited. S˙ m is located between S˙ k and S˙ j , but it does not provide an input signal to Π♥: S˙ m is a mute segment, from Π♥ point of view. Consequently, Π♥ can recruit neither Sm , nor Sk . One way to deal with this situation is to produce a recruitable delayed signal dRm (t), set by each emission of an item from S˙ k (upon each dRk (t) DIP), and reset after the travel duration Dm of items through S˙ m . But Dm is not known a priori, and it may vary. As a simple solution, dRm (t) could be an exp-pulse triggered by dRk (t) DIP, with a decay lasting beyond the usual range of Dm . A more accurate but more expensive solution uses a set of several delay operators Smi (see [12]), which are all triggered at each dRk (t) DIP, setting every signal Rmi (t), and which decay with various 15

delays from set to decay, such that Rmi (t) decays elicited by a given dRk (t) DIP overlap and cover the range of usual Dm durations. Plug the set of derivatives dRmi (t) as inputs of Π♥. If an item actually emits from S˙ k and then admits into S˙ j after an usual duration, one or several delayed signals dRmi (t) triggered by the emission from S˙ k may be recruited, and then Sk will also be recruited. Resettable delay operators should prevent a δΠ (t) DIP at the usual reward delivery time, when value transmission occurs earlier than usual [12] (clarifying DA ramps [15], which have a similar timescale, may help to address this issue). Various types of other cognitive input operators may be useful upstream of Π♥. Notably, static input operators may combine perceptions, e.g. performing logical functions (AND, OR, NOT, XOR, ...).

10 10.1

OESO spaces Value transformations; logical segments

v0 is the set of substances detected by sensor C˙ 0 , including the substance (e.g. glucose) interesting value consumers downstream. Transitivity spreads v0 net value tracking upstream of S˙ 0 . But v0 net value tracking may spread far upstream of v0 domain. Imagine a segment S˙ T which transforms upstream items of type vu (e.g. fructose) into downstream items of another type vd (e.g. glucose): vu → vd . Fit this segment with two sensors, C˙ u (sensing vu items) and C˙ d (sensing vd items), and plug the corresponding variation signals dRu (t) and dRd (t) as Π♥ inputs. Transformation of a vu item into a vd one elicits a dRu DIP and a dRd PIC. If the dRd signal has been previously recruited by Π♥, then vu → vd will cause dRu recruitment. Therefore, upstream segments transmitting vu items to S˙ T will also be recruited. Physical segments such as S˙ T where transformation occurs in situ may be considered as a chain of two logical segments segments Su and Sd . Π♥ is substance-agnostic, aware of its input dynamics, but unaware of value types. “Substance transformation” should be understood beyond its usual chemical sense. It includes every in-situ change in spatial configuration, either physical (e.g. a field being plowed) or logical (a changing image; a scientific paper being edited).

10.2

Extended transitivity beyond value transformations

Transformations and combinations contribute to an extended transitivity which may spread prognodendrons far upstream from their origin (S0 ), beyond value transformations (e.g., manufacturing yields money spent to buy food).

10.3

OESO spaces folding

OESO cognitive structures lay in OESO spaces. More generally, progression along OESO spaces may be either perceived as a discrete process (from nodes to nodes along an OESO discrete structure such as a tree), or as a continuous process (from OESO places to other OESO places), whatever the underlying reality. In many cases, OESO progression is correlated to physical space progression. However, in case of in-situ substance transformation, the OESO space is folded compared to the apparent physical space, at least according to a macroscopic viewpoint. OESO spaces are even more folded in many in-situ human activities which involve information coding. As an illustration, consider progression towards success during computer gaming. The OESO model may provide support in interpreting various brain recordings, referring to the progression in OESO space rather than restricting to phenomena. 16

11

From TD to the oesomeric model

11.1

Temporal discount

In the RL framework, the temporal discount factor γ specifies how future rewards are weighted with respect to the immediate reward. This section recalls two main classical justifications of temporal discount provided by the RL literature, and discuss them according to the OESO framework. 11.1.1

The algorithmic rationale

The algorithmic justification for the discount rate parameter γ is to avoid getting an infinite sum V (t) of expected future rewards [4]. In the Π♥×Σd case, V˜Π (t) is a weighted sum of value levels Rk (t)(k=0···K). In physical reality, V˜Π (t) is limited, both upstream (even if Π♥ is farsighted, SΠ is limited by Π♥ perception capabilities, since the set of Rk inputs is limited), and downstream, thanks to value consumption (this is a requirement for artificial oesomeric devices design). 11.1.2

The psychological rationale

The psychological justification may be summarized by “better an egg today than tomorrow” [4]. TD applies a systematic and uniform temporal discount to all cues: see γ in the conventional TD production rule (1) (§3.2.1). In the OESO framework, this rationale would translates into a systematic and uniform a priori omission rate in cognitive segments. However, unlike TD, OESO takes into account a specific (Sk -dependent) a priori omission rate at acquisition. 11.1.3

OESO rationale

Taking into account the above rationale (§11.1.1 and §11.1.2), OESO assumes no systematic uniform temporal discount: γ = 1. 11.2

Delta signal production rule

11.2.1

Translating the TD(λ) production rule

The conventional TD(λ) delta signal production rule is defined by the update equation (1) recalled in §3.2.1: T δt = Rt+1 +γ ΘT t φt+1 − Θt φt To compare the TD and OESO production equations, assume that at time step t: • the TD device inputs one reward signal Rt+1 and K feature signals φk (t+1), • the OESO device inputs one reference signal dR0 (t) and K recruitable signals dRk (t), with k = 1 . . . K in both cases. Notice that: • Rt+1 is a variation signal (at least in the logistics case). For example, in the classical CS-US case, Rt+1 is the variation of the amount of value inside a RASE segment S˙ U S (typically the intra-body nutritive system of an animal). • While the TD feature vector φ(t+1) and weight vector Θt have K components, the OESO input vector dR(t) and the weight vector W(t) have K+1 components. Let us translate the TD equation (1) recalled above, according to the following translation rationale: 1. Replace “δt ” by “δT D (t)”. 17

2. Translate “Rt+1 ” into “W0 (t−1) × dR0 (t)” (OESO weights the reference signal). 3. Assume γ := 1 (no systematic uniform temporal discount: §11.1.3). 4. Θ(t)T φ(t+1) − Θ(t)T φ(t) = Θ(t)T dφ(t+1) using the “d” operator (§5.2). 5. Expand the dot product: Θ(t)T dφ(t+1) = Θ1 (t) × dφ1 (t+1) + . . . + ΘK (t) × dφK (t+1) 6. At this point, we get the translation: δT D (t) = W0 (t) × dR0 (t) + Θ1 (t) × dφ1 (t+1) + . . . + ΘK (t) × dφK (t+1) 7. Replace all Θi (t) by Wi (t−1), and replace all dφi (t+1) by dRi (t). Then δT D (t) = W0 (t−1) × dR0 (t)+W1 (t−1) × dR1 (t) + . . . + WK (t−1) × dRK (t) or, in vector notation: δT D (t) = W(t−1)T dR(t) 11.2.2

Production rules comparison

Recall the OESO delta production rule formula (11) (§5.5): δΠ (t) := dV˜Π (t) := W(t−1)T dR(t) Therefore, TD and OESO delta production rules are equivalent, at least in the logistics case: δT D (t) = δΠ (t) 11.2.3

(35)

Architecture

DAP and TD authors have proposed various functional or neuronal architectures (see e.g. [8]) including a subtraction operator to compute V (t−1)−V (t) [7][16], and either a dt delay operator (when dt is constant and small) or a V (t−1) memory (when dt is variable or large, e.g. in game cases). The OESO production rule (11) requires neither a subtraction operator, nor a dt delay operator or an estimated value memory: it simply performs a weighted sum of its inputs. The simple OESO cognitive functional model may help neuroscientists in clarifying neuronal architectures, notably that of the dopamine system (and presumably that of the cerebellum [17]). 11.3 11.3.1

TD learning Translating the TD learning rule

The conventional TD(λ) learning rule is defined by update equations (2) and (3) recalled in §3.2.1: et = γ λ et−1 + φt Θt+1 = Θt + α δt et Application of the TD to OESO translation rules defined in § 11.2.1, we get the following translated TD learning rule equations: e(t) = λ e(t−1) + R(t) (36) W(t) = W(t−1) + α δT D (t) e(t)

(37)

Setting λ := 0, we get the TD(0) learning rule equation: W(t) = W(t−1) + α R(t) δT D (t) 18

(38)

11.3.2

Comparing the TD and OESO learning rules

Recall the OESO cognitive learning rule equation (33) (§8.3): Wk (t) = Wk (t−1) + α b−dRk (t)c+ × dV˜Π (t) The OESO equation (33) is similar to the conventional TD(λ) rule [18], replacing eligibity traces ek (t) by b−dRk (t)c+ . 11.3.3

TD(λ) eligibility traces

Notice that: • ek (t) is an exp-pulse triggered by the corresponding feature φk (t). • b−dEPtλ0 (t)c+ = λ EPtλ0 (t−1). Fig. 7 shows the rectified variation signal following an admission into a RASE segment.

Figure 7: RASE segment and exp-pulse: level, variation, and rectified variation. Indeed, TD(λ) eligibility traces ek (t) are rectified variations of exponential-decay RASE cognitive segments. 11.3.4

Comparison of TD(λ) and OESO learning rules

Hence, the OESO cognitive learning rule equation (33) with exp-pulse delay operators (i.e. with exponential-decay RASE delay segments) is equivalent to the conventional TD(λ) one. Therefore, 19

TD(λ) and OESO learning rules are equivalent, provided OESO cognitive learning is restricted to recruitment of exponential-decay delay operators. But the OESO framework provides several benefits. Notably, the OESO framework explicits a clear separation between its core (Π♥) and the cognitive input operators (see §11.4 below). Output signals of various types of delay operators may thus be recruited. Notably, non-exponential-decay delay operators may be involved, e.g. hyperbolic-decay discounting (see e.g. [11]). On the contrary, TD has built-in exponential-decay delay operators (with an uniform exponential decay constant λ), through which every input signal must be processed. However this is a minor restriction, because the TD learning rule equations may be accommodated... by replacing TD learning equations (2) and (3) by the OESO learning equation (33) for a subset of input signals φ(t). On a more fundamental viewpoint, the OESO learning rule explicits the dynamistical nature of both TD and OESO learning (§8.4), which is inferred from the OESO spatiotemporal model. 11.4

Modularity

Let us recap the modularity benefits of OESO: • TD(λ) rules embed temporal discount and delay operators, with systematic and uniform γ and λ parameters for all input signals. • OESO model externalizes delay operators out of Π♥ (allowing use of various temporal profiles and delays), and it lets weight adjustment apply an Sk -dependent temporal discount.

12

Validation

Equivalence of TD(λ) and OESO production and learning rules is demonstrated above: • delta production rules at §11.2.2, • learning rules (with exponential-decay delay operators) at §11.3.4. The vast DAP/TD literature provides numerous simulation and experimental validation reports of such a case, which should apply to the oesomeric model. Further work should validate the oesomeric model in larger cases, notably to verify high-order (> 2) DAP learning transitivity (§4.2), and the neuronal support of the (theoretical) extended transitivity capability (involving value transformation: §10.1).

13

Conclusion

The OESO cognitive model provides an extended viewpoint, adding directed space to the classical temporal framework of Reinforcement Learning models. The structural, functional, and dynamical OESO cognitive models defined in this paper have natural grounding: tracking substance amount variations inside a structure of segments. Several concepts are defined (some of them being quantified), among others: oesomere, mission events (admission, omission, etc.) and acquisition, net value and expected value, axiogogue and prognodendron. The OESO framework should help to interpret various phenomena elicited in many application domains related to logistics. The OESO production rule is simple (a weighted sum of variations, devoid of subtraction operator) and it is equivalent to the TD production rule. While the OESO learning rule (using exponential-decay delay operators) is equivalent to the TD(λ) learning rule, the OESO model unveils their dynamistical nature. Moreover, the OESO functional architecture is modular (§11.4). Noting that basic OESO cognitive devices are space-agnostic leads us to extend the value announcement transitivity upstream to substance transformations occurring at interfaces between logical segments. Extended transitivity may shed light on various psychological phenomena, bridging the gap between vital values and accessory needs (e.g. seeking for tasty meals ?). While this paper focuses on the cognitive component of logistics, studying the driving (control) component (aiming to favor downward progress of value items) should help to clarify several still obscure psychological concepts such as drives and desires and their neuronal implementation. In this paper I focus on the progression in space of value items to consumers downwards. But frequently roles are inverted, where the consumer progresses in space to reach value items (goals). 20

Indeed animals (notably humans) chain both types of progressions to eventually consume values: you move to reach your kitchen, then grasp a food item and put it in your mouth, then the food item travels along your nutritive system towards nutriment consumers. Possible applications are games, using software and internet, and several human activities prefixed by “pro-”, e.g. production (cf. supply chain management and workflow), projects, procedures. Several paths may lead a consumer from her current location to a given goal, hence graph-structured axiogogues and prognodendrons. OESO applies to both progress types, notably expected value quantification. Nutriments flow downwards along unchanging structures. It may be interesting to study an extension of the OESO model to 2D spaces, where value items move in various directions. Even if wind direction changes on a day timescale, it is rather constant on a shorter timescale, which allows us to predict sun or rain here within ten minutes by looking in the right direction at the horizon. Similarly, a subset of cognitive input operators could be activated by direction sensors, to track objects of interest in the field of vision. OESO could shed light on several physiological phenomena which display anticipation capabilities triggered by extra-body clues, notably: cephalic phase responses (e.g. insulin rise), sexual arousal (e.g. erection), risk appraisal as anticipation of damage (eliciting a rise of the sympathetic system). The intra-body immunity role of serotonin [9] may spread to extra-body concerns related to self-care, shelter, attachment, which may explain its involvement in mood disorders. Acknowledgments I gratefully thanks all my relatives which provide me help and support. Many thanks to NIPS 2016 reviewers for their constructive comments.

References [1] Klopf, A. H. (1988). A neuronal model of classical conditioning. Psychobiology, 16(2), 85-125. [2] Sutton, R. S. (1988). Learning to predict by the methods of temporal differences. Machine learning, 3(1), 9-44. [3] Schultz, W., Dayan, P., & Montague, P. R. (1997). A neural substrate of prediction and reward. Science, 275(5306), 1593-1599. [4] Sutton, R. S., & Barto, A. G. (1998). Reinforcement learning: An introduction. MIT press. [5] Hollerman, J. R., & Schultz, W. (1998). Dopamine neurons report an error in the temporal prediction of reward during learning. Nature neuroscience, 1(4), 304-309. [6] Suri, R. E., & Schultz, W. (1999). A neural network model with dopamine-like reinforcement signal that learns a spatial delayed response task. Neuroscience, 91(3), 871-890. [7] Pan, W. X., Schmidt, R., Wickens, J. R., & Hyland, B. I. (2005). Dopamine cells respond to predicted events during classical conditioning: evidence for eligibility traces in the rewardlearning network. The Journal of neuroscience, 25(26), 6235-6242. [8] Kawato, M., & Samejima, K. (2007). Efficient reinforcement learning: computational theories, neuroscience and robotics. Current opinion in neurobiology, 17(2), 205-212. [9] Rubio-Godoy, M., Aunger, R., & Curtis, V. (2007). Serotonin–A link between disgust and immunity? Medical hypotheses, 68(1), 61-66. [10] Schultz,W. (2007). Reward signals. Scholarpedia, 2(6):2184. [11] Kobayashi, S., & Schultz, W. (2008). Influence of reward delays on responses of dopamine neurons. The Journal of Neuroscience, 28(31), 7837-7846. [12] Ludvig, E. A., Sutton, R. S., & Kehoe, E. J. (2008). Stimulus representation and the timing of reward-prediction errors in models of the dopamine system. Neural Computation, 20(12), 3034-3054. 21

[13] Niv, Y., & Schoenbaum, G. (2008). Dialogues on prediction errors. Trends in cognitive sciences, 12(7), 265-272. [14] de Araujo, I. E., Ferreira, J. G., Tellez, L. A., Ren, X., & Yeckel, C. W. (2012). The gut–brain dopamine axis: a regulatory system for caloric intake. Physiology & behavior, 106(3), 394-399. [15] Howe, M. W., Tierney, P. L., Sandberg, S. G., Phillips, P. E., & Graybiel, A. M. (2013). Prolonged dopamine signalling in striatum signals proximity and value of distant rewards. Nature, 500(7464), 575-579. [16] Eshel, N., Bukwich, M., Rao, V., Hemmelder, V., Tian, J., & Uchida, N. (2015). Arithmetic and local circuitry underlying dopamine prediction errors. Nature. [17] Ohmae, S., & Medina, J. F. (2015). Climbing fibers encode a temporal-difference prediction error during cerebellar learning in mice. Nature neuroscience. [18] van Seijen, H., Mahmood, A. R., Pilarski, P. M., Machado, M. C., & Sutton, R. S. (2015). True Online Temporal-Difference Learning. arXiv preprint arXiv:1512.04087. [19] Lak, A., Stauffer, W. R., & Schultz, W. (2016). Dopamine neurons learn relative chosen value from probabilistic rewards. eLife, 5, e18044. [20] Michaud, P. (2016). The PGD model: giving Space to Reinforcement Learning Temporal models. Submitted at NIPS 2016. [21] NIPS (2016). The Thirtieth Annual Conference on Neural Information Processing Systems. https://nips.cc/Conferences/2016. [22] Michaud, P. (2016). The oesomeric model home page. http://oesomeric.free.fr. [23] Wikipedia contributors. “Oesomeric model”. In Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/wiki/Oesomeric model. Versions of this “OESO model” paper: I submitted an initial version at NIPS 2016 [21] under the title “The PGD model: giving Space to Reinforcement Learning Temporal models” [20]. The NIPS 2016 version is available at the “Oesomeric model” homepage {http://oesomeric.free.fr} [22]. Compared to the NIPS 2016 version, the present “Salon des Refus´es” version takes into account several NIPS 2016 comments, and it renames the model, from “PGD model”, to “oesomeric model” or “OESO”. Enhanced versions of the present “OESO model” paper will be made available at the “OESO model” homepage [22]. Contributions are welcome at the “Oesomeric model” English Wikipedia page [23], notably identification of relevant application domains, and validation references (simulations, experiments, ...).

14 14.1

Annexes New and uncommon terms

This annex lists and defines new terms (in bold) and uncommon terms. The sections indicate where the term is defined in this paper. • acquisition (§7.7): when a value item enters in the axiogogue (reality) or the prognodendron (cognition) via one of its foremost entry port. • axiogogue (§4.1): a structured set of real OESO segments which conveys value items to consumers downstream. • boolean segment (§7.5): a segment which either transmits or omits (it never partially transmits). • ceteris paribus (§7.6): other things being equal; with all other things or factors remaining the same. 22

• cognitive OESO segment or simply cognitive segment (§5.4): the sensibility (spatial) domain of a recruitable signal. • DAP (§3.1): the phasic dopamine signal. • DIP (§3.1): a negative impulse. • dynamistics (§5.4): a learning rule is of dynamistics nature if it learns from variations coincidences (cf. statistics). • exp-pulse (§4.1): a pulse with a sudden rise, followed by a slower exponential-decay. • level (§4.1): a level is the amount of something, e.g. glucose concentration around a glucose sensor. • missions (§4.1): “admission”, “emission”, “omission”, or “transmission”. • NIL (§3.1): no significant PIC or DIP during a given time interval. • OESO: an abreviation for “oesomeric”. • oesomere (§4.1): a segment which transmits previously admitted value items to downstream segment(s). The oesophagus is a biological oesomere. • oesomeric: related to the oesomeric model / framework. • OR (§3.1): the “Omitted Reward” scenario (one of the three “URPROR” scenarios). • PIC (§3.1): a positive impulse. • PR (§3.1): the “Predicted Reward” scenario (one of the three “URPROR” scenarios). • prognodendron (§6): the mental “representation” of all the cognitive segments belonging to an OESO cognitive structure. • RASE segment (§4.1): an OESO segment with rapid admission (usually) and slow emission (structurally). • real OESO segment or simply real segment (§4.1): a real oseomere (e.g. the oesophagus) • RPE (§3.2.2): Reward Prediction Error. • statistics (§5.4): classically, statistics deals with levels coincidences (cf. dynamistics). • TD (§3.2): Temporal-Difference. • TD(0) (§3.2): the initial version of TD learning. • TD(lambda) (§3.2): TD learning with eligibility traces. • TD error (§3.2): the difference between the sum of rewards which were expected after the previous time step, and the sum of rewards expected after the current time step. • UR (§3.1): the “Unexpected Reward” scenario (one of the three “URPROR” scenarios). • URPROR (§3.1): the set of three basic DAP scenarios: UR, PR, OR. • variation (§4.1): a variation is the temporal change of a level. A future version of this paper will include a comprehensive index, and a summary of OESO notation.

23