Reinforcement Learning-based Control Allocation for the Innovative Control Effectors Aircraft

(1)

Delft University of Technology

Reinforcement Learning-based Control Allocation for the Innovative Control Effectors

Aircraft

de Vries, Pieter Simke; van Kampen, Erik-jan DOI

10.2514/6.2019-0144 Publication date 2019

Document Version Final published version Published in

AIAA Scitech 2019 Forum

Citation (APA)

de Vries, P. S., & van Kampen, E. (2019). Reinforcement Learning-based Control Allocation for the Innovative Control Effectors Aircraft. In AIAA Scitech 2019 Forum: 7-11 January 2019, San Diego, California, USA [AIAA 2019-0144] https://doi.org/10.2514/6.2019-0144

Important note

To cite this publication, please use the final published version (if applicable). Please check the document version above.

Copyright

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons. Takedown policy

Please contact us and provide details if you believe this document breaches copyrights. We will remove access to the work immediately and investigate your claim.

(2)

Reinforcement Learning-based Control Allocation for the

Innovative Control Effectors Aircraft

Pieter Simke de Vries∗and Erik-Jan van Kampen†

Established Control Allocation (CA) methods rely on knowledge of the control effectiveness for distributing control effector utilization for control of (overactuated) systems. The Innovative Control Effectors (ICE) aircraft model is highly overactuated with its 13 control effectors, CA is a preferred method to distribute control effector utilization. In this paper it is envisioned to use Reinforcement Learning (RL) for distributing control effector utilization, which requires no knowledge of the control effectiveness. RL allows to pursue more abstract and timescale separated objectives. The ICE aircraft’s altitude, considering only longitudinal motion, is controlled by distributing the control effector utilization using RL, from an initial offset, while pursuing secondary objectives such as decreasing effector utilization and thrust.

Nomenclature

ACS = admissible control set

AMT = all-moving wingtip

BWB = blended wing body

CA = control allocation

CES = control effector suite

CFD = computational fluid dynamics

FBW = fly-by-wire

ICE = innovative control effectors

INDI = incremental nonlinear dynamic inversion ISA = international standard atmosphere LEF = leading edge flap

MATV = multi-axis thrust-vectoring RCS = radar cross section

RL = reinforcement learning

RSS = relaxed static stability SAM = surface-to-air missile SSD = spoiler-slot deflector

TD = temporal difference

α = angle of attack [deg] or learning rate [−] β = angle of sideslip [deg]

δ = angle of effector deflection [deg]

₌ _{random action parameter [−]}

θ ₌ _{pitch angle [deg]}

π ₌ _{policy [−]}

ρ ₌ _{air density [slug/ f t}3_]

γ ₌ _{discounting factor [−]}

x, y, z ₌ _{axis indicators}

l, m, n = moment axis indicators

a = (discrete) action [−]

A = action space [−]

b = wing span [ f t]

B = (linear) control effectiveness matrix [−] C I = control intensity [%]

Cp,r = position and rate limited control spaces [-] Cx,y,z = force coefficients [-]

Cl,m,n = moment coefficients [-] ¯

c = mean aerodynamic chord [ f t] E ₌ _{action history matrix [−]}

f ₌ _{state dynamics function [−]} fs = simulation frequency [H z] Fx,y,z = body forces [l b f ]

g = input dynamics function [−]

G = return [−]

h = altitude [ f t]

` = output function [−]

M = Mach number [−]

Mx,y,z = body moments [l b f · f t] pdyn = dynamic pressure [l b f / f t2] p, q, r = body rotation rates [rad/s]

Q = value table [−]

R = reward [−]

S = wing reference surface area [ f t2] st = (discrete) state vector [−]

t = time [s]

T = thrust force [l b f ]

u = input vector [−]

U = admissible control set [−] v = virtual control input vector [−] V = true airspeed [ f t/s]

x, = state vector [−]

∗

Graduate Student, Section Control & Simulation, Department Control & Operations, Faculty of Aerospace Engineering, Delft University of Technology, Kluyverweg 1 2629HS Delft, Zuid-Holland, The Netherlands

†

Assistant Professor, Section Control & Simulation, Department Control & Operations, Faculty of Aerospace Engineering, Delft University of Technology, Kluyverweg 1 2629HS Delft, Zuid-Holland, The Netherlands, AIAA Member .

Downloaded by TU DELFT on May 28, 2019 | http://arc.aiaa.org | DOI: 10.2514/6.2019-0144

AIAA Scitech 2019 Forum

7-11 January 2019, San Diego, California

10.2514/6.2019-0144 AIAA SciTech Forum

(3)

I. Introduction

S

urvivability of combat aircraft has been a main theme in their design. Since the advent of guided missiles, RadarCross Section (RCS) became an important driver for combat aircraft design [1–5]. Today, it can even be considered as the most important design parameter with respect to the physical configuration of combat aircraft. Modern fighter aircraft, like the F-22 and F-35, have designs centered around low observability.

The Innovative Control Effectors (ICE) aircraft appeals to two main concepts of survivability, avoidance; by reducing RCS by eliminating all tail surfaces and resilience; by having 13 independent control effectors to control the aircraft in its three attitude axes, providing a high degree of redundancy in control. While reducing airframe weight and increasing aerodynamic performance and maneuverability [6].

Challenges arise with respect to the aircraft’s stability characteristics but more importantly towards handling the presence of redundant, highly-coupled, nonlinear control effectors. Most current-day (fighter) aircraft have design that focus on decoupling cross-axis control effectiveness and typically have do not have more effectors than controlled axes. As such these aircraft use linear (scheduled) control systems with one effector typically controlling only one attitude axis. Since the genesis of the concept Relaxed Static Stability (RSS) [7]. Constant intervention of the aircraft’s fly-by-wire (FBW) system is needed to maintain stable flight conditions. With the advances in the field of FBW and most importantly in the field of Control Allocation (CA) [8, 9] highly unstable aircraft with a large amount of redundant effectors, like the ICE aircraft, can be controlled and the configuration’s performance benefits be exploited [10, 11]. Another opportunity is that additional objectives, such as reducing drag [12, 13] and potentially RCS are possible. Established CA methods might allow for reconfigurability as well as optimizing for other objectives on a short timescale. However, established CA methods require a model and while some CA algorithms might have some tolerance to model inaccuracy in their topology, they are inherently model-dependent.

Reinforcement Learning-based (RL) CA might allow moving past these limitations, not requiring a model[14–18]. As well as being able to fulfill higher level, more abstract, objectives, without their explicit quantification in terms of control effector utilization. Another advantage is that RL does not need the objectives to be defined on the same timescale; they may even be discontinuous. In RL, so long as there is a metric which can provide a reinforcement signal (reward) to the agent, it can dynamically attribute value to the relevant state-action combinations for optimizing for that objective, without explicit quantification on either timescale or in terms of control effector utilization. Ideally, CA using RL would be integrated into a complete intelligent flight control system as conceptualized by [19, 20]. However, hybrid [21, 22] or partial approaches are possible.

The advantages of RL-based CA are obvious when the aircraft is flying autonomously, as the aircraft needs to adapt to changing conditions by measure of the same reinforcement signal, representative of the (abstract) higher-level objective, thereby fulfilling mission tasks performed by the pilot or remote operator [19]. However, it is also relevant when the aircraft is piloted, as the underlying considerations for effector utilization distribution choice are likely to be too complex for a human operator to handle.

The contribution of this research is a RL-based CA algorithm for the ICE aircraft, using tabular Q-learning. The RL-agent will be trained in offline simulations, in which it is initialized at an altitude offset from its target altitude and learns to fly towards and remain near this target altitude. The agent starts with no knowledge of the system other than some predefined limits for its state- and action space and is provided a reward for achieving its objective. A brief overview of existing CA methods is given in section II, the ICE aircraft is described in section III, an examination of RL is shown in section IV, the RL-based CA method is detailed in section V, followed by the simulation setup in section VI. Section VII shows the simulation results with the discussion of the results taking place in section VIII, finally the conclusions and recommendations are presented in section IX.

(4)

II. Control Allocation

The general scheme for implementing Control Allocation (CA) is shown in figure 1.

HLMC CTRLALLOC LLAC EFF EOM

x u DSD ATT DES MOMS CTL INP MMT STAT

Fig. 1 General Control Allocation implementation topology.

Virtually all CA implementations rely on the above topology, as broadly concluded in [9] and in previous research [8, 23, 24]. CA relies on higher-level control systems to generate, typically, moment commands, for which the CA algorithm finds a suitable set of effector deflections. Additional objectives are often considered, mainly that of reducing some (1-, 2- or infinity-)norm of the control effector deflections as to avoid null-space intersections [25] and cycling between solutions that might result in the same moments [23]. This limits the solution space, improving computational time.

Other objectives can be considered, as listed below [9]. • Minimum control surface deflection

• Minimum drag • Maximum lift

• Minimum number of effectors • Minimum hydraulic system wear • Minimum wing loading

• Minimum radar signature

• Rapid re-configurability for fault tolerance

The reduction of control effector deflection norm might also have implications for attaining other objectives, such as minimizing drag or radar signature. The general expression for aircraft dynamics is shown below.

Ûx= f (x, t) + g(x, t)v (1)

y= `(x, t) (2)

Here, the state vector is x ∈ Rn. f describes the contribution of the system’s state to the change of state. The virtual control input v ∈ Rmhas the same dimension as the output y ∈ Rm, typically m = 3 since there are three to-be-controlled axes. The term v maps the effector inputs u ∈ U ⊂ Rpto a set of ‘virtual moments’, while function g(x, t) maps these moments to system input. The output function ` is often needed in the framework to relate sensor data to underlying state information. Another powerful aspect of CA is the handling of effector position (δi) and rate (expressed per discrete time step: δÛi· _f1_s) constraints:

U = {ui |u_i ≤ ui ≤ ¯ui} ⊂ Rp (3)

Cp = {ui |δi ≤δi ≤ ¯δi} ⊂ Rp, Cr = { Ûui |δi ≤ uli + ∆ui ≤ ¯δi} ⊂ R

p

(4) CA is popularized by the fact that it can handle overactuation; there might be more effectors than to-be-controlled axes. The control effectiveness with respect to the moments can be described by the general effector model form (left) or linear effector model form (right)

v= h(u, x, t) v= B(x, t)u (5)

When there are more effectors than to-be-controlled axes (p > m), the system is overdetermined; there are too many inputs for the number of outputs. The inverse problem then is underdetermined; there are too few equations to determine the unknowns. If the control effectiveness has a linear description in the form of a control effectiveness matrix B, linear solvers may be used, which can be used online due to their computational efficiency. Using such a linear expression of the control effectiveness is not suited for the ICE model since it cannot capture the interactions between effectors, as well as not accurately being able to describe parabolic character of drag-driven control effectiveness, predominantly

(5)

present in control effectiveness in yaw [6, 26].

CA can be considered to be an effective tool, especially when the inversion-based higher-level motion control is incremental. Since Incremental Nonlinear Dynamic Inversion (INDI) considers only the increments in moments to be attained to achieve a commanded outer-loop state, through the time-scale separation principle. Thereby it increases performance and reduces model dependency significantly, only requiring the control effectiveness description of the system. This method is applied to a quadcoptor drone in [27], although there is no overactuation present in this case. The INDI framework was extended, to determine the incremental deflection of control effectors needed to attain increments in desired moments for the ICE model in [11]. Showing superior tracking performance of both high- and low-level commands compared to previous methods such as Linear Control Allocation (LCA) and online implementability, which plagued previous endeavors of Nonlinear Control Allocation (NCA), due to computational intensity [9]. With the concept of CA introduced and previous research discussed, the next section continues on the subject of application: the Innovative Control Effectors aircraft.

III. The Innovative Control Effectors Aircraft Model

The RL-based CA algorithm developed in this research will be applied to the Innovative Control Effectors (ICE) aircraft model. The ICE model is a tailless fighter aircraft, which utilizes a mix of new, innovative control effectors as well as more well-known and often utilized control effectors. The absence of passive stability elements, such as a vertical and horizontal tail, make it highly maneuverable [6, 26]. The aircraft, control effectors and simulation model are described in this section.

A. General Configuration

The ICE aircraft’s configuration can be interpreted as a flying wing or Blended Wing Body (BWB), although the former might be more accurate since often fighters are considered to be BWB since (especially modern, stealth) fighters have a high degree of structural blending between fuselage and wing and a significant portion of the lift is generated by the fuselage. While the ICE aircraft takes that concept a step further. The high sweep angle is effective at generating vortex lift at high angles of attack, as well as reducing effective Mach number, resulting in lower trans- and supersonic drag. Not of the least importance, the sharp angle likely also positively contributes towards a lower radar cross section (RCS) [28].

B. Control Effector Suite

While several experimental as well as production aircraft have a flying wing configuration, none of those air-craft possess the multitude and diversity of control effectors that the ICE airair-craft possesses. The Control Effector Suite (CES) is the both the most challenging and progressive aspect of the ICE aircraft. While conventional con-figurations seek to decouple effector responses to the controlled axis and focus on reducing effector interaction, the ICE CES has strong interactions between some effectors as well as cross coupling in terms of effector response on the aircraft attitude angles. The configuration used at the time of this research entailed 13 independent control effectors. As illustrated in figure 2, the individual control effectors can be identified. From front to back, the aircraft has inboard- and outboard Leading Edge Flaps (LEFs), with the outboard LEF terminating in the All Moving (wing) Tips (AMT). At high angles of attack, the LEFs can provide roll- and yaw control since the flow conditions near the leading edge are more favorable; flow downstream at the trailing edge might separate. The AMTs, having the largest moment arm of any directional effector, are one of the main sources of roll- and yaw control power. The elevons at the trailing edge can be seen as one of the most conventional effectors, along with the pitch flaps. The elevons might be deflected differentially, providing both pitch- and roll-control power. However, the pitch flaps may only be deflected symmetrically, this was a design choice as to have a dedicated control surface for arguably the most important control axis. The CES also incorporates Multi-Axis-Thrust-Vectoring (MATV), providing pitch- and yaw-control power, under circumstances when other surfaces might be stalled (recovery) or at low dynamic pressures where utilization of other effectors cannot provide the desired control power. In front of the pitch flaps and elevons the spoiler-slot-deflectors (SSDs) are located. Similar to spoilers on transport and military aircraft, they can be used as a drag-driven effector for yaw control. However, since the pitch flaps and elevons are located just behind the SSDs, the SSDs will disturb the flow over these effectors and strongly reduce their effectiveness. At high angles of attack, the opposite is true; flow separates or is separated when

(6)

AMT

LE

MATV

LE

AMT

Fig. 2 The ICE control effector suite. [6]

reaching the trailing edge effectors, but the slot that opens between the upper- and lower restores some of the flow over these effectors, aiding their control effectiveness. A summary of the effectors and their capabilities is shown in table 1. Only the actuator position limits are considered in this research, the actuator dynamics and rate limits are not.

Table 1 Effector Summary

Effector Acron. Position Lims. Rate Lims.

Inboard Leading Edge Flaps (x2) LEFi [0 , 40] deg 40 deg/s Outboard Leading Edge Flaps (x2) LEFo [−40, 40] deg 40 deg/s

All-Moving Wing Tips (x2) AMT [0 , 60] deg 150 deg/s

Elevons (x2) ELEV [−30, 30] deg 150 deg/s

Spoiler Slot Deflection (x2) SSD [0 , 60] deg 150 deg/s

Pitch Flaps PF [−30, 30] deg 150 deg/s

Pitch Thrust Vectoring PTV [−15, 15] deg 150 deg/s

Yaw Thrust Vectoring YTV [−15, 15] deg 150 deg/s

C. Spline-Based Three Degree of Freedom Simulation

The ‘original’ ICE model is a Simulink-based simulation model with 108 lookup tables, constructed from wind tunnel experiments and CFD simulations. From these data tables the resulting aircraft force coefficients Cx, Cy, Cz, Cl, Cn, Cm can be constructed. Since the implementation of the RL-controller was easier and faster when using solely MATLAB, the Simulink model was not used and instead the spline model from [29] was used to calculate the force- and moment coefficients as a function of the aircraft state as shown in equation 6 with the control input vector as specified in 7.

[Cx, Cy, Cz, Cl, Cn, Cm]= f (α, M, β, V, p, q, r, u, T) (6) u= [δLL E F i, δLL E F o, δLAM T, δLE L E V, δLS S D, δPF, δRL E F i, δRL E F o, δRAM T, δRE L E V, δRS S D, δPTV, δYTV] (7)

Using these equations, the body forces and moments can be derived Fx= T · cos(δPTV) ∗ cos(δYTV) − Cx∗1

2 ρV2_S Fy= T · cos(δPTV) ∗ sin(δYTV)+ Cy∗1

2ρV 2_S

Fz= −T · sin(δPTV) ∗ cos(δYTV) − Cz∗1 2 ρV2_S Mx= Cl·1 2 ρV2_Sb My= Cm·1 2ρV 2_{S ¯c − T · d}

tvsin(δPTV) ∗ cos(δYTV) Mz = Cm·1

2

ρV2_{Sb − T · d}

tvcos(δPTV) ∗ sin(δYTV) (8)

(7)

The terms Fx, Fy, Fzare the body forces (in [l b f ]), the thrust force is T (also in [l b f ]), Mx, My, Mz(in [l b f · f t]), the body moments, taking the body axis system as right-handed, with positive x direction pointing towards the front of the aircraft and nominal flight direction. The z axis has its positive direction downward. ρ is the air density (in [slug/ f t3_{]), V the true airspeed (in [ f t/s]), S is the reference area (in [ f t}2_{]), b the wing span (in [ f t]), ¯c the mean} aerodynamic chord (in [ f t]). dtv is the arm for the thrust-vectoring components (in [ f t]), which is the difference between the point where the thrust vectoring acts and the center of gravity of the aircraft.

Since only longitudinal motion is considered, only the equations of symmetrical motion (first, third and fifth) are needed to describe the aircraft motion. Also, the input to the spline model in 7 can be simplified; the yaw-thrust vectoring δYTV term will always be zero and effectors should not be deflected asymmetrically so the inputs to the left and right effectors is equal. The aircraft states xp = [u w q θ x h] are calculated every simulation time step using a numerical integration scheme for the three degree of freedom (3DoF) equations of motion. Furthermore, the atmospheric conditions are derived using the International Standard Atmosphere (ISA) as described in [30]. Finally, a complete state-vector is constructed, which is used, partially, by the RL agent and for calculations in the next time step according to

xc= [ Ûx, Ûh, x, h, u, w, α, θ, q, V, m, ρ, pdyn]; (9)

Having explained the concept of CA and the research aircraft model, the next section provides a brief introduction to the concept of Reinforcement Learning (RL).

IV. Reinforcement Learning

Reinforcement Learning (RL) is a type of machine learning where an agent learns from interaction with its environment [31]. The control engineer, that designs the agent and the topology how it interacts with the environment, determines the reward function: the reinforcement signal, a function of the states and/or inputs, that tells whether the actions the agent took in a state made it progress to the goal. Rewards may be continuous or discontinuous, but the agent will attribute value to intermediate states irrespective of the nature of the reward function. In turn, the agent attempts to maximize the total (discounted) return Gt, as shown in equation 10. In this equation, rt+k+1is the discrete time step reward, 0 ≤ γ ≤ 1 is the discounting factor, t is the start time step and T the end time step.

Gt= rt+1+ γrt+2+ γ2rt+3+ ... = T −t−1 Õ k=0 γk_r t+k+1 (10)

The agent constructs a value function, based on the rewards, which allows the agent to form a policy, a mapping from state to action. The most basic RL topology can be seen in figure 3. Reinforcement learning has been applied to flight control, mostly using more advanced function approximators [32–35], among others and more recently to overcome some limitations in the online implementation, [36–38].

Agent

Environment

Action Reward

State

Fig. 3 General Reinforcement Learning Diagram.

Where s ∈ S is the allowable state , a ∈ A is the allowable action and r = f (s, a) is the reward. In this research, one of the least model-dependent versions of RL was selected, which does not rely on the knowledge of state transitions for calculating the value. Q-learning estimates value of state-action pairs, contrary to RL methods such as Dynamic

(8)

Programming, where the value of states is estimated using knowledge of the system dynamics [31]. The backup routine for Q-learning can be seen in equation 11.

Q(st, at) ← Q(st, at)+ α[rt+ max

a γQ(st+1, a) − Q(st, at)] (11)

From the estimated value function, the agent can derive a policy. Typically, a -greedy policy is adopted. In this case the agent takes the most greedy action a fraction 1 − of the time, where 0 ≤ ≤ 1, and a random action a fraction of the time. The greedy policy is derived as shown in equation 12.

π(st)= arg max

a Q(st, a) (12)

Q-learning bootstraps; its current estimate is a function of previous estimates and thus converges in the limit to the actual value(s) [31]. The leftmost term, Q(st, at) indicates the newly updated value, one term to the right depicts the previous estimate value. α is the learning rate, the weight of the newly learned information towards the backup. The term in square brackets is considered the ‘temporal difference error’, which is zero if the value has converged. rtis the reward and max

a γQ(st+1, a) is the maximum value of the state where the agent has departed to, from the previous state-action pair st, at. γ is the discounting factor, the weight of the best next state value towards the previous state-action pair. A view of Q-learning is that it learns, similar to a Monte Carlo-based approach, from trajectories through the state-action-space. The main difference being, that Q-learning is a temporal difference (TD) algorithm, where value updates are allowed up to every time step. The state-action-space limits are defined by the control engineer and some state combinations might not be (physically) possible. Thus obtaining a value estimation and subsequent policy might not be possible for every state combination.

Since there is no ‘real-world’ ICE model, learning as well as benchmarking will be done offline. In the learning phase, the agent starts with an all-zero Q-matrix, a high degree of random action ( = 0.99) and high learning rate α = 0.8. These are gradually decreased during learning as the value estimation converges. Benchmarking will be done, where the value is no longer updated according to the experienced reward and only an exploitative policy adopted; no (random) exploration α = = 0. The next section will discuss the application of RL to the CA problem, in turn, applying this method to the ICE aircraft.

V. Q-Learning Control Allocation

The framework of Q-learning, at least in its discrete form, is readily applicable to the Control Allocation (CA) problem. Similar to the objective function in numerical optimization-based methods, the reward function in RL can be used to specify for the agent what is the goal of using its different inputs. The main difference being, that the reward function does not need to be interpreted on an per-time step basis, but can taken into account even if the reward is experienced in another state or even discontinuous. The reward might even be a more abstract construct stemming from a higher level objective. In terms of handling the degree of overactuation, or effectively, the presence of a (significantly) larger action space, which does raise concerns toward the ‘curse of dimensionality’ that already plagues the discrete forms of Reinforcement Learning (RL) [31], there is no fundamental difference in having multiple (redundant) effectors and ‘simply’ having a finer discretization for a smaller number of (non-redundant) effectors.

EPIINIT DISCSTATE ACTSEL STEPSIM REWFUN

UPDQ DISCSTATE ZMIN INITSTATE DISTAT QTAB CACTION DACTION CSTATE REWV DSTATE ZEROINIT EPISIM

Fig. 4 Q-learning control allocation episodic simulation topology.

(9)

A. Implications Towards Redundant Actions

Difficulties can arise with respect to the solutions. Similar to what might occur in a conventional CA scheme where numerical optimization methods are used, two or more solutions having the same cost, or in the case of RL, value, the solver [23] or agent might cycle between solutions. To avoid cycling and null-space intersections, numerical CA methods typically include minimization of some norm of the control effector deflections. A similar approach can be taken in the case of RL, as to avoid the policy from cycling from different actions which have similar value. This can be done by directly incorporating the effector utilization in the RL agent’s reward function. Another option is to define reward function incorporating a different objective, one which implies no simultaneous antagonistic use of effectors.

B. The Control Allocation Problem

For concerns of dimensionality, only the longitudinal motion is considered. As a result, all the effectors work symmetrically, and the maximum amount of effectors is reduced to six. In order to solve the CA problem with RL, the action space is constructed similarly to the state space. Different combinations of actions need to be considered explicitly, the former is common in RL while the latter is not as common but does not cause additional complexity, but does raise concerns towards the curse of the dimensionality.

Solving using linear expressions of the control effectiveness causes problems not taking into account effector interaction in CA frameworks [11] and as such it was hypothesized that using linear function approximation in a RL framework would also not capture these characteristics. Since the discrete tabular form of RL needs to consider each combination of input setting separately, something which would be inconvenient if the underlying relationships were linear but this inconvenience is partly diminished because of the nonlinear character of the underlying dynamics.

C. Discrete State- and Action-Space Construction

Since a discrete version of RL is used, the state- and action-spaces used by the agent have to be discretized. The state-space is discretized according to a vector SD = [S1, S2, . . . , Si, Sk] where the number Si indicates the discretization steps for the individual state component and i indicates the state component index. The state component describes one part of the state, for instance, altitude, whereas the state comprises combinations of multiple state components such as pitch angle, angle of attack and altitude. The total size of the state-space is then given by SN =

k Ö

i=1 Si. Typically, the amount of steps within a state component is odd so that the zero state component might be captured within one state, whereas an even amount of steps would result in the zero state component being observed by the agent in two states. Additionally, it might be beneficial to have a smaller state component step near zero (or at another point), for this, deadzones can be defined, which specify the width of the center (or start-) state explic-itly, after which the remaining state component steps are spaced linearly, between the predefined state upper- and lower limits. The action space is generated from predefined action limits and has equal step size for all action components. With the discretized state- and action-component-spaces, the full discrete state and action-spaces can be constructed. An individual state component has a column vector ci of length Si containing the discretization points for that state component. The state- (and action-space equivalently) may then be described using a matrix SM = [c1, c2, . . . , ci, ck], composed of the component vectors. The baseline specifications for describing the state- and action-space can be seen in table 3 and 4, respectively. In order to use a two-dimensional Q-table, for concerns of computational efficiency in MATLAB, the k−dimensional state-space SM should be converted to a one-dimensional vector, this can be done using a Cartesian Product (CP), which then describes all the combinations of the individual state component discretization points in a list of SN elements. The resulting Cartesian Set (CS) is efficient to use in MATLAB as the continuous state xconly has to be compared row-wise to matrix SM, returning k row-indices, after which the row-index in the Q-table is found by a summation of the row-indices from SM multiplied by their respective first-appearance row-indices the CS.

D. Q-Table Construction

The state-space is described by the CS, SC and the action space by AC, which can both be considered to be one-dimensional representations of k-dimensional state- and l-dimensional action-space, respectively. The Q-table is generated by letting the CS SC span the rows and CS AC span the columns, thereby creating a Q-representation for the

(10)

entire discretized state-action space.

E. Action Selection

Some modifications are done to the ‘standard’ -greedy policy [31]. If is scheduled, as it is in this simulation framework, exploration occurring due to , as it is decreased, decreases rapidly. This is due to the fact that the likeliness of an unexplored action being taken is small; all actions have equal chance of being performed and most actions have been explored. The trajectories through the Q-space can also change significantly as is reduced; following a greedy policy brings the agent to different places in the Q-space than with random actions. The main point is that when only a small number of actions have never been taken on a certain state trajectory, the chance that the -greedy policy takes an action that has not been taken at all is smaller. Moreover, one would still prefer to investigate this action as to obtain a value estimation. Leaving at a moderate value (or reduce it very gradually) for extended periods can make sure that all actions on state trajectories caused by both random and pure greedy actions might be explored. However, in that case, the exploration rate might be very slow and is considered computationally inefficient. To help exploration on all trajectories (either caused by only random actions, only greedy actions or a mix of both), the action which has never been taken in the perceived state, is taken. If there are more actions that have not been taken once, a random action from this subset of actions will be selected. The action selection algorithm is detailed in algorithm 1. In this algorithm action-space Auis a subset of the local (one-state) action space, only containing the actions which have not been taken (once) in that state, recorded in matrix E, which records the amount of visits to each state-action pair.

Data: , Q(st, a), A, E(st, a), Previous Action, Action Hold Switch State, Current and Last Discrete State Determine non-taken actions Au= (E(st, a) < 1)

ifCurrent Discrete State is equal to Last Discrete State and Action Hold Switch is On then Action is the same as Last Action

else if Auis not empty then Take a random action from Au Set the Action Hold Switch to On

else ifUniform random number (z ∈ [0, 1]) > then Take greedy action a = arg max

a Q(st, a) if > 0 then

Set Action Hold Switch to On else

Set Action Hold Switch to Off end

else

Take random action a ∈ A Set Action Hold Switch to On end

Algorithm 1: Modified -greedy action selection

F. Reward Functions

The most important element in RL is the choice of reward function, it is the only elements that ‘communicates’ the objective to the agent. All reward functions used are in the form as shown in equation 13. The reward function is positive semi-definite; if the measure Mi exceeds Mim a x, the simulation is interrupted as the agent has violated its

state-space limits and receives a negative valued reward instead. Mi can either be a state or action, and their respective maximum Mim a x is typically the same as the limits used for defining the agent’s state- or action-space. In this equation,

r is the reward per time step, ÛR is the predefined, maximum reward per second, fsthe simulation frequency, Withe measure’s weight. An example reward function is given in equation 14, which corresponds to the reward function used in research case 2 in section VI.

(11)

r = Û R fs − RÛ fs·Ík_i=1Wi k Õ i=1 Wi Mi Mim a x 2 (13) r= Û R fs − RÛ fs· (Wh+ WE LE+ WPF) · ₍₁₄₎ Wh _∆h ∆hmax 2 + WE LE δ E LE δE LEM a x 2 + WPF δ PF δE LEP F 2!

The following section will describe the simulation setup in which the algorithm, discussed in this section, is applied to the ICE aircraft model in simulation. In the following section, in table 2 the measures ˆh relate to equations 13 and 14, where ˆh ∝ h hm a x 2 .

VI. Simulation Setup

The controller and dynamics are simulated using MATLAB. The dynamics are composed from the force- and moment coefficients, calculated from the relevant states using a spline model [29], from which the forces and moment acting on the aircraft are derived (section III). These forces and moment are then used to calculate the motion of the aircraft using a general three degree of freedom (3DoF) model for the equations of motion. Atmospheric conditions are derived using the International Standard Atmosphere (ISA) model. These elements make up the dynamics simulation part of the program. The diagram showing general simulation setup can be seen in figure 4.

A. Research Cases

Several research cases are envisioned to evaluate the capabilities of the Q-Learning Control Allocation (QLCA) algorithm, as depicted in table 2. The first two cases only consider two effectors each, which have little or no direct interaction. However, the agent, in the second case, has to make a tradeoff on which to use and how. Cases three and four consider the use of two effectors, however with direct interaction. The last three cases consider a far larger set of effectors, in this case both elevons, pitch flaps, inboard- and outboard leading edge flaps. Different reward functions are considered, considering the main objective of attaining the target altitude, as well as reducing control effector utilization. Research case 7 considers reducing thrust (command) in addition to reducing effector utilization.

Table 2 Effector Configurations and Reward Functions

Case No. of Eff. Effectors Reward

RC1 2 Elevons + Pitch Flaps hˆ

RC2 2 Elevons + Pitch Flaps hˆ+ ˆu

RC3 2 Elevons + Outboard Leading Edge Flaps hˆ

RC4 2 Elevons + Outboard Leading Edge Flaps hˆ+ ˆu

RC5 4 Elevons + Pitch Flaps + Inboard + Outboard Leading Edge Flaps hˆ RC6 4 Elevons + Pitch Flaps + Inboard + Outboard Leading Edge Flaps hˆ+ ˆu RC7 4 Elevons + Pitch Flaps + Inboard + Outboard Leading Edge Flaps hˆ+ ˆu + ˆT

B. Baseline Parameters

This section provides some baseline parameters for the simulations shown in tables 3 and 4, for state- and action-space, respectively. These parameters apply to research cases 1 and 2 as shown in table 2. Other research cases share the same state-space discretization settings, while action-space parameters might vary, especially for research cases 5 through 7. The ‘Deadzone’ in tables 3 and 4 indicates the width of the zero state explicitly. For states with symmetrical limits (all but α) this means the perception of the zero center state is the ‘Deadzone’ width, which means the nearest discretization

(12)

points (one step lower and higher) are located at the positive and negative deadzone values. This means for instance, for state ∆u the state vector is defined as [−20, 0, 20]. The state α is asymmetrical and its state vector starts with the zero-state, implying that the next state is located at twice the deadzone value; the state vector for alpha is then defined as [0, 2, 5.83, 9.67, 13.5, 17.33, 21.17].

Table 3 State-Space Discretization Specifications

State # Steps Si Limits Deadzone Unit

∆u 3 [−100 , 100] 20 f t/s α 7 [0 , 25] 1 deg θ ₇ _{[−25 ,} _25] ₁ deg q 7 [−45 , 45] 1 deg/s ∆h ₉ _{[−1000, 1000]} ₁₀₀ f t S 9261 = SN

Table 4 Action-Space Discretization Specifications

Effector # Steps Ai Limits Deadzone Unit

Elevons 11 [−30, 30] Default deg

Pitch Flap 11 [−30, 30] Default deg

A 121 = AN

C. Hyper Parameter Scheduling

Often in RL, constant values for hyper parameters α, γ, , learning rate, discounting factor and random action parameter respectively, are used. During online learning and adaptation, using constant hyper parameters might be favorable as to maintain a stable policy under stationary conditions while at the same time allowing enough adaptation power for non-stationary conditions such as a change in the environment or to the aircraft itself. During offline learning, when the agent starts with no knowledge, the parameters α and can be increased as to promote early exploration and value estimation. Then, as the agent explores and gains knowledge of the environment in its Q-table, the values can gradually decrease as to only update small changes into the Q-table. The schedule for the hyper parameters can be seen in figure 5. 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 Episode [-] 105 0 0.2 0.4 0.6 0.8 1

Hyper Paramter Value [-]

Fig. 5 Hyper parameter schedule

D. Episode Simulation

The episode simulation loop is best described by figure 4. Every episode is initialized at a random state between some given limits. An episode runs for a maximum of tmax = 60s, at a simulation frequency of fs= 100Hz, amounting to a maximum of 6000 time steps. The simulation is interrupted, and a punishment is given to the agent, when the state limits or model are violated. This reward is then, as at any time step, updated into the last state action pair before reaching the violating state, as to teach the agent not to violate those limits.

(13)

1. Initialization

The initialization is performed by creating a random state between given limits. The random state only applies to the states (xa = [∆u, α, θ, q, ∆h]) the agent perceives, but other states are derived from the agent state as to obtain a correct physical state. Since the aircraft is unstable unless a trim input is found a-priori, it is deemed challenging enough for the agent to only have a non-zero initialization for the altitude; the other states will depart anyway if no action is taken. The initialization takes place as a factor of the agent’s state limits and for general learning purposes the lower bound is set at [0, 0, 0, 0, 0.1] while the upper bound is set at [0, 0, 0, 0, 0.5], resulting in the agent’s initialization point to be somewhere between −0.5 · hmax, −0.1 · hmaxand 0.1 · hmax, 0.5 · hmax.

2. Thrust Controller

The thrust is controlled by a simple PID controller, the reference true airspeed is set at 800 f t/s. The proportional gain equals that of the mass of the aircraft, while the integral gain set set lower, at a value of 10. The differential gain is zero. The PID controller is saturated using limits of 0 and 10, 000l b f , anti-windup is also included, which saturates the integral control signal to the maximum thrust value.

3. State-Space and Model Violation Conditions

When the actor violates the state-space limits, the simulation is interrupted, the episode is ended and a new one is started. The limits are shown in table 3 for the state-space. The actor cannot violate action-space limits as it cannot choose an action that lies outside the action-space. When the simulation is interrupted and the episode is ended, the state-action pair causing the transition to the violating state is updated using a negative definite reward. The negative reward equals the user-set reward per secondR, multiplied with a factor which is also user-defined. This factor is set toÛ 10 in this simulation setup, meaning that the agent will receive a reward equal to −10 times the normalized reward per second.

E. Convergence Criteria

While the maximum amount of episodes is limited (typically 3 · 105), it is possible that the agent has learned enough before reaching the maximum number of episodes. The criteria for convergence is measured by the relative change in value functionÍ |∆Q |_{Í |Q |} , when this value is below 10−5for more than 10, 000 episodes, the simulation is stopped and considered completed, as the agent is not attaining any extra information.

F. Benchmarking Setup

In order to compare performance between configurations with different reward functions and effector configurations, they must be evaluated using the same initial conditions. To achieve this, a benchmark is set up. The different agents are all initialized at exactly the same conditions, after which, their performance in terms of achieving the target altitude, as well as the Control Intensity (CI) are monitored. Under benchmarking conditions, the agents only evaluate the learned policy; no learning takes place as well as no random action (corresponding to α = = 0). Only the policy π(s) = arg max

a Q(s, a), needs to be provided to the agent. Two benchmarks are performed, one at which the agent is initialized 500 f t above its target altitude and one where it is initialized 500 f t below its target altitude. The subsequent section will show the results to the simulation described in this section.

VII. Results

This section presents the results of the simulations performed in this research. These results are divided into, off-line learning results (A), benchmarking results (B) and a sensitivity analysis (C).

A. Off-line Learning Results

The offline learning results are shown in this subsection, the progress of the agent as it gains more experience through repeated exposure to the learning case. Figure 6 (top left) shows the development of the sum of the absolute values in the Q-tableÍ |Q| for research cases 5 and 7. These figures indicate the change of value information in the agent’s perception. If the value is constant the agent will have obtained a complete perception of the (stochastically) stationary environment. It can be seen that the development follows a similar trend, however the value is typically a

(14)

lower for research case 7 since it receives more punishments, resulting in lower experienced rewards. While changes can occur due to repeated visits to state-action pairs, most important is the initial visit to a state-action pair, to monitor this, the exploration graph is shown in the top right of figure 6, which shows, in percentage, the number of state-action pairs that have been visited at least once. Here, it can be seen that research case 7 explores more of the state space before converging to its stable Q-table, this can be due to the fact that its reward function produces different Q-space trajectories in the greedy learning phase.

2 4 6 8 10 |Q|[-] 108 RC5 RC7 2 4 6 8 10 12 14 16 Exploration [%] 0 1 2 3 4 5 Episode [-] 105 0 20 40 60 80 100 Relative Performance [%] 0 1 2 3 4 5 Episode [-] 105 0 20 40 60 80 100 Completion [%]

Fig. 6 Learning development plots. Top left: value table development. Top right: Q-space exploration. Bottom left: Performance development. Bottom right: Completion development.

The lower right plot in figure 6 illustrates the progress of episode completion while the bottom left plot performance throughout the learning process. Since the maximum episode length is fixed for a certain simulation, values are given as a percentage. The same holds for the performance; the maximum performance is determined by the choice of reward function, episode length and simulation-break punishment, the relation describing the relative performance is shown in equation 15, in this equation, Pr el represents the relative performance, Gt=te n d the accumulated reward (=return)

at the end of the episode (which might be before t = tmax),R represents the user-set reward per second and B is theÛ user-defined simulation break penalty factor. The maximum performance possible is dependent on the initialization point; as the agent is initialized further from the optimal position, the maximum possible reward and thus performance will decrease (the former normalization remains though). Even though the maximum relative performance for research case 7 is less than that for research case 5, because of additional terms in the reward function, it converges to a higher relative performance, while also its completion converges to 100%. Research case 5 therefore converges to worse relative performance and less performance towards the primary objective even, while also it could not complete every episode after the Q-table had converged. This results from the ambiguity stemming from both state- and action-space discretizations as well as multiplicity of solutions causing convergence towards a sub-optimal policy. Benchmarking is performed in the next section to evaluate the agent’s performance at a fixed initialization point for comparison purposes.

Pr el[%] = 100 ·

G_t=te n d − B · ÛR

Û

R ·tmax− B · ÛR

(15) Figure 7 compares the altitude time history of an agent that has not learned an optimal policy (left), where the actor

(15)

has the worst recorded performance during learning on a completed episode, to a stage in the learning process where the agent has a (near) optimal policy (right), the best recorded performance. The dashed horizontal lines in these figures indicate the points within this state component where the agents perception of the state changes; they are located halfway the state component discretization points It must be observed that the initialization points are not equal, there is about 100 f t difference. The earlier policy oscillates significantly more than the latter case which only oscillates around the lower end of the zero-state. While the earlier policy does finish the episode, it nearly violates the state limit of 1000 f t, which reflects in its poor performance.

0 10 20 30 40 50 60 Time [s] -200 0 200 400 600 800 1000 h [ft] 0 10 20 30 40 50 60 Time [s] -200 0 200 400 600 800 1000 h [ft]

Fig. 7 Comparison of altitude time history during learning for a sub-optimal policy (left) and near-optimal policy (right), for research case 7

B. Benchmarking Results

After learning has completed, the converged Q-tables are used to form a fixed policy. Where the Q-table has a description of the value of the value of each action in each state, the policy only contains the action with the highest value. The agent also does not update the Q-table, nor does it take any random action. Different configurations are tested, with different reward functions and available effectors as described previously in section F. Figures 8 and 9, compare the benchmarking results for research case 5 and 7. The results comparing all different research cases are shown in table 5. In figure 8 it can be seen how two agents behave under benchmarking conditions, which have been trained using different reward functions. The horizontal dotted and dashed lines indicate the agent’s perception of the respective state; when and only when the a continuous state crosses one of the dashed-dotted lines does the state that the agent perceives change. Research case 5 only considers altitude in its reward function while research case 7 also considers effector utilization and thrust. While their behaviors are similar, some difference can be seen as the latter research case allows for some ‘relaxation’ of the primary objective (attaining target altitude). However, having a more ‘restrictive’ reward function can help the agent during learning to find a suitable policy more quickly. Also when it is not exclusively opposing the objective, which thrust and effector utilization do not have to be, the secondary objective might help the agent progress toward the primary objective. This conclusion can be drawn when regarding table 5; the inclusion of a secondary objective improves the agent performance towards the primary objective, which is the case when comparing research case 2 to 1 and 5 to 6 and 7. Due to the the fact that the reward and resulting value from actions are more unique than when only a single objective is considered, resulting in less ‘cycling’ which has been discussed in section V. In the top left in figure 8 it can be seen that the agent of research case 7 does not have as much forward body velocity variation as its unpunished counterpart, research case 5. Variation that does take place is more gradual and has less high frequency content. Similar behavior holds for the angle of attack, displayed in the top left of figure 8, where it can be observed that the agent of research case 7 has a reduced excursion magnitude while also there is less high frequency variations. While research case 5 also attempts to maintain constant values of the angle of attack at the discrete state transitions, research case 7 does not exhibit this behavior. This also reflects in the middle left graph, where the pitch angle is compared, research case 7 attempts to maintain between the zero state and one discrete state above it, while research case 5 travels through the zero-state every oscillation while showing more effort when it reaches a discrete state boundary to attempt to stay there, resulting in local high frequency oscillations. This effect can be observed even more clearly in the middle right graph in figure 8, where it is clear that the agent of research case 7 does not display as high pitch rates; every high pitch rate it generates needs more effector input than a mild pitch rate, while a higher pitch rate

(16)

also needs to be canceled by a higher effector input the moment it becomes undesirable. In the bottom left graph it can be seen that the agent of research case 7 passes the target altitude earlier on in the benchmark than research case 5 and after a little overshoot (which is positive towards its thrust objective, although somewhat negative towards its primary objective), continues to attempt to maintain the target altitude at the bottom side of the zero state, while research case 5 tends to leave the zero state at both the top and bottom sides. The tracking performance as such is better in case of research case 7, because its ‘rise time’ is shorter as well as its oscillation afterward having smaller amplitude. The most significant result can be seen the bottom right graph of figure 8. The agent of research case 7 combines the objectives of attaining target altitude and reducing thrust by performing a more aggressive pitch-down maneuver at the start of the benchmark, allowing it to reduce the throttle setting significantly during this time (t < 10s). Also, by considering the effector inputs directly, it does not produce excess drag during this phase. The resulting upwards excursion in thrust afterward, resulting from an aggressive PID controller is still below the TR M Sof research case 5. During the rest of the benchmark the thrust oscillates at a far lower average than that of research case 5, resulting in a TR M S reduction of 2783l b f or 35.7%.

Figure 9 shows the delta input time histories for the two agents under benchmarking conditions, downsampled to a one second bin from the ₁₀₀1 s. Research case 5 shows significantly higher usage of control effectors and also utilizes effectors antagonistically, producing excess drag; not corresponding to a minimum effort in order to attain the primary objective. Research case 7 also takes into account both effector utilization and thrust and as such avoids the antagonistic usage of control effectors as that results in a larger total control effector utilization and more drag resulting in a higher commanded thrust. However, it must be observed from table 5 that the CI does increase from case 6 to case 7, however, the drag is still about 100l b f lower. The deflections for the elevons and pitch flaps are not significantly higher but that of the leading edge flaps, both inboard and outboard, is. A significant portion of the CI reduction seen in research cases 6 and 7 with respect to research case 5 is a result of the usage of the outboard leading edge flaps. The spline model [29] used in this research, in its current implementation, returns the same force- and moment-coefficients for negative deflections of the outboard leading edge flap as it does at zero deflection. As such, the agents of research cases 4, 6 and 7 do not use negative deflections of the outboard leading edge flap, as it provides the same performance towards the primary objective as zero deflection, but does weigh into the reward when effector usage is punished.

Table 5 shows the quantitative benchmark results for all research cases. Research cases 1, 3 and 5 do not consider anything other than reaching the target altitude in their reward. While cases 2, 4, 6 and 7 also consider reducing effector utilization explicitly. Research case 7 considers thrust as well in its reward function. Introducing the effector utilization in the reward functions causes a significantly lower CI, by avoiding null-space intersections of antagonistic effector deflections. Another observation is that introducing the effector utilization, the performance towards the primary objective does not degrade; this suggests that more solutions were possible to attain the objective. Consequently, the pursuit of this secondary objective does not result in relaxation or degradation of the primary objective. Logically, as the amount of effectors increases (research cases 5, 6, 7), one would expect the CI to reduce, since more effectors are available, their summed maximum utilization also increases. However, since the ambiguity in solutions increases as more effectors are present and discretization of the action space is more coarse when more effectors are considered, the CI for case 5 is among the highest where effector utilization is not considered in the reward function. In all cases, when the effector utilization is introduced into the reward function, the CI is significantly reduced. Also, the inclusion of the effector utilization only reduces performance towards the primary objective slightly in one case, as might be observed when comparing research case 3 to 4. In research case 2, 6 and 7, the inclusion even results in increased performance. It must also be observed that the avoidance of antagonistic simultaneous effector deflections in cases results in significantly lower thrust required. This is caused directly by the reduction of drag but also from emergent behavior stemming from the reward function definition; the inclusion of effector utilization causes far milder flight behavior resulting in less drag stemming from the aircraft states.

(17)

-15 -10 -5 0 5 10 u [ft/s] RC5 RC7 0 2 4 6 8 10 12 [deg] -5 0 5 10 15 [deg] -20 -10 0 10 20 30 q [deg/s] 0 10 20 30 40 50 60 Time [s] -200 -100 0 100 200 300 400 500 h [ft] 0 10 20 30 40 50 60 Time [s] 0 2000 4000 6000 8000 10000 Thrust [lbf]

Fig. 8 Benchmarking time history. Top left: forward body velocity. Top right: angle of attack. Middle left: pitch angle. Middle right: pitch rate. Bottom left: altitude. Bottom right: thrust.

(18)

-3 -2 -1 0 1 2 3

Elevons Deflection [deg]

-30 -20 -10 0 10 20

Pitch Flaps Deflection [deg]

10 20 30 40 50 60 Time [s] -30 -20 -10 0 10 20

LEFi Deflection [deg]

10 20 30 40 50 60 Time [s] 0 10 20 30 40 50

LEFo Deflection [deg]

Fig. 9 Benchmarking inputs. Top left: elevons deflection. Top right: pitch flaps deflection. Bottom left: inboard leading edge flaps deflection. Bottom right: outboard leading edge flaps deflection.

Table 5 Benchmarking results

Case Effector Reward hR M S C I δ_{E LE}R M S δR M S_PF δR M S_{LE F i} δR M S_{LE F o} TR M S

Configuration Function [ f t] [%] [deg] [deg] [deg] [deg] [lb f ]

RC1 ELE+PF hˆ 156.2 57.24 12.77 26.47 - - 7381 RC2 ELE+PF hˆ+ û 148.4 32.97 6.432 17.46 - - 6212 RC3 ELE+LEFo hˆ 152.8 43.88 13.41 - - 27.73 6686 RC4 ELE+LEFo hˆ+ û 160.9 20.70 7.963 - - 12.25 5009 RC5 ELE+PF+LEFi+LEFo hˆ 154.9 55.84 18.23 22.47 26.03 29.13 7791 RC6 ELE+PF+LEFi+LEFo hˆ+ û 144.1 18.95 7.341 9.954 18.39 9.978 5103 RC7 ELE+PF+LEFi+LEFo hˆ+ û + ˆT 145.3 33.59 8.759 11.36 26.68 21.30 5008

(19)

C. Sensitivity Analysis

In order to investigate the effects of the different parameters on the algorithm’s tracking performance and computational efficiency, a sensitivity analysis was performed in which the state- and action space parameters are investigated. Research case 2 is taken as the baseline for the sensitivity (sensitivity case 0) analysis. The sensitivity research cases are shown in table 6. In sensitivity case 1 the state space resolution is increased, which results in a significant increase of state-space size. In case 2, the action-space resolution is increased, while case 3 increases both state- and action-space resolutions. Case 4 reduces the action space resolution and finally case 5 adds the vertical rate state.

Table 6 Sensitivity analysis research cases

Case States State Disc. SN Act. Disc. AN Q-size

SCO/RC2 5 [3 7 7 7 9] 9, 261 [11 11] 121 1,120,581 SC1 5 [3 9 9 9 11] 24, 057 [11 11] 121 2,910,897 SC2 5 [3 7 7 7 9] 9, 261 [15 15] 225 2,083,725 SC3 5 [3 9 9 9 11] 24, 057 [15 15] 225 5,412,825 SC4 5 [3 5 5 5 7] 2, 625 [11 11] 121 317,625 SC5 6 [3 7 7 7 9 5] 46, 305 [11 11] 121 5,602,905

The results for the sensitivity analysis can be seen in table 7. The asterisk for cases 1 and 3 indicates that their offline learning simulations did not converge reliably. The results seen in this table are from benchmarking runs of the respective cases, with sensitivity case 0 being equal to research case 2 which is shown in the first line with non-delta values. Starting with the last column which displays the Real-Time-Ratio (RTR), the simulation time divided by the computational time. No significant deviations take place, especially the results for sensitivity case 1 and 4, and the increase in computational intensity for the other cases is not proportional to the increase in state- and action-spaces; the decrease in RTR is far less than the increase in sizes. Although time-to-learn is not considered here. This can be explained by two reasons, the first being that the state-lookup algorithm used in this research is relatively insensitive to increase in dimensions in the state-action space or in their discretization fineness. Another reason is that MATLAB allows for multithreaded mathematical operations, the advantage of this only shows when considered arrays grow significantly (the Q table in sensitivity case 5 is 5 times larger than case 0), resulting in only small diminishment in computational performance (less than 10%). In the first column of table 7 it can be observed that all cases decrease the performance towards the primary objective, sensitivity case 1 shows the least performance degradation while providing significant decrease in CI. Sensitivity case 3 also sees a decrease in performance towards the primary objective, however smaller than the other cases, while also improving the CI. Again, it must be noted that this agent did not achieve the desired convergence criteria. The decrease in performance for sensitivity case 2 can be explained by the fact that the reward function did not change while its smallest non-zero action was smaller than previously, which resulted in the actor preferring the smallest input and disregarding the primary objective too much. Too mitigate this, the weight on the usage of effectors can be increased. The decrease in performance for sensitivity case 4 can be attributed to its decreased resolution, although tracking performance is significantly worse (about 30%), the CI is similar. This can be seen as the opposite effect experienced in case 2; the usage of effectors might be similar but tracking of the primary objective is complicated by the reduced state-space resolution. Finally, it was expected that including the vertical rate for the actor would improve its performance as it obtains knowledge directly of the rate of change of reward. The opposite is true and this agent shows the worst tracking performance of any research- and sensitivity case. Conversely, the agent already had a perception of the vertical rate through the forward body velocity, angle of attack and pitch angle states and when there is a mismatch between the discretization points within those state components and the vertical rate. Thus a state like the vertical rate is best not too be considered when it conflicts with other state component discretizations and in general, redundant state components should be avoided.

The simulation results from this section will be generalized in the discussion in the next section as to ascertain their research value.

(20)

Table 7 Sensitivity analysis results SC ∆hR M S[ f t] ∆C I[%] ∆δPFR M S ∆δE LER M S ∆RT R[−] SC0 148.4 39.97 6.432 17.46 155.8 SC1∗ 5.945 -11.80 0.0321 -5.195 -1.618 SC2 51.69 -12.24 1.711 -7.409 -5.440 SC3∗ 14.15 -13.30 -0.292 -7.230 -10.74 SC4 54.15 -0.374 2.880 -1.975 -1.430 SC5 63.40 6.026 15.66 -10.70 -12.11

VIII. Discussion

Past [9, 23] and recent [11, 13, 39, 40] research in the field of Control Allocation (CA) have proven successful in applications to overactuated systems such as the ICE aircraft. While Reinforcement Learning (RL) has been applied to aircraft control, especially with respect to reconfigurability [20], online model identification [41] and online flight control [33, 34, 42], the closest application to the CA problem has been in the form of neurodynamic aircraft (effector) models [21, 22].

The results in section VII show a more direct application of RL to CA than previous research, using up to four effectors for controlling the altitude of the ICE aircraft, considering isolated longitudinal dynamics. The Q-learning algorithm as described in [31], and using a compatible discretized expression of the state- and action-space and maintaining knowledge of this state-action space in a Q-table. Five states are observed by the agent, receiving a reward for attaining the desired altitude and optionally effector utilization and commanded thrust. While this form of RL is able to learn without prior knowledge and is able to steer the aircraft towards the desired altitude, the tracking performance due to state- as well as action-space discretization causes aggressive flight behavior far less practical than that of previous model-based methods [10, 11, 13]. However, since these methods are model-based, they are also sensitive to model inaccuracies and would need to external model identification means to maintain adequate performance in case of damages or other perturbations. RL-based methods can learn and adapt from experience and are intrinsically model-independent, providing for favorable flight control system characteristics in case of damage or other unforeseen circumstances. With future aircraft likely being remote- or autonomously controlled, this is an increasing necessity. Preliminary research on past experiences with the ICE-aircraft showed linear CA methods to be inadequate of exploiting the full performance capabilities of the aircraft and it was conceived that similar problems would arise when using linear function approximators in RL. Generalization through linearization certainly improves the agent’s performance when considering linear models; as it can capture the underlying dynamics.

Instead using a discrete representation of the state- and action-space does not result in a ‘simple’ oversight of underlying dynamics in case when both state- and input-dynamics are highly nonlinear. The performance is very acceptable considering the state- and action-space discretization as in most cases the agent travels to and maintains at its best perception of the desired location in the state-space. While it was conceived that any form of RL applied to the CA problem would suffer from the same problems of ‘similar cost’, resulting in cycling, as previously found in optimization-based CA methods [23], this effect was amplified by the coarse state-space discretization as it blurs the specificity of optimality of certain solutions even further. As such, when the agent’s perception of optimality is modified through the introduction of additional terms into its reward function such as effector utilization and thrust, the ambiguity pertaining to true value of different actions in a certain discrete state is diminished, resulting in faster and more robust convergence of the agent value estimation and consequently policy. The RL-based controller in this research considers the problem of CA in a larger scheme since it not only considers the distribution of control effector utilization based on higher level moment commands, which is the essential common element in previous CA methods [9]. The performance of this simple flight controller could be improved by using more advanced function approximators, especially for making the behavior less aggressive and to consider the actuator dynamics and rate limits. However, the current approach might also be used to investigate an application which is more analog to CA. One in which the effector positions are considered (and the dynamics of the actuators are considered as an unknown dynamic such as the rest of the system) and per time step deflection increment is considered. Moreover, hybrid approaches to the CA problem can be considered [21, 22], possibly full-parallel implementations of other CA methods [11, 13] with full-scale neurodynamic flight control systems [19, 20]. However, currently most realistic would be to have a neurodynamic model of the (inverted)

(21)

control effectiveness [22] which may be used in an inversion framework, where most outer-loops are model-independent and the outer-loop gains are tuned by a RL-agent [37]. Following this discussion, the final conclusions are given in the next section.

IX. Conclusions

Future fighter aircraft will have similar configuration to the Innovative Control Effector (ICE) aircraft as it possesses a favorable configuration towards the aspects of low observability and reconfigurability. This results in improved survivability, while through a significant redundancy in effectors it allows for mission-specific control configuration. While established Control Allocation (CA) methods have shown to provide a significant improvement towards the primary control objectives in such a configuration, they require all secondary objectives to be defined on the same timescale and in explicit terms of control effector utilization. Reinforcement learning (RL) allows for pursuing more time-scale separated and abstract objectives which might pertain to the mission objective.

The RL-based CA method presented in this research shows that this method can learn, without prior knowledge, to control the ICE aircraft’s longitudinal dynamics as to attain a desired altitude. This is shown using two or four symmetrical sets of control effectors, while also considering secondary objectives such as reducing control effector utilization to avoid opposite deflections of control effectors resulting in excess drag and considering the engine thrust. Introducing these secondary objectives results in better convergence of the RL agent’s value estimation and policy and reduced drag through avoiding antagonistic effector deflections. While the performance of the presented scheme is less than that of model-based CA methods, stemming mostly from limitations in the function approximation, these can be overcome in future research using continuous function approximators.

Future research will investigate shifting the focus toward purely the CA problem, that of finding control effector deflections to obey motion system moment commands. In that case, the effector positions should be considered as a state as to incorporate actuator dynamics as well as effector rate limitations. Furthermore, approaches where RL is incorporated in more conventional CA frameworks, for instance, using a neurodynamic approximation of the (inverse) control effectiveness should be investigated. As well as considering tuning outer loop gains using RL methods and the application to command shaping filters, so that model dependency can be avoided and robustness guaranteed.

References

[1] Stonier, R. A., “Stealth aircraft & technology from World War II to the Gulf. Part II. Applications and Design,” SAMPE Journal, Vol. 27, No. 4, 1991, pp. 9–18.

[2] Rich, B. R., and Janos, L., Skunk Works, Little, Brown, Company, New York, New York, USA, 1994.

[3] Kamran, A., “Stealth Considerations for Aerodynamic Design of Missiles,” CADDM, Vol. 19, No. October 2015, 2009, p. 9. [4] Ball, R. E., The fundamentals of aircraft combat survivability analysis and design, American Institute of Aeronautics and

Astronautics, New York, 2003. doi:10.2514/4.862519.

[5] Brown, A. C., “Fundamentals of low radar cross-sectional aircraft design,” Journal of Aircraft, Vol. 30, No. 3, 1993, pp. 289–290. doi:10.2514/3.46331.

[6] Dorsett, K. M., Fears, S. P., and Houlden, H. P., “Innovative Control Effectors (ICE) Phase II,” Tech. rep., Flight Dynamics Directorate Wright Laboratory, Wright-Patterson AFB, Ohio, 1997. doi:WL-TR-96-3043.

[7] Abzug, M. J., and Larrabee, E. E., Airplane stability and control : a history of the technologies that made aviation possible., Cambridge University Press, New York, New York, USA, 2002.

[8] Enns, D. F., “Control Allocation Approaches,” Guidance, Navigation, and Control Conference and Exhibit, American Institute of Aeronautics and Astronautics, Reston, Virigina, 1998, pp. 98–108. doi:10.2514/6.1998-4109.

[9] Johansen, T. A., and Fossen, T. I., “Control allocation - A Survey,” Automatica, Vol. 49, No. 5, 2013, pp. 1087–1103. doi:10.1016/j.automatica.2013.01.035.

[10] Buffington, J. M., “Tailless Aircraft Control Allocation,” Guidance, Navigation, and Control Conference, American Institute of Aeronautics and Astronautics, Reston, Virigina, 1997, pp. 737–747. doi:10.2514/6.1997-3605.

[11] Matamoros, I., and De Visser, C. C., “Incremental Nonlinear Control Allocation for a Tailless Aircraft with Innovative Control Effectors,” to be presented at AIAA Scitech, Delft University of Technology, Kissimmee, FL, USA, 2018, p. 220.