PhD Thesis proposition
Non-stationary and robust Reinforcement Learning methodologies for
Stefano Fortunati EC IPSA Paris/L2S (70 % of supervision),
Alexandre Renaux MCF (HDR) Universite Paris-Saclay/L2S (Directeur de th`ese, 30 % of
Ecole Doctorale de rattachement: Sciences et technologies de l’information et de la communication
(STIC) at Universite Paris-Saclay.
EG: One of the underlying assumptions of Reinforcement Learning (RL) methodologies is the
stationarity of the environment embedding the agent. Specifically, the three main elements
characterizing a Markov Decision Process (MDP), i.e. the set of states, the set of actions and
the reward function are assumed to be constant/invariant over time. In surveillance applications
however, such assumption is generally unrealistic since the environment (i.e. the area that the
agent has to monitor) is constantly changing. The amount and types of objects or different
disturbance statistics are just two examples of non-stationarity. The main aim of this project
is then the development of original RL schemes able to cope with time-dependent MDP. This
challenging goal may bring significant benefits in many theoretical and applicative AI-subfield:
from the statistical learning theory, non-stationary random processes and sets to Signal Processing
applications. The theoretical findings will be validated in an emerging crucial issue: the
detections of drones using massive antenna arrays.
FR: Une des hypotheses de l’apprentissage par renforcement est la suppos´ee stationnarit´e de
l’environnement dans lequel evolue l’agent. En effet, les trois ´el´ements qui caract´erisent le processus
de d´ecision markovien (PDM), c’est-`a-dire l’ensemble des ´etats, l’ensemble des actions
et la r´ecompense sont suppos´es constants au cours du temps. Cependant, pour des applications
de surveillance, une telle hypoth`ese est peu r´ealiste puisque l’environnement (c’est-`a-dire
la zone surveill´ee par l’agent) est en perp´etuelle ´evolution. Le nombre et le type d’objets `a
surveiller ainsi que les perturbations statistiques de l’environnement sont deux exemples de
non-stationarit´e. L’objectif principal de ce projet consistera en le d´eveloppement de m´ethodes
originales d’apprentissage par renforcement avec la prise en compte de PDM ´evoluant au cours
du temps. De telles m´ethodes auront des retomb´ees dans des domaines th´eoriques et applicatifs
de l’intelligence artificielle : apprentissage statistique, processus et ensembles al´eatoires
non stationnaires et traitement statistique du signal. Notre application principale concernera
un probl`eme qui est d´esormais d’une grande importance : la d´etection de drones `a l’aide d’un
grand r´eseau multi-antenne.
Keywords: Markov decision process, Reinforcement Learning (RL), Multi-agent RL, Non-
Stationary Stochastic Learning, Robust Statistics.
1 Detailed description of the PhD project
1.1 Scientific context
Reinforcement Learning (RL) methodologies are currently adopted in different context requiring
sequential decision-making tasks under uncertainty . The RL paradigm is based on the
perception-action cycle, characterized by the presence of an agent that senses and explores the
unknown environment, tracks the evolution of the system state and intelligently adapts its behavior
in order to fulfill a specific mission. This is accomplished through a sequence of actions
aiming at optimizing a pre-assigned performance metric (reward). There are countless applications
that can benefit from this perception-action cycle (traffic signal control, robots interactions
the physical objects, just to cite a few), each of which is characterized by a peculiar definition of
“uncertainty” or “unknown environment”. A more precise definition of this uncertainty strongly
depends on the particular domain considered. However, there is at least one crucial assumption
underlying the majority of classical RL algorithms: the stationarity of the environment, i.e. the
statistical and physical characterization of the scenario, is assumed to be time-invariant. This
is clearly a quite restrictive limitation in many real-world RL applications, where the agent is
usually embedded in a changing scenario whose both statistical and physical characterization
may evolve over time. Due to the crucial importance of including the non-stationarity in the
RL framework, both theoretical and application-oriented non-stationary approaches have been
proposed recently in the RL literature (e.g. [2,3]). Among the numerous potential applications,
in this project we will focus on the problem of Cognitive Radar (CR) detection in unknown
and non-stationary environment. Specifically, building upon the previous works [4–6], we will
aim at proposing an RL based algorithm for cognitive multi-target detection in the presence of
unknown, non-stationary disturbance statistics. The radar acts as an agent that continuously
senses the unknown environment (i.e., targets and disturbance) and consequently optimizes
transmitted waveforms in order to maximize the probability of detection (PD) by focusing the
energy in specific range-angle cells.
1.2 Scientific goals
The scientific goal of the proposed PhD thesis is twofold. Firstly, the PhD candidate will
get familiar and develop original RL-based algorithms for non-stationary environments. These
theoretical outcomes will be then applied to a specific scenario of great interest nowadays: the
radar detection of drones. More specifically, the PhD thesis will be structured in two steps:
1. Theoretical foundation of non-stationary RL algorithms
The aim of this first step is to develop an original theoretical foundation of non-stationary
Markov Decision Processes (MDP) . In particular, the candidate will investigate the possibility
to generalize classical RL methodologies to MDP characterized by a time-varying
sets of states, actions and reward functions. This non-stationary generalization is of crucial
importance for a wide variety of applications and it is an almost unexplored research field.
2. Surveillance applications and drone detection
The theoretical results obtained in the first part of the PhD thesis will then be used to
derive and implement new algorithms for drones detection and tracking using radar systems
[4–6]. Specifically, we will consider a co-located Multiple-Input-Multiple-Output (MIMO)
radar with a large (“massive”) number of transmitters and receivers. It has been shown,
in fact, that this massive MIMO configuration allows one to dispense with unrealistic
assumptions about the a-priori knowledge of the statistical model of the disturbance .
1.3 Expected impact
The generalization to non-stationary environment of the actual RL algorithm are of central
importance for many real-world applications. Intelligent traffic control and pilotage of robots
and drones for rescue missions are two classical examples of non-stationary environments that
may cause a violation of the basic assumptions underlying the actual RL methodologies. The
first desired expected outcome of this project is to provide a rigorous mathematical framework
general enough to be able to guarantee the possibility to develop, by independent researchers,
new and original non-stationary RL algorithms that may be used in different fields. Moreover,
the specific application proposed in the project, i.e. non-stationary RL algorithm for target
detection, can potentially represent a breakthrough in the radar signal processing community.
In fact, such an algorithm will be able to provide optimal performance (in term of probability
of false alarm and probability of detection) without the need of any a-priori information on
the disturbance statistical model, unlike the classical likelihood ratio-based detection schemes
commonly used since 40 years in radar applications.
1.4 Organization of the thesis work
The three-years research program will be organized as follows:
1. First step (4 months). The first part of the research project will be dedicated to a review
of the existing literature on stationary RL methodologies in order to formalize the mathematical
framework underlying classical stationary learning methodologies. Specifically,
the attention will be focused on the formal definition of Markov Decision Process and on
the related concepts of set of actions, set of states and reward function [2,7] from a statistical
point of view. This will allow us to better clarify how the non-stationarity may be
introduced in the main plot.
2. Second step (6 months). After an in-depth analysis of the state-of-the-art on this topics, we
will formally define what we will consider as non-stationary environments. Having in mind
the application to target detection that will be developed later, we will characterize the
non-stationarity with both physical abrupt changes in the scenario (e.g. objects appearing
and disappearing from the field of view) and statistical evolution of the environmental
data (e.g. changes in the data time correlation from a learning cycle to the next one).
3. Third step (10 months). After having posed the mathematical foundation of the main
framework, we will move to the core of the proposed project: the development of learning
methodologies for non-stationary environments. Original RL strategy will be proposed and
their effectiveness and robustness to non-stationarity assessed through extensive simulation
4. Fourth step (10 months). The theoretical outcomes obtained in the previous steps will be
exploited to derive and implement original RL-based detection algorithm for surveillance
applications. Specifically, following [5,6], the abstract notions of agent, actions, states
and reward will be specialized in the radar detection framework. Moreover, the nonstationarity
of the environment assumes here a precise characterization: the number of
targets/sources to be detected changes over time along with the statistical characterization
of the disturbance.
5. Fifth step (6 months). Finally, the obtained results will be applied to a specific scenario of
great interest nowadays: the detection of drones. Due to its highly non-stationarity and
physical variability, this is a challenging problem that cannot be addressed with classical,
time-invariant, RL algorithms.
1.5 Expected results and perspectives in research and applications
The aim of this PhD project is twofold. From the theoretical research side, we aim at developing
advanced RL methodologies able to handle the non-stationarity of the environment to
be explored by the agent. Since the vast majority of real-world scenario is affected by some
sort of non-stationarity, this original line of research may pave the way to countless practical
exploitations (intelligent traffic control, pilotage of robots in harsh conditions, and so on) as well
as further theoretical investigation (statistical optimality condition under high non-stationarity,
data-driven selection of the hyper-parameters used in the RL algorithm). From the application
side, the derivation of a cognitive detection algorithm for radars may be of interest to integrate
with new functionalities existing surveillance systems. As research products, during the duration
of the PhD scholarship, we plan to publish at least one journal paper and two conference papers
in Machine Learning (ML) and related field where our theoretical findings will be discussed and
presented to the ML community. Moreover, the original application of the theoretical outcomes
to the detection problem will be disseminated in the Signal Processing and Radar community
through another journal paper and three additional conferences.
1.6 Hosting laboratory
The PhD thesis will be developed at the L2S Laboratory (Laboratoire des signaux et syst`emes,
UMR8506) where the candidate will join the Modeling and Estimation Group (GME) in the
Signals and Statistics group. An office and a PC will be made available to the candidate. It
is important to underline that this PhD thesis are fully in line with an active international
collaboration already established among the L2S (France), the University of Pisa (Italy) and
the Ruhr-University Bochum (Germany). Depending on the availability of funds, the candidate
will have the opportunity to visit these two institutions for short research periods (from 2 up
to 6 months).
 R. S. Sutton and A. G. Barto, Reinforcement Learning: An Introduction. The MIT Press,
second ed., 2018.
 E. Lecarpentier and E. Rachelson, “Non-stationary markov decision processes, a worst-case
approach using model-based reinforcement learning,” Advances in neural information processing
systems, vol. 32, 2019.
 S. Padakandla, K. J. Prabuchandran, and S. Bhatnagar, “Reinforcement learning algorithm
for non-stationary environments,” Applied Intelligence, vol. 50, p. 3590?3606, 2020.
 S. Fortunati, L. Sanguinetti, F. Gini, M. S. Greco, and B. Himed,“Massive MIMO radar for
target detection,” IEEE Transactions on Signal Processing, vol. 68, pp. 859–871, 2020.
 A. M. Ahmed, A. A. Ahmad, S. Fortunati, A. Sezgin, M. S. Greco, and F. Gini, “A reinforcement
learning based approach for multitarget detection in massive MIMO radar,” IEEE
Transactions on Aerospace and Electronic Systems, vol. 57, no. 5, pp. 2622–2636, 2021.
 F. Lisi, S. Fortunati, , M. S. Greco, and F. Gini, “Enhancement of a state-of-the-art RLbased
detection algorithm for Massive MIMO radars,” IEEE Transactions on Aerospace and
Electronic Systems (accepted), 2022.
 M. L. Puterman, Markov decision processes: discrete stochastic dynamic programming. John
Wiley & Sons, 2014.