2018 |
Mehdi Khamassi, George Velentzas, Theodore Tsitsimis, Costas Tzafestas Robot fast adaptation to changes in human engagement during simulated dynamic social interaction with active exploration in parameterized reinforcement learning Journal Article IEEE Transactions on Cognitive and Developmental Systems, 10 , pp. 881 - 893, 2018. Abstract | BibTeX | Links: [PDF] @article{BFB99, title = {Robot fast adaptation to changes in human engagement during simulated dynamic social interaction with active exploration in parameterized reinforcement learning}, author = {Mehdi Khamassi and George Velentzas and Theodore Tsitsimis and Costas Tzafestas}, url = {http://robotics.ntua.gr/wp-content/publications/Khamassi_TCDS2018.pdf}, doi = {10.1109/TCDS.2018.2843122}, year = {2018}, date = {2018-01-01}, journal = { IEEE Transactions on Cognitive and Developmental Systems}, volume = {10}, pages = {881 - 893}, publisher = {IEEE}, abstract = {Dynamic uncontrolled human-robot interactions (HRI) require robots to be able to adapt to changes in the human’s behavior and intentions. Among relevant signals, non-verbal cues such as the human’s gaze can provide the robot with important information about the human’s current engagement in the task, and whether the robot should continue its current behavior or not. However, robot reinforcement learning (RL) abilities to adapt to these non-verbal cues are still underdeveloped. Here we propose an active exploration algorithm for RL during HRI where the reward function is the weighted sum of the human’s current engagement and variations of this engagement. We use a parameterized action space where a meta-learning algorithm is applied to simultaneously tune the exploration in discrete action space (e.g. moving an object) and in the space of continuous characteristics of movement (e.g. velocity, direction, strength, expressivity). We first show that this algorithm reaches state-of-the-art performance in the non-stationary multi-armed bandit paradigm. We then apply it to a simulated HRI task, and show that it outperforms continuous parameterized RL with either passive or active exploration based on different existing methods. We finally test the performance in a more realistic test of the same HRI task, where a practical approach is followed to estimate human engagement through visual cues of the head pose. The algorithm can detect and adapt to perturbations in human engagement with different durations. Altogether, these results suggest a novel efficient and robust framework for robot learning during dynamic HRI scenarios.}, keywords = {}, pubstate = {published}, tppubtype = {article} } Dynamic uncontrolled human-robot interactions (HRI) require robots to be able to adapt to changes in the human’s behavior and intentions. Among relevant signals, non-verbal cues such as the human’s gaze can provide the robot with important information about the human’s current engagement in the task, and whether the robot should continue its current behavior or not. However, robot reinforcement learning (RL) abilities to adapt to these non-verbal cues are still underdeveloped. Here we propose an active exploration algorithm for RL during HRI where the reward function is the weighted sum of the human’s current engagement and variations of this engagement. We use a parameterized action space where a meta-learning algorithm is applied to simultaneously tune the exploration in discrete action space (e.g. moving an object) and in the space of continuous characteristics of movement (e.g. velocity, direction, strength, expressivity). We first show that this algorithm reaches state-of-the-art performance in the non-stationary multi-armed bandit paradigm. We then apply it to a simulated HRI task, and show that it outperforms continuous parameterized RL with either passive or active exploration based on different existing methods. We finally test the performance in a more realistic test of the same HRI task, where a practical approach is followed to estimate human engagement through visual cues of the head pose. The algorithm can detect and adapt to perturbations in human engagement with different durations. Altogether, these results suggest a novel efficient and robust framework for robot learning during dynamic HRI scenarios. |
2017 |
G. Velentzas, C. Tzafestas, M. Khamassi Bio-inspired meta-learning for active exploration during non-stationary multi-armed bandit tasks Conference Proc. IEEE Intelligent Systems Conference, London, UK, 2017. Abstract | BibTeX | Links: [PDF] @conference{BFB97, title = {Bio-inspired meta-learning for active exploration during non-stationary multi-armed bandit tasks}, author = {G. Velentzas and C. Tzafestas and M. Khamassi}, url = {http://robotics.ntua.gr/wp-content/publications/Velentzas_Intellisys2017.pdf}, doi = {10.1109/IntelliSys.2017.8324365}, year = {2017}, date = {2017-09-01}, booktitle = {Proc. IEEE Intelligent Systems Conference}, address = {London, UK}, abstract = {Fast adaptation to changes in the environment requires agents (animals, robots and simulated artefacts) to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the field of machine learning have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimal - bounded regret. In parallel, researches in active exploration in the fields of robot learning and computational neuroscience of learning and decision-making have proposed alternative solutions such as transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when choosing them. In this work, we compare different methods from machine learning, computational neuroscience and robot learning on a set of non-stationary stochastic multi-armed bandit tasks: abrupt shifts; best bandit becomes worst one and vice versa; multiple shifting frequencies. We find that different methods are appropriate in different scenarios. We propose a new hybrid method combining bio-inspired meta-learning, kalman filter and exploration bonuses and show that it outperforms other methods in these scenarios.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Fast adaptation to changes in the environment requires agents (animals, robots and simulated artefacts) to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the field of machine learning have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimal - bounded regret. In parallel, researches in active exploration in the fields of robot learning and computational neuroscience of learning and decision-making have proposed alternative solutions such as transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when choosing them. In this work, we compare different methods from machine learning, computational neuroscience and robot learning on a set of non-stationary stochastic multi-armed bandit tasks: abrupt shifts; best bandit becomes worst one and vice versa; multiple shifting frequencies. We find that different methods are appropriate in different scenarios. We propose a new hybrid method combining bio-inspired meta-learning, kalman filter and exploration bonuses and show that it outperforms other methods in these scenarios. |
Theodore Tsitsimis, George Velentzas, Mehdi Khamassi, Costas Tzafestas Online adaptation to human engagement perturbations in simulated human-robot interaction using hybrid reinforcement learning Conference Proc. of the 25th European Signal Processing Conference - Workshop: "MultiLearn 2017 - Multimodal processing, modeling and learning for human-computer/robot interaction applications", Kos, Greece, 2017., Kos, Greece, 2017. Abstract | BibTeX | Links: [PDF] @conference{BFB98, title = {Online adaptation to human engagement perturbations in simulated human-robot interaction using hybrid reinforcement learning}, author = {Theodore Tsitsimis and George Velentzas and Mehdi Khamassi and Costas Tzafestas}, editor = {Michael Aron}, url = {http://robotics.ntua.gr/wp-content/uploads/sites/2/MultiLearn2017.pdf}, year = {2017}, date = {2017-08-01}, booktitle = {Proc. of the 25th European Signal Processing Conference - Workshop: "MultiLearn 2017 - Multimodal processing, modeling and learning for human-computer/robot interaction applications", Kos, Greece, 2017.}, address = {Kos, Greece}, abstract = {Dynamic uncontrolled human-robot interaction requires robots to be able to adapt to changes in the human’s behavior and intentions. Among relevant signals, non-verbal cues such as the human’s gaze can provide the robot with important information about the human’s current engagement in the task, and whether the robot should continue its current behavior or not. In a previous work [1] we proposed an active exploration algorithm for reinforcement learning where the reward function is the weighted sum of the human’s current engagement and variations of this engagement (so that a low but increasing engagement is rewarding). We used a structured (parameterized) continuous action space where a meta-learning algorithm is applied to simultaneously tune the exploration in discrete and continuous action space, enabling the robot to learn which discrete action is expected by the human (e.g. moving an object) and with which velocity of movement. In this paper we want to show the performance of the algorithm to a simulated humanrobot interaction task where a practical approach is followed to estimate human engagement through visual cues of the head pose. We then measure the adaptation of the algorithm to engagement perturbations simulated as changes in the optimal action parameter and we quantify its performance for variations in perturbation duration and measurement noise.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Dynamic uncontrolled human-robot interaction requires robots to be able to adapt to changes in the human’s behavior and intentions. Among relevant signals, non-verbal cues such as the human’s gaze can provide the robot with important information about the human’s current engagement in the task, and whether the robot should continue its current behavior or not. In a previous work [1] we proposed an active exploration algorithm for reinforcement learning where the reward function is the weighted sum of the human’s current engagement and variations of this engagement (so that a low but increasing engagement is rewarding). We used a structured (parameterized) continuous action space where a meta-learning algorithm is applied to simultaneously tune the exploration in discrete and continuous action space, enabling the robot to learn which discrete action is expected by the human (e.g. moving an object) and with which velocity of movement. In this paper we want to show the performance of the algorithm to a simulated humanrobot interaction task where a practical approach is followed to estimate human engagement through visual cues of the head pose. We then measure the adaptation of the algorithm to engagement perturbations simulated as changes in the optimal action parameter and we quantify its performance for variations in perturbation duration and measurement noise. |
Active exploration and parameterized reinforcement learning applied to a simulated human-robot interaction task Conference Proc. IEEE Int'l Conference on Robotic Computing, Taichung, Taiwan, 2017. Abstract | BibTeX | Links: [PDF] @conference{BFB95, title = {Active exploration and parameterized reinforcement learning applied to a simulated human-robot interaction task}, url = {http://robotics.ntua.gr/wp-content/publications/khamassi_IRC2017.pdf}, doi = {10.1109/IRC.2017.33}, year = {2017}, date = {2017-04-01}, booktitle = {Proc. IEEE Int'l Conference on Robotic Computing}, address = {Taichung, Taiwan}, abstract = {Online model-free reinforcement learning (RL) methods with continuous actions are playing a prominent role when dealing with real-world applications such as Robotics. However, when confronted to non-stationary environments, these methods crucially rely on an exploration-exploitation trade-off which is rarely dynamically and automatically adjusted to changes in the environment. Here we propose an active exploration algorithm for RL in structured (parameterized) continuous action space. This framework deals with a set of discrete actions, each of which is parameterized with continuous variables. Discrete exploration is controlled through a Boltzmann softmax function with an inverse temperature β parameter. In parallel, a Gaussian exploration is applied to the continuous action parameters. We apply a meta-learning algorithm based on the comparison between variations of short-term and long-term reward running averages to simultaneously tune β and the width of the Gaussian distribution from which continuous action parameters are drawn. We first show that this algorithm reaches state-of-the-art performance in the non-stationary multi-armed bandit paradigm, while also being generalizable to continuous actions and multi-step tasks. We then apply it to a simulated human-robot interaction task, and show that it outperforms continuous parameterized RL both without active exploration and with active exploration based on uncertainty variations measured by a Kalman-Q-learning algorithm.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } Online model-free reinforcement learning (RL) methods with continuous actions are playing a prominent role when dealing with real-world applications such as Robotics. However, when confronted to non-stationary environments, these methods crucially rely on an exploration-exploitation trade-off which is rarely dynamically and automatically adjusted to changes in the environment. Here we propose an active exploration algorithm for RL in structured (parameterized) continuous action space. This framework deals with a set of discrete actions, each of which is parameterized with continuous variables. Discrete exploration is controlled through a Boltzmann softmax function with an inverse temperature β parameter. In parallel, a Gaussian exploration is applied to the continuous action parameters. We apply a meta-learning algorithm based on the comparison between variations of short-term and long-term reward running averages to simultaneously tune β and the width of the Gaussian distribution from which continuous action parameters are drawn. We first show that this algorithm reaches state-of-the-art performance in the non-stationary multi-armed bandit paradigm, while also being generalizable to continuous actions and multi-step tasks. We then apply it to a simulated human-robot interaction task, and show that it outperforms continuous parameterized RL both without active exploration and with active exploration based on uncertainty variations measured by a Kalman-Q-learning algorithm. |
Mehdi Khamassi, George Velentzas, Theodore Tsitsimis, Costas Tzafestas Active exploration and parameterized reinforcement learning applied to a simulated human-robot interaction task Conference Proceedings - 2017 1st IEEE International Conference on Robotic Computing, IRC 2017, 2017, ISBN: 9781509067237. Abstract | BibTeX | Links: [PDF] @conference{337, title = {Active exploration and parameterized reinforcement learning applied to a simulated human-robot interaction task}, author = { Mehdi Khamassi and George Velentzas and Theodore Tsitsimis and Costas Tzafestas}, url = {http://ieeexplore.ieee.org/document/7926511/%0Ahttp://ieeexplore.ieee.org/ielx7/7925476/7926477/07926511.pdf?tp=&arnumber=7926511&isnumber=7926477}, doi = {10.1109/IRC.2017.33}, isbn = {9781509067237}, year = {2017}, date = {2017-01-01}, booktitle = {Proceedings - 2017 1st IEEE International Conference on Robotic Computing, IRC 2017}, pages = {28--35}, abstract = {textcopyright 2017 IEEE. Online model-free reinforcement learning (RL) methods with continuous actions are playing a prominent role when dealing with real-world applications such as Robotics. However, when confronted to non-stationary environments, these methods crucially rely on an exploration-exploitation trade-off which is rarely dynamically and automatically adjusted to changes in the environment. Here we propose an active exploration algorithm for RL in structured (parameterized) continuous action space. This framework deals with a set of discrete actions, each of which is parameterized with continuous variables. Discrete exploration is controlled through a Boltzmann softmax function with an inverse temperature $beta$ parameter. In parallel, a Gaussian exploration is applied to the continuous action parameters. We apply a meta-learning algorithm based on the comparison between variations of short-Term and long-Term reward running averages to simultaneously tune $beta$ and the width of the Gaussian distribution from which continuous action parameters are drawn. We first show that this algorithm reaches state-of-The-Art performance in the non-stationary multi-Armed bandit paradigm, while also being generalizable to continuous actions and multi-step tasks. We then apply it to a simulated human-robot interaction task, and show that it outperforms continuous parameterized RL both without active exploration and with active exploration based on uncertainty variations measured by a Kalman-Q-learning algorithm.}, keywords = {}, pubstate = {published}, tppubtype = {conference} } textcopyright 2017 IEEE. Online model-free reinforcement learning (RL) methods with continuous actions are playing a prominent role when dealing with real-world applications such as Robotics. However, when confronted to non-stationary environments, these methods crucially rely on an exploration-exploitation trade-off which is rarely dynamically and automatically adjusted to changes in the environment. Here we propose an active exploration algorithm for RL in structured (parameterized) continuous action space. This framework deals with a set of discrete actions, each of which is parameterized with continuous variables. Discrete exploration is controlled through a Boltzmann softmax function with an inverse temperature $beta$ parameter. In parallel, a Gaussian exploration is applied to the continuous action parameters. We apply a meta-learning algorithm based on the comparison between variations of short-Term and long-Term reward running averages to simultaneously tune $beta$ and the width of the Gaussian distribution from which continuous action parameters are drawn. We first show that this algorithm reaches state-of-The-Art performance in the non-stationary multi-Armed bandit paradigm, while also being generalizable to continuous actions and multi-step tasks. We then apply it to a simulated human-robot interaction task, and show that it outperforms continuous parameterized RL both without active exploration and with active exploration based on uncertainty variations measured by a Kalman-Q-learning algorithm. |
G. Velentzas, C. Tzafestas, M. Khamassi Bridging Computational Neuroscience and Machine Learning on Non-Stationary Multi-Armed Bandits Miscellaneous bioRxiv, 117598, 2017. Abstract | BibTeX | Links: [PDF] @misc{BFB96, title = {Bridging Computational Neuroscience and Machine Learning on Non-Stationary Multi-Armed Bandits}, author = {G. Velentzas and C. Tzafestas and M. Khamassi }, url = {http://robotics.ntua.gr/wp-content/publications/Velentzas_RLDM2017.pdf}, doi = {10.1101/117598}, year = {2017}, date = {2017-06-01}, address = {Ann Arbor, USA}, abstract = {Fast adaptation to changes in the environment requires both natural and artificial agents to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). The problem of finding an efficient exploration-exploitation trade-off has been well studied both in the Machine Learning and Computational Neuroscience fields. Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the former have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimal - bounded regret. In parallel, researches in the latter have investigated solutions such as progressively increasing exploita- tion in response to improvements of performance, transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when performing these actions. In this work, we first try to bridge some of these different methods from the two research fields by rewriting their decision process with a common formalism. We then show numerical simulations of a hybrid algorithm combining bio-inspired meta-learning, kalman filter and exploration bonuses compared to several state-of-the-art alternatives on a set of non-stationary stochastic multi-armed bandit tasks. While we find that different methods are appropriate in different scenarios, the hybrid algorithm displays a good combination of advantages from different methods and outperforms these methods in the studied scenarios.}, howpublished = {bioRxiv, 117598}, keywords = {}, pubstate = {published}, tppubtype = {misc} } Fast adaptation to changes in the environment requires both natural and artificial agents to be able to dynamically tune an exploration-exploitation trade-off during learning. This trade-off usually determines a fixed proportion of exploitative choices (i.e. choice of the action that subjectively appears as best at a given moment) relative to exploratory choices (i.e. testing other actions that now appear worst but may turn out promising later). The problem of finding an efficient exploration-exploitation trade-off has been well studied both in the Machine Learning and Computational Neuroscience fields. Rather than using a fixed proportion, non-stationary multi-armed bandit methods in the former have proven that principles such as exploring actions that have not been tested for a long time can lead to performance closer to optimal - bounded regret. In parallel, researches in the latter have investigated solutions such as progressively increasing exploita- tion in response to improvements of performance, transiently increasing exploration in response to drops in average performance, or attributing exploration bonuses specifically to actions associated with high uncertainty in order to gain information when performing these actions. In this work, we first try to bridge some of these different methods from the two research fields by rewriting their decision process with a common formalism. We then show numerical simulations of a hybrid algorithm combining bio-inspired meta-learning, kalman filter and exploration bonuses compared to several state-of-the-art alternatives on a set of non-stationary stochastic multi-armed bandit tasks. While we find that different methods are appropriate in different scenarios, the hybrid algorithm displays a good combination of advantages from different methods and outperforms these methods in the studied scenarios. |
2010 |
John N Karigiannis, Theodoros I Rekatsinas, Costas S Tzafestas Hierarchical Multi-Agent Architecture employing TD ( $łambda$ ) Learning with Function Approximators for Robot Skill Acquisition Conference Architecture, 2010. @conference{36b, title = {Hierarchical Multi-Agent Architecture employing TD ( $łambda$ ) Learning with Function Approximators for Robot Skill Acquisition}, author = { John N Karigiannis and Theodoros I Rekatsinas and Costas S Tzafestas}, year = {2010}, date = {2010-01-01}, booktitle = {Architecture}, keywords = {}, pubstate = {published}, tppubtype = {conference} } |
Copyright Notice:
Some material presented is available for download to ensure timely dissemination of scholarly and technical work. Copyright and all rights therein are retained by authors or by other copyright holders. All persons copying this information are expected to adhere to the terms and constraints invoked by each author’s copyright. In most cases, these works may not be reposted without the explicit permission of the copyright holder.
The work already published by the IEEE is under its copyright. Personal use of such material is permitted. However, permission to reprint/republish the material for advertising or promotional purposes, or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of the work in other works must be obtained from the IEEE.