Application of Biological Neural Networks in Reinforcement Learning Problems

The liquid state machine (LSM) is a mathematical model of a "liquid computer". A LSM consists of some liquid which transforms the input time series u(t) into "liquid states" x(t).

The liquid state machine (LSM) is a mathematical model of a “liquid computer”. A LSM consists of some liquid which transforms the input time series u(t) into “liquid states” x(t). A memory-less readout function f maps the “liquid state” x(t) at time t into the output v(t)=f(x(t)). A randomly connected network of biologically quite realistic model neurons was employed as a “liquid”. The readout function was implemented by another population of model neurons that received input from all the “liquid neurons:. During training for a particular task only the synaptic strengths of the connections from the “liquid” to the readout are modified.
Reinforcement Learning
A problem faced by an agent that learns how to behave through trial – and – error interaction with a dynamic environment. At each time step t=0, 1, 2., the agent receives some representation of the environment’s state, s(t), where S is the set of possible states and on that basis selects an action a(t) , where A is the set of actions available in the state s(t). One time step later, in part as a consequence of its action the agent receives a numerical reward, r(t+1), and finds itself in a new state s(t+1) The agent’s sole objective is to maximize the total reward it receives in the long run. Therefore if we want the agent to do something for us, we must provide rewards in such a way that in maximizing them the agent will also achieve our goals – the rewards are our way of telling the agent what to do and not how to do it.


Our first application with reinforcement learning was the game Blackjack . Because the running time is very long we decided to implement the game with the same rules, expect for the fact that the highest sum of cards needed is 9 (instead of 21) and the pack of cards included only cards which numbers are 1-5 and with a relative error. After 22,199 runs we finally got a stable policy – which was to tack another card only if the agent’s sum of cards is 1,2,3 or 4. In cases that we gave a constant error or a bad teacher the algorithm didn’t succeed to find the correct policy. When we compare our results to the results with the SARSA algorithm from the article ” A. Perez-Uribe and E. Sanchez. Blackjack as a Test Bed for Learning Strategies in Neural Networks” we can see a clear advantage to the SARSA algorithm -the reasons are:
1. in our algorithm there is no way of giving a positive reward
2. in our algorithm – even if we succeeded in finding the right policy , in the case of the game blackjack, we will still get , by time to time, negative reinforcements (sometimes the dealer gets lucky) and therefore the policy will ultimately change to the worst

Our second application with reinforcement learning algorithm was a T maze – in this game the agent has four possible actions: move North, East, South, or West. The agent must learn to move from the starting position at the beginning of the corridor to the T-junction. There it must move either North or South to a changing goal position depends on a “road sign” (The results are summarized in the figure). We also tried another application with the T-maze, in which the location of the goal was given only at the starting position –
To achieve a convergence in a reasonable time, and a reasonable number of time steps we tried to lighten the task in two different ways (more details in the project report): 1. We tried to find inputs and outputs that would be easy to separate. 2. We tried to find an optimal way to give the errors, so that it will take more time until the synaptic weights will descent significantly and avoid the zero case all together – our attempts were in vain.

If we compare our learning algorithm with others we can see some basic disadvantages:
1. The disability to give rewards.
2. Our learning algorithm is suitable only in applications that are characterized by an error=0 when we achieved the correct policy.
3. It’s ability to learn is limited – after about 45,000 time steps with a big error the difference in the synaptic weights is very big and therefore the policy is fixed – even if it’s a bad policy, that results in a lot of errors. But if we look at the big picture – there is a very optimistic side, We should put in mind that the real implementation of our algorithm is to mimic the nerves system – and therefore it is NOT DESIGNED to make the same mistakes for thousands of times- for that reason there is a need for a highly developed separation system that can distinguish easily between different inputs – each of us has six different senses (smell, taste,.)-and in every situation we get inputs from some or all of our them. This is similar to having 6 different nmcs, each of them having a different set of inputs (the same as if we look at something and smell it we have different input from each sense) – and all these nmcs together will create an input to another major nmc. If this was the case it would have been relatively easy to achieve the correct policy. For example in the case of the maze the cubes no. N=2,.N would have the same input which would be very different from the input for N=1,N+1 – so for any length of the corridor we will have only 6 very distinct inputs (3 for ‘up’, and 3 for ‘down’), and as we saw in our first application of the T-maze very distinct inputs give the correct output rather easily

We would like to thank our supervisors, Karina Odinaev and Igal Raichelgauz who put in a great effort and many hours into helping us find the right direction for this project. We would also like to thank Johanan Erez, the laboratory supervisor who gave us technical and logistical aid. It would not be possible without their help and guidance. .