Considering how last time’s post was about how algorithms “learn”, let’s look at another way in which AI can advance – “Q-learning”. Q-Learning takes a different approach to teaching AI as compared to neural networks. Instead of just multiple rounds of trial and error wherein the AI learns what’s wrong, the AI is instead rewarded for doing good, and punished for doing bad. Try to imagine it like this :
Imagine the AI is a pig, that you want to teach how to stay in one place. In a neural network, first you would obtain a pig.

Then, you would wait to see what it did. Ideally, you would bring in more than just one pig.

Now, you would wait. If a pig does something good, you would then let that pig make lots and lots of little baby pigs, and those pigs would do the same. If the pig did something bad, you would get rid of it, and get another.

This tries to emulate the ideas of “natural selection”. Good behaved pigs go on to make others, and bad behaved pigs do not. Eventually, all the pigs then know what to do. However, this takes many versions of the pig, and makes a lot of excess bacon. In Q-learning, the pig would instead be given a reward every time it stayed for a long enough time, and punished if it did not. As a result, we get something a lot closer to how human children learn, with the principle of ‘Negative Reinforcement‘.

Like our pig friends, the AI would be put on an already created playing field where it wouldn’t know what to do. It would then slowly try to move around. If it did anything bad, it would get a punishment in the form of a low number, and if it did something good, then it would get a reward as a high number. The AI would keep trying out different things, and in the end, the AI would follow all the steps that give the best reward and thus learn to do whatever you want it to.
Q-Learning comes from the “Q-function”, or “quality function”. The AI would use this Q function for every task it does. The Q function is essentially modelled like this:
Q[s(state),a(action)]
Here, the function would consider the current state of the AI. It would then consider the action that the AI is about to take. The function would try to realize the immediate reward it would get for doing the current action, and all the future rewards that the current action would help it get later (the function isn’t “greedy”, it doesn’t try to just look at the immediate reward, it considers the future also). So, while actually working, the process is as follows :
π(s) = argmaxa(Q[s,a])
The ‘π(s)’ part represents the “policy” for state ‘s’, or the action we take in state ‘s’. The equation here tries to test all the possible actions we can take in state ‘s’, and then find the one that gives the maximum reward. A table is then made for all the possible rewards that the AI can get, and then the table is constantly updated as the AI performs more and more actions. The AI then does that over and over until there is a clear picture of what it should do. Finally, the AI can follow the path with the highest reward as listed in the table, and do whatever needs to be done.

For example, take a look at this explanation of the value functions and a simple approach to eating using Q-learning. Also, check out this video by Siraj Raval for a great explanation about how Q-learning works, and this video by Code Bullet again to see a cool car drive around a track with Q-learning. Well, that’s pretty much it for now. Next week, we’ll look at some other stuff with different AI algorithms. Until then, good luck.
Resources : freecodecamp.org

