強化學習筆記1-Python/OpenAI/TensorFlow/ROS-基礎知識
概念:
機器學習分支之一強化學習,學習通過與環(huán)境交互進行,是一種目標導向的方法。
不告知學習者應采用行為,但其行為對于獎勵懲罰,從行為后果學習。
機器人避開障礙物案例:
靠近障礙物-10分,遠離障礙物+10分。
智能體自己探索獲取優(yōu)良獎勵的各自行為,包括如下步驟:
智能體執(zhí)行行為與環(huán)境交互
行為執(zhí)行后,智能體從一個狀態(tài)轉(zhuǎn)移至另一個狀態(tài)
依據(jù)行為獲得相應的獎勵或懲罰
智能體理解正面和反面的行為效果
獲取更多獎勵,避免懲罰,調(diào)整策略進行試錯學習。
需要對比,理解和掌握強化學習與其他機器學習的差異,在機器人中的應用前景。
強化學習元素:智能體,策略函數(shù),值函數(shù),模型等。
環(huán)境類型:確定,隨機,完全可觀測,部分可觀測,離散,連續(xù),情景序列,非情景序列,單智能體,多智能體。
強化學習平臺:OpenAI Gym/Universe/DeepMind Lab/RL-Glue/Rroject Malmo/VizDoom等。
強化學習應用:教育!醫(yī)療!健康!制造業(yè)!管理!金融!細分行業(yè):自然語言處理/計算機視覺等。
參考文獻:
https://www.cs.ubc.ca/~murphyk/Bayes/pomdp.html
https://morvanzhou.github.io/
https://github.com/sudharsan13296/Hands-On-Reinforcement-Learning-With-Python
配置:
安裝配置Anaconda/Docker/OpenAI Gym/TensorFlow。
由于涉及系統(tǒng)環(huán)境,版本配置各不相同,自行查閱資料配置即可。
常用命令如下:
bash/conda create/source activate/apt install/docker/pip3 install gym/universe/等。
上述全部配置完成后,測試OpenAI Gym和OpenAI Universe。
*.ipynb文檔查看:ipython notebook或jupyter notebook
Gym案例:
倒立擺案例:
示例代碼
import gym
env = gym.make('CartPole-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample())
關于這個代碼更多內(nèi)容,參考鏈接:
https://blog.csdn.net/ZhangRelay/article/details/89325679
查看gym全部支持的環(huán)境。
from gym import envs
print(envs.registry.all())
賽車示例:
import gym
env = gym.make('CarRacing-v0')
env.reset()
for _ in range(1000):
env.render()
env.step(env.action_space.sample())
足式機器人:
import gym
env = gym.make('BipedalWalker-v2')
for episode in range(100):
observation = env.reset()
# Render the environment on each step
for i in range(10000):
env.render()
# we choose action by sampling random action from environment's action space. Every environment has
# some action space which contains the all possible valid actions and observations,
action = env.action_space.sample()
# Then for each step, we will record the observation, reward, done, info
observation, reward, done, info = env.step(action)
# When done is true, we print the time steps taken for the episode and break the current episode.
if done:
print("{} timesteps taken for the Episode".format(i+1))
break
flash游戲環(huán)境示例:
import gym
import universe
import random
env = gym.make('flashgames.NeonRace-v0')
env.configure(remotes=1)
observation_n = env.reset()
# Move left
left = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', True),
('KeyEvent', 'ArrowRight', False)]
# Move right
right = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowLeft', False),
('KeyEvent', 'ArrowRight', True)]
# Move forward
forward = [('KeyEvent', 'ArrowUp', True), ('KeyEvent', 'ArrowRight', False),
('KeyEvent', 'ArrowLeft', False), ('KeyEvent', 'n', True)]
# We use turn variable for deciding whether to turn or not
turn = 0
# We store all the rewards in rewards list
rewards = []
# we will use buffer as some kind of threshold
buffer_size = 100
# We set our initial action has forward i.e our car moves just forward without making any turns
action = forward
while True:
turn -= 1
# Let us say initially we take no turn and move forward.
# First, We will check the value of turn, if it is less than 0
# then there is no necessity for turning and we just move forward
if turn <= 0:
action = forward
turn = 0
action_n = [action for ob in observation_n]
# Then we use env.step() to perform an action (moving forward for now) one-time step
observation_n, reward_n, done_n, info = env.step(action_n)
# store the rewards in the rewards list
rewards += [reward_n[0]]
# We will generate some random number and if it is less than 0.5 then we will take right, else
# we will take left and we will store all the rewards obtained by performing each action and
# based on our rewards we will learn which direction is the best over several timesteps.
if len(rewards) >= buffer_size:
mean = sum(rewards)/len(rewards)
if mean == 0:
turn = 20
if random.random() < 0.5:
action = right
else:
action = left
rewards = []
env.render()
部分測試如下(多次測試):
Python TensorFlow 機器學習
版權聲明:本文內(nèi)容由網(wǎng)絡用戶投稿,版權歸原作者所有,本站不擁有其著作權,亦不承擔相應法律責任。如果您發(fā)現(xiàn)本站中有涉嫌抄襲或描述失實的內(nèi)容,請聯(lián)系我們jiasou666@gmail.com 處理,核實后本網(wǎng)站將在24小時內(nèi)刪除侵權內(nèi)容。