Reinforcement learning

Summit Training

Rule-based VS ML

  • Rule-based approach: Decision rules are clearly defined by humans

  • Machine Learning: Trained from examples. Rules are note defined by humans but learned by the machine from data.

AI VS ML VS DL

  • Artificial Intelligence: A concept of machines to make human intelligence. Ability to solve different kinds of work from managing schedule to analyizing marketing trends.

  • Machine Learning: A method for computers to learn from data. A scale in making prediction based on past experience. Using past sales data and learn to predict future sales trends baesd on patterns it observes.

    1. Supervised Learning: learn the relationship of given inputs to a given output
      • linear regression
      • logistic regression (output is binary)
      • naive bayes
      • support vector machine (non linear problem)
      • ada boost
      • decision tree
      • random forest
      • simple neural network
    2. Un-supervised Learning: without and explicit output
      • K Means Clustering: puts data into groups
      • Hierarchal Clustering
      • Gaussian Mixture
      • Recommender System
      • Apriori algorithm
    3. Reinforcement Learning: learning through interaction with an environment and rewards/punishments
  • Deep Learning: Specialized technique using neural networks to process data deeply A specialty in deeply analysizing data to understand even a tiniest enfluences on sales. Not just understang which product sales well, but also recognizing many patterns, such as how weather changes might influence sales.


种群的勘探(Exploration)与开发(Exploitation)

Exploration:当算法发散时,种群个体之间维数值的差异增大,意味着种群个体在搜索环境中是分散的。通过Exploration,算法能够访问搜索环境中不可见的邻域,以最大限度地提高找到全局最优位置的效率。

Exploitation:当种群收敛时,差异减小,种群个体聚集到一个集中区域。Exploitation可以使群体个体成功地收敛到一个潜在的领域,并极有可能获得全局最优解。

https://blog.csdn.net/jieyanping/article/details/129189230


Hyperparameter 超参数

  • gradient descent batch size: 64 梯度下降样本数
  • number of epochs: 10 迭代次数
  • learning rate: 0.0003 学习率(初次训练0.01尽快到最优解附近,后续训练降低学习率微调)
  • entropy: 0.01 熵
  • discount factor: 0.999 折扣率 learning rate = epoch * discount factor 使学习率逐渐下降
  • loss type: huber

Action space

  • continus action space 连续 (由算法决定)
    • steering angel range: -30 - 30 转弯角度范围(取决于地图)
    • speed: 0.1 - 4 速度范围
  • discrete action space 离散 (手动输入:角度越大,速度越慢)

https://github.com/dgnzlz/Capstone_AWS_DeepRacer/blob/master/Compute_Speed_And_Actions/RaceLine_Speed_ActionSpace.ipynb

Reward function

https://docs.aws.amazon.com/deepracer/latest/developerguide/deepracer-reward-function-input.html#reward-function-input-distance_from_center

  • Follow the center line in time trials
if distance_from_center <= marker_1:
    reward = 1
elif distance_from_center <= marker_2:
    reward = 0.5
elif distance_from_center <= marker_3:
    reward = 0.1
else:
    reward = 1e-3  # likely crashed/ close to off track

return reward
  • Stay inside the two borders in time trials
if all_wheels_on_track and (0.5*track_width - distance_from_center) >= 0.05:
    reward = 1.0
  • Prevent zig-zag in time trials
# Steering penality threshold, change the number based on your action space setting
ABS_STEERING_THRESHOLD = 15 

# Penalize reward if the car is steering too much
if abs_steering > ABS_STEERING_THRESHOLD:
    reward *= 0.8

Params dictionary

{
    "all_wheels_on_track": Boolean,        # flag to indicate if the agent is on the track
    "x": float,                            # agent's x-coordinate in meters
    "y": float,                            # agent's y-coordinate in meters
    "distance_from_center": float,         # distance in meters from the track center 
    "is_left_of_center": Boolean,          # Flag to indicate if the agent is on the left side to the track center or not. 
    "is_offtrack": Boolean,                # Boolean flag to indicate whether the agent has gone off track.
    "is_reversed": Boolean,                # 反向
    "heading": float,                      # 朝向
    "progress": float,                     # progress尽可能大,steps尽可能小
    "steps": int,                          # 1 秒15个steps
    "speed": float,                        # agent's speed in meters per second (m/s)
    "steering_angle": float,               # 轮子角度
    "track_length": float,                 # 轨道长度
    "track_width": float,                  # 轨道宽度
    "closest_waypoints": [int, int],       # 最近的两点
    "waypoints": [(float, float), ]        # 地图信息
}

原始轨道:https://github.com/aws-deepracer-community/deepracer-race-data/tree/main/raw_data/tracks

最优路径:https://github.com/cdthompson/deepracer-k1999-race-lines

## Define the default reward ##
reward = 1

progress = params['progress']
steps = params['steps']
TOTAL_NUM_STEPS = 10 * 15
if (steps % 15) == 0 and progress/100 > (steps/TOTAL_NUM_STEPS):
    reward += 2.22 #for each second faster than 45s projected
    # reward += progress - (steps/TOTAL_NUM_STEPS)*100

## Reward if car goes close to optimal racing line ##
DISTANCE_MULTIPLE = 2.5
dist = dist_to_racing_line(optimals[0:2], optimals_second[0:2], [x, y])
distance_reward = max(1e-3, 1 - (dist/(track_width*0.5)))
reward += distance_reward * DISTANCE_MULTIPLE

## Reward if speed is close to optimal speed ##
SPEED_DIFF_NO_REWARD = 1
SPEED_MULTIPLE = 2
speed_diff = abs(optimals[2]-speed)
if speed_diff <= SPEED_DIFF_NO_REWARD:
    # we use quadratic punishment (not linear) bc we're not as confident with the optimal speed
    # so, we do not punish small deviations from optimal speed
    speed_reward = (1 - (speed_diff/(SPEED_DIFF_NO_REWARD))**2)**2
else:
    speed_reward = 0
reward += speed_reward * SPEED_MULTIPLE


# Zero reward if obviously wrong direction (e.g. spin)
direction_diff = racing_direction_diff(
    optimals[0:2], optimals_second[0:2], [x, y], heading)
if direction_diff > 30:
    reward = 1e-3

# Zero reward of obviously too slow
speed_diff_zero = optimals[2]-speed
if speed_diff_zero > 0.5:
    reward = 1e-3

## Incentive for finishing the lap in less steps ##
REWARD_FOR_FASTEST_TIME = 500 # should be adapted to track length and other rewards
STANDARD_TIME = 12  # seconds (time that is easily done by model)
FASTEST_TIME = 9  # seconds (best time of 1st place on the track)
if progress == 100:
    finish_reward = max(1e-3, (-REWARD_FOR_FASTEST_TIME /
                (15*(STANDARD_TIME-FASTEST_TIME)))*(steps-STANDARD_TIME*15))
else:
    finish_reward = 0
reward += finish_reward

## Zero reward if off track ##
if all_wheels_on_track == False:
    reward = 1e-3

Best-route strategy: https://github.com/dgnzlz/Capstone_AWS_DeepRacer

AWS Deepracer Community: https://github.com/aws-deepracer-community/deepracer-analysis

https://zhuanlan.zhihu.com/p/635604595?utm_id=0

Article
Tagcloud
DVA Java Express Architecture Azure CI/CD database ML AWS ETL nest sql AntV Next Deep Learning Flutter TypeScript Angular DevTools Microsoft egg Tableau SAP Token Regexp Unit test Nginx nodeJS sails wechat Jmeter HTML2Canvas Swift Jenkins JS event GTM Algorithm Echarts React-Admin Rest React hook Flux Redux ES6 Route Component Ref AJAX Form JSX Virtual Dom Javascript CSS design pattern