PaperReading:Micro-armed Bandit for Microarchitecture Decision-Making

发表于 2026-03-12 分类于体系结构阅读次数：

(1)What is your take-away message from this paper

MAB轻量化的强化学习算法在微架构上的应用

(2)What is the motivation for this work (both people problem and technical problem), and its distillation into a research question? Why doesn’t the people problem have a trivial solution? What are the previous solutions and why are they inadequate?

Motivation:

Use MAB to solve Data Prefetching & Instruction Fetch in Simultaneous Multithreaded(SMT)

Background:

RL用来解决微架构中的各种决策问题

RL : 1 High complexity and storage overhead 2 not reusable Temporal homogenrity in the action space 动作空间的时间同质性 ：在一些微架构决策问题中，在给定的时间窗口内，只有动作空间的一小部分是有用的

先前的解决办法：在线强化学习（与Pythia对比） ## (3)What is the proposed solution (hypothesis, idea, design)? Why is it believed it will work? How does it represent an improvement? How is the solution achieved?

预取中的时间同质性

表征预取动作空间的时间同质性需要识别程序中每个点的最优预取动作。使用Pythia的动作作为代理，分析了两个最受欢迎的Pythia动作在10亿条指令期间的频率。平均而言，每个应用程序中最常选择的动作占总选择的60%，而第二常选择的动作占总动作选择的15% ，在10亿条指令期间，3%的动作空间占到了总动作选择的75%。请注意，每个应用程序中最常选择的动作都不同。

[[Pasted image 20260309201750.png]] 多臂老虎机智能体无法区分环境状态（因为本身算法就一个状态），而传统的、广泛采用的非强化学习预取器（如步长预取器和流预取器）在一定程度上可以区分环境状态,类比38论文进行建模。对于每一个程序都是老虎机+非强化学习预选器 ### 获取优先级和SMT线程门控策略

总体分为两步： 1. 基于Choi算法，分配IQ（每一个都有考虑和不考虑两种情况） 2. 用BrC ICount IQ LSQCouunt LSQ为没有达到fetch-gating的线程排序

优先级：BrC ICount IQ LSQCouunt LSQ 门控策略：Choi算法（具有时序稳定性）：Hill Climbing algorithm——IQ IRF ROB设置阈值，最后用ICount确定门控的优先级

原有Bandit对于微架构的修改

原有问题与解决： (1) IPC作为MAB的奖励，对于IPC不同的基准测试来说，有着共同的探索常数c。总之就是量纲不一样。

计算所有臂的平均初始奖励\(r_{avg}\)并归一化，以及在算法主循环中收集每一个新\(r_{step}\) (2) 当许多核心同时运行MAB算法时，核心间干扰现象会影响bandit的探索. 比方说某个核选择了一个非常激进的臂，导致其他核心的 DRAM带宽出现饥饿，从而使其他核的IPC下降，而并非是这些核的arm选的不好。
为了缓解多核干扰导致的错误学习，某个核心会独立地以很小概率重新进入“依次试遍所有臂”的阶段，但保留以前积累的奖励和计数统计。这样它可以在其他核心大多已经收敛、系统环境更稳定的时候，重新评估各个臂，纠正早期因干扰造成的误判。

(4)What is the author’s evaluation of the solution? What logic, argument, evidence, artifacts (e.g., a proof-of-concept system), or experiments are presented in support of the idea?

PS实验细节没有看 #### 三种Bandit算法抉择

使用调优集，比较了这三种和Pythia的IPC（min max ave），结果DUCB最好

数据预取

和步长、MLOP、Bingo、Pythia（重点）进行比较单核：结果L1步长+L2Bondit效果不错 (问题：如果是其他预取器的多级缓存组合会不会更好？)；在带宽受限下比Pythia还要更好多核：高于其他小于Pythia，因为核心之间相互干扰

SMT

比简单的Choi算法要更好

(5)What is your analysis of the identified problem, idea and evaluation? Is this a good idea? What flaws do you perceive in the work? What are the most interesting or controversial ideas? For work that has practical implications, ask whether this will work, who would want it, what it will take to give it to them, and when might it become a reality?

本文主要针对的问题是：RL预取器在微架构中的高开销、不可重用的问题，决定用基于DUCB的Bandit算法进行解决。文章发现了在微架构中Data Fetching阶段的时间同质性，大大减小了模型的复杂程度。这是一个非常好的想法。缺点1： Bandit奖励臂的设计，只参考IPC，也就是说只能单核下进行评价，而不能解决多核处理器相互干扰的问题，这导致这个架构在实际应用中其实没有什么太多作用——核心越多，效果越差. 不解决多核下奖励臂的求取问题，就不能放入实际。

(6)What are the paper’s contributions (author’s and your opinion)? Ideas, methods, software, experimental results, exper- imental techniques...?

Idea:时间同质性 SMT中将优先级策略和门控策略联合，并扩充了门控策略

(7)What are future directions for this research (author’s and yours, perhaps driven by shortcomings or other critiques)?

多核下的Data Fetching SMT——非线性下的资源分配比例探讨

(8)What questions are you left with? What questions would you like to raise in an open discussion of the work (review interesting and controversial points, above)? What do you find difficult to understand? List as many as you can.

SMT：为什么0.2%的平均周期比例提升能带来比Choi本身高平均2.2%的性能提高，二者之间的联系是什么，作者为什么要把这两个放在一起。我认为这是不是说明门控策略其实是没有必要的呢？

SMT所用的Choi算法里有一个核心假设：各种资源之间的分配是成比例的