gem5-smaug 自定义算子教程

本文最后更新于：2024年12月23日晚上

紧急加更一期smaug，谁都没有想到这个学校实验课程的内容占用了我两周的大量时间，并且这个框架在中英文互联网上都极其缺少参考资料。~~用于发论文的代码是这样的~~

简介

gem5模拟器是一个开源的离散事件体系结构模拟器，结合了系统级和微架构仿真。

一个与之类似但是更加有名的模拟器是QEMU，区别在于， qemu更多用于尽可能高性能的运行guest code，而gem5偏向对硬件层进行精确的仿真，用于分析模拟特定CPU，管线，缓存等的设计性能信息。

gem5-aladdin出自Harvard ACC(Architecture, Circuits, Compliers)实验室，用于在gem5上端到端模拟SoC工作负载。

smaug是Harvard ACC的后续工作，全称Simulating Machine Learning Accelerators Using gem5-Aladdin，SMAUG 是一个深度学习框架，支持在具有各种硬件加速器的定制 SoC 上端到端模拟深度学习模型。

任务

通过 GEM5 SMAUG 设计一个用于特定任务的硬件加速单元，在架构级别仿真中分析其对性能和能耗的影响。

配置环境

smaug提供了一个docker image来进行简单的环境配置。

1 2	`docker pull xyzsam/smaug:latest docker run it rm mount source= smaug workspace,target =/workspace xyzsam smaug:latest`

(可能需要自行使用镜像以及docker run时指定–net)

git pull

1
2
3

cd gem5 aladdin && git pull origin master && git submodule update init recursive && cd ..
cd LLVM Tracer && git pull origin master && cd ..
cd smaug && git pull origin master && git submodule update init recursive && cd ..

build

cd /workspace/gem5 aladdin
python2.7 `which scons` build/X86/gem5.opt MESI_Two_Level_aladdin -j2
cd /workspace/smaug
make all -j8

接下来就可以尝试运行example的一个Minerva Model

1
2
3

cd /workspace/smaug/experiments/smv/tests/Minerva
chmod +x run.sh
./run.sh

这样我们就可以从stdout里获得能耗分析工具的输出，以及在outputs/stat.txt里获得gem5的性能模拟输出。

自定义 Operator

Example

书接上文，这个密涅瓦网络是什么呢，我们其实可以在

/workspace/smaug/experiments/models/minerva/minerva_network.py 中查看，在下文中，我们当前的根目录均为/workspace/smaug

#!/usr/bin/env python

"""Create the Minerva network."""

import numpy as np
import smaug as sg

def generate_random_data(shape):
  r = np.random.RandomState(1234)
  return (r.rand(*shape) * 0.005).astype(np.float16)

def create_minerva_model():
  with sg.Graph(name="minerva_smv", backend="SMV") as graph:
    # Tensors and kernels are initialized as NCHW layout.
    input_tensor = sg.Tensor(
        data_layout=sg.NHWC, tensor_data=generate_random_data((1, 28, 28, 1)))
    fc0_tensor = sg.Tensor(
        data_layout=sg.NC, tensor_data=generate_random_data((256, 784)))
    fc1_tensor = sg.Tensor(
        data_layout=sg.NC, tensor_data=generate_random_data((256, 256)))
    fc2_tensor = sg.Tensor(
        data_layout=sg.NC, tensor_data=generate_random_data((256, 256)))
    fc3_tensor = sg.Tensor(
        data_layout=sg.NC, tensor_data=generate_random_data((10, 256)))

    act = sg.input_data(input_tensor)
    act = sg.nn.mat_mul(act, fc0_tensor, activation="relu")
    act = sg.nn.mat_mul(act, fc1_tensor, activation="relu")
    act = sg.nn.mat_mul(act, fc2_tensor, activation="relu")
    act = sg.nn.mat_mul(act, fc3_tensor)
    return graph

if __name__ != "main":
  graph = create_minerva_model()
  graph.print_summary()
  graph.write_graph()

在不清楚具体类型的情况下，也可以大体看出是一个简单的矩阵连乘。那么我们运行这个python文件会发生什么呢，答案是显而易见的，会生成同一文件夹下的topo.txt和params.pb,也就是用于模型模拟的权重和参数文件。
这一用法我们还会在后续使用。

所以，我们最后的目标就是，通过类似这样的python文件调用我们自定义的算子生成网络图，之后使用gem5对其进行性能和能耗分析。

Create

这一小节我们照着官方仅剩的少量python文档和Cpp文档来进行一个最简单的算子的定义，定义一个tensor的加法。

要向 smaug 添加算子，首先要添加算子的 Cpp 实现，以及公开一个python的API。我们找到算子的cpp文件定义位置/smaug/operations/"下新建文件custom_operator.h，把这个header-only的头文件当作算子。

namespace smaug {
 
template <typename Backend>
class MyCustomOperator : public Operator {
 public:
  MyCustomOperator(const std::string& name, Workspace* workspace) :
    Operator(name, workspace) {
      inputs.resize(kNumInputs, nullptr);
      outputs.resize(kNumOutputs, nullptr);
  }
 
  void setParam1(int val) { param1 = val; }
  void setParam2(int val) { param2 = val; }
 
  // A required function that implements the actual Operator logic.  Leave this
  // blank for now.
  void run() override {}
 
  // Optional override for testing purposes.
  void createAllTensors() override {}
 
  // Optional but recommended function to verify operator parameters.
  bool validate() override {}
 
  // An optional function to tile the input tensors.
  void tile() override {}
 
  enum {kInput0, kInput1, kNumInputs};
  enum {kOutput, kNumOutputs};
 
 private:
  int param1 = 0;
  int param2 = 0;
};
 
}

可以从enum看到这个算子接受两个输入，产生一个输出，接下来我们要向Backend注册这个算子。

将此操作符的新 OpType 枚举添加到 smaug/core/types.proto。这个文件会被反序列化生成为类型定义。
在所有后端中定义该操作符。只需遵循 backend.h 和 backend.cpp 中的现有约定:
在 backend.h 中包含头文件并前向声明该操作符。
在 backend.h 的所有后端添加 DECL_CREATE_OP(MyCustomOperator)。
在 backend.cpp 的所有后端添加 DEF_CREATE_OP(MyCustomOperator, Backend)。
需要注意的是，既然编译期不能确定我们使用的是哪个后端，所以需要为所有后端使用宏来实现createMyCustomOperator函数。

更新 network_builder.cpp 以引入新的操作符，也就是我们在1中声明的类型。

if (type == OpType::MyCustomOperator) {
    auto op = Backend::createMyCustomOperator(name, workspace);
    op->setParam1(node.param1());
    op->setParam2(node.param2());
    network->addOperator(op);
}

将任何新的 .cpp 文件添加到 smaug/make/Makefile.common 中的 SRCS 变量。

Impl

接下来我们就需要通过重写虚函数来实现这个算子。首先验证输入是否合法

bool validate() override {
    Tensor* input0 = getInput(kInput0);
    Tensor* input1 = getInput(kInput1);

    if (input0->getShape().size() != input1->getShape().size() ||
        input0->getDataType() != DataType::Float32 ||
        input1->getDataType() != DataType::Float32) {
        return false;
    }

    return true;
}

接下来我们要将输入的tensor进行展平，以便于加速器直接计算

void tile() override {
    Tensor* inputs0 = getInput(kInput0);
    Tensor* inputs1 = getInput(kInput1);
    Tensor* outputs = getOutput(kOutput);

    //Calculate the size of each tile (assuming the maximum tile size
    //is the local memory capacity of the hardware).
    int maxTileSize =
            std::min(ReferenceBackend::SpadSize() / inputs0->getDataTypeSize(),
                     inputs0->getShape().storageSize());
    TensorShape tileShape(
            { 1, maxTileSize }, DataLayout::NC,
            ReferenceBackend::Alignment);  // Assuming the data is 1D.

    // Tile the input and output Tensors.
    tiledTensors[0] =
            generateTiledTensorPerBatchNC(inputs0, tileShape, this, true);
    tiledTensors[1] =
            generateTiledTensorPerBatchNC(inputs1, tileShape, this, true);
    tiledTensors[2] =
            generateTiledTensorPerBatchNC(outputs, tileShape, this, false);
}

当然，我们需要算子类自己维护一个

1 2	`protected: std::array<TiledTensor, 3> tiledTensors;`

可以看到这里使用了ReferenceBackend::SpadSize()这样的参数，用来限定展平的尺寸。显然展平的尺寸与我们自定义加速器的尺寸有关，我们需要手动在后端声明加速器的代号和pad的大小。我们在core/backend.cpp中声明

namespace ref {
    // This is all existing code...
    const unsigned kConvolutionHw = 0x0001;
    const unsigned kInnerProductHw = 0x0002;
    const unsigned kEltwiseOpHw = 0x0003;
    const unsigned kBatchNormHw = 0x0004;
    const unsigned kPoolingHw = 0x0005;
    
    // Define our new scratchpads here.
    int kSpadSize;
    float* spad0;
    float* spad1;
    
    // Add a unique ID for our accelerator HW. This will be used to invoke the
    // accelerator during simulation.
    const unsigned kMyCustomOperatorHw = 0x00006;
}  // namespace ref

在头文件core/backend.h中extern

namespace ref {
// This is all existing code...
extern const unsigned kConvolutionHw;
extern const unsigned kInnerProductHw;
extern const unsigned kEltwiseOpHw;
extern const unsigned kBatchNormHw;
extern const unsigned kPoolingHw;
 
// Declare our two new global arrays and accelerator IDs here.
extern int kSpadSize;
extern float* spad0;
extern float* spad1;
extern const unsigned kMyCustomOperatorHw;
}  // namespace ref

并且在头文件中修改init函数对我们使用的pad进行初始化

class ReferenceBackend {
  static int SpadSize() { return ref::kSpadSize; }
  static void initGlobals() {
    ref::kSpadSize = 32*1024;  // Replace with your actual value.
    ref::spad0 = (float*) malloc_aligned(ref::kSpadSize);
    ref::spad1 = (float*) malloc_aligned(ref::kSpadSize);
  }
  static void freeGlobals() {
    free(ref::spad0);
    free(ref::spad1);
  }
}

这样我们的tile()函数就可以拿到正确的Size了。之后我们来实现算子的运算过程，也就是run()函数,先贴代码

    void run() override {
    TiledTensor& input0 = tiledTensors[0];
    TiledTensor& input1 = tiledTensors[1];
    TiledTensor& output = tiledTensors[2];

    for (int i = 0; i < input0.size(); i++) {
        Tensor* input0Tile = input0.getTileWithData(i);
        Tensor* input1Tile = input1.getTileWithData(i);
        Tensor* outputTile = output.getTileWithData(i);

        // Get handles to the actual underlying data storage. This performs
        // a dynamic_cast to the specified data type, which we verified is
        // safe inside validate().
        float* input0Data = input0Tile->data<float>();
        float* input1Data = input1Tile->data<float>();
        float* outputData = outputTile->data<float>();
        int size = outputTile->getShape().size();

        // Set up the TLB mappings.
        mapArrayToAccelerator(
                ref::kMyCustomOperatorHw,  // The accelerator ID this TLB
                                           // mapping is for.
                "host_input0",  // The name of the function argument in the
                                // kernel function.
                input0Data,     // The pointer to the data.
                size * sizeof(float)           // The size of the TLB mapping
        );
        mapArrayToAccelerator(
                ref::kMyCustomOperatorHw, "host_input1", input1Data, size * sizeof(float) );
        mapArrayToAccelerator(
                ref::kMyCustomOperatorHw, "host_output", outputData, size * sizeof(float) );

        // Wrap the call to elementwise_add with invokeKernel.
        invokeKernel(
                ref::kMyCustomOperatorHw,  // our accelerator ID
                elementwise_add,  // if not simulating, the function to call
                // All of the function call arguments.
                input0Data,
                input1Data,
                outputData,
                ref::spad0,
                ref::spad1,
                outputTile->getShape().size());
    }
    // The results of the elementwise_add are stored in the tiled tensor. We
    // need to merge the data from the individual tiles back into a single
    // contiguous Tensor.
    flattenTiledTensor(tiledTensors[2],
                       dynamic_cast<smaug::Tensor*>(outputs.at(kOutput)));
}

从函数名大体可以看出，我们对输入和输出都进行DMA Mapping，然后交由Kernel调用加速器kMyCustomOperatorHw执行elementwise_add函数。

elementwise_add 函数的实现当然也是简单的,这里的label可以用来标注循环的位置进行后续的优化，后文会再次提及。

#ifdef __cplusplus
extern "C" {
#endif

inline void elementwise_add(float* host_input0,
                            float* host_input1,
                            float* host_output,
                            float* spad0,
                            float* spad1,
                            int size) {
    // 将输入数据从host_inputN复制到spadN。dmaLoad或dmaStore的第一个参数始终是目标。
    dmaLoad(spad0, host_input0, size * sizeof(float));
    dmaLoad(spad1, host_input1, size * sizeof(float));
    std::cout << size << std::endl;
    custom_add:
    for (int i = 0; i < size; i++) {
        // 将spad0的数据累积到spad1中。
        // 注意：如果我们有三1个scratchpad而不是两个，这可以更优化。这将是读者一个很好的练习
        // :)
        spad1[i] += spad0[i];
    }
    // 将输出数据从spad1复制回主机。
    dmaStore(host_output, spad1, size * sizeof(float)); 
}

#ifdef __cplusplus
}
#endif

务必注意这里的extern C，因为这个函数是在我们自定义的加速器中运行的，所以函数签名必须是C style以便于LLVM Tracer进行采样，我们在下文还会提到这一点。

这样，我们就编写完了这个算子的所有实现。

Test

接下来我们可以通过单元测试验证一下这个算子的正确性。具体来说，通过smaug_test来创建network，调用算子并验证输入输出。既然我们的算子后端是Reference，于是我们先进入core/smaug_test.h的Line 48，58将SMVBackend修改为ReferenceBackend。

接着，我们在算子文件的同路径下编写my_custom_operator_test.cpp，仍然是先贴代码

void fillTensorWithSequentialFloat32Data(Tensor* tensor) {
    float* data = tensor->data<float>();
    for (int i = 0; i < tensor->getShape().size(); i++) {
        data[i] = i;
    } 
}

TEST_CASE_METHOD(SmaugTest, "MyCustomOperatorWithTiling", "[tiling]") {  
    // With float32 elements, this will occupy 128KB, which should create four
    // tiles per tensor.
    TensorShape shape( {8, 4096}, DataLayout::NC); 
    Tensor* input0 = new Tensor("tensor0", shape);
    Tensor* input1 = new Tensor("tensor1", shape);
    workspace()->addTensor(input0);
    workspace()->addTensor(input1);  
 
    // // Create the operator and fill it with our tensors.
    using TestOp = MyCustomOperator<ReferenceBackend>;
    auto op = new TestOp("eltwise_add", workspace());
    op->setInput(input0, TestOp::kInput0);
    op->setInput(input1, TestOp::kInput1);
    // This will handle creating/allocating storage/filling data into all the
    // input tensors.
    createAndFillTensorsWithData<float>(op, &fillTensorWithSequentialFloat32Data); 
    // // Compute the expected output.
    std::vector<float> expected_output(8 * 4096, 0);   
    for (int i = 0; i < expected_output.size(); i++) {
        expected_output[i] = 2 * i;
    }

    op->tile();    
    op->run();  

    Tensor* output = op->getOutput(TestOp::kOutput);
    verifyOutputs(output, expected_output); 
}

使用宏注册了test，生成有序数据并将其相加。如果一切无误的话，将其添加至make/Makefile.common的target TESTS中，就可以正确的编译。

1 2	`make test -j8 ./build/smaug/operators/my_custom_operator_test`

会输出

1 2	`=============================================================================== All tests passed (32769 assertions in 1 test case)`

这样就证明了我们的算子输出了正确的结果。

Simulate

接下来，我们回到开头，使用这个算子构建网络并且交由gem5进行模拟。这里的每一步我都没有找到参考资料，在此权当抛砖引玉，期待指正和讨论。

首先我们为自定义算子创造一个python接口，在smaug/operators/python/ops中创建一个新文件

from smaug.core import node_pb2, types_pb2
from smaug.python.ops import common

def my_custom_operator(tensor_a, tensor_b, name):
    if tensor_a.shape.dims != tensor_b.shape.dims:
        raise ValueError("The input tensors to MyCustomOperator must be of the same shape")
        
    return common.add_node(
        name=name,
        op=types_pb2.CustomOperator,
        input_tensors=[tensor_a, tensor_b],
        output_tensors_dims=[tensor_a.shape.dims],
        output_tensor_layout=tensor_a.shape.layout
    )[0]

这里和Cpp对应的地方在于types_pb2.CustomOperator，就是我们之前在types.proto中输入的用于反序列化的类型名称。

接着，我们参考一开始提到的minerva，在/experiments下新建文件夹/custom，编写网络结构，这里我们编写了一个最简单的连续加法

#!/usr/bin/env python

"""Create the Minerva network."""

import numpy as np
import smaug as sg

def generate_random_data(shape):
  r = np.random.RandomState(1234)
  return (r.rand(*shape)).astype(np.float32)

def create_custom_model():
  with sg.Graph(name="custom_ref", backend="Reference") as graph:
    # Tensors and kernels are initialized as NCHW layout.
    input_tensor = sg.Tensor(
        data_layout=sg.NC, tensor_data=generate_random_data((1024, 1)))
    fc0_tensor = sg.Tensor(
        data_layout=sg.NC, tensor_data=generate_random_data((1024, 1)))
    fc1_tensor = sg.Tensor(
        data_layout=sg.NC, tensor_data=generate_random_data((1024, 1)))
    
    act = sg.input_data(input_tensor)
    act = sg.my_custom_ops.my_custom_operator(act, fc0_tensor,"element")
    act = sg.my_custom_ops.my_custom_operator(act, fc1_tensor,"element")
    return graph

if __name__ != "main":
  graph = create_custom_model()
  graph.print_summary()
  graph.write_graph()

运行该文件，就可以在当前目录生成params和topo。

接着，我们参考之前的run.sh，在/experiments/sims/smv/tests下新建/custom，参考minerva的实现，这里建议直接将.sh和.cfg复制一份过来

在model_files中，指定我们刚才创建的topo和param，

model_dir=`git rev-parse --show-toplevel`/models

topo_file=${model_dir}/custom/custom_ref_topo.pbtxt
params_file=${model_dir}/custom/custom_ref_params.pb

对于trace.sh，我们需要使用其采样来生成dynamic_trace_acc0.gz,如果之前没有编译的话，可以回到/workspace/smaug先执行

1	`make tracer -j8`

但是，我们发现tracer完全无法采样到我们Custom的算子，为什么呢，前文提到过，我们使用 C style 的elementwise_add正是为了让函数签名不变，但是tracer不认识这个函数签名，于是我们必须在/make/kernel_functions.txt中添加一行elementwise_add，并且重新编译tracer。

这样就可以发现正确的输出：

Scheduling element (CustomOperator).
dynamic_trace_acc0.gz: Starting to log at inst = 0.
dynamic_trace_acc0.gz: Stopping logging at inst 10257.
Scheduling element_1 (CustomOperator).
dynamic_trace_acc0.gz: Starting to log at inst = 0.
dynamic_trace_acc0.gz: Stopping logging at inst 10257.

生成了dynamic_trace_acc0.gz

接下来，我们在gem5.cfg中指定我们的自定义加速器

1 2	`[acc0] accelerator_id = 6`

按需修改我们的smv-accel.cfg，为我们的pad分配空间

partition,cyclic,host_input0,65536,4,8
partition,cyclic,host_input1,65536,4,8
partition,cyclic,host_output,65536,4,8
partition,cyclic,spad0,65536,4,8
partition,cyclic,spad1,65536,4,8

然后一切无误的话，使用run.sh启动模拟。

学习

#xjtu-course #arch #AI

gem5-smaug 自定义算子教程

http://tzr.icu/20241223/gem5-smaug-custom-operator/

发布于

2024年12月23日

更新于

2024年12月23日

许可协议

2024年终总结 | 喧闹散去时，愿你我不再孤独上一篇

无痛入门dpdk 0 前言下一篇