PyTorch自定义张量操作

NSDT工具推荐： Three.js AI纹理开发包 - YOLO合成数据生成器 - GLTF/GLB在线编辑 - 3D模型格式在线转换 - 可编程3D场景编辑器 - REVIT导出3D模型插件 - 3D模型语义搜索引擎 - AI模型在线查看 - Three.js虚拟轴心开发包 - 3D模型在线减面 - STL模型在线切割 - 3D道路快速建模

PyTorch 与 TensorFlow 一起成为深度学习研究人员和从业者的标准。虽然 PyTorch 在张量运算或深度学习层方面提供了多种选择，但一些专门的操作仍然需要手动实现。在运行时至关重要的情况下，应使用 C 或 CUDA 来完成此操作，以支持 CPU 和 GPU 计算。

在本文中，我想提供一个简单的示例和框架，用于使用 CFFI for Python 和 CuPy 通过自定义 C 和 CUDA 操作扩展 PyTorch。

1、简介

PyTorch 已成为深度学习研究和开发的标准工具。即使是不常见的张量运算或神经网络层也可以使用 PyTorch 提供的各种操作轻松实现。但是，对于某些特殊操作，求助于高效的 C 和 CUDA 实现可能是有意义的。在本文中，我想展示如何使用 CFFI 轻松扩展 PyTorch 进行 CPU 操作和使用 CuPy 进行 GPU 操作。作为示例，我将展示如何实现张量运算，计算两个张量之间的元素级、位级汉明距离。

本文的代码可在 GitHub 上找到，并且可以轻松扩展和调整。

2、环境设置

我将使用 PyThon 的 C 外部函数接口 (CFFI) 在 CPU 上实现数据类型为 torch.int32 的张量之间的汉明距离。为了支持 GPU 计算，汉明距离也可以实现为 CUDA 内核。可以使用 CuPy 导入 CUDA 内核。CFFI 和 CuPy 都可以轻松安装，例如，使用 pip install。但是，对于 CuPy，安装需要适合使用的 CUDA 版本（对于 PyTorch 也是如此）。可以在此处找到详细的安装说明。

3、C 中的位汉明距离

下面提供了 C 中两个 32 位整数之间的位级汉明距离的简单实现：

清单 1：两个整数之间的汉明距离。

int a = 15;
int b = 19;
int dist = 0;
int x = a ^ b;
 
while(x != 0) {
    x = x & (x-1);
    dist++;
}

为了计算两个张量之间的逐元素汉明距离，可以将清单 1 包装在一个简单的循环中。生成的函数（如清单 2 所示）需要三个 int 数组作为输入：第一个输入张量、第二个输入张量和将用汉明距离填充的输出张量。所有数组都假定为相同的长度 n：

清单 2：两个整数数组之间的逐元素汉明距离。可选地，可以使用 OpenMP 来加快计算速度。

    #pragma omp parallel for
    for (int elem_idx = 0; elem_idx < n; elem_idx++) {
 
        dist[elem_idx] = 0;
        int x = a[elem_idx] ^ b[elem_idx];
 
        while(x != 0) {
            x = x & (x-1);
            dist[elem_idx]++;
        }
    }
}

使用 CFFI，此函数可直接用于操作 NumPy 数组或 PyTorch 张量。为此，可以将实现保存在 cffi.c 中，并将相应的头文件保存在 cffi.h 中。

4、位汉明距离 CUDA 内核

清单 1 中概述的相同算法可轻松放入 CUDA 内核中：

清单 3：用于计算两个整数数组之间汉明距离的 CUDA 内核。这本质上是清单 1 中由内核块/dim/id 确定的数组元素。

extern "C" __global__ void cupy_int32hammingdistance(
    const int n,
    const int* a,
    const int* b,
    int* dist
) {
    int elem_idx = blockIdx.x * blockDim.x + threadIdx.x;
        
    if (elem_idx >= n) {
        return;
    }
 
    int x = a[elem_idx] ^ b[elem_idx];
 
    while(x != 0) {
        x = x & (x-1);
        dist[elem_idx]++;
    }
}

CuPy 只需要内核；内核不需要存储在单独的代码文件中。相反，它可以作为 Python 中的字符串提供给 CuPy。

5、将所有内容放在一起

为了组装所有部分，我将使用一个简单的模块化结构，将实际实现（使用 cffi/ 中的 CFFI 或 cupy.py 中的 CuPy）与 torch.py 中的高级方法分开：

common/
- __init__.py
- cffi/
  |- cffi.h
  |- cffi.c
  |- __init__.py
- cupy.py
- torch.py

5.1 CPU 实现

首先，我将整理 CPU 实现，即 cffi.c 和 cffi.h。为简单起见，可以将它们放入自己的目录中，Python 接口将在相应的 __init__.py 中定义。

清单 4：使用 CFFI，可以即时编译 C 代码并直接在 Python 中访问。详情请参阅注释。

    # https://stackoverflow.com/questions/22931147/stdisinf-does-not-work-with-ffast-math-how-to-check-for-infinity
    else:
        if use_openmp:
            ffi.set_source(
                '_cffi',
                my_source.read(),
                extra_compile_args=['-fopenmp', '-D use_openmp', '-O3','-march=native'],
                extra_link_args=['-fopenmp'],
            )
        else:
            ffi.set_source('_cffi',
                my_source.read(),
                extra_compile_args=['-O3','-march=native'],
)
 
# 4.
# Compile using the parameters above.
ffi.compile()
#ffi.compile(verbose=True)
# 5.
# This simply imports all compiled functions and makes them available in this module.
from _cffi import *

代码负责编译清单 2 中的函数并将其与 Python 接口。之后，可以通过 cffi.lib.cffi_int32hammingdistance 访问该函数，其中清单 4 对应于 cffi/__init__.py。我将在下面详细介绍基本步骤：

获取 __init__.py 文件目录的绝对路径。这对于定位要编译的头文件和实现文件是必要的。根据设置，也可以通过不同的方式解决这个问题，例如，通过硬编码绝对路径。
读取头文件，以便 CFFI 知道函数定义。
读取源文件并设置编译选项。在这里，代码允许几种不同的设置，包括不带优化的调试设置和支持 OpenMP 的设置。
了解函数定义（通过头文件）、确定编译选项并阅读源代码后，CFFI 可以编译所有内容。
最后，导入所有编译函数，以便以后可以更轻松地访问它们。

5.2 GPU 实现

对于 CuPy 部分，我将在 cupy.py 中创建一个单独的模块：

清单 5：与 CFFI 类似，CuPy 也允许即时编译 CUDA 内核。详情请参阅注释。

import torch
 
 
try:
    import cupy
    # 1. This will be used to call a kernel with source code provided as Python string.
    @cupy.util.memoize(for_each_device=True)
    def cunnex(strFunction):
        return cupy.cuda.compile_with_cache(globals()[strFunction]).get_function(strFunction)
except ImportError:
    print("CUPY cannot initialize, not using CUDA kernels")
 
 
class Stream:
    ptr = torch.cuda.current_stream().cuda_stream
 
# 2. The kernel as Python string from Listing 3
cupy_int32hammingdistance = '''
    extern "C" __global__ void cupy_int32hammingdistance(
        const int n,
        const int* a,
        const int* b,
        int* dist
    ) {
        int elem_idx = blockIdx.x * blockDim.x + threadIdx.x;
        
        if (elem_idx >= n) {
            return;
        }
 
        int x = a[elem_idx] ^ b[elem_idx];
 
        while(x != 0) {
            x = x & (x-1);
            dist[elem_idx]++;
        }
    }
'''

CuPy 接口甚至更简单：

此实用函数将负责编译和接口函数。作为参数，该函数需要一个变量的名称，该变量包含实际的 CUDA 内核源代码。
源代码保存在此变量中，而不是单独的源文件中。

最后，在 torch.py 中，将合并两个实现。结果是一个高级函数 int32_hamming_distance，需要两个 torch.int32 张量作为输入。该函数将自动为输出分配内存，并根据输入是否在 GPU 内存上调用适当的接口。为了确定张量是否在 GPU 内存上，提供了一个简单的 is_cuda 函数（此处未显示）。

清单 6：将 CFFI 和 CuPy 实现放在一个高级方法 int32_hamming_distance 中，该方法根据输入张量自动使用 CPU 或 GPU 实现。

def int32_hamming_distance(a, b):
    """
    Bit-wise hamming distance.
 
    :param a: first tensor
    :type a: torch.Tensor
    :param b: first tensor
    :type b: torch.Tensor
    :return: hamming distance
    :rtype: torch.Tensor
    """
 
    if not a.is_contiguous():
        a.contiguous()
    assert (a.dtype == torch.int32)
    cuda = is_cuda(a)
 
    if not b.is_contiguous():
        b.contiguous()
    assert (b.dtype == torch.int32)
    assert is_cuda(b) is cuda
 
    assert len(a.shape) == len(a.shape)
    for d in range(len(a.shape)):
        assert a.shape[d] == b.shape[d]
 
    # 1. Initialize output tensor to hold the element-wise hamming distances.
    dist = a.new_zeros(a.shape).int()
    n = dist.nelement()
 
    if cuda:
        # 2. Call the cupy implementation using the helper function cupy.cunnex.
        # The function returned by cupy.cunnex expects, among others, the grid/block division to use
        # and the kernel arguments; here a.data_ptr() will return the point to the tensor a
        # and is assumed to be on GPU memory.
        cupy.cunnex('cupy_int32hammingdistance')(
            grid=tuple([int((n + 512 - 1) / 512), 1, 1]),
            block=tuple([512, 1, 1]),
            args=[n,
                  a.data_ptr(),
                  b.data_ptr(),
                  dist.data_ptr()],
            stream=cupy.Stream
        )
    else:
        # 3. For CFFI, the inputs have to be cast to the target C equivalents using cffi.ffi.cast.
        # Afterwards, the C function can be called like a regular Python function using the converted arguments.
        _n = cffi.ffi.cast('int', n)
        _a = cffi.ffi.cast('int*', a.data_ptr())
        _b = cffi.ffi.cast('int*', b.data_ptr())
        _dist = cffi.ffi.cast('int*', dist.data_ptr())
 
        cffi.lib.cffi_int32hammingdistance(_n, _a, _b, _dist)
 
    return dist

本质上，该函数创建输出张量来保存元素汉明距离，然后根据输入是否存储在 GPU 内存中调用 CuPy 或 CFFI 接口：

创建输出张量，如果需要，也可以在 GPU 上创建。它也将是相同大小的 torch.int32 张量。
CuPy 实现通过辅助函数 cupy.cunnex 调用，该函数获取相应的源代码，对其进行编译（如果未缓存）并返回相应的函数。返回的函数需要输入——这里， a.data_ptr() 用于访问给定张量底层的指针——以及内核的块/dim/id 细分。

调用 CFFI 实现需要将输入显式转换为等效的 C 类型。然后，可以像常规 Python 函数一样调用该函数。

6、结束语

总体而言，本文表明，使用 CFFI 和 CuPy 可以非常简单地在 C 和 CUDA 中实现支持 CPU 和 GPU 计算的自定义 PyTorch 操作。此外，我提供了一个简单的框架，可以轻松扩展到自定义操作。

原文链接：Implementing Custom PyTorch Tensor Operations in C and CUDA

BimAnt翻译整理，转载请标明出处

PyTorch自定义张量操作

1、简介

2、环境设置

3、C 中的位汉明距离

4、位汉明距离 CUDA 内核

5、将所有内容放在一起

5.1 CPU 实现

5.2 GPU 实现

6、结束语

admin

倒角距离

PyTorch预测圆环的方向

1、简介

2、环境设置

3、C 中的位汉明距离

4、位汉明距离 CUDA 内核

5、将所有内容放在一起

5.1 CPU 实现

5.2 GPU 实现

6、结束语

倒角距离

PyTorch预测圆环的方向

You might also like...

You might also like...