vllm.model_executor.layers.quantization.kernels.scaled_mm ¶
 Modules:
| Name | Description | 
|---|---|
ScaledMMLinearKernel |    |  
aiter |    |  
cpu |    |  
cutlass |    |  
triton |    |  
xla |    |  
  _POSSIBLE_KERNELS  module-attribute  ¶
 _POSSIBLE_KERNELS: dict[
    PlatformEnum, list[type[ScaledMMLinearKernel]]
] = {
    CPU: [CPUScaledMMLinearKernel],
    CUDA: [CutlassScaledMMLinearKernel],
    ROCM: [
        AiterScaledMMLinearKernel,
        TritonScaledMMLinearKernel,
    ],
    TPU: [XLAScaledMMLinearKernel],
}
  choose_scaled_mm_linear_kernel ¶
 choose_scaled_mm_linear_kernel(
    config: ScaledMMLinearLayerConfig,
    compute_capability: int | None = None,
) -> type[ScaledMMLinearKernel]
Choose an ScaledMMLinearKernel that can implement the given config for the given compute capability. Attempts to choose the best kernel in terms of performance.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
 config  |   ScaledMMLinearLayerConfig  |    Description of the linear layer to be implemented.  |  required | 
 compute_capability  |   Optional[int]  |    The compute capability of the target device, if None uses   |   None  |  
Raises:
| Type | Description | 
|---|---|
 ValueError  |    If no kernel can implement the given config.  |  
Returns:
| Type | Description | 
|---|---|
 type[ScaledMMLinearKernel]  |    type[ScaledMMLinearKernel]: Chosen kernel.  |