vllm.model_executor.layers.quantization.deepspeedfp ¶
   DeepSpeedFPConfig ¶
  Bases: QuantizationConfig
Config for DeepSpeed FP quantizer. It supports fp6 and fp8.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
 weight_bits  |   int  |    the target quantization bits, 6 or 8.  |   8  |  
 group_size  |   int  |    group size for quantizaiton, default to 128.  |   512  |  
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
   __init__ ¶
  Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
   from_config  classmethod  ¶
 from_config(config: dict[str, Any]) -> DeepSpeedFPConfig
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
    get_config_filenames  staticmethod  ¶
     get_linear_method ¶
 get_linear_method() -> DeepSpeedFPLinearMethod
  get_name  classmethod  ¶
 get_name() -> QuantizationMethods
  get_quant_method ¶
 get_quant_method(
    layer: Module, prefix: str
) -> Optional[DeepSpeedFPLinearMethod]
  DeepSpeedFPLinearMethod ¶
  Bases: LinearMethodBase
Linear method for DeepSpeedFP quantizer.
Parameters:
| Name | Type | Description | Default | 
|---|---|---|---|
 quant_config  |   DeepSpeedFPConfig  |    the DeepSpeedFP quantization config.  |  required | 
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
   __init__ ¶
 __init__(quant_config: DeepSpeedFPConfig)
  apply ¶
     create_weights ¶
 create_weights(
    layer: Module,
    input_size_per_partition: int,
    output_partition_sizes: list[int],
    input_size: int,
    output_size: int,
    params_dtype: dtype,
    weight_loader=None,
    **extra_weight_attrs,
)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
   DeepSpeedFPParameter ¶
  Bases: Parameter
DeepSpeedFP quantized parameter class that implements fp8/fp6 quantization deepspeed. Weights are stored in quantized form on GPUs, and can be dequantized on-the-fly when needed by the model.
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
   __new__ ¶
 __new__(
    orig_shape: Size,
    params_dtype: dtype,
    quant_config: DeepSpeedFPConfig,
)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
   ds_dequantize ¶
 ds_dequantize(fp_out=None) -> Tensor
Return a tensor containing the dequantized weights of this parameter.
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
   ds_quantize_ ¶
 ds_quantize_(tensor: Tensor)
Source code in vllm/model_executor/layers/quantization/deepspeedfp.py
    ds_selective_dequantize ¶
 ds_selective_dequantize(indices, fp_out=None) -> Tensor
Return a tensor where only the weights at indices are dequantized (to save HBM -> SRAM bandwidth).