chinese直男口爆体育生外卖, 99久久er热在这里只有精品99, 又色又爽又黄18禁美女裸身无遮挡, gogogo高清免费观看日本电视,私密按摩师高清版在线,人妻视频毛茸茸,91论坛 兴趣闲谈,欧美 亚洲 精品 8区,国产精品久久久久精品免费

0
  • 聊天消息
  • 系統(tǒng)消息
  • 評(píng)論與回復(fù)
登錄后你可以
  • 下載海量資料
  • 學(xué)習(xí)在線課程
  • 觀看技術(shù)視頻
  • 寫文章/發(fā)帖/加入社區(qū)
會(huì)員中心
創(chuàng)作中心

完善資料讓更多小伙伴認(rèn)識(shí)你,還能領(lǐng)取20積分哦,立即完善>

3天內(nèi)不再提示

LLM推理上的DeepSpeed Inference優(yōu)化實(shí)踐方案

jf_pmFSk4VX ? 來源:知乎 ? 2023-11-16 14:36 ? 次閱讀
加入交流群
微信小助手二維碼

掃碼添加小助手

加入工程師交流群

一、 DeepSpeed Inference 的優(yōu)化點(diǎn)

概括來說,DeepSpeed Inference 的優(yōu)化點(diǎn)主要有以下幾點(diǎn):

GPU 的并行優(yōu)化

小batch的算子融合

INT8 模型量化

推理的 pipeline 方案

1.1 DeepSpeed 的算子融合

對(duì)于 Transformer layer,可分為以下4個(gè)主要部分:

Input Layer-Norm plus Query, Key, and Value GeMMs and their bias adds.

Transform plus Attention.

Intermediate FF, Layer-Norm, Bias-add, Residual, and Gaussian Error Linear Unit (GELU).

Bias-add plus Residual.

如圖所示,每一部分可分別進(jìn)行融合,與未融合相比,以上幾個(gè)部分的加速比可分別達(dá)到 1.5x, 2.9x, 3x, 1.2x 。

3d5eb55c-82d5-11ee-939d-92fbcf53809c.jpg

Deep-Fusion strategy for the small-batch inference

1.2 inference pipeline

與訓(xùn)練不同,對(duì)于生成式的大模型,主要有以下挑戰(zhàn):

prompt processing phase,該階段是 compute-bound 的, bubble 數(shù)量越少算力利用越充分

token generation phase,該階段是 bandwidth-bound的,micro-batch 越少對(duì)帶寬要求越低

decoder 是自回歸的,需要多 stage 間的高效調(diào)度

自回歸生成式模型有兩個(gè)階段

pipeline 中多序列的 KV cache 會(huì)占用大量顯存

3d71ef8c-82d5-11ee-939d-92fbcf53809c.jpg

對(duì)于上述問題,DeepSpeed 主要通過以下3種方式進(jìn)行優(yōu)化:

1. 隱藏?cái)?shù)據(jù)依賴及復(fù)合時(shí)序安排

首先將 batch 拆分為 micro-batch,其中 micro-batch 數(shù)等于 pipeline 深度,micro-batch通過動(dòng)態(tài)隊(duì)列的順序產(chǎn)生token 并避免 bubbles。另外另外由于兩個(gè)階段的耗時(shí)不同,使用混合時(shí)序可以同時(shí)減少兩個(gè)階段的延時(shí)。

3d7ca67a-82d5-11ee-939d-92fbcf53809c.jpg

2. 將 KV cache 存到 CPU 內(nèi)存上

3. 通信優(yōu)化

二、DeepSpeed Inference 實(shí)踐

2.1 DeepSpeed Inference 的基本用法

首先構(gòu)造一個(gè)基本的 6B GPT-J 文本生成任務(wù),并測(cè)量其延時(shí)性能

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from time import perf_counter
import numpy as np
import transformers
# hide generation warnings
transformers.logging.set_verbosity_error()

def measure_latency(model, tokenizer, payload, generation_args, device):
  input_ids = tokenizer(payload, return_tensors="pt").input_ids.to(device)
  latencies = []
  # warm up
  for _ in range(2):
    _ = model.generate(input_ids, **generation_args)
  # Timed run
  for _ in range(10):
    start_time = perf_counter()
    _ = model.generate(input_ids, **generation_args)
    latency = perf_counter() - start_time
    latencies.append(latency)
  # Compute run statistics
  time_avg_ms = 1000 * np.mean(latencies)
  time_std_ms = 1000 * np.std(latencies)
  time_p95_ms = 1000 * np.percentile(latencies,95)
  return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +- {time_std_ms:.2f};", time_p95_ms

# Model Repository on huggingface.co
model_id = "philschmid/gpt-j-6B-fp16-sharded"

# load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.float16).cuda()

payload = "Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend but it"
input_ids = tokenizer(payload,return_tensors="pt").input_ids.to(model.device)
print(f"input payload: 
 
{payload}")
logits = model.generate(input_ids, do_sample=True, num_beams=1, min_length=128, max_new_tokens=128)
print(f"prediction: 
 
 {tokenizer.decode(logits[0].tolist()[len(input_ids[0]):])}")

payload="Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend but it"*2
print(f'Payload sequence length is: {len(tokenizer(payload)["input_ids"])}')

# generation arguments
generation_args = dict(
 do_sample=False,
 num_beams=1,
 min_length=128,
 max_new_tokens=128
)
vanilla_results = measure_latency(model, tokenizer, payload, generation_args, model.device)

print(f"Vanilla model: {vanilla_results[0]}")

其性能結(jié)果輸出如下:

Vanilla model: P95 latency (ms) - 5369.078720011748; Average latency (ms) - 5357.65 +- 12.07;

加下來使用 DeepSpeed 的 init_inference 函數(shù)包裝模型,可以看到之后的模型層變成了 DeepSpeedTransformerInference 類,代碼如下:

import deepspeed
# init deepspeed inference engine
ds_model = deepspeed.init_inference(
  model=model,   # Transformers models
  mp_size=1,    # Number of GPU
  dtype=torch.float16, # dtype of the weights (fp16)
  replace_method="auto", # Lets DS autmatically identify the layer to replace
  replace_with_kernel_inject=True, # replace the model with the kernel injector
)
print(f"model is loaded on device {ds_model.module.device}")

from deepspeed.ops.transformer.inference import DeepSpeedTransformerInference
assert isinstance(ds_model.module.transformer.h[0], DeepSpeedTransformerInference) == True, "Model not sucessfully initalized"

# Test model
example = "My name is Philipp and I"
input_ids = tokenizer(example,return_tensors="pt").input_ids.to(model.device)
logits = ds_model.generate(input_ids, do_sample=True, max_length=100)
tokenizer.decode(logits[0].tolist())

payload = (
  "Hello my name is Philipp. I am getting in touch with you because i didn't get a response from you. What do I need to do to get my new card which I have requested 2 weeks ago? Please help me and answer this email in the next 7 days. Best regards and have a nice weekend but it"
  * 2
)
print(f'Payload sequence length is: {len(tokenizer(payload)["input_ids"])}')

# generation arguments
generation_args = dict(do_sample=False, num_beams=1, min_length=128, max_new_tokens=128)
ds_results = measure_latency(ds_model, tokenizer, payload, generation_args, ds_model.module.device)

print(f"DeepSpeed model: {ds_results[0]}")

性能結(jié)果如下,跟未優(yōu)化的相比,性能提升2倍以上

DeepSpeed model: P95 latency (ms) - 2415.4563972260803; Average latency (ms) - 2413.18 +- 1.47;

2.2 inference pipeline 的效果分析

使用 transformers 默認(rèn)的 pipeline 進(jìn)行 6B GPT-J 文本生成任務(wù),并測(cè)量延時(shí)性能

import osfrom time import perf_counterimport numpy as npfrom transformers import pipeline, set_seeddef measure_pipeline_latency(generator, prompt, max_length, num_return_sequences):

  latencies = []
  # warm up
  for _ in range(2):
    output = generator(prompt, max_length=max_length, num_return_sequences=num_return_sequences)
  # Timed run
  for _ in range(10):
    start_time = perf_counter()
    output = generator(prompt, max_length=max_length, num_return_sequences=num_return_sequences)
    latency = perf_counter() - start_time
    latencies.append(latency)
  # Compute run statistics
  time_avg_ms = 1000 * np.mean(latencies)
  time_std_ms = 1000 * np.std(latencies)
  time_p95_ms = 1000 * np.percentile(latencies,95)
  return f"P95 latency (ms) - {time_p95_ms}; Average latency (ms) - {time_avg_ms:.2f} +- {time_std_ms:.2f};", time_p95_msmodel_id = "philschmid/gpt-j-6B-fp16-sharded"local_rank = int(os.getenv('LOCAL_RANK', '0'))generator = pipeline('text-generation', model=model_id, device=local_rank)set_seed(42)output = generator("Hello, I'm a language model,", max_length=50, num_return_sequences=4)print(output)vanilla_results = measure_pipeline_latency(generator, "Hello, I'm a language model,", 50, 4)print(f"Vanilla model: {vanilla_results[0]}")

首先會(huì)生成4段不同的結(jié)果,如下所示:

[{'generated_text': "Hello, I'm a language model, for the question that is being asked, and I have 2 vectors, and it's not going to start with a verb (ex. I love this movie) what is the best classifier to see if I"}, 
{'generated_text': "Hello, I'm a language model, and I'm at the beginning of my study of LSTMs. I'm following https://medium.com/@LSTMJiaLiu/understanding-lstms-for-nlp"}, 
{'generated_text': "Hello, I'm a language model, I'm a language model. I talk all the time, and I help teach people. That's my only purpose. That's what this is about. I'm doing this, I'm just letting you know"}, 
{'generated_text': "Hello, I'm a language model, a sequence-to-sequence model for generating natural language. I've been
learning all about it, and the more I learn the more awesome it becomes.
I have a couple of questions regarding it."}]

延時(shí)性能結(jié)果如下:

Vanilla model: P95 latency (ms) - 2417.524548433721; Average latency (ms) - 2409.83 +- 5.28;

使用 Nsight Systems 分析其 profiling,如下:

3d8e2940-82d5-11ee-939d-92fbcf53809c.png

單GPU HF pipeline 的 profiling

接下來可以使用 deepspeed 對(duì)原模型進(jìn)行處理,之后可采用相同的方式生成文本,并測(cè)量性能

import torch
import deepspeed
world_size = int(os.getenv('WORLD_SIZE', '4'))
generator.model = deepspeed.init_inference(generator.model,
                      mp_size=world_size,
                      dtype=torch.float,
                      replace_with_kernel_inject=True)

string = generator("DeepSpeed is", do_sample=True, max_length=50, num_return_sequences=4)
if not torch.distributed.is_initialized() or torch.distributed.get_rank() == 0:
  print(string)

ds_results = measure_pipeline_latency(generator, "Hello, I'm a language model,", 50, 4)
print(f"DS model: {ds_results[0]}")

當(dāng) world_size=1 時(shí)相當(dāng)于只使用了一個(gè)GPU,其延時(shí)結(jié)果如下,與原始版本相比性能提高了62.6%:

DS model: P95 latency (ms) - 1482.9604600323364; Average latency (ms) - 1482.22 +- 0.51;

當(dāng) world_size=4 時(shí)并使用deepspeed --num_gpus 4 test.py運(yùn)行代碼,此時(shí)使用了4塊 GPU,性能如下所示,延時(shí)約為 單GPU 性能的 37.2%:

DS model: P95 latency (ms) - 553.2004246022552; Average latency (ms) - 551.79 +- 0.86;

使用 Nsight Systems 分析4卡 profiling,可以看到,盡管模型加載到卡上的過程不一致,但最終計(jì)算的過程是一致的,如下:

3d9d5c08-82d5-11ee-939d-92fbcf53809c.jpg

4*GPU DeepSpeed pipeline 的 profiling

2.3 DeepSpeed 與 Accelerate 的比較

使用 HuggingFace 的 huggingface/transformers-bloom-inference: Fast Inference Solutions for BLOOM (github.com)的例子來比較 DeepSpeed 與 Accelerate。

下表是分別用 DeepSpeed 與 Accelerate 在 8xA6000 上對(duì)bigscience/bloom-7b1模型的 Latency(ms/token) 實(shí)測(cè)結(jié)果:

project bs 1 8 16 32 64 128
accelerate fp16 34.04 7.41 5.12 3.69 3.54 3.38
accelerate int8 89.64 15.41 9.16 5.83 4.52 3.82
ds-inference fp16 19.33 2.42 1.20 0.63 0.38 0.27
ds-zero bf16 145.60 18.56 9.26 4.62 2.35 1.17

Accelerate 運(yùn)行示例如下:

python3 bloom-inference-scripts/bloom-accelerate-inference.py --name bigscience/bloom-7b1 --batch_size 1 --benchmark

其輸出日志為:

Using 8 gpus
Loading model bigscience/bloom-7b1
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:08<00:00, ?4.46s/it]
*** Starting to generate 100 tokens with bs=1
Generate args {'max_new_tokens': 100, 'do_sample': False}
*** Running generate
------------------------------------------------------------
in=DeepSpeed is a machine learning framework
out=DeepSpeed is a machine learning framework for deep learning models. It is designed to be easy to use and flexible. DeepSpeed is a Python library that provides a high-level API for training and inference on deep learning models. DeepSpeed is a Python library that provides a high-level API for training and inference on deep learning models. DeepSpeed is a Python library that provides a high-level API for training and inference on deep learning models. DeepSpeed is a Python library that provides a high-level API for training and inference on deep learning models. DeepSpeed is a

*** Running benchmark

*** Performance stats:
Throughput per token including tokenize: 34.04 msecs
Start to ready to generate: 21.959 secs
Tokenize and generate 500 (bs=1) tokens: 6.832 secs
Start to finish: 28.791 secs

DeepSpeed Inference 運(yùn)行示例如下:

deepspeed --num_gpus 8 bloom-inference-scripts/bloom-ds-inference.py --name bigscience/bloom-7b1 --batch_size 8 --benchmark

其輸出日志為:

[2023-05-17 0830,648] [INFO] [launch.pymain] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2023-05-17 0830,648] [INFO] [launch.pymain] nnodes=1, num_local_procs=8, node_rank=0
[2023-05-17 0830,648] [INFO] [launch.pymain] global_rank_mapping=defaultdict(, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2023-05-17 0830,648] [INFO] [launch.pymain] dist_world_size=8
[2023-05-17 0830,648] [INFO] [launch.pymain] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
[2023-05-17 0833,029] [INFO] [comm.pyinit_distributed] Initializing TorchBackend in DeepSpeed with backend nccl
*** Loading the model bigscience/bloom-7b1
[2023-05-17 0836,668] [INFO] [utils.pysee_memory_usage] pre-ds-inference-init
[2023-05-17 0836,669] [INFO] [utils.pysee_memory_usage] MA 0.0 GB     Max_MA 0.0 GB     CA 0.0 GB     Max_CA 0 GB 
[2023-05-17 0836,669] [INFO] [utils.pysee_memory_usage] CPU Virtual Memory: used = 14.64 GB, percent = 5.8%
[2023-05-17 0840,881] [WARNING] [config_utils.py_process_deprecated_field] Config parameter mp_size is deprecated use tensor_parallel.tp_size instead
[2023-05-17 0840,881] [INFO] [logging.pylog_dist] [Rank 0] quantize_bits = 8 mlp_extra_grouping = False, quantize_groups = 1
Detected CUDA files, patching ldflags
Emitting ninja build file /root/.cache/torch_extensions/py38_cu117/transformer_inference/build.ninja...
Building extension module transformer_inference...
Allowing ninja to set a default number of workers... (overridable by setting the environment variable MAX_JOBS=N)
ninja: no work to do.
Loading extension module transformer_inference...
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.3437182903289795 seconds
[2023-05-17 0841,920] [INFO] [logging.pylog_dist] [Rank 0] DeepSpeed-Inference config: {'layer_id': 0, 'hidden_size': 4096, 'intermediate_size': 16384, 'heads': 32, 'num_hidden_layers': -1, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'stochastic_mode': False, 'epsilon': 1e-05, 'mp_size': 8, 'q_int8': False, 'scale_attention': True, 'triangular_masking': True, 'local_attention': False, 'window_size': 1, 'rotary_dim': -1, 'rotate_half': False, 'rotate_every_two': True, 'return_tuple': True, 'mlp_after_attn': True, 'mlp_act_func_type': , 'specialized_mode': False, 'training_mp_size': 1, 'bigscience_bloom': True, 'max_out_tokens': 1024, 'min_out_tokens': 1, 'scale_attn_by_inverse_layer_idx': False, 'enable_qkv_quantization': False, 'use_mup': False, 'return_single_tuple': False, 'set_empty_params': False, 'transposed_mode': False}
Loading extension module transformer_inference...
Time to load transformer_inference op: 0.06054544448852539 seconds
Loading 2 checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:21<00:00, 11.46s/it]check
*** Starting to generate 100 tokens with bs=8
Generate args {'max_new_tokens': 100, 'do_sample': False}
*** Running generate warmup
Loading 2 checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:26<00:00, 14.34s/it]checkpoint loading time at rank 2: 26.608989238739014 sec
Loading 2 checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████| 2/2 [00:26<00:00, 13.30s/it]
------------------------------------------------------
Free memory : 42.997620 (GigaBytes) ?
Total memory: 47.544250 (GigaBytes) ?
Requested memory: 0.687500 (GigaBytes) 
Setting maximum total tokens (input + output) to 1024 
WorkSpace: 0x7f3ab4000000 
------------------------------------------------------
*** Running generate
------------------------------------------------------------
in=DeepSpeed is a machine learning framework
out=DeepSpeed is a machine learning framework for deep learning models. It is designed to be easy to use and flexible. DeepSpeed is a Python library that provides a high-level API for training and inference on deep learning models. DeepSpeed is a Python library that provides a high-level API for training and inference on deep learning models. DeepSpeed is a Python library that provides a high-level API for training and inference on deep learning models. DeepSpeed is a Python library that provides a high-level API for training and inference on deep learning models. DeepSpeed is a
------------------------------------------------------------
...
[2023-05-17 0814,933] [INFO] [utils.pysee_memory_usage] end-of-run
[2023-05-17 0814,934] [INFO] [utils.pysee_memory_usage] MA 3.33 GB ? ? ? ? Max_MA 3.44 GB ? ? ? ? CA 3.33 GB ? ? ? ? Max_CA 4 GB 
[2023-05-17 0814,934] [INFO] [utils.pysee_memory_usage] CPU Virtual Memory: ?used = 30.16 GB, percent = 12.0%
*** Running benchmark

*** Performance stats:
Throughput per token including tokenize: 2.42 msecs
Start to ready to generate: 35.592 secs
Tokenize and generate 4000 (bs=8) tokens: 1.946 secs
Start to finish: 37.537 secs

編輯:黃飛

聲明:本文內(nèi)容及配圖由入駐作者撰寫或者入駐合作網(wǎng)站授權(quán)轉(zhuǎn)載。文章觀點(diǎn)僅代表作者本人,不代表電子發(fā)燒友網(wǎng)立場(chǎng)。文章及其配圖僅供工程師學(xué)習(xí)之用,如有內(nèi)容侵權(quán)或者其他違規(guī)問題,請(qǐng)聯(lián)系本站處理。 舉報(bào)投訴
  • cpu
    cpu
    +關(guān)注

    關(guān)注

    68

    文章

    11062

    瀏覽量

    216454
  • gpu
    gpu
    +關(guān)注

    關(guān)注

    28

    文章

    4925

    瀏覽量

    130899
  • 內(nèi)存
    +關(guān)注

    關(guān)注

    8

    文章

    3117

    瀏覽量

    75115
  • LLM
    LLM
    +關(guān)注

    關(guān)注

    1

    文章

    323

    瀏覽量

    765

原文標(biāo)題:DeepSpeed Inference 在 LLM 推理上的優(yōu)化探究

文章出處:【微信號(hào):GiantPandaCV,微信公眾號(hào):GiantPandaCV】歡迎添加關(guān)注!文章轉(zhuǎn)載請(qǐng)注明出處。

收藏 人收藏
加入交流群
微信小助手二維碼

掃碼添加小助手

加入工程師交流群

    評(píng)論

    相關(guān)推薦
    熱點(diǎn)推薦

    使用NVIDIA Triton和TensorRT-LLM部署TTS應(yīng)用的最佳實(shí)踐

    針對(duì)基于 Diffusion 和 LLM 類別的 TTS 模型,NVIDIA Triton 和 TensorRT-LLM 方案能顯著提升推理速度。在單張 NVIDIA Ada Love
    的頭像 發(fā)表于 06-12 15:37 ?458次閱讀
    使用NVIDIA Triton和TensorRT-<b class='flag-5'>LLM</b>部署TTS應(yīng)用的最佳<b class='flag-5'>實(shí)踐</b>

    【飛凌嵌入式OK3576-C開發(fā)板體驗(yàn)】rkllm板端推理

    : ulimit -HSn 10240 最后,執(zhí)行llm_demo可執(zhí)行文件,并指定rkllm模型文件的路徑。這樣即可開始推理過程: ./llm_demo --model_path /path
    發(fā)表于 08-31 22:45

    深度剖析OpenHarmony AI調(diào)度管理與推理接口

    1 簡(jiǎn)介AI任務(wù)管理與統(tǒng)一的推理能力提供了接口的統(tǒng)一標(biāo)準(zhǔn)系統(tǒng)上CPU提供了AI任務(wù)調(diào)度管理的能力,對(duì)AI的能力進(jìn)行了開放的推理推理,同時(shí),提供了一個(gè)不同的生命周期框架層級(jí)的應(yīng)用程序。推理
    發(fā)表于 03-25 11:15

    基準(zhǔn)數(shù)據(jù)集(CORR2CAUSE)如何測(cè)試大語(yǔ)言模型(LLM)的純因果推理能力

    ? 因果推理是人類智力的標(biāo)志之一。因果關(guān)系NLP領(lǐng)域近年來引起了人們的極大興趣,但其主要依賴于從常識(shí)知識(shí)中發(fā)現(xiàn)因果關(guān)系。本研究提出了一個(gè)基準(zhǔn)數(shù)據(jù)集(CORR2CAUSE)來測(cè)試大語(yǔ)言模型(LLM
    的頭像 發(fā)表于 06-20 15:39 ?2361次閱讀
    基準(zhǔn)數(shù)據(jù)集(CORR2CAUSE)如何測(cè)試大語(yǔ)言模型(<b class='flag-5'>LLM</b>)的純因果<b class='flag-5'>推理</b>能力

    mlc-llm對(duì)大模型推理的流程及優(yōu)化方案

    比如RWKV和給定的device信息一起編譯為TVM中的runtime.Module(在linux上編譯的產(chǎn)物就是.so文件)提供mlc-llm的c++推理接口調(diào)用 。
    發(fā)表于 09-26 12:25 ?1225次閱讀
    mlc-<b class='flag-5'>llm</b>對(duì)大模型<b class='flag-5'>推理</b>的流程及<b class='flag-5'>優(yōu)化</b><b class='flag-5'>方案</b>

    周四研討會(huì)預(yù)告 | 注冊(cè)報(bào)名 NVIDIA AI Inference Day - 大模型推理線上研討會(huì)

    由 CSDN 舉辦的 NVIDIA AI Inference Day - 大模型推理線上研討會(huì),將幫助您了解 NVIDIA 開源大型語(yǔ)言模型(LLM推理加速庫(kù) TensorRT-
    的頭像 發(fā)表于 10-26 09:05 ?492次閱讀

    現(xiàn)已公開發(fā)布!歡迎使用 NVIDIA TensorRT-LLM 優(yōu)化大語(yǔ)言模型推理

    NVIDIA 于 2023 年 10 月 19 日公開發(fā)布 TensorRT-LLM ,可在 NVIDIA GPU 上加速和優(yōu)化最新的大語(yǔ)言模型(Large Language Models)的推理
    的頭像 發(fā)表于 10-27 20:05 ?1469次閱讀
    現(xiàn)已公開發(fā)布!歡迎使用 NVIDIA TensorRT-<b class='flag-5'>LLM</b> <b class='flag-5'>優(yōu)化</b>大語(yǔ)言模型<b class='flag-5'>推理</b>

    Hugging Face LLM部署大語(yǔ)言模型到亞馬遜云科技Amazon SageMaker推理示例

    亞馬遜云科技Amazon SageMaker 進(jìn)行推理并與我們的模型聊天 清理環(huán)境 ?什么是Hugging Face LLM Inference DLC? ?Hugging Face LLM
    的頭像 發(fā)表于 11-01 17:48 ?1256次閱讀
    Hugging Face <b class='flag-5'>LLM</b>部署大語(yǔ)言模型到亞馬遜云科技Amazon SageMaker<b class='flag-5'>推理</b>示例

    自然語(yǔ)言處理應(yīng)用LLM推理優(yōu)化綜述

    當(dāng)前,業(yè)界在將傳統(tǒng)優(yōu)化技術(shù)引入 LLM 推理的同時(shí),同時(shí)也在探索從大模型自回歸解碼特點(diǎn)出發(fā),通過調(diào)整推理過程和引入新的模型結(jié)構(gòu)來進(jìn)一步提升推理
    發(fā)表于 04-10 11:48 ?929次閱讀
    自然語(yǔ)言處理應(yīng)用<b class='flag-5'>LLM</b><b class='flag-5'>推理</b><b class='flag-5'>優(yōu)化</b>綜述

    LLM大模型推理加速的關(guān)鍵技術(shù)

    LLM(大型語(yǔ)言模型)大模型推理加速是當(dāng)前人工智能領(lǐng)域的一個(gè)研究熱點(diǎn),旨在提高模型在處理復(fù)雜任務(wù)時(shí)的效率和響應(yīng)速度。以下是對(duì)LLM大模型推理加速關(guān)鍵技術(shù)的詳細(xì)探討,內(nèi)容將涵蓋模型壓縮、
    的頭像 發(fā)表于 07-24 11:38 ?1722次閱讀

    TensorRT-LLM低精度推理優(yōu)化

    本文將分享 TensorRT-LLM 中低精度量化內(nèi)容,并從精度和速度角度對(duì)比 FP8 與 INT8。首先介紹性能,包括速度和精度。其次,介紹量化工具 NVIDIA TensorRT Model
    的頭像 發(fā)表于 11-19 14:29 ?1178次閱讀
    TensorRT-<b class='flag-5'>LLM</b>低精度<b class='flag-5'>推理</b><b class='flag-5'>優(yōu)化</b>

    解鎖NVIDIA TensorRT-LLM的卓越性能

    NVIDIA TensorRT-LLM 是一個(gè)專為優(yōu)化大語(yǔ)言模型 (LLM) 推理而設(shè)計(jì)的庫(kù)。它提供了多種先進(jìn)的優(yōu)化技術(shù),包括自定義 Att
    的頭像 發(fā)表于 12-17 17:47 ?781次閱讀

    新品| LLM630 Compute Kit,AI 大語(yǔ)言模型推理開發(fā)平臺(tái)

    LLM630LLM推理,視覺識(shí)別,可開發(fā),靈活擴(kuò)展···LLM630ComputeKit是一款A(yù)I大語(yǔ)言模型推理開發(fā)平臺(tái),專為邊緣計(jì)算和智能交互應(yīng)用而設(shè)計(jì)。該套件的主板搭載愛芯AX63
    的頭像 發(fā)表于 01-17 18:48 ?651次閱讀
    新品| <b class='flag-5'>LLM</b>630 Compute Kit,AI 大語(yǔ)言模型<b class='flag-5'>推理</b>開發(fā)平臺(tái)

    新品 | Module LLM Kit,離線大語(yǔ)言模型推理模塊套裝

    推理與數(shù)據(jù)交互需求。ModuleLLM是一款集成化的離線大語(yǔ)言模型(LLM)推理模塊,專為需要高效、智能交互的終端設(shè)備設(shè)計(jì)。Module13.2LLMMate模塊
    的頭像 發(fā)表于 03-28 18:49 ?307次閱讀
    新品 | Module <b class='flag-5'>LLM</b> Kit,離線大語(yǔ)言模型<b class='flag-5'>推理</b>模塊套裝

    詳解 LLM 推理模型的現(xiàn)狀

    2025年,如何提升大型語(yǔ)言模型(LLM)的推理能力成了最熱門的話題之一,大量優(yōu)化推理能力的新策略開始出現(xiàn),包括擴(kuò)展推理時(shí)間計(jì)算、運(yùn)用強(qiáng)化學(xué)
    的頭像 發(fā)表于 04-03 12:09 ?366次閱讀
    詳解 <b class='flag-5'>LLM</b> <b class='flag-5'>推理</b>模型的現(xiàn)狀