Google 云平台 Vertex AI 服务 流式输出非常慢, gemini-3-pro-preview 模型,首个流式输出超过 17s,有没有好的解决方案 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
请不要在回答技术问题时复制粘贴 AI 生成的内容
xuliang12187
V2EX    程序员

oogle 云平台 Vertex AI 服务 流式输出非常慢, gemini-3-pro-preview 模型,首个流式输出超过 17s,有没有好的解决方案

  •  
  •   xuliang12187 2 天前 816 次点击

    问题描述 当从位于美国硅谷的基础设施向 Vertex AI API ( aiplatform.googleapis.com ) 模型: gemini-3-pro-preview 发起流式预测调用时,我们观察到响应流中首个 Token 的延迟异常偏高。首 Token 延迟( TTFT )持续超过 17 秒,而通常情况应低于 2 秒。

    server address: 142.250.191.42

    1 、Basic Ping Tests (Connectivity & Baseline Latency) Run these commands from the affected server/client in Silicon Valley. ping(base) [root@usa-gg-test01 ~]# ping aiplatform.googleapis.com PING aiplatform.googleapis.com (142.250.191.42) 56(84) bytes of data. 64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=1 ttl=118 time=2.67 ms 64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=2 ttl=118 time=2.62 ms 64 bytes from nuq04s42-in-f10.1e100.net (142.250.191.42): icmp_seq=3 ttl=118 time=2.64 ms

    2 、python code test Using the model:gemini-3-pro-preview

    import requests import json import time

    def stream_gemini_content(): api_key='xxx' url = "https://aiplatform.googleapis.com/v1/publishers/google/models/gemini-3-pro-preview:streamGenerateContent?alt=sse"

    headers = { "x-goog-api-key": api_key, "Content-Type": "application/json" } data = { "contents": [{ "role": "user", "parts": [{ "text": "请讲一个 200 字的故事,不要用推理,直接回答。" }] }], "generationConfig": { "thinkingConfig": { "includeThoughts": False } } } print(f"begin requests: {url} ...") start_time = time.time() first_token_time = None last_chunk_time = None try: with requests.post(url, headers=headers, json=data, stream=True) as response: if response.status_code != 200: print(f"status: {response.status_code}") print(response.text) return print("-" * 50) for line in response.iter_lines(): if not line: continue decoded_line = line.decode('utf-8').strip() if not decoded_line.startswith("data: "): continue json_str = decoded_line[6:] if json_str == "[DONE]": break try: now = time.time() if first_token_time is None: first_token_time = now print(f"\n[total] frist token TTFT: {(now - start_time) * 1000:.2f} ms") print("-" * 50) last_chunk_time = now chunk_data = json.loads(json_str) candidates = chunk_data.get("candidates", []) total_elapsed = (now - start_time) * 1000 chunk_gap = (now - last_chunk_time) * 1000 if last_chunk_time else 0 last_chunk_time = now if candidates: cOntent= candidates[0].get("content", {}) parts = content.get("parts", []) if parts: text_chunk = parts[0].get("text", "") print(text_chunk, end="", flush=True) except Exception as e: pass except Exception as e: pass end_time = time.time() print("\n\n" + "-" * 50) print(f"total time: {(end_time - start_time) * 1000:.2f} ms") 

    if name == "main": stream_gemini_content()

    代码测试非常慢,200 个字故事就超过 17s 了

    8 条回复    2025-12-12 17:10:09 +08:00
    heqing
        1
    heqing  
       2 天前
    第一、第二次调用首个 Token 输出延迟是否有显著差异?更换其他模型是否出现相同的现象?
    xuliang12187
        2
    xuliang12187  
    OP
       2 天前
    用了 gemini-2.0-flash 模型首个 token 输出 300ms 200 字的故事,3-4s 就返回了全部内容了 gemini-2.5-flash 首 token 超过 3s 很慢,总时间长度超过 5s ,gemini-3-pro-preview 首个 token 超过 12s ,我们用的 google cloud 企业服务 vertex AI apI 接口。
    chenluo0429
        3
    chenluo0429  
       2 天前 via Android
    你也没指定不思考啊,gemni3 默认思考级别是高,这不是得先思考再给你回答吗
    xuliang12187
        4
    xuliang12187  
    OP
       2 天前
    @chenluo0429 调过一样,很慢都超过 17s
    fov6363
        5
    fov6363  
       2 天前
    +1 ,不加 thinking 太弱智了,加了就是得 10s+,即使是简单的 QA 也不行。问了 chatGPT 说是 vertex 要开那个 endpoint 独占的实例概念,不了会有冷启动,first chunk 只有几百 ms ,但是等到第一次返回就得 10s+
    xuliang12187
        6
    xuliang12187  
    OP
       2 天前
    @fov6363 vertex 先阶段 没有 endpoint 独立实例概念,现在只有 global 全球的。说是有不同付费级别。那个是针对业务并发量高。并不能解决 接口延迟问题
    GXD
        7
    GXD  
       2 天前
    gemini3 得用`thinking_level`参数来指定推理深度吧,默认是 high
    fov6363
        8
    fov6363  
       1 天前
    @xuliang12187 #6 感谢,这个方案我也没试。

    有没有试过不用 vertex ,直接用 gemini API ?我这边试了感觉没有快多少,带 thinking 好像就挺慢的
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2451 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 24ms UTC 05:02 PVG 13:02 LAX 21:02 JFK 00:02
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86