gpt4o 图像生成的技术讨论(自回归模型又好起来了?) - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
lthero
V2EX    机器学习

gpt4o 图像生成的技术讨论(自回归模型又好起来了?)

  •  
  •   lthero 185 天前 2175 次点击
    这是一个创建于 185 天前的主题,其中的信息可能已经有所发展或是发生改变。

    gpt4o 图像生成的特点是,生成时从上到下逐渐清晰化(并不只是显示技巧)

    s1

    如果使用 diffusion 进行生成,它的过程可能是这样的

    s2

    但已知的是 gpt4o 图像生成(似乎)已经转向 autoregressive(自回归模型)+transformer

    s3

    目前外网也对 gpt4o 的技术进行了猜测,但也没讨论出个结果来(大多是认同转向了 ar 模型)

    自回归模型是要打败 diffusion ,并在多模态领域又好用起来了吗?

    另外,目前开源界似乎还没有什么动静,国内的字节跳动在 ar 的图像生成领域探索得还挺多(发了不少 paper )

    6 条回复    2025-04-10 21:25:07 +08:00
    mxT52CRuqR6o5
        1
    mxT52CRuqR6o5  
       185 天前
    我看有分析说是纯前端特效啊
    zzq825924
        2
    zzq825924  
       185 天前
    @mxT52CRuqR6o5 不是前端特效,刷新过程中,图片的 url 一直在变
    lthero
        3
    lthero  
    OP
       185 天前
    @mxT52CRuqR6o5 #1 有前端特效,但图片也会发生变化(可能一共发了 4 张图过来)
    kneo
        4
    kneo  
       185 天前
    不太理解这种技术。按照常理说图片的上下没有逻辑上的依赖关系。从上往下还是从下往上,不应该就是一个参数的事吗?
    halberd
        5
    halberd  
       185 天前   1
    有 diffusion ,混合结构。ClosedAI 虽然连技术报告都不发了,但最后一点良心体现在这张生成样例图里

    https://imgur.com/a/YGzxVIp

    官网给出的 Prompt:
    ```
    A wide image taken with a phone of a glass whiteboard, in a room overlooking the Bay Bridge. The field of view shows a woman writing, sporting a tshirt wiith a large OpenAI logo. The handwriting looks natural and a bit messy, and we see the photographer's reflection.

    The text reads:

    (left)
    "Transfer between Modalities:

    Suppose we directly model
    p(text, pixels, sound) [equation]
    with one big autoregressive transformer.

    Pros:
    * image generation augmented with vast world knowledge
    * next-level text rendering
    * native in-context learning
    * unified post-training stack

    Cons:
    * varying bit-rate across modalities
    * compute not adaptive"

    (Right)
    "Fixes:
    * model compressed representations
    * compose autoregressive prior with a powerful decoder"

    On the bottom right of the board, she draws a diagram:
    "tokens -> [transformer] -> [diffusion] -> pixels"
    ```

    已经有论文给出详细猜测:
    https://arxiv.org/abs/2504.02782
    lthero
        6
    lthero  
    OP
       185 天前
    @halberd #5 感谢,下午也刷到这论文了,刚把它看完;"tokens -> [transformer] -> [diffusion] -> pixels"算是开发者留下的彩蛋了
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     2257 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 22ms UTC 16:04 PVG 00:04 LAX 09:04 JFK 12:04
    Do have faith in what you're doing.
    ubao snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86