真能学会打麻将: Claude 3 LLM-RGB 测评遥遥领先 GPT4 - V2EX
V2EX = way to explore
V2EX 是一个关于分享和探索的地方
现在注册
已注册用户请  登录
Bazingawang
V2EX    分享发现

真能学会打麻将: Claude 3 LLM-RGB 测评遥遥领先 GPT4

  •  
  •   Bazingawang 2024-03-05 19:20:42 +08:00 2399 次点击
    这是一个创建于 592 天前的主题,其中的信息可能已经有所发展或是发生改变。

    图片 昨天 ,Anthropic 发布了最新的 Claude3 模型, 引发广泛关注。在 Babel.cloud 的开源测评项目的 LLM-RGB 项目中,Claude3 在单次测试中获得了 97.6 分的高分,大大超过了 GPT-4 Turbo ,成为目前大模型能力的领先者。 回答详情:https://llm-rgb.babel.run/view/testId/a581e4a9-ce1e-4b2f-8f45-980889913b58 作为参考,截至 1.24 日各大模型测评得分

    教 Claude 3 打麻将

    其中值得注意的是,在 LLM-RGB 测评中,015_simple_mahjong 是一道复杂性极高的题目,在 Prompt 中,会教给大模型麻将的简化版规则,并给出示例,再让大模型在特定场景下给出出牌选择。这道题在过往的测试中鲜有做对的情况。但 Claude 3 Opus 给出最优解的概率为 20%,次优解概率为 80%。说明其在多轮推理能力上远超其他模型,可以利用有限的上下文快速学习知识,并加以运用,这将使 Claude 3 的落地场景远不止简单的客服,文本生成的场景。而可以在具有更长工程过程的领域中有很好的发挥。 附录中将给出 Prompt 方便测试 附录中将给出 Prompt,方便测试 其他方面,速度上,Claude 3 由于过快的回答速度频繁触发 rate limits ,给测评本身造成了麻烦,笔者不得不将其与 GPT 4 turbo 一起测试,以降低访问频率。同时,从得分的稳定性来看,Claude 3 在多轮测试中的稳定性非常高,除 015_simple_mahjong 外,鲜有回答不稳定的情况。 Claude 3 的超预期成功不代表 Anthropic 能力已经全面超越 OpenAI ,Claude 3 明显强于 GPT4 ,但也许 GPT-5 早已被 Open AI 捏在手上。 不过 Claude 3 的出现说明大模型领域已不再是一家独大的场面,也并不存在只有 OpenAI 可以创造的“核心魔法”,而更多的是工程能力与资源投入的领先。百家争鸣的底层大模型给了上层应用开发者们更多的选择,也必将带来更低的价格。从这个角度来看,Claude 3 的成功带来行业价值和社会影响怎么高估都不为过。

    关于 LLM-RGB

    LLM-RGB 项目是一个专门为评估 LLM 在复杂情境中的推理和生成能力而设计的测试用例集合。这些复杂情境相比于聊天或简单生成,主要考察以下三个方面:

    1. 有效上下文长度。
    2. 推理深度:生成答案可能需要多步推理。
    3. 指令合规性:LLM 需要以特定格式生成响应,而非自然语言。 可前往 LLM-RGB 项目官网查看其他大模型得分: https://llm-rgb.babel.run/ ,开源项目地址为: https://github.com/babelcloud/LLM-RGB

    关于 Babel

    Babel 是一家致力于建立 Agent Team 来构建复杂软件的初创企业,LLM-RGB 项目是其选用底层大模型的判定依据(详见LLM-RGB:系统性评估 LLM 的复杂问题处理能力 ),在 Claude 3 出现之前,长期由 GPT-4 Turbo 把持测评榜首。

    附录

    附上 015_simple_mahjong 的 Prompt 供大家测试使用:

    You are a Mahjong game AI. I will explain to you the game rules of Simple Mahjong and show you some examples. === Simple Mahjong Rules === 1. Simple Mahjong is a board game with four participants. 2. Simple Mahjong has three types of tiles, named "Dots", "Bamboo", "Character". There is no relationship between different types of tiles. 3. Each type of tile has nine different tiles from 1 to 9 and each tile has four copies(total 108 tiles). - Bamboos: B1 B2 ... B9, each with four identical tiles - Characters: C1 C2 ... C9, each with four identical tiles - Dots: D1 D2 ... D9, each with four identical tiles 4. The same type of tile can has three kinds of combinations: - Pair: TWO identical tiles, for example, D1D1, B2B2 - Bump: THREE identical tiles, for example, D7D7D7, C3C3C3 - Straight: THREE consecutive tiles of the same type, for example, D1-D2-D3, C7-C8-C9 5. At the beginning of the game, each player has 13 random tiles in hand. 6. The rest of the tiles face down on the table, which we call the tile wall. 7. Players play the game clockwise. 8. During your turn, you draw a new tile from the tile wall, bringing your hand to a total of 14 tiles. If these 14 tiles match a winning pattern, then you win. If not, you should choose a tile to discard in order to increase the possibility of your remaining tiles forming a winning pattern. 9. Winning pattern: - Straights-win: the 14 tiles are in FOUR straights and ONE pair, for example, D1-D2-D3 C2-C3-C4 D5-D6-D7 D6-D7-D8 C9C9 - Bumps-win: the 14 tiles are in FOUR bumps and ONE pair,for example, B1B1B1 B2B2B2 C1C1C1 C6C6C6 D9D9 - Mixed-win: the 14 tiles are mixed with bumps, straights and ONE pair, for example, B1B1B1 C1C2C3 C6-C7-C8 D4-D5-D6 D7D7 === End Rules === === Examples === GAME INFO: Tiles Discarded in Previous Rounds: C1 D4 D3 B3 Observe: Drew B6 Current Tiles: C2C2C2 C3C3 D9D9D9 B2 B7 B8 D6 D8 B6(just drew) DECISION: Thought: The tiles could potentially form a Mixed-win. I should keep the 'B6' as it could form a straight B6 B7 B8. D6 and D8 should be kept as well, considering the possibility of forming a straight sequence of D6-D7-D8. However, 'B2' is isolated and should be discarded. Target Winning Pattern: Mixed-win Winning Tile(s): D7 Action: Discard B2 --- GAME INFO: Tiles Discarded in Previous Rounds: B1 B1 B2 B2 B5 B6 B9 C2 C6 D5 D5 D8 D8 D9 Observe: Drew C3 Current Tiles: B1-B2-B3 B6B6B6 B7B7B7 C1C1 C2 C3C3(just drew) DECISION: Thought: My current hand of tiles is nearing a Mixed-win, with two bumps and one straight. If I discard C2, I'll have two pairs, C1C1 and C3C3. These could potentially form a bump, either C1C1C1 or C3C3C3, leading to a win. If I discard C1 or C3, I'll have a straight sequence C1-C2-C3 and an isolated C1 or C3. I would then need to wait for another C1 or C3 to form a pair for a win. Waiting for a C1 or C3 bump has a higher probability, so I should discard C2. Target Winning Pattern: Mixed-win Winning Tile(s): C1 C3 Action: Discard C2 --- GAME INFO: Tiles Discarded in Previous Rounds: B1 B1 B1 B2 B3 B5 B9 C2 C3 C3 D1 D1 D1 D1 D4 D4 D4 Observe: Drew D2 Current Tiles: B3B3B3 B6B6B6 B8B8B8 C3C3 D2 D3 D2(just drew) DECISION: Thought: My hand of tiles is nearing a Bumps-win. If I discard D3, I'll have two choices, C3 or D2, to form a winning pattern. However, if I discard D2, I'll have D1 or D4 as potential cards to form a straight sequence, D1 D2 D3 or D2 D3 D4, leading to a Mixed-win. Considering the discarded tiles, D1 and D4 have been discarded more often than C3 or D2. This reduces the likelihood of drawing D1 or D4 from the tile wall. Therefore, I should aim for a Bumps-win pattern and discard D3. Target Winning Pattern: Bumps-win Winning Tile(s): D2 Action: Discard D3 --- GAME INFO: Tiles Discarded in Previous Rounds: B1B1 B2 C7 C8 D5 Observe: Drew D5 Current Tiles: B3-B4-B5 B4-B5-B6 C7-C8-C9 B9B9 D2 D3 D5(just drew) DECISION: Thought: The tiles are close to a Straights-win pattern. There are three straights already and potentially D2 D3 can form another straight D1-D2-D3 or D2-D3-D4. Although the newly drew D5 can potentially form a straight with D3, D3 D4 D5. But waiting for D4 has lower chance than waiting for D1 or D4. Thus I should keep current tiles and discard the newly drew D5. Target Winning Pattern: Straights-win Winning Tile(s): D1 D4 Action: Discard D5 --- GAME INFO: Tiles Discarded in Previous Rounds: B6 B7 B8 C7 C9 D2 D2 D5 D5 D5 D8 Observe: Drew D4 Current Tiles: B3B3B3 B9B9B9 C7C7C7 D4D4 D5 D6 D4(just drew) DECISION: Thought:The tiles are Mixed-Win pattern.The newly drew D4 can form a Straights D4-D5-D6 Target Winning Pattern: Mixed-win Winning Tile(s): D4(just drew) Action:NOne=== End Examples === GAME INFO: Tiles Discarded in Previous Rounds: B1 B3 C1 C1 D8 D9 Observe: Drew B8 Current Tiles: C5C5C5 C8C8C8 C7-C8-C9 D1-D2-D3 C1 B8(just drew) DECISION: 

    最优解

    Thought: Target Winning Pattern: mixed-win Winning Tile(s): B8 Action: discard C1 

    次优解

    Thought: Target Winning Pattern: mixed-win Winning Tile(s): C1 Action: discard B8 
    4 条回复    2024-03-06 13:51:54 +08:00
    luckybearops
        1
    luckybearops  
       2024-03-06 01:08:41 +08:00 via iPhone
    uses090
        2
    uses090  
       2024-03-06 06:22:31 +08:00 via iPhone
    虽然但是为什么要拿 GPT4Turbo 来比而不是 GPT4 呢
    zhaoyeye
        3
    zhaoyeye  
       2024-03-06 10:01:29 +08:00 via Android
    封号,我也没说就被封了,不知道他们公司怎么想的
    Bazingawang
        4
    Bazingawang  
    OP
       2024-03-06 13:51:54 +08:00
    @uses090 因为 gpt4turbo 更强呀
    关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     920 人在线   最高记录 6679       Select Language
    创意工作者们的社区
    World is powered by solitude
    VERSION: 3.9.8.5 24ms UTC 19:42 PVG 03:42 LAX 12:42 JFK 15:42
    Do have faith in what you're doing.
    ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86