V2EX tokenizing

Tokenizing

Definition / 定义

tokenizing(动名词/现在分词):在计算机与语言处理中,指把一段文本按规则切分成一个个“词元/标记(tokens)”的过程,例如按单词、子词(subword)、符号或字符进行分割。常用于自然语言处理(NLP)、搜索引擎、文本分析等场景。(也可写作 tokenising,英式拼写)

Pronunciation / 发音

/toknaz/

Examples / 例句

Tokenizing turns a sentence into words.
分词(tokenizing)会把一个句子拆分成一个个单词。

Before training the model, we spent days tokenizing millions of customer reviews and handling punctuation, emojis, and mixed languages.
在训练模型之前,我们花了好几天对数百万条客户评论进行分词,并处理标点、表情符号以及混合语言的情况。

Etymology / 词源

tokenizing 来自 token(“标记、代币、象征物”)+ -ize(“使……化/使成为……”的动词后缀)+ -ing(表示过程/进行时)。其中 token 可追溯到古英语 tācn,意为“记号、征兆”。在现代计算机语境里,“token”被借用为“可被识别的最小单位”,因此 tokenize 就是“把内容变成可处理的标记单位”。

Related Words / 相关词

Literary Works / 文学作品

  • Speech and Language Processing(Daniel Jurafsky & James H. Martin):在文本预处理与语言模型章节中讨论 tokenizing/tokenization。
  • Natural Language Processing with Python(Steven Bird, Ewan Klein, Edward Loper):在用 NLTK 进行分词与文本处理时频繁使用该术语。
  • Introduction to Information Retrieval(Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze):在检索系统的文本处理流程中涉及 tokenizing/tokenization。
关于     帮助文档     自助推广系统     博客     API     FAQ     Solana     991 人在线   最高记录 6679       Select Language
创意工作者们的社区
World is powered by solitude
VERSION: 3.9.8.5 6ms UTC 19:52 PVG 03:52 LAX 11:52 JFK 14:52
Do have faith in what you're doing.
ubao msn snddm index pchome yahoo rakuten mypaper meadowduck bidyahoo youbao zxmzxm asda bnvcg cvbfg dfscv mmhjk xxddc yybgb zznbn ccubao uaitu acv GXCV ET GDG YH FG BCVB FJFH CBRE CBC GDG ET54 WRWR RWER WREW WRWER RWER SDG EW SF DSFSF fbbs ubao fhd dfg ewr dg df ewwr ewwr et ruyut utut dfg fgd gdfgt etg dfgt dfgd ert4 gd fgg wr 235 wer3 we vsdf sdf gdf ert xcv sdf rwer hfd dfg cvb rwf afb dfh jgh bmn lgh rty gfds cxv xcv xcs vdas fdf fgd cv sdf tert sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf sdf shasha9178 shasha9178 shasha9178 shasha9178 shasha9178 liflif2 liflif2 liflif2 liflif2 liflif2 liblib3 liblib3 liblib3 liblib3 liblib3 zhazha444 zhazha444 zhazha444 zhazha444 zhazha444 dende5 dende denden denden2 denden21 fenfen9 fenf619 fen619 fenfe9 fe619 sdf sdf sdf sdf sdf zhazh90 zhazh0 zhaa50 zha90 zh590 zho zhoz zhozh zhozho zhozho2 lislis lls95 lili95 lils5 liss9 sdf0ty987 sdft876 sdft9876 sdf09876 sd0t9876 sdf0ty98 sdf0976 sdf0ty986 sdf0ty96 sdf0t76 sdf0876 df0ty98 sf0t876 sd0ty76 sdy76 sdf76 sdf0t76 sdf0ty9 sdf0ty98 sdf0ty987 sdf0ty98 sdf6676 sdf876 sd876 sd876 sdf6 sdf6 sdf9876 sdf0t sdf06 sdf0ty9776 sdf0ty9776 sdf0ty76 sdf8876 sdf0t sd6 sdf06 s688876 sd688 sdf86