Wednesday, January 29, 2025

Tokenization

Tokenizarion is to dissect the text into smaller units. The unit must not always be word. It can be a character, a phrase or symbols. Tokenization enables us to grapple the complication of language (vocabulary, format, grammar etc). This allows significant reduction in compute and memory resources required for the model. 

Different model build may use different Tokenization methods - rule based, statistical or neural. The number of token affect the computation required by the model to process. Therefore, the charge by model is based on number of token input. 

No comments: