a ½Àh ã@s¼UdZddlmZddlmZmZddlmZe e ¡Z dZdZdZ dZd Zd ZdZe ded edededediZeeefed<dd„e ¡DƒZeeefed<Gdd„deƒZdgZdS)z Tokenization classes for CANINE.é)ÚOptionalé)Ú AddedTokenÚPreTrainedTokenizer)Úloggingiiàiàiàiàiàz[CLS]z[SEP]z[BOS]z[MASK]z[PAD]z [RESERVED]ÚSPECIAL_CODEPOINTScCsi|]\}}||“qS©r)Ú.0Ú codepointÚnamerrúj/var/www/html/assistant/venv/lib/python3.9/site-packages/transformers/models/canine/tokenization_canine.pyÚ 7ór ÚSPECIAL_CODEPOINTS_BY_NAMEcseZdZdZeeƒeeƒeeƒeeƒeeƒeeƒddf‡fdd„ Z e edœdd„ƒZd d „Z eeedœdd „Zeedœdd„Zeedœdd„Zdd„Zd eeeeeeedœdd„Zd!eeeeeeeedœ‡fdd„ Zd"eeedœdd„Z‡ZS)#ÚCanineTokenizeraé Construct a CANINE tokenizer (i.e. a character splitter). It turns text into a sequence of characters, and then converts each character into its Unicode code point. [`CanineTokenizer`] inherits from [`PreTrainedTokenizer`]. Refer to superclass [`PreTrainedTokenizer`] for usage examples and documentation concerning parameters. Args: model_max_length (`int`, *optional*, defaults to 2048): The maximum sentence length the model accepts. Fic st|tƒrt|dddn|}t|tƒr4t|dddn|}t|tƒrPt|dddn|}t|tƒrlt|dddn|}t|tƒrˆt|dddn|}t|tƒr¤t|dddn|}i|_t ¡D]\} }| |j|<q¶dd„|j ¡Dƒ|_t|_t |jƒ|_ tƒjf||||||||dœ| ¤ŽdS)NF)ÚlstripÚrstripTcSsi|]\}}||“qSrr)r rr rrrr csz,CanineTokenizer.__init__..)Ú bos_tokenÚ eos_tokenÚ sep_tokenÚ cls_tokenÚ pad_tokenÚ mask_tokenÚadd_prefix_spaceÚmodel_max_length) Ú isinstanceÚstrrZ_special_codepointsrÚitemsZ_special_codepoint_stringsÚUNICODE_VOCAB_SIZEÚ_unicode_vocab_sizeÚlenZ_num_special_tokensÚsuperÚ__init__)ÚselfrrrrrrrrÚkwargsr r©Ú __class__rrr"Hs4ÿø ÷zCanineTokenizer.__init__)ÚreturncCs|jS)N)r)r#rrrÚ vocab_sizevszCanineTokenizer.vocab_sizecCs$dd„t|jƒDƒ}| |j¡|S)NcSsi|]}t|ƒ|“qSr)Úchr)r Úirrrr {rz-CanineTokenizer.get_vocab..)Úranger(ÚupdateZadded_tokens_encoder)r#ZvocabrrrÚ get_vocabzszCanineTokenizer.get_vocab)Útextr'cCst|ƒS)z5Tokenize a string (i.e. perform character splitting).)Úlist)r#r.rrrÚ _tokenizeszCanineTokenizer._tokenize)Útokenr'cCs2z t|ƒWSty,td|›dƒ‚Yn0dS)zaConverts a token (i.e. a Unicode character) in an id (i.e. its integer Unicode code point value).zinvalid token: 'ú'N)ÚordÚ TypeErrorÚ ValueError)r#r1rrrÚ_convert_token_to_idƒs z$CanineTokenizer._convert_token_to_id)Úindexr'cCsBz|tvrt|WSt|ƒWSty<td|›ƒ‚Yn0dS)z˜ Converts a Unicode code point (integer) in a token (str). In case it's a special code point, convert to human-readable format. zinvalid id: N)rr)r4r5)r#r7rrrÚ_convert_id_to_tokenŠs z$CanineTokenizer._convert_id_to_tokencCs d |¡S)NÚ)Újoin)r#ÚtokensrrrÚconvert_tokens_to_string–sz(CanineTokenizer.convert_tokens_to_stringN)Útoken_ids_0Útoken_ids_1r'cCs4|jg}|jg}|||}|dur0|||7}|S)a˜ Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A CANINE sequence has the following format: - single sequence: `[CLS] X [SEP]` - pair of sequences: `[CLS] A [SEP] B [SEP]` Args: token_ids_0 (`List[int]`): List of IDs to which the special tokens will be added. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. Returns: `List[int]`: List of [input IDs](../glossary#input-ids) with the appropriate special tokens. N)Zsep_token_idZcls_token_id)r#r=r>ÚsepÚclsÚresultrrrÚ build_inputs_with_special_tokens™sz0CanineTokenizer.build_inputs_with_special_tokens)r=r>Úalready_has_special_tokensr'csT|rtƒj||ddSdgdgt|ƒdg}|durP|dgt|ƒdg7}|S)aÄ Retrieve sequence ids from a token list that has no special tokens added. This method is called when adding special tokens using the tokenizer `prepare_for_model` method. Args: token_ids_0 (`List[int]`): List of IDs. token_ids_1 (`List[int]`, *optional*): Optional second list of IDs for sequence pairs. already_has_special_tokens (`bool`, *optional*, defaults to `False`): Whether or not the token list is already formatted with special tokens for the model. Returns: `List[int]`: A list of integers in the range [0, 1]: 1 for a special token, 0 for a sequence token. T)r=r>rCérN)r!Úget_special_tokens_maskr )r#r=r>rCrAr%rrrE´sÿz'CanineTokenizer.get_special_tokens_mask)Úsave_directoryÚfilename_prefixcCsdS)Nrr)r#rFrGrrrÚsave_vocabularyÑszCanineTokenizer.save_vocabulary)N)NF)N)Ú__name__Ú __module__Ú__qualname__Ú__doc__r)ÚCLSÚSEPÚPADÚMASKr"ÚpropertyÚintr(r-rr/r0r6r8r<rrBÚboolrErHÚ __classcell__rrr%rr:s8÷.ÿþÿþrN)rLÚtypingrZtokenization_utilsrrÚutilsrZ get_loggerrIÚloggerrrOrMrNZBOSrPZRESERVEDrÚdictrRrÚ__annotations__rrrÚ__all__rrrrÚs, ô"