a ½Àhã8@sêdZddlZddlZddlmZddlZddlmZddl m Z e e¡Z ddd œZd ddd ddddddddddddddddddd d!d"d#d$d%d&d'd(d)d*d+d,d-d.d/d0d1d2d3d4d5d6d7d8d9d:d;dd?d@dAœ7ZdBdC„ZGdDdE„dEeƒZdEgZdS)Fz)Tokenization classes for Salesforce CTRL.éN)ÚOptionalé)ÚPreTrainedTokenizer)Úloggingz vocab.jsonz merges.txt)Ú vocab_fileÚmerges_fileiµ’iûi·ŸiÐ÷i»öi#jiñviµ~i6²iÅÁivÌiòiØ.iïiè½i×šiÍ¨i§¯i%æi¦iøi3iR-iniS.iKiñiwÌiÁ´i[i*i¡“iœìiÚ/iè?iñíin1iipi€i„iòÉiÏ’i i)i-‘iœ(iºøi™KiîÕiŒiÇ¢i iÄhi–õ)7Z PregnancyZChristianityZExplainZFitnessZSavingZAskZAssZJokeZ QuestionsZThoughtsZRetailZFeminismZWritingZAtheismZNetflixZ ComputingZOpinionZAloneÚFunnyZGamingZHumanZIndiaZJokerZDietZLegalZNormanZTipZWeightZMoviesZRunningZScienceZHorrorZ ConfessionZFinanceZPoliticsZScaryZSupportZTechnologiesZTeenageÚEventZLearnedZNotionZ WikipediaZBooksZExtractZConfessionsZ ConspiracyZLinksZ NarcissusZRelationshipZ RelationshipsZReviewsZNewsZTranslationZmultilingualcCs>tƒ}|d}|dd…D]}| ||f¡|}qt|ƒ}|S)z… Return set of symbol pairs in a word. Word is represented as tuple of symbols (symbols being variable-length strings). réN)ÚsetÚadd)ÚwordÚpairsZ prev_charÚchar©rúf/var/www/html/assistant/venv/lib/python3.9/site-packages/transformers/models/ctrl/tokenization_ctrl.pyÚ get_pairs^srcs‚eZdZdZeZeZd‡fdd„ Ze dd„ƒZ dd„Zd d „Zdd„Z d d„Zdd„Zdd„Zdeeeeedœdd„Z‡ZS)Ú CTRLTokenizera` Construct a CTRL tokenizer. Based on Byte-Pair-Encoding. This tokenizer inherits from [`PreTrainedTokenizer`] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. Args: vocab_file (`str`): Path to the vocabulary file. merges_file (`str`): Path to the merges file. unk_token (`str`, *optional*, defaults to `""`): The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead. úcsÖt|dd}t |¡|_Wdƒn1s.0Ydd„|j ¡Dƒ|_t|dd&}| ¡ d¡dd…}Wdƒn1s†0Ydd „|Dƒ}tt |t t|ƒƒƒƒ|_i|_ tƒjfd |i|¤ŽdS)Núutf-8©ÚencodingcSsi|]\}}||“qSrr)Ú.0ÚkÚvrrrÚ …óz*CTRLTokenizer.__init__..Ú r éÿÿÿÿcSsg|]}t| ¡ƒ‘qSr)ÚtupleÚsplit)rÚmergerrrÚ ˆrz*CTRLTokenizer.__init__..Ú unk_token)ÚopenÚjsonÚloadÚencoderÚitemsÚdecoderÚreadr ÚdictÚzipÚrangeÚlenÚ bpe_ranksÚcacheÚsuperÚ__init__)Úselfrrr#ÚkwargsZvocab_handleZ merges_handleZmerges©Ú __class__rrr2‚s*4zCTRLTokenizer.__init__cCs t|jƒS©N)r.r'©r3rrrÚ vocab_sizeszCTRLTokenizer.vocab_sizecCst|jfi|j¤ŽSr7)r+r'Zadded_tokens_encoderr8rrrÚ get_vocab‘szCTRLTokenizer.get_vocabc s|ˆjvrˆj|St|ƒ}tt|dd…ƒ|ddgƒ}t|ƒ}|sN|St|‡fdd„d}|ˆjvrpql|\}}g}d}|t|ƒkrBz| ||¡} Wn*tyÈ| ||d…¡YqBYn0| ||| …¡| }|||kr*|t|ƒdkr*||d|kr*| ||¡|d7}q€| ||¡|d7}q€t|ƒ}|}t|ƒdkrbqlqNt|ƒ}qNd |¡}|dd …}|ˆj|<|S)Nrzcsˆj |tdƒ¡S)NÚinf)r/ÚgetÚfloat)Úpairr8rrÚŸrz#CTRLTokenizer.bpe..©Úkeyrr éú@@ éüÿÿÿ)r0rÚlistrÚminr/r.ÚindexÚ ValueErrorÚextendÚappendÚjoin) r3Útokenr rZbigramÚfirstÚsecondZnew_wordÚiÚjrr8rÚbpe”sF " 2 zCTRLTokenizer.bpecCs8g}t d|¡}|D]}| t| |¡ d¡ƒ¡q|S)zTokenize a string.z\S+\n?ú )ÚreÚfindallrIrErQr )r3ÚtextZsplit_tokensÚwordsrLrrrÚ _tokenizeÀs zCTRLTokenizer._tokenizecCs|j ||j |j¡¡S)z0Converts a token (str) in an id using the vocab.)r'r<r#)r3rLrrrÚ_convert_token_to_idÊsz"CTRLTokenizer._convert_token_to_idcCs|j ||j¡S)z=Converts an index (integer) in a token (str) using the vocab.)r)r<r#)r3rGrrrÚ_convert_id_to_tokenÎsz"CTRLTokenizer._convert_id_to_tokencCsd |¡ dd¡ ¡}|S)z:Converts a sequence of tokens (string) in a single string.rRrCÚ)rKÚreplaceÚstrip)r3ÚtokensZ out_stringrrrÚconvert_tokens_to_stringÒsz&CTRLTokenizer.convert_tokens_to_stringN)Úsave_directoryÚfilename_prefixÚreturnc CsTtj |¡s"t d|›d¡dStj ||r6|dndtd¡}tj ||rX|dndtd¡}t|ddd .}| t j |jd ddd d¡Wdƒn1s¨0Yd}t|ddd v}| d¡t|j ¡dd„dD]D\}} || krt d|›d¡| }| d |¡d¡|d7}qæWdƒn1sB0Y||fS)NzVocabulary path (z) should be a directoryú-rZrrÚwrrrBTF)ÚindentÚ sort_keysÚensure_asciirrz#version: 0.2 cSs|dS)Nr r)Úkvrrrr?èrz/CTRLTokenizer.save_vocabulary..r@zSaving vocabulary to zZ: BPE merge indices are not consecutive. Please check that the tokenizer is not corrupted!rRr )ÚosÚpathÚisdirÚloggerÚerrorrKÚVOCAB_FILES_NAMESr$Úwriter%Údumpsr'Úsortedr/r(Úwarning) r3r_r`rZ merge_fileÚfrGÚwriterZ bpe_tokensZtoken_indexrrrÚsave_vocabulary×s.ÿÿ< ÿ*zCTRLTokenizer.save_vocabulary)r)N)Ú__name__Ú __module__Ú__qualname__Ú__doc__rmZvocab_files_namesÚ CONTROL_CODESÚ control_codesr2Úpropertyr9r:rQrWrXrYr^ÚstrrrrtÚ __classcell__rrr5rrns , r)rxr%rhÚtypingrÚregexrSZtokenization_utilsrÚutilsrZ get_loggerrurkrmryrrÚ__all__rrrrÚsŒ þÉ;