Transforms

tokenizers

tokenizers init

class mindnlp.transforms.tokenizers.BartTokenizer(tokenizer_file=None, unk_token='<unk>', bos_token='<s>', eos_token='</s>', add_prefix_space=False, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for Bart text process. :param vocab: Vocabulary used to look up words. :type vocab: Vocab :param return_token: Whether to return token. If True: return tokens. False: return ids. Default: True. :type return_token: bool

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.BasicTokenizer(lower_case=False, py_transform=False)[source]

Bases: TextTensorOperation, PyTensorOperation

Tokenize the input UTF-8 encoded string by specific rules.

Parameters:

lower_case (bool, optional) – Whether to perform lowercase processing on the text. If True, will fold the text to lower case and strip accented characters. If False, will only perform normalization on the text, with mode specified by normalization_form. Default: False.
py_transform (bool, optional) – Whether use python implementation. Default: False.

Raises:

TypeError – If lower_case is not of type bool.
TypeError – If py_transform is not of type bool.
RuntimeError – If dtype of input Tensor is not str.

Supported Platforms:: CPU

Examples

>>> from mindnlp.dataset.transforms import BasicTokenizer
>>> tokenizer_op = BasicTokenizer()
>>> text = "Welcom to China!"
>>> tokenized_text = tokenizer_op(text)

execute_py(text_input)[source]: Execute method.

parse()[source]: parse function - not yet implemented

class mindnlp.transforms.tokenizers.BertTokenizer(vocab=None, tokenizer_file=None, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for Bert text process.

Parameters:

vocab (Vocab) – Vocabulary used to look up words.
lower_case (bool, optional) – Whether to perform lowercase processing on the text. If True, will fold the text to lower case. Default: True.
return_token (bool) – Whether to return token. If True: return tokens. False: return ids. Default: True.

Raises:

TypeError – If lower_case is not of type bool.
TypeError – If py_transform is not of type bool.
RuntimeError – If dtype of input Tensor is not str.

Examples

>>> from mindspore.dataset import text
>>> from mindnlp.transforms import BertTokenizer
>>> vocab_list = ["床", "前", "明", "月", "光", "疑", "是", "地", "上", "霜", "举", "头", "望", "低",
      "思", "故", "乡","繁", "體", "字", "嘿", "哈", "大", "笑", "嘻", "i", "am", "mak",
      "make", "small", "mistake", "##s", "during", "work", "##ing", "hour", "😀", "😃",
      "😄", "😁", "+", "/", "-", "=", "12", "28", "40", "16", " ", "I", "[CLS]", "[SEP]",
      "[UNK]", "[PAD]", "[MASK]", "[unused1]", "[unused10]"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> tokenizer_op = BertTokenizer(vocab=vocab, lower_case=True)
>>> text = "i make a small mistake when i'm working! 床前明月光😀"
>>> test_dataset = ['A small mistake was made when I was working.']
>>> dataset = GeneratorDataset(test_dataset, 'text')
>>> tokenized_text = tokenizer_op(text)
>>> tokenized_dataset = dataset.map(operations=tokenizer_op)
>>> #encode method will return a Encoding class with many useful attributes
>>> tokens = tokenizer_op.encode(text)
>>> tokens_offset = tokens.offsets

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.ChatGLMTokenizer(vocab_file, do_lower_case=False, remove_space=False, bos_token='<sop>', eos_token='<eop>', end_token='</s>', mask_token='[MASK]', gmask_token='[gMASK]', padding_side='left', pad_token='<pad>', unk_token='<unk>', num_image_tokens=0, **kwargs)[source]

Bases: PreTrainedTokenizer

Construct a ChatGLM tokenizer. Based on byte-level Byte-Pair-Encoding.

Parameters:: vocab_file (str) – Path to the vocabulary file.

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A BERT sequence has the following format:

single sequence: [CLS] X [SEP]
pair of sequences: [CLS] A [SEP] B [SEP]

Parameters:

token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.
token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns:

List of [input IDs](../glossary#input-ids) with the appropriate special tokens.

Return type:

List[int]

convert_tokens_to_string(tokens: List[str]) → str[source]: convert tokens to string.

property end_token_id: Optional[int]

Id of the end of context token in the vocabulary. Returns None if the token has not been set.

Type:: Optional[int]

execute_py(text_input)[source]: Execute method.

get_vocab()[source]: Returns vocab as a dict

property gmask_token_id: Optional[int]: gmask token id

preprocess_text(inputs)[source]: preprocess text.

save_vocabulary(save_directory)[source]

Save the vocabulary and special tokens file to a directory.

Parameters:

save_directory (str) – The directory in which to save the vocabulary.
filename_prefix (str, optional) – An optional prefix to add to the named of the saved files.

Returns:

Paths to the files saved.

Return type:

Tuple(str)

property vocab_size: Returns vocab size

class mindnlp.transforms.tokenizers.CodeGenTokenizer(vocab: str, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for CodeGen text process. :param vocab: Vocabulary used to look up words. :type vocab: Vocab :param return_token: Whether to return token. If True: return tokens. False: return ids. Default: True. :type return_token: bool

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.ErnieTokenizer(vocab: str, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for Ernie text process.

Parameters:

vocab (Vocab) – Vocabulary used to look up words.
return_token (bool) – Whether to return token. If True: return tokens. False: return ids. Default: True.

execute_py(text_input)[source]: Execute method.

Bases: PreTrainedTokenizer

Tokenizer used for GPT2 text process. :param vocab: Vocabulary used to look up words. :type vocab: Vocab :param return_token: Whether to return token. If True: return tokens. False: return ids. Default: True. :type return_token: bool

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.GPTTokenizer(tokenizer_file=None, unk_token='<unk>', **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for Bert text process.

Parameters:

vocab (Vocab) – Vocabulary used to look up words.
return_token (bool) – Whether to return token. If True: return tokens. False: return ids. Default: True.

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.LongformerTokenizer(vocab: str, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for T5 text process. :param vocab: Vocabulary used to look up words. :type vocab: Vocab :param return_token: Whether to return token. If True: return tokens. False: return ids. Default: True. :type return_token: bool

Examples

>>> from mindspore.dataset import text
>>> from mindnlp.transforms import T5Tokenizer
>>> text = "Believing that faith can triumph over everything is in itself the greatest belief"
>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')
>>> tokens = tokenizer.encode(text)

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.LukeTokenizer(vocab_file, merges_file, entity_vocab_file, task=None, max_entity_length=32, max_mention_length=30, entity_token_1='<ent>', entity_token_2='<ent2>', entity_unk_token='[UNK]', entity_pad_token='[PAD]', entity_mask_token='[MASK]', entity_mask2_token='[MASK2]', errors='replace', bos_token='<s>', eos_token='</s>', sep_token='</s>', cls_token='<s>', unk_token='<unk>', pad_token='<pad>', mask_token='<mask>', add_prefix_space=False, **kwargs)[source]

Bases: PreTrainedTokenizer

Constructs a LUKE tokenizer, derived from the GPT-2 tokenizer, using byte-level Byte-Pair-Encoding.

This tokenizer has been trained to treat spaces like parts of the tokens (a bit like sentencepiece) so a word will be encoded differently whether it is at the beginning of the sentence (without space) or not:

```python >>> from transformers import LukeTokenizer

>>> tokenizer = LukeTokenizer.from_pretrained("studio-ousia/luke-base")
>>> tokenizer("Hello world")["input_ids"]
[0, 31414, 232, 2]

>>> tokenizer(" Hello world")["input_ids"]
[0, 20920, 232, 2]
```

You can get around that behavior by passing add_prefix_space=True when instantiating this tokenizer or when you call it on some text, but since the model was not pretrained this way, it might yield a decrease in performance.

<Tip>

When used with is_split_into_words=True, this tokenizer will add a space before each word (even the first one).

</Tip>

This tokenizer inherits from [PreTrainedTokenizer] which contains most of the main methods. Users should refer to this superclass for more information regarding those methods. It also creates entity sequences, namely entity_ids, entity_attention_mask, entity_token_type_ids, and entity_position_ids to be used by the LUKE model.

Parameters:

vocab_file (str) – Path to the vocabulary file.
merges_file (str) – Path to the merges file.
entity_vocab_file (str) – Path to the entity vocabulary file.
task (str, optional) – Task for which you want to prepare sequences. One of “entity_classification”, “entity_pair_classification”, or “entity_span_classification”. If you specify this argument, the entity sequence is automatically created based on the given entity span(s).
max_entity_length (int, optional, defaults to 32) – The maximum length of entity_ids.
max_mention_length (int, optional, defaults to 30) – The maximum number of tokens inside an entity span.
entity_token_1 (str, optional, defaults to <ent>) – The special token used to represent an entity span in a word token sequence. This token is only used when task is set to “entity_classification” or “entity_pair_classification”.
entity_token_2 (str, optional, defaults to <ent2>) – The special token used to represent an entity span in a word token sequence. This token is only used when task is set to “entity_pair_classification”.
errors (str, optional, defaults to “replace”) – Paradigm to follow when decoding bytes to UTF-8. See [bytes.decode](https://docs.python.org/3/library/stdtypes.html#bytes.decode) for more information.
bos_token (str, optional, defaults to “<s>”) –
The beginning of sequence token that was used during pretraining. Can be used a sequence classifier token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the beginning of sequence. The token used is the cls_token.

</Tip>
eos_token (str, optional, defaults to “</s>”) –
The end of sequence token.

<Tip>

When building a sequence using special tokens, this is not the token that is used for the end of sequence. The token used is the sep_token.

</Tip>
sep_token (str, optional, defaults to “</s>”) – The separator token, which is used when building a sequence from multiple sequences, e.g. two sequences for sequence classification or for a text and a question for question answering. It is also used as the last token of a sequence built with special tokens.
cls_token (str, optional, defaults to “<s>”) – The classifier token which is used when doing sequence classification (classification of the whole sequence instead of per-token classification). It is the first token of the sequence when built with special tokens.
unk_token (str, optional, defaults to “<unk>”) – The unknown token. A token that is not in the vocabulary cannot be converted to an ID and is set to be this token instead.
pad_token (str, optional, defaults to “<pad>”) – The token used for padding, for example when batching sequences of different lengths.
mask_token (str, optional, defaults to “<mask>”) – The token used for masking values. This is the token used when training this model with masked language modeling. This is the token which the model will try to predict.
add_prefix_space (bool, optional, defaults to False) – Whether or not to add an initial space to the input. This allows to treat the leading word just as any other word. (LUKE tokenizer detect beginning of words by the preceding space).

bpe(token)[source]

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]

Build model inputs from a sequence or a pair of sequence for sequence classification tasks by concatenating and adding special tokens. A LUKE sequence has the following format:

single sequence: <s> X </s>
pair of sequences: <s> A </s></s> B </s>

Parameters:

token_ids_0 (List[int]) – List of IDs to which the special tokens will be added.
token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns:

List of [input IDs](../glossary#input-ids) with the appropriate special tokens.

Return type:

List[int]

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]

Create a mask from the two sequences passed to be used in a sequence-pair classification task. LUKE does not make use of token type ids, therefore a list of zeros is returned.

Parameters:

token_ids_0 (List[int]) – List of IDs.
token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns:

List of zeros.

Return type:

List[int]

encode_plus(text: Union[str, List[str], List[int]], text_pair: Optional[Union[str, List[str], List[int]]] = None, add_special_tokens: bool = True, padding: Union[bool, str, PaddingStrategy] = False, truncation: Optional[Union[bool, str, TruncationStrategy]] = None, max_length: Optional[int] = None, stride: int = 0, is_split_into_words: bool = False, pad_to_multiple_of: Optional[int] = None, return_tensors: Optional[Union[str, TensorType]] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_offsets_mapping: bool = False, return_length: bool = False, verbose: bool = True, **kwargs)[source]

Tokenize and prepare for the model a sequence or a pair of sequences.

This method is deprecated, __call__ should be used instead.

</Tip>

Parameters:

text (str, List[str] or List[int] (the latter only for not-fast tokenizers)) – The first sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).
text_pair (str, List[str] or List[int], optional) – Optional second sequence to be encoded. This can be a string, a list of strings (tokenized string using the tokenize method) or a list of integers (tokenized string ids using the convert_tokens_to_ids method).

num_special_tokens_to_add(pair: bool = False) → int[source]

Returns the number of added tokens when encoding a sequence with special tokens.

<Tip>

This encodes a dummy input and checks the number of added tokens, and is therefore not efficient. Do not put this inside your training loop.

</Tip>

Parameters:: pair (bool, optional, defaults to False) – Whether the number of added tokens should be computed in the case of a sequence pair or a single sequence.
Returns:: Number of special tokens added to sequences.
Return type:: int

pad(encoded_inputs, padding: Union[bool, str, PaddingStrategy] = True, max_length: Optional[int] = None, max_entity_length: Optional[int] = None, pad_to_multiple_of: Optional[int] = None, return_attention_mask: Optional[bool] = None, verbose: bool = True)[source]

Pad a single encoded input or a batch of encoded inputs up to predefined length or to the max sequence length in the batch. Padding side (left/right) padding token ids are defined at the tokenizer level (with self.padding_side, self.pad_token_id and self.pad_token_type_id) .. note:: If the encoded_inputs passed are dictionary of numpy arrays, PyTorch tensors or TensorFlow tensors, the result will use the same type unless you provide a different tensor type with return_tensors. In the case of PyTorch tensors, you will lose the specific device of your tensors however.

Parameters:

encoded_inputs ([BatchEncoding], list of [BatchEncoding], Dict[str, List[int]], Dict[str, List[List[int]] or List[Dict[str, List[int]]]) – Tokenized inputs. Can represent one input ([BatchEncoding] or Dict[str, List[int]]) or a batch of tokenized inputs (list of [BatchEncoding], Dict[str, List[List[int]]] or List[Dict[str, List[int]]]) so you can use this method during preprocessing as well as in a PyTorch Dataloader collate function. Instead of List[int] you can have tensors (numpy arrays, PyTorch tensors or TensorFlow tensors), see the note above for the return type.
padding (bool, str or [~utils.PaddingStrategy], optional, defaults to True) –

Select a strategy to pad the returned sequences (according to the model’s padding side and padding
index) among:
- True or ‘longest’: Pad to the longest sequence in the batch (or no padding if only a single sequence if provided).
- ’max_length’: Pad to a maximum length specified with the argument max_length or to the maximum acceptable input length for the model if that argument is not provided.
- False or ‘do_not_pad’ (default): No padding (i.e., can output a batch with sequences of different lengths).
max_length (int, optional) – Maximum length of the returned list and optionally padding length (see above).
max_entity_length (int, optional) – The maximum length of the entity sequence.
pad_to_multiple_of (int, optional) – If set will pad the sequence to a multiple of the provided value. This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability >= 7.5 (Volta).
return_attention_mask (bool, optional) – Whether to return the attention mask. If left to the default, will return the attention mask according to the specific tokenizer’s default, defined by the return_outputs attribute. [What are attention masks?](../glossary#attention-mask)
return_tensors (str or [~utils.TensorType], optional) –
If set, will return tensors instead of list of python integers. Acceptable values are:
- ’tf’: Return TensorFlow tf.constant objects.
- ’pt’: Return PyTorch torch.Tensor objects.
- ’np’: Return Numpy np.ndarray objects.
verbose (bool, optional, defaults to True) – Whether or not to print more information and warnings.

prepare_for_model(ids: List[int], pair_ids: Optional[List[int]] = None, entity_ids: Optional[List[int]] = None, pair_entity_ids: Optional[List[int]] = None, entity_token_spans: Optional[List[Tuple[int, int]]] = None, pair_entity_token_spans: Optional[List[Tuple[int, int]]] = None, add_special_tokens: bool = True, padding: Union[bool, str, PaddingStrategy] = False, truncation: Optional[Union[bool, str, TruncationStrategy]] = None, max_length: Optional[int] = None, max_entity_length: Optional[int] = None, stride: int = 0, pad_to_multiple_of: Optional[int] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_length: bool = False, verbose: bool = True, **kwargs)[source]

Prepares a sequence of input id, entity id and entity span, or a pair of sequences of inputs ids, entity ids, entity spans so that it can be used by the model. It adds special tokens, truncates sequences if overflowing while taking into account the special tokens and manages a moving window (with user defined stride) for overflowing tokens. Please Note, for pair_ids different than None and truncation_strategy = longest_first or True, it is not possible to return overflowing tokens. Such a combination of arguments will raise an error.

Parameters:

ids (List[int]) – Tokenized input ids of the first sequence.
pair_ids (List[int], optional) – Tokenized input ids of the second sequence.
entity_ids (List[int], optional) – Entity ids of the first sequence.
pair_entity_ids (List[int], optional) – Entity ids of the second sequence.
entity_token_spans (List[Tuple[int, int]], optional) – Entity spans of the first sequence.
pair_entity_token_spans (List[Tuple[int, int]], optional) – Entity spans of the second sequence.
max_entity_length (int, optional) – The maximum length of the entity sequence.

prepare_for_tokenization(text, is_split_into_words=False, **kwargs)[source]

tokenize_(text: str, **kwargs) → List[str][source]

Converts a string in a sequence of tokens, using the tokenizer.

Split in words for word-based vocabulary or sub-words for sub-word-based vocabularies (BPE/SentencePieces/WordPieces). Takes care of added tokens.

Parameters:

text (str) – The sequence to be encoded.
**kwargs (additional keyword arguments) – Passed along to the model-specific prepare_for_tokenization preprocessing method.

Returns:

The list of tokens.

Return type:

List[str]

class mindnlp.transforms.tokenizers.MegatronBertTokenizer(vocab=None, tokenizer_file=None, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for MegatronBert text process. :param vocab: Vocabulary used to look up words. :type vocab: Vocab :param return_token: Whether to return token. If True: return tokens. False: return ids. Default: True. :type return_token: bool

Examples

>>> from mindspore.dataset import text
>>> from mindnlp.transforms import MegatronBertTokenizer
>>> text = "Believing that faith can triumph over everything is in itself the greatest belief"
>>> tokenizer = MegatronBertTokenizer.from_pretrained('nvidia/megatron-bert-cased-345m')
>>> tokens = tokenizer.encode(text)

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.MobileBertTokenizer(vocab=None, tokenizer_file=None, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for MobileBert text process.

Parameters:

vocab (Vocab) – Vocabulary used to look up words.
lower_case (bool, optional) – Whether to perform lowercase processing on the text. If True, will fold the text to lower case. Default: True.
return_token (bool) – Whether to return token. If True: return tokens. False: return ids. Default: True.

Raises:

TypeError – If lower_case is not of type bool.
TypeError – If py_transform is not of type bool.
RuntimeError – If dtype of input Tensor is not str.

Examples

>>> from mindspore.dataset import text
>>> from mindnlp.transforms import MobileBertTokenizer
>>> vocab_list = ["床", "前", "明", "月", "光", "疑", "是", "地", "上", "霜", "举", "头", "望", "低",
      "思", "故", "乡","繁", "體", "字", "嘿", "哈", "大", "笑", "嘻", "i", "am", "mak",
      "make", "small", "mistake", "##s", "during", "work", "##ing", "hour", "😀", "😃",
      "😄", "😁", "+", "/", "-", "=", "12", "28", "40", "16", " ", "I", "[CLS]", "[SEP]",
      "[UNK]", "[PAD]", "[MASK]", "[unused1]", "[unused10]"]
>>> vocab = text.Vocab.from_list(vocab_list)
>>> tokenizer_op = MobileBertTokenizer(vocab=vocab, lower_case=True)
>>> text = "i make a small mistake when i'm working! 床前明月光😀"
>>> test_dataset = ['A small mistake was made when I was working.']
>>> dataset = GeneratorDataset(test_dataset, 'text')
>>> tokenized_text = tokenizer_op(text)
>>> tokenized_dataset = dataset.map(operations=tokenizer_op)
>>> #encode method will return a Encoding class with many useful attributes
>>> tokens = tokenizer_op.encode(text)
>>> tokens_offset = tokens.offsets

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.NezhaTokenizer(vocab: str, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for Nezha text process.

Parameters:

vocab (Vocab) – Vocabulary used to look up words.
return_token (bool) – Whether to return token. If True: return tokens. False: return ids. Default: True.

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.OPTTokenizer(tokenizer_file=None, unk_token='</s>', bos_token='</s>', eos_token='</s>', pad_token='<pad>', add_prefix_space=False, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for OPT text process. :param vocab: Vocabulary used to look up words. :type vocab: Vocab :param return_token: Whether to return token. If True: return tokens. False: return ids. Default: True. :type return_token: bool

batch_decode(sequences, skip_special_tokens: bool = False)[source]: Batch Decode

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.RobertaTokenizer(vocab: str, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for Robertas text process.

Parameters:

vocab (Vocab) – Vocabulary used to look up words.
return_token (bool) – Whether to return token. If True: return tokens. False: return ids. Default: True.

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.T5Tokenizer(vocab: str, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for T5 text process. :param vocab: Vocabulary used to look up words. :type vocab: Vocab :param return_token: Whether to return token. If True: return tokens. False: return ids. Default: True. :type return_token: bool

Examples

>>> from mindspore.dataset import text
>>> from mindnlp.transforms import T5Tokenizer
>>> text = "Believing that faith can triumph over everything is in itself the greatest belief"
>>> tokenizer = T5Tokenizer.from_pretrained('t5-base')
>>> tokens = tokenizer.encode(text)

execute_py(text_input)[source]: Execute method.

class mindnlp.transforms.tokenizers.TinyBertTokenizer(vocab=None, tokenizer_file=None, do_lower_case=True, unk_token='[UNK]', sep_token='[SEP]', pad_token='[PAD]', cls_token='[CLS]', mask_token='[MASK]', tokenize_chinese_chars=True, strip_accents=None, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for TinyBert text process.

execute_py(text_input)[source]: Execute method.

save(save_path: str)[source]: save tokenizer

class mindnlp.transforms.tokenizers.UIETokenizer(vocab: str, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for UIE text process.

Parameters:

vocab (Vocab) – Vocabulary used to look up words.
return_token (bool) – Whether to return token. If True: return tokens. False: return ids. Default: True.

execute_py(text_input, pair, max_length, truncation, padding, return_token_type_ids, return_attention_mask, return_offsets_mapping, return_position_ids)[source]: Execute method.

class mindnlp.transforms.tokenizers.XLMTokenizer(vocab_file, merges_file, unk_token='<unk>', bos_token='<s>', sep_token='</s>', pad_token='<pad>', cls_token='</s>', mask_token='<special1>', additional_special_tokens=['<special0>', '<special1>', '<special2>', '<special3>', '<special4>', '<special5>', '<special6>', '<special7>', '<special8>', '<special9>'], lang2id=None, id2lang=None, do_lowercase_and_remove_accent=True, **kwargs)[source]

Bases: PreTrainedTokenizer

Tokenizer used for XLM text process. :param vocab: Vocabulary used to look up words. :type vocab: Vocab :param return_token: Whether to return token. If True: return tokens. False: return ids. Default: True. :type return_token: bool

Examples

>>> from mindspore.dataset import text
>>> from mindnlp.transforms import XLMTokenizer
>>> text = "Believing that faith can triumph over everything is in itself the greatest belief"
>>> tokenizer = XLMTokenizer.from_pretrained('xlm-clm-ende-1024')
>>> tokens = tokenizer.encode(text)

bpe(token)[source]

build_inputs_with_special_tokens(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]

convert_tokens_to_ids(tokens: Union[str, List[str]]) → Union[int, List[int]][source]

Converts a token string (or a sequence of tokens) in a single integer id (or a sequence of ids), using the vocabulary.

Parameters:: tokens (str or List[str]) – One or several token(s) to convert to token id(s).
Returns:: The token id or list of token ids.
Return type:: int or List[int]

create_token_type_ids_from_sequences(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None) → List[int][source]

Create a mask from the two sequences passed to be used in a sequence-pair classification task. An XLM sequence pair mask has the following format:

` 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 | first sequence | second sequence | `

If token_ids_1 is None, this method only returns the first portion of the mask (0s).

Parameters:

token_ids_0 (List[int]) – List of IDs.
token_ids_1 (List[int], optional) – Optional second list of IDs for sequence pairs.

Returns:

List of [token type IDs](../glossary#token-type-ids) according to the given sequence(s).

Return type:

List[int]

property do_lower_case

encode_plus(text: Union[str, List[str], List[int]], text_pair: Optional[Union[str, List[str], List[int]]] = None, add_special_tokens: bool = True, padding=False, truncation=None, max_length: Optional[int] = None, stride: int = 0, pad_to_multiple_of: Optional[int] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_length: bool = False, verbose: bool = True, **kwargs)[source]: # Backward compatibility for ‘truncation_strategy’, ‘pad_to_max_length’

get_special_tokens_mask(token_ids_0: List[int], token_ids_1: Optional[List[int]] = None, already_has_special_tokens: bool = False) → List[int][source]

moses_pipeline(text, lang)[source]

moses_punct_norm(text, lang)[source]

moses_tokenize(text, lang)[source]

num_special_tokens_to_add(pair: bool = False) → int[source]

pad(encoded_inputs, padding: Union[bool, str, PaddingStrategy] = True, max_length: Optional[int] = None, pad_to_multiple_of: Optional[int] = None, return_attention_mask: Optional[bool] = None, verbose: bool = True)[source]

prepare_for_model(ids: List[int], pair_ids: Optional[List[int]] = None, add_special_tokens: bool = True, padding: Union[bool, str, PaddingStrategy] = False, truncation: Optional[Union[bool, str, TruncationStrategy]] = None, max_length: Optional[int] = None, stride: int = 0, pad_to_multiple_of: Optional[int] = None, return_token_type_ids: Optional[bool] = None, return_attention_mask: Optional[bool] = None, return_overflowing_tokens: bool = False, return_special_tokens_mask: bool = False, return_length: bool = False, verbose: bool = True, **kwargs)[source]: Backward compatibility for ‘truncation_strategy’, ‘pad_to_max_length’

tokenize_(text: str, **kwargs) → List[str][source]: # Simple mapping string => AddedToken for special tokens with specific tokenization behaviors

truncate_sequences(ids: List[int], pair_ids: Optional[List[int]] = None, num_tokens_to_remove: int = 0, truncation_strategy: Union[str, TruncationStrategy] = 'longest_first', stride: int = 0) → Tuple[List[int], List[int], List[int]][source]