Vocab

vocab

Vocab Class

class mindnlp.vocab.vocab.Vocab(list_or_dict: Union[list, dict], special_tokens: Optional[Union[list, tuple]] = None, special_first: bool = True)[source]

Bases: object

Creates a vocab object which maps tokens to indices.

append_token(token)[source]

Parameters:: token (str) – The token used to lookup the corresponding index.
Raises:: RuntimeError – If token already exists in the vocab.

classmethod from_dataset(dataset, columns=None, freq_range=None, top_k=None, special_tokens=None, special_first=True)[source]

Build a Vocab from a dataset.

This would collect all unique words in a dataset and return a vocab within the frequency range specified by user in freq_range. User would be warned if no words fall into the frequency. Words in vocab are ordered from the highest frequency to the lowest frequency. Words with the same frequency would be ordered lexicographically.

Parameters:

dataset (Dataset) – dataset to build vocab from.
columns (list[str], optional) – column names to get words from. It can be a list of column names. Default: None.
freq_range (tuple, optional) – A tuple of integers (min_frequency, max_frequency). Words within the frequency range would be kept. 0 <= min_frequency <= max_frequency <= total_words. min_frequency=0 is the same as min_frequency=1. max_frequency > total_words is the same as max_frequency = total_words. min_frequency/max_frequency can be None, which corresponds to 0/total_words separately. Default: None, all words are included.
top_k (int, optional) – top_k is greater than 0. Number of words to be built into vocab. top_k means most frequent words are taken. top_k is taken after freq_range. If not enough top_k, all words will be taken. Default: None, all words are included.
special_tokens (list, optional) – A list of strings, each one is a special token. For example special_tokens=[“<pad>”,”<unk>”]. Default: None, no special tokens will be added.
special_first (bool, optional) – Whether special_tokens will be prepended/appended to vocab. If special_tokens is specified and special_first is set to True, special_tokens will be prepended. Default: True.

Returns:

Vocab, Vocab object built from the dataset.

Examples

>>> import mindspore.dataset as ds
>>> import mindspore.dataset.text as text
>>> dataset = ds.TextFileDataset("/path/to/sentence/piece/vocab/file", shuffle=False)
>>> vocab = text.Vocab.from_dataset(dataset, "text", freq_range=None, top_k=None,
...                                 special_tokens=["<pad>", "<unk>"],
...                                 special_first=True)
>>> dataset = dataset.map(operations=text.Lookup(vocab, "<unk>"), input_columns=["text"])

classmethod from_pretrained(name='glove.6B.50d', root='/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', special_tokens=('<pad>', '<unk>'), special_first=True)[source]

Parameters:

name (str) – The name of the pretrained vector. Default: “glove.6B.50d”.
root (str) – Default storage directory. Default: DEFAULT_ROOT.
special_tokens (str|Tuple[str]) – List of special participles. Default: (“<pad>”, “<unk>”).
special_first (bool) – Indicates whether special participles from special_tokens will be added to the top of the dictionary. If True, add special_tokens to the beginning of the dictionary, otherwise add them to the end. Default: True.

Returns:

Vocab, Returns a vocab generated from the url download.

lookup_ids(token_or_list)[source]

Converts a token string or a sequence of tokens in a single integer id or a sequence of ids.

Parameters:

token_or_list (Union[str, list[str]]) – One or several token(s) to convert to token id(s).

Returns:

list[int], The token id or list of token ids. if only one token used to lookup, return one id instead of a list of ids.

Examples

>>> import mindspore.dataset.text as text
>>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)
>>> ids = vocab.lookup_ids(["w1", "w3"])

lookup_tokens(index_or_list)[source]

Converts a single index or a sequence of indices in a token or a sequence of tokens. If id does not exist, return empty string.

Parameters:

index_or_list (Union[int, list[int]]) – The token id (or token ids) to convert to tokens.

Returns:

List<str>, The decoded token(s). if only one id used to lookup, return one token instead of a list of tokens.

Raises:

RuntimeError – If ‘ids’ is not in vocab.

Examples

>>> import mindspore.dataset.text as text
>>> vocab = text.Vocab.from_list(["w1", "w2", "w3"], special_tokens=["<unk>"], special_first=True)
>>> token = vocab.lookup_tokens(0)

property vocab: return vocab dict.