Sequence Tagging
conll2000chunking
CoNLL2000Chunking load function
- mindnlp.dataset.sequence_tagging.conll2000chunking.CoNLL2000Chunking(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]
Load the CoNLL2000Chunking dataset
- Parameters:
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns:
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'test') >>> dataset_train,dataset_test = CoNLL2000Chunking(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.sequence_tagging.conll2000chunking.CoNLL2000Chunking_Process(dataset, vocab, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]
the process of the CoNLL2000Chunking dataset
- Parameters:
dataset (GeneratorDataset) – CoNLL2000Chunking dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
batch_size (int) – The number of rows each batch is created with. Default: 64.
max_len (int) – The max length of the sentence. Default: 500.
bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. Default: None.
drop_remainder (bool) – When the last batch of data contains a data entry smaller than batch_size, whether to discard the batch and not pass it to the next operation. Default: False.
- Returns:
dataset (MapDataset) - dataset after transforms.
- Raises:
TypeError – If input_column is not a string.
Examples
>>> dataset_train,dataset_test = CoNLL2000Chunking() >>> vocab = text.Vocab.from_dataset(dataset_train,columns=["words"],freq_range=None, top_k=None,special_tokens=["<pad>","<unk>"],special_first=True) >>> dataset_train = CoNLL2000Chunking_Process(dataset=dataset_train, vocab=vocab, batch_size=32, max_len=80)
udpos
UDPOS load function
- mindnlp.dataset.sequence_tagging.udpos.UDPOS(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]
Load the UDPOS dataset
- Parameters:
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns:
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'dev', 'test') >>> dataset_train,dataset_dev,dataset_test = UDPOS(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))