Sequence Tagging

conll2000chunking

CoNLL2000Chunking load function

mindnlp.dataset.sequence_tagging.conll2000chunking.CoNLL2000Chunking(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), proxies=None)[source]

Load the CoNLL2000Chunking dataset

Parameters:
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns:

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'test')
>>> dataset_train,dataset_test = CoNLL2000Chunking(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.sequence_tagging.conll2000chunking.CoNLL2000Chunking_Process(dataset, vocab, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]

the process of the CoNLL2000Chunking dataset

Parameters:
  • dataset (GeneratorDataset) – CoNLL2000Chunking dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

  • batch_size (int) – The number of rows each batch is created with. Default: 64.

  • max_len (int) – The max length of the sentence. Default: 500.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets. Must be strictly increasing. Default: None.

  • drop_remainder (bool) – When the last batch of data contains a data entry smaller than batch_size, whether to discard the batch and not pass it to the next operation. Default: False.

Returns:

  • dataset (MapDataset) - dataset after transforms.

Raises:

TypeError – If input_column is not a string.

Examples

>>> dataset_train,dataset_test = CoNLL2000Chunking()
>>> vocab = text.Vocab.from_dataset(dataset_train,columns=["words"],freq_range=None,
                            top_k=None,special_tokens=["<pad>","<unk>"],special_first=True)
>>> dataset_train = CoNLL2000Chunking_Process(dataset=dataset_train, vocab=vocab,
                                  batch_size=32, max_len=80)
class mindnlp.dataset.sequence_tagging.conll2000chunking.Conll2000chunking(path)[source]

Bases: object

CoNLL2000Chunking dataset source

udpos

UDPOS load function

mindnlp.dataset.sequence_tagging.udpos.UDPOS(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev', 'test'), proxies=None)[source]

Load the UDPOS dataset

Parameters:
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘dev’, ‘test’).

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns:

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'dev', 'test')
>>> dataset_train,dataset_dev,dataset_test = UDPOS(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
class mindnlp.dataset.sequence_tagging.udpos.Udpos(path)[source]

Bases: object

UDPOS dataset source