Huggingface Datasets

hf_imdb

Hugging Face IMDB load function

mindnlp.dataset.hf_datasets.hf_imdb.HF_IMDB(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), shuffle=True)[source]

Load the huggingface IMDB dataset.

Parameters:
  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • shuffle (bool) – Whether to shuffle the dataset. Default:True.

Returns:

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'test')
>>> dataset_train,dataset_test = HF_IMDB(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.hf_datasets.hf_imdb.HF_IMDB_Process(dataset, tokenizer, vocab, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]

the process of the IMDB dataset

Parameters:
  • dataset (GeneratorDataset) – IMDB dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

  • batch_size (int) – size of the batch.

  • max_len (int) – max length of the sentence.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets.

  • drop_remainder (bool) – If True, will drop the last batch for each bucket if it is not a full batch

Returns:

  • dataset (MapDataset) - dataset after transforms.

  • Vocab (Vocab) - vocab created from dataset

Raises:

TypeError – If input_column is not a string.

Examples

>>> imdb_train, imdb_test = load_dataset('imdb', shuffle=True)
>>> embedding, vocab = Glove.from_pretrained('6B', 100, special_tokens=["<unk>", "<pad>"], dropout=drop)
>>> tokenizer = BasicTokenizer(True)
>>> imdb_train = process('hf_imdb', imdb_train, tokenizer=tokenizer, vocab=vocab,                         bucket_boundaries=[400, 500], max_len=600, drop_remainder=True)
class mindnlp.dataset.hf_datasets.hf_imdb.HFimdb(dataset_list)[source]

Bases: object

Hugging Face IMDB dataset source

hf_glue

Hugging Face GLUE load function

mindnlp.dataset.hf_datasets.hf_glue.HF_GLUE(name: str, root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), shuffle=True)[source]

Load the huggingface GLUE dataset.

Parameters:
  • name (str) – Task name

  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • shuffle (bool) – Whether to shuffle the dataset. Default:True.

Returns:

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'test')
>>> dataset_train,dataset_test = HF_GLUE(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.hf_datasets.hf_glue.HF_GLUE_Process(name, dataset, column=None, tokenizer=<mindnlp.transforms.tokenizers.basic_tokenizer.BasicTokenizer object>, vocab=None)[source]

the process of the GLUE dataset

class mindnlp.dataset.hf_datasets.hf_glue.HFglue(dataset_list, name)[source]

Bases: object

Hugging Face GLUE dataset source

hf_msra_ner

Hugging Face Msra_ner load function

mindnlp.dataset.hf_datasets.hf_msra_ner.HF_Msra_ner(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), shuffle=True)[source]

Load the huggingface Msra_ner dataset.

Parameters:
  • name (str) – Task name

  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).

  • shuffle (bool) – Whether to shuffle the dataset. Default:True.

Returns:

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> from mindnlp.dataset import HF_Msra_ner
>>> split = ('train', 'test')
>>> dataset_train,dataset_test = HF_Msra_ner(split=split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.hf_datasets.hf_msra_ner.HF_Msra_ner_Process(dataset, tokenizer, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]

the process of the Msra_ner dataset

Parameters:
  • dataset (GeneratorDataset) – Msra_ner dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • batch_size (int) – size of the batch.

  • max_len (int) – max length of the sentence.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets.

  • drop_remainder (bool) – If True, will drop the last batch for each bucket if it is not a full batch

Returns:

  • dataset (MapDataset) - dataset after transforms.

input_columns = [“tokens”, “ner_tags”], input_columns = [“tokens”, “seq_length”, “ner_tags”].

Raises:

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.transforms import BertTokenizer
>>> from mindnlp.dataset import HF_Msra_ner, HF_Msra_ner_Process
>>> dataset_train,dataset_test = HF_Msra_ner()
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
>>> dataset_train = HF_Msra_ner_Process(dataset_train, tokenizer=tokenizer,                             batch_size=64, max_len=512)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
class mindnlp.dataset.hf_datasets.hf_msra_ner.HFmsra_ner(dataset_list)[source]

Bases: object

Hugging Face Msra_ner dataset source

hf_ptb_text_only

Hugging Face Ptb_text_only load function

mindnlp.dataset.hf_datasets.hf_ptb_text_only.HF_Ptb_text_only(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'validation', 'test'), shuffle=True)[source]

Load the huggingface Ptb_text_only dataset.

Parameters:
  • name (str) – Task name

  • root (str) – Directory where the datasets are saved. Default:~/.mindnlp

  • split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘validation’, ‘test’).

  • shuffle (bool) – Whether to shuffle the dataset. Default:True.

Returns:

  • datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Examples

>>> from mindnlp.dataset import HF_Ptb_text_only
>>> split = ('train', 'test')
>>> dataset_train, dataset_test = HF_Ptb_text_only(split=split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
mindnlp.dataset.hf_datasets.hf_ptb_text_only.HF_Ptb_text_only_Process(dataset, column='sentence', tokenizer=<mindnlp.transforms.tokenizers.basic_tokenizer.BasicTokenizer object>, vocab=None, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]

the process of the Ptb_text_only dataset

Parameters:
  • dataset (GeneratorDataset) – Ptb_text_only dataset.

  • column (str) – the column needed to be transpormed of the Ptb_text_only dataset.

  • tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.

  • vocab (Vocab) – vocabulary object, used to store the mapping of token and index.

  • batch_size (int) – size of the batch.

  • max_len (int) – max length of the sentence.

  • bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets.

  • drop_remainder (bool) – If True, will drop the last batch for each bucket if it is not a full batch

Returns:

  • dataset (MapDataset) - dataset after transforms.

Raises:

TypeError – If input_column is not a string.

Examples

>>> from mindnlp.dataset import HF_Ptb_text_only, HF_Ptb_text_only_Process
>>> dataset_train, dataset_test = HF_Ptb_text_only()
>>> dataset_train = HF_Ptb_text_only_Process(dataset_train)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
class mindnlp.dataset.hf_datasets.hf_ptb_text_only.HFptb_text_only(dataset_list)[source]

Bases: object

Hugging Face Ptb_text_only dataset source