Huggingface Datasets
hf_imdb
Hugging Face IMDB load function
- mindnlp.dataset.hf_datasets.hf_imdb.HF_IMDB(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), shuffle=True)[source]
Load the huggingface IMDB dataset.
- Parameters:
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
shuffle (bool) – Whether to shuffle the dataset. Default:True.
- Returns:
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'test') >>> dataset_train,dataset_test = HF_IMDB(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.hf_datasets.hf_imdb.HF_IMDB_Process(dataset, tokenizer, vocab, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]
the process of the IMDB dataset
- Parameters:
dataset (GeneratorDataset) – IMDB dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
batch_size (int) – size of the batch.
max_len (int) – max length of the sentence.
bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets.
drop_remainder (bool) – If True, will drop the last batch for each bucket if it is not a full batch
- Returns:
dataset (MapDataset) - dataset after transforms.
Vocab (Vocab) - vocab created from dataset
- Raises:
TypeError – If input_column is not a string.
Examples
>>> imdb_train, imdb_test = load_dataset('imdb', shuffle=True) >>> embedding, vocab = Glove.from_pretrained('6B', 100, special_tokens=["<unk>", "<pad>"], dropout=drop) >>> tokenizer = BasicTokenizer(True) >>> imdb_train = process('hf_imdb', imdb_train, tokenizer=tokenizer, vocab=vocab, bucket_boundaries=[400, 500], max_len=600, drop_remainder=True)
hf_glue
Hugging Face GLUE load function
- mindnlp.dataset.hf_datasets.hf_glue.HF_GLUE(name: str, root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), shuffle=True)[source]
Load the huggingface GLUE dataset.
- Parameters:
name (str) – Task name
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
shuffle (bool) – Whether to shuffle the dataset. Default:True.
- Returns:
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'test') >>> dataset_train,dataset_test = HF_GLUE(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
hf_msra_ner
Hugging Face Msra_ner load function
- mindnlp.dataset.hf_datasets.hf_msra_ner.HF_Msra_ner(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'test'), shuffle=True)[source]
Load the huggingface Msra_ner dataset.
- Parameters:
name (str) – Task name
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘test’).
shuffle (bool) – Whether to shuffle the dataset. Default:True.
- Returns:
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> from mindnlp.dataset import HF_Msra_ner >>> split = ('train', 'test') >>> dataset_train,dataset_test = HF_Msra_ner(split=split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.hf_datasets.hf_msra_ner.HF_Msra_ner_Process(dataset, tokenizer, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]
the process of the Msra_ner dataset
- Parameters:
dataset (GeneratorDataset) – Msra_ner dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
batch_size (int) – size of the batch.
max_len (int) – max length of the sentence.
bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets.
drop_remainder (bool) – If True, will drop the last batch for each bucket if it is not a full batch
- Returns:
dataset (MapDataset) - dataset after transforms.
input_columns = [“tokens”, “ner_tags”], input_columns = [“tokens”, “seq_length”, “ner_tags”].
- Raises:
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.transforms import BertTokenizer >>> from mindnlp.dataset import HF_Msra_ner, HF_Msra_ner_Process >>> dataset_train,dataset_test = HF_Msra_ner() >>> tokenizer = BertTokenizer.from_pretrained('bert-base-chinese') >>> dataset_train = HF_Msra_ner_Process(dataset_train, tokenizer=tokenizer, batch_size=64, max_len=512) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
hf_ptb_text_only
Hugging Face Ptb_text_only load function
- mindnlp.dataset.hf_datasets.hf_ptb_text_only.HF_Ptb_text_only(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'validation', 'test'), shuffle=True)[source]
Load the huggingface Ptb_text_only dataset.
- Parameters:
name (str) – Task name
root (str) – Directory where the datasets are saved. Default:~/.mindnlp
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’, ‘validation’, ‘test’).
shuffle (bool) – Whether to shuffle the dataset. Default:True.
- Returns:
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
Examples
>>> from mindnlp.dataset import HF_Ptb_text_only >>> split = ('train', 'test') >>> dataset_train, dataset_test = HF_Ptb_text_only(split=split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))
- mindnlp.dataset.hf_datasets.hf_ptb_text_only.HF_Ptb_text_only_Process(dataset, column='sentence', tokenizer=<mindnlp.transforms.tokenizers.basic_tokenizer.BasicTokenizer object>, vocab=None, batch_size=64, max_len=500, bucket_boundaries=None, drop_remainder=False)[source]
the process of the Ptb_text_only dataset
- Parameters:
dataset (GeneratorDataset) – Ptb_text_only dataset.
column (str) – the column needed to be transpormed of the Ptb_text_only dataset.
tokenizer (TextTensorOperation) – tokenizer you choose to tokenize the text dataset.
vocab (Vocab) – vocabulary object, used to store the mapping of token and index.
batch_size (int) – size of the batch.
max_len (int) – max length of the sentence.
bucket_boundaries (list[int]) – A list consisting of the upper boundaries of the buckets.
drop_remainder (bool) – If True, will drop the last batch for each bucket if it is not a full batch
- Returns:
dataset (MapDataset) - dataset after transforms.
- Raises:
TypeError – If input_column is not a string.
Examples
>>> from mindnlp.dataset import HF_Ptb_text_only, HF_Ptb_text_only_Process >>> dataset_train, dataset_test = HF_Ptb_text_only() >>> dataset_train = HF_Ptb_text_only_Process(dataset_train) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter))