Question Answer
squad1
SQuAD1 load function
- mindnlp.dataset.question_answer.squad1.SQuAD1(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev'), proxies=None)[source]
Load the SQuAD1 dataset
- Parameters:
root (str) – Directory where the datasets are saved.
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’,’dev’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns:
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
- Raises:
TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'dev') >>> dataset_train, dataset_dev = SQuAD1(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) {'context': Tensor(shape=[], dtype=String, value= 'Architecturally, \ the school has a Catholic character. Atop the Main Building\'s gold dome ...'), 'question': Tensor(shape=[], dtype=String, value= 'To whom did the Virgin Mary allegedly \ appear in 1858 in Lourdes France?'), 'answers': Tensor(shape=[1], dtype=String, value= ['Saint Bernadette Soubirous']), 'answers_start': Tensor(shape=[1], dtype=Int32, value= [515])}
- mindnlp.dataset.question_answer.squad1.SQuAD1_Process(dataset, char_vocab, word_vocab=None, tokenizer=<mindnlp.transforms.tokenizers.basic_tokenizer.BasicTokenizer object>, max_context_len=768, max_question_len=64, max_char_len=48, batch_size=64, drop_remainder=False)[source]
the process of the squad1 dataset
- Parameters:
dataset (GeneratorDataset) – Squad1 dataset.
tokenizer (TextTensorOperation) – Tokenizer you choose to tokenize the text dataset.
word_vocab (Vocab) – Vocabulary object of words, used to store the mapping of the token and index.
char_vocab (Vocab) – Vocabulary object of chars, used to store the mapping of the token and index.
max_context_len (int) – Max length of the context. Default: 768.
max_question_len (int) – Max length of the question. Default: 64.
max_char_len (int) – Max length of the char. Default: 48.
batch_size (int) – The number of rows each batch is created with. Default: 64.
drop_remainder (bool) – When the last batch of data contains a data entry smaller than batch_size, whether to discard the batch and not pass it to the next operation. Default: False.
- Returns:
MapDataset, Squad1 Dataset after process.
- Raises:
TypeError – If word_vocab is not of type text.Vocab.
TypeError – If char_vocab is not of type text.Vocab.
TypeError – If max_context_len is not of type int.
TypeError – If max_question_len is not of type int.
TypeError – If max_char_len is not of type int.
TypeError – If batch_size is not of type int.
TypeError – If drop_remainder is not of type bool.
Examples
>>> from mindspore.dataset import text >>> from mindnlp.dataset import SQuAD1, SQuAD1_Process >>> char_dic = {"<unk>": 0, "<pad>": 1, "e": 2, "t": 3, "a": 4, "i": 5, "n": 6, "o": 7, "s": 8, "r": 9, "h": 10, "l": 11, "d": 12, "c": 13, "u": 14, "m": 15, "f": 16, "p": 17, "g": 18, "w": 19, "y": 20, "b": 21, ",": 22, "v": 23, ".": 24, "k": 25, "1": 26, "0": 27, "x": 28, "2": 29, """: 30, "-": 31, "j": 32, "9": 33, "'": 34, ")": 35, "(": 36, "?": 37, "z": 38, "5": 39, "8": 40, "q": 41, "3": 42, "4": 43, "7": 44, "6": 45, ";": 46, ":": 47, "–": 48, "%": 49, "/": 50, "]": 51, "[": 52} >>> char_vocab = text.Vocab.from_dict(char_dic) >>> dev_dataset = SQuAD1(split='dev') >>> squad_dev = SQuAD1_Process(dataset=dev_dataset, char_vocab=char_vocab) >>> squad_dev = squad_dev.create_tuple_iterator() >>> print(next(squad_dev))
squad2
SQuAD2 load function
- mindnlp.dataset.question_answer.squad2.SQuAD2(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev'), proxies=None)[source]
Load the SQuAD2 dataset
- Parameters:
root (str) – Directory where the datasets are saved.
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’,’dev’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.
- Returns:
datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.
- Raises:
TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].
Examples
>>> root = "~/.mindnlp" >>> split = ('train', 'dev') >>> dataset_train, dataset_dev = SQuAD2(root, split) >>> train_iter = dataset_train.create_tuple_iterator() >>> print(next(train_iter)) [Tensor(shape=[], dtype=String, value= 'Beyoncé Giselle Knowles-Carter...), Tensor(shape=[], dtype=String, value= 'When did Beyonce start becoming popular?'), Tensor(shape=[1], dtype=String, value= ['in the late 1990s']), Tensor(shape=[1], dtype=Int32, value= [269])]