Question Answer

squad1

SQuAD1 load function

mindnlp.dataset.question_answer.squad1.SQuAD1(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev'), proxies=None)[source]

Load the SQuAD1 dataset

Parameters:

root (str) – Directory where the datasets are saved.
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’,’dev’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns:

datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Raises:

TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'dev')
>>> dataset_train, dataset_dev = SQuAD1(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
{'context': Tensor(shape=[], dtype=String, value= 'Architecturally, \
    the school has a Catholic character. Atop the Main Building\'s gold dome ...'),
'question': Tensor(shape=[], dtype=String, value= 'To whom did the Virgin Mary allegedly \
    appear in 1858 in Lourdes France?'),
'answers': Tensor(shape=[1], dtype=String, value= ['Saint Bernadette Soubirous']),
'answers_start': Tensor(shape=[1], dtype=Int32, value= [515])}

mindnlp.dataset.question_answer.squad1.SQuAD1_Process(dataset, char_vocab, word_vocab=None, tokenizer=<mindnlp.transforms.tokenizers.basic_tokenizer.BasicTokenizer object>, max_context_len=768, max_question_len=64, max_char_len=48, batch_size=64, drop_remainder=False)[source]

the process of the squad1 dataset

Parameters:

dataset (GeneratorDataset) – Squad1 dataset.
tokenizer (TextTensorOperation) – Tokenizer you choose to tokenize the text dataset.
word_vocab (Vocab) – Vocabulary object of words, used to store the mapping of the token and index.
char_vocab (Vocab) – Vocabulary object of chars, used to store the mapping of the token and index.
max_context_len (int) – Max length of the context. Default: 768.
max_question_len (int) – Max length of the question. Default: 64.
max_char_len (int) – Max length of the char. Default: 48.
batch_size (int) – The number of rows each batch is created with. Default: 64.
drop_remainder (bool) – When the last batch of data contains a data entry smaller than batch_size, whether to discard the batch and not pass it to the next operation. Default: False.

Returns:

MapDataset, Squad1 Dataset after process.

Raises:

TypeError – If word_vocab is not of type text.Vocab.
TypeError – If char_vocab is not of type text.Vocab.
TypeError – If max_context_len is not of type int.
TypeError – If max_question_len is not of type int.
TypeError – If max_char_len is not of type int.
TypeError – If batch_size is not of type int.
TypeError – If drop_remainder is not of type bool.

Examples

>>> from mindspore.dataset import text
>>> from mindnlp.dataset import SQuAD1, SQuAD1_Process
>>> char_dic = {"<unk>": 0, "<pad>": 1, "e": 2, "t": 3, "a": 4, "i": 5, "n": 6,                    "o": 7, "s": 8, "r": 9, "h": 10, "l": 11, "d": 12, "c": 13, "u": 14,                    "m": 15, "f": 16, "p": 17, "g": 18, "w": 19, "y": 20, "b": 21, ",": 22,                    "v": 23, ".": 24, "k": 25, "1": 26, "0": 27, "x": 28, "2": 29, """: 30,                     "-": 31, "j": 32, "9": 33, "'": 34, ")": 35, "(": 36, "?": 37, "z": 38,                    "5": 39, "8": 40, "q": 41, "3": 42, "4": 43, "7": 44, "6": 45, ";": 46,                    ":": 47, "–": 48, "%": 49, "/": 50, "]": 51, "[": 52}
>>> char_vocab = text.Vocab.from_dict(char_dic)
>>> dev_dataset = SQuAD1(split='dev')
>>> squad_dev = SQuAD1_Process(dataset=dev_dataset, char_vocab=char_vocab)
>>> squad_dev = squad_dev.create_tuple_iterator()
>>> print(next(squad_dev))

class mindnlp.dataset.question_answer.squad1.Squad1(path)[source]

Bases: object

SQuAD1 dataset source

squad2

SQuAD2 load function

mindnlp.dataset.question_answer.squad2.SQuAD2(root: str = '/home/docs/checkouts/readthedocs.org/user_builds/mindnlpdoc/checkouts/latest/docs/.mindnlp', split: Union[Tuple[str], str] = ('train', 'dev'), proxies=None)[source]

Load the SQuAD2 dataset

Parameters:

root (str) – Directory where the datasets are saved.
split (str|Tuple[str]) – Split or splits to be returned. Default:(‘train’,’dev’).
proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns:

datasets_list (list) -A list of loaded datasets. If only one type of dataset is specified,such as ‘trian’, this dataset is returned instead of a list of datasets.

Raises:

TypeError – If root is not a string.
TypeError – If split is not a string or Tuple[str].

Examples

>>> root = "~/.mindnlp"
>>> split = ('train', 'dev')
>>> dataset_train, dataset_dev = SQuAD2(root, split)
>>> train_iter = dataset_train.create_tuple_iterator()
>>> print(next(train_iter))
[Tensor(shape=[], dtype=String, value= 'Beyoncé Giselle Knowles-Carter...),
Tensor(shape=[], dtype=String, value= 'When did Beyonce start becoming popular?'),
Tensor(shape=[1], dtype=String, value= ['in the late 1990s']),
Tensor(shape=[1], dtype=Int32, value= [269])]

class mindnlp.dataset.question_answer.squad2.Squad2(path)[source]

Bases: object

SQuAD2 dataset source