Utils

decompress

Decompress functions

mindnlp.utils.decompress.ungz(file_path: str, unzip_path: Optional[str] = None)[source]

Untar .gz file

Parameters:
  • file_path (str) – The path where the .gz file is located.

  • unzip_path (str) – The directory where the files were unzipped.

Returns:

The directory where the files were unzipped.

Return type:

  • unzip_path (str)

Raises:
  • TypeError – If file_path is not a string.

  • TypeError – If untar_path is not a string.

mindnlp.utils.decompress.untar(file_path: str, untar_path: str)[source]

Untar tar.gz file

Parameters:
  • file_path (str) – The path where the tgz file is located.

  • multiple (str) – The directory where the files were unzipped.

Returns:

  • names (list) -All filenames in the tar.gz file.

Raises:
  • TypeError – If file_path is not a string.

  • TypeError – If untar_path is not a string.

Examples

>>> file_path = "./mindnlp/datasets/IWSLT2016/2016-01.tgz"
>>> untar_path = "./mindnlp/datasets/IWSLT2016"
>>> output = untar(file_path,untar_path)
>>> print(output[0])
'2016-01'
mindnlp.utils.decompress.unzip(file_path: str, unzip_path: str)[source]

Untar .zip file

Parameters:
  • file_path (str) – The path where the .zip file is located.

  • unzip_path (str) – The directory where the files were unzipped.

Returns:

  • names (list) -All filenames in the .zip file.

Raises:
  • TypeError – If file_path is not a string.

  • TypeError – If untar_path is not a string.

download

Download functions

mindnlp.utils.download.cache_file(filename: str, cache_dir: Optional[str] = None, url: Optional[str] = None, md5sum=None, download_file_name=None, proxies=None)[source]

If there is the file in cache_dir, return the path; if there is no such file, use the url to download.

Parameters:
  • filename (str) – The name of the required dataset file.

  • cache_dir (str) – The path of save the file.

  • url (str) – The url of the required dataset file.

  • md5sum (str) – The true md5sum of download file.

  • download_file_name (str) – The name of the downloaded file.(This parameter is required if the end of the link is not the downloaded file name.)

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns:

  • str, If path is a folder containing a file, return {path}{filename}; if path is a folder containing multiple files or a single file, return path.

Raises:
  • TypeError – If filename is not a string.

  • TypeError – If cache_dir is not a string.

  • TypeError – If url is not a string.

  • RuntimeError – If filename is None.

Examples

>>> filename = 'aclImdb_v1'
>>> path, filename = cache_file(filename)
>>> print(path, filename)
'{home}\.text' 'aclImdb_v1.tar.gz'
mindnlp.utils.download.cached_path(filename_or_url: str, cache_dir: Optional[str] = None, md5sum=None, download_file_name=None, proxies=None)[source]

If there is the file in cache_dir, return the path; if there is no such file, use the url to download.

Parameters:
  • filename_or_url (str) – The name or url of the required file .

  • cache_dir (str) – The path of save the file.

  • folder_name (str) – The additional folder to which the dataset is cached.(under the cache_dir)

  • md5sum (str) – The true md5sum of download file.

  • download_file_name (str) – The name of the downloaded file.(This parameter is required if the end of the link is not the downloaded file name.)

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns:

  • str, If path is a folder containing a file, return {path}{filename}; if path is a folder containing multiple files or a single file, return path.

Raises:
  • TypeError – If path is not a string.

  • RuntimeError – If path is None.

Examples

>>> path = "https://mindspore-website.obs.myhuaweicloud.com/notebook/datasets/aclImdb_v1.tar.gz"
>>> path, filename = cached_path(path)
>>> print(path, filename)
'{home}\.text\aclImdb_v1.tar.gz' 'aclImdb_v1.tar.gz'
mindnlp.utils.download.check_md5(filename: str, md5sum=None)[source]

Check md5 of download file.

Parameters:
  • filename (str) – The fullname of download file.

  • md5sum (str) – The true md5sum of download file.

Returns:

bool, the md5 check result.

Raises:
  • TypeError – If filename is not a string.

  • RuntimeError – If filename is None.

Examples

>>> filename = 'test'
>>> check_md5_result = check_md5(filename)
True
mindnlp.utils.download.get_cache_path()[source]

Get the storage path of the default cache. If the environment ‘cache_path’ is set, use the environment variable.

Parameters:

None

Returns:

str, the path of default or the environment ‘cache_path’.

Examples

>>> default_cache_path = get_cache_path()
>>> print(default_cache_path)
'{home}\.mindnlp'
mindnlp.utils.download.get_checkpoint_shard_files(index_filename, cache_dir=None, url=None, force_download=False, proxies=None)[source]

For a given model:

  • download and cache all the shards of a sharded checkpoint if pretrained_model_name_or_path is a model ID on the Hub

  • returns the list of paths to all the shards, as well as some metadata.

For the description of each arg, see [PreTrainedModel.from_pretrained]. index_filename is the full path to the index (downloaded and cached if pretrained_model_name_or_path is a model ID on the Hub).

mindnlp.utils.download.get_filepath(path: str)[source]

Get the filepath of file.

Parameters:

path (str) – The path of the required file.

Returns:

  • str, If path is a folder containing a file, return {path}{filename}; if path is a folder containing multiple files or a single file, return path.

Raises:
  • TypeError – If path is not a string.

  • RuntimeError – If path is None.

Examples

>>> path = '{home}\.text'
>>> get_filepath_result = get_filepath(path)
>>> print(get_filepath_result)
'{home}\.text'
mindnlp.utils.download.get_from_cache(url: str, cache_dir: Optional[str] = None, md5sum=None, download_file_name=None, proxies=None)[source]

If there is the file in cache_dir, return the path; if there is no such file, use the url to download.

Parameters:
  • url (str) – The path to download the file.

  • cache_dir (str) – The path of save the file.

  • md5sum (str) – The true md5sum of download file.

  • download_file_name (str) – The name of the downloaded file.(This parameter is required if the end of the link is not the downloaded file name.)

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns:

  • str, The path of save the downloaded file.

  • str, The name of downloaded file.

Raises:
  • TypeError – If url is not a string.

  • TypeError – If cache_dir is not a Path.

  • RuntimeError – If url is None.

Examples

>>> path = "https://mindspore-website.obs.myhuaweicloud.com/notebook/datasets/aclImdb_v1.tar.gz"
>>> path, filename = cached_path(path)
>>> print(path, filename)
'{home}\.text' 'aclImdb_v1.tar.gz'
mindnlp.utils.download.http_get(url, path=None, md5sum=None, download_file_name=None, proxies=None)[source]

Download from given url, save to path.

Parameters:
  • url (str) – download url

  • path (str) – download to given path (default value: ‘{home}.text’)

  • md5sum (str) – The true md5sum of download file.

  • download_file_name (str) – The name of the downloaded file.(This para meter is required if the end of the link is not the downloaded file name.)

  • proxies (dict) – a dict to identify proxies,for example: {“https”: “https://127.0.0.1:7890”}.

Returns:

str, the path of default or the environment ‘cache_path’.

Raises:
  • TypeError – If url is not a String.

  • RuntimeError – If url is None.

Examples

>>> url = 'https://mindspore-website.obs.myhuaweicloud.com/notebook/datasets/aclImdb_v1.tar.gz'
>>> cache_path = http_get(url)
>>> print(cache_path)
('{home}\.text', '{home}\aclImdb_v1.tar.gz')
mindnlp.utils.download.match_file(filename: str, cache_dir: str) str[source]

If there is the file in cache_dir, return the path; otherwise, return empty string or error.

Parameters:
  • filename (str) – The name of the required file.

  • cache_dir (str) – The path of save the file.

Returns:

  • str, If there is the file in cache_dir, return filename; if there is no such file, return empty string ‘’; if there are two or more matching file, report an error.

Raises:
  • TypeError – If filename is not a string.

  • TypeError – If cache_dir is not a string.

  • RuntimeError – If filename is None.

  • RuntimeError – If cache_dir is None.

Examples

>>> name = 'aclImdb_v1.tar.gz'
>>> path = get_cache_path()
>>> match_file_result = match_file(name, path)
mindnlp.utils.download.try_to_load_from_cache(filename: str, cache_dir: Optional[Union[str, Path]] = None) Optional[str][source]

Explores the cache to return the latest cached file for a given revision if found.

This function will not raise any exception if the file in not cached.

Parameters:
  • cache_dir (str or os.PathLike) – The folder where the cached files lie.

  • filename (str) – The filename to look for inside repo_id.

Returns:

Will return None if the file was not cached. Otherwise: - The exact path to the cached file if it’s found in the cache - A special value _CACHED_NO_EXIST if the file does not exist at the given commit hash and this fact was

cached.

Return type:

Optional[str] or _CACHED_NO_EXIST