lephare.data_retrieval

This module provides functionality for downloading and managing data files using pooch.

Functions

filter_files_by_prefix(file_path, target_prefixes)

Returns all lines in a file that contain any of the target prefixes.

download_registry_from_github([url, outfile])

Fetch the contents of a file from a GitHub repository.

read_list_file(list_file[, prefix])

Reads file names from a list file and returns a list of file paths.

make_default_retriever()

Create a retriever with the default settings.

make_retriever([base_url, registry_file, data_path])

Create a retriever for downloading files.

download_file(retriever, file_name[, ignore_registry, ...])

Download a file using the retriever, optionally ignoring the registry.

download_all_files(retriever, file_names[, ...])

Download all files in the given list using the retriever.

Module Contents

filter_files_by_prefix(file_path, target_prefixes)[source]

Returns all lines in a file that contain any of the target prefixes.

Parameters:
  • file_path (str) – The path to the file.

  • target_prefixes (list) – A list of target prefixes to check for in each line.

Returns:

A list of lines that contain one of the target prefixes.

Return type:

list

download_registry_from_github(url='', outfile='')[source]

Fetch the contents of a file from a GitHub repository.

Parameters:
  • url (str) – The URL of the registry file. Defaults to a “data_registry.txt” file at DEFAULT_BASE_DATA_URL.

  • outfile (str) – The path where the file will be saved. Defaults to DEFAULT_REGISTRY_FILE.

Raises:

Exception – If there is any problem fetching the registry hash file or full registry file, including network issues, server errors, or other HTTP errors.

read_list_file(list_file, prefix='')[source]

Reads file names from a list file and returns a list of file paths.

Parameters:
  • list_file (str) – The name of the file containing the list of filenames. Can be local or a URL.

  • prefix (str) –

    Optional prefix to add to all file names. When downloaded, file paths must be relative to the “base url,” which is the top-level directory.

    Prefixes will be inferred from list_file paths or urls that contain “sed” or “filt”; otherwise; they should be manually specified.

Returns:

A list of file paths read from the list file.

Return type:

list of str

make_default_retriever()[source]

Create a retriever with the default settings.

make_retriever(base_url=DEFAULT_BASE_DATA_URL, registry_file=DEFAULT_REGISTRY_FILE, data_path=DEFAULT_LOCAL_DATA_PATH)[source]

Create a retriever for downloading files.

Parameters:
  • base_url (str, optional) – The base URL for the data files.

  • registry_file (str, optional) – The path to the registry file that lists the files and their hashes.

  • data_path (str, optional) – The local path where the files will be downloaded.

Returns:

The retriever object for downloading files.

Return type:

pooch.Pooch

download_file(retriever, file_name, ignore_registry=False, downloader=None)[source]

Download a file using the retriever, optionally ignoring the registry.

Parameters:
  • retriever (pooch.Pooch) – The retriever object for downloading files.

  • file_name (str) – The name of the file to download.

  • ignore_registry (bool) – If True, download the file without checking its hash against the registry.

  • downloader (pooch.HTTPDownloader) – The downloader is required to set the user for building on readthedocs

Returns:

The path to the downloaded file.

Return type:

str

download_all_files(retriever, file_names, ignore_registry=False, retry=MAX_RETRY_ATTEMPTS)[source]

Download all files in the given list using the retriever.

Parameters:
  • retriever (pooch.Pooch) – The retriever object for downloading files.

  • file_names (list of str) – List of file names to download.

  • ignore_registry (bool) – If True, download the files without checking their hashes against the registry.

  • retry (int) – Number of times to retry downloading a file if first attempt fails.