lephare.data_retrieval

This module provides functionality for downloading and managing data files using pooch.

Functions

`filter_files_by_prefix`(file_path, target_prefixes)	Returns all lines in a file that contain any of the target prefixes.
`download_registry_from_github`([url, outfile])	Fetch the contents of a file from a GitHub repository.
`read_list_file`(list_file[, prefix])	Reads file names from a list file and returns a list of file paths.
`make_default_retriever`()	Create a retriever with the default settings.
`make_retriever`([base_url, registry_file, data_path])	Create a retriever for downloading files.
`download_file`(retriever, file_name[, ignore_registry, ...])	Download a file using the retriever, optionally ignoring the registry.
`download_all_files`(retriever, file_names[, ...])	Download all files in the given list using the retriever.

Module Contents

filter_files_by_prefix(file_path, target_prefixes)[source]

Returns all lines in a file that contain any of the target prefixes.

Parameters:

file_path (str) – The path to the file.
target_prefixes (list) – A list of target prefixes to check for in each line.

Returns:

A list of lines that contain one of the target prefixes.

Return type:

list

download_registry_from_github(url='', outfile='')[source]

Fetch the contents of a file from a GitHub repository.

Parameters:

url (str) – The URL of the registry file. Defaults to a “data_registry.txt” file at DEFAULT_BASE_DATA_URL.
outfile (str) – The path where the file will be saved. Defaults to DEFAULT_REGISTRY_FILE.

Raises:

Exception – If there is any problem fetching the registry hash file or full registry file, including network issues, server errors, or other HTTP errors.

read_list_file(list_file, prefix='')[source]

Reads file names from a list file and returns a list of file paths.

Parameters:

list_file (str) – The name of the file containing the list of filenames. Can be local or a URL.
prefix (str) –
Optional prefix to add to all file names. When downloaded, file paths must be relative to the “base url,” which is the top-level directory.

Prefixes will be inferred from list_file paths or urls that contain “sed” or “filt”; otherwise; they should be manually specified.

Returns:

A list of file paths read from the list file.

Return type:

list of str

make_default_retriever()[source]: Create a retriever with the default settings.

make_retriever(base_url=DEFAULT_BASE_DATA_URL, registry_file=DEFAULT_REGISTRY_FILE, data_path=DEFAULT_LOCAL_DATA_PATH)[source]

Create a retriever for downloading files.

Parameters:

base_url (str, optional) – The base URL for the data files.
registry_file (str, optional) – The path to the registry file that lists the files and their hashes.
data_path (str, optional) – The local path where the files will be downloaded.

Returns:

The retriever object for downloading files.

Return type:

pooch.Pooch

download_file(retriever, file_name, ignore_registry=False, downloader=None)[source]

Download a file using the retriever, optionally ignoring the registry.

Parameters:

retriever (pooch.Pooch) – The retriever object for downloading files.
file_name (str) – The name of the file to download.
ignore_registry (bool) – If True, download the file without checking its hash against the registry.
downloader (pooch.HTTPDownloader) – The downloader is required to set the user for building on readthedocs

Returns:

The path to the downloaded file.

Return type:

str

download_all_files(retriever, file_names, ignore_registry=False, retry=MAX_RETRY_ATTEMPTS)[source]

Download all files in the given list using the retriever.

Parameters:

retriever (pooch.Pooch) – The retriever object for downloading files.
file_names (list of str) – List of file names to download.
ignore_registry (bool) – If True, download the files without checking their hashes against the registry.
retry (int) – Number of times to retry downloading a file if first attempt fails.