Common utilities module#
General utilities#
- recommenders.utils.general_utils.get_number_processors()[source]#
Get the number of processors in a CPU.
- Returns:
Number of processors.
- Return type:
int
- recommenders.utils.general_utils.get_physical_memory()[source]#
Get the physical memory in GBs.
- Returns:
Physical memory in GBs.
- Return type:
float
- recommenders.utils.general_utils.invert_dictionary(dictionary)[source]#
Invert a dictionary
Note
If the dictionary has unique keys and unique values, the inversion would be perfect. However, if there are repeated values, the inversion can take different keys
- Parameters:
dictionary (dict) – A dictionary
- Returns:
inverted dictionary
- Return type:
dict
GPU utilities#
- recommenders.utils.gpu_utils.get_cuda_version()[source]#
Get CUDA version
- Returns:
Version of the library.
- Return type:
str
- recommenders.utils.gpu_utils.get_cudnn_version()[source]#
Get the CuDNN version
- Returns:
Version of the library.
- Return type:
str
Kubernetes utilities#
- recommenders.utils.k8s_utils.nodes_to_replicas(n_cores_per_node, n_nodes=3, cpu_cores_per_replica=0.1)[source]#
Provide a rough estimate of the number of replicas supported by a given number of nodes with n_cores_per_node cores each
- Parameters:
n_cores_per_node (int) – Total number of cores per node within an AKS cluster that you want to use
n_nodes (int) – Number of nodes (i.e. VMs) used in the AKS cluster
cpu_cores_per_replica (float) – Cores assigned to each replica. This can be fractional and corresponds to the cpu_cores argument passed to AksWebservice.deploy_configuration()
- Returns:
Total number of replicas supported by the configuration
- Return type:
int
- recommenders.utils.k8s_utils.qps_to_replicas(target_qps, processing_time, max_qp_replica=1, target_utilization=0.7)[source]#
Provide a rough estimate of the number of replicas to support a given load (queries per second)
- Parameters:
target_qps (int) – target queries per second that you want to support
processing_time (float) – the estimated amount of time (in seconds) your service call takes
max_qp_replica (int) – maximum number of concurrent queries per replica
target_utilization (float) – proportion of CPU utilization you think is ideal
- Returns:
Number of estimated replicas required to support a target number of queries per second.
- Return type:
int
- recommenders.utils.k8s_utils.replicas_to_qps(num_replicas, processing_time, max_qp_replica=1, target_utilization=0.7)[source]#
Provide a rough estimate of the queries per second supported by a number of replicas
- Parameters:
num_replicas (int) – number of replicas
processing_time (float) – the estimated amount of time (in seconds) your service call takes
max_qp_replica (int) – maximum number of concurrent queries per replica
target_utilization (float) – proportion of CPU utilization you think is ideal
- Returns:
queries per second supported by the number of replicas
- Return type:
int
Notebook utilities#
- recommenders.utils.notebook_utils.execute_notebook(input_notebook, output_notebook, parameters={}, kernel_name='python3', timeout=2200)[source]#
Execute a notebook while passing parameters to it.
Note
Ensure your Jupyter Notebook is set up with parameters that can be modified and read. Use Markdown cells to specify parameters that need modification and code cells to set parameters that need to be read.
- Parameters:
input_notebook (str) – Path to the input notebook.
output_notebook (str) – Path to the output notebook
parameters (dict) – Dictionary of parameters to pass to the notebook.
kernel_name (str) – Kernel name.
timeout (int) – Timeout (in seconds) for each cell to execute.
- recommenders.utils.notebook_utils.is_databricks()[source]#
Check if the module is running on Databricks.
- Returns:
True if the module is running on Databricks notebook, False otherwise.
- Return type:
bool
- recommenders.utils.notebook_utils.is_jupyter()[source]#
Check if the module is running on Jupyter notebook/console.
- Returns:
True if the module is running on Jupyter notebook or Jupyter console, False otherwise.
- Return type:
bool
- recommenders.utils.notebook_utils.read_notebook(path)[source]#
Read the metadata stored in the notebook’s output source code.
- Parameters:
path (str) – Path to the notebook.
- Returns:
Dictionary of data stored in the notebook.
- Return type:
dict
- recommenders.utils.notebook_utils.store_metadata(name, value)[source]#
Store data in the notebook’s output source code.
- Parameters:
name (str) – Name of the data.
value (int,float,str) – Value of the data.
- recommenders.utils.notebook_memory_management.pre_run_cell()[source]#
Capture current time before we execute the current command
- recommenders.utils.notebook_memory_management.start_watching_memory()[source]#
Register memory profiling tools to IPython instance.
Python utilities#
- recommenders.utils.python_utils.binarize(a, threshold)[source]#
Binarize the values.
- Parameters:
a (numpy.ndarray) – Input array that needs to be binarized.
threshold (float) – Threshold below which all values are set to 0, else 1.
- Returns:
Binarized array.
- Return type:
numpy.ndarray
- recommenders.utils.python_utils.cosine_similarity(cooccurrence)[source]#
Helper method to calculate the Cosine similarity of a matrix of co-occurrences.
Cosine similarity can be interpreted as the angle between the i-th and j-th item.
- Parameters:
cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
- Returns:
The matrix of cosine similarity between any two items.
- Return type:
numpy.ndarray
- recommenders.utils.python_utils.exponential_decay(value, max_val, half_life)[source]#
Compute decay factor for a given value based on an exponential decay.
Values greater than max_val will be set to 1.
- Parameters:
value (numeric) – Value to calculate decay factor
max_val (numeric) – Value at which decay factor will be 1
half_life (numeric) – Value at which decay factor will be 0.5
- Returns:
Decay factor
- Return type:
float
- recommenders.utils.python_utils.get_top_k_scored_items(scores, top_k, sort_top_k=False)[source]#
Extract top K items from a matrix of scores for each user-item pair, optionally sort results per user.
- Parameters:
scores (numpy.ndarray) – Score matrix (users x items).
top_k (int) – Number of top items to recommend.
sort_top_k (bool) – Flag to sort top k results.
- Returns:
Indices into score matrix for each user’s top items.
Scores corresponding to top items.
- Return type:
numpy.ndarray, numpy.ndarray
- recommenders.utils.python_utils.inclusion_index(cooccurrence)[source]#
Helper method to calculate the Inclusion Index of a matrix of co-occurrences.
Inclusion index measures the overlap between items.
- Parameters:
cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
- Returns:
The matrix of inclusion index between any two items.
- Return type:
numpy.ndarray
- recommenders.utils.python_utils.jaccard(cooccurrence)[source]#
Helper method to calculate the Jaccard similarity of a matrix of co-occurrences. When comparing Jaccard with count co-occurrence and lift similarity, count favours predictability, meaning that the most popular items will be recommended most of the time. Lift, by contrast, favours discoverability/serendipity, meaning that an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. Jaccard is a compromise between the two.
- Parameters:
cooccurrence (numpy.ndarray) – the symmetric matrix of co-occurrences of items.
- Returns:
The matrix of Jaccard similarities between any two items.
- Return type:
numpy.ndarray
- recommenders.utils.python_utils.lexicographers_mutual_information(cooccurrence)[source]#
Helper method to calculate the Lexicographers Mutual Information of a matrix of co-occurrences.
Due to the bias of mutual information for low frequency items, lexicographers mutual information corrects the formula by multiplying it by the co-occurrence frequency.
- Parameters:
cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
- Returns:
The matrix of lexicographers mutual information between any two items.
- Return type:
numpy.ndarray
- recommenders.utils.python_utils.lift(cooccurrence)[source]#
Helper method to calculate the Lift of a matrix of co-occurrences. In comparison with basic co-occurrence and Jaccard similarity, lift favours discoverability and serendipity, as opposed to co-occurrence that favours the most popular items, and Jaccard that is a compromise between the two.
- Parameters:
cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
- Returns:
The matrix of Lifts between any two items.
- Return type:
numpy.ndarray
- recommenders.utils.python_utils.mutual_information(cooccurrence)[source]#
Helper method to calculate the Mutual Information of a matrix of co-occurrences.
Mutual information is a measurement of the amount of information explained by the i-th j-th item column vector.
- Parameters:
cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.
- Returns:
The matrix of mutual information between any two items.
- Return type:
numpy.ndarray
- recommenders.utils.python_utils.rescale(data, new_min=0, new_max=1, data_min=None, data_max=None)[source]#
Rescale/normalize the data to be within the range [new_min, new_max] If data_min and data_max are explicitly provided, they will be used as the old min/max values instead of taken from the data.
Note
This is same as the scipy.MinMaxScaler with the exception that we can override the min/max of the old scale.
- Parameters:
data (numpy.ndarray) – 1d scores vector or 2d score matrix (users x items).
new_min (int|float) – The minimum of the newly scaled data.
new_max (int|float) – The maximum of the newly scaled data.
data_min (None|number) – The minimum of the passed data [if omitted it will be inferred].
data_max (None|number) – The maximum of the passed data [if omitted it will be inferred].
- Returns:
The newly scaled/normalized data.
- Return type:
numpy.ndarray
Spark utilities#
- recommenders.utils.spark_utils.start_or_get_spark(app_name='Sample', url='local[*]', memory='10g', config=None, packages=None, jars=None, repositories=None)[source]#
Start Spark if not started
- Parameters:
app_name (str) – set name of the application
url (str) – URL for spark master
memory (str) – size of memory for spark driver. This will be ignored if spark.driver.memory is set in config.
config (dict) – dictionary of configuration options
packages (list) – list of packages to install
jars (list) – list of jar files to add
repositories (list) – list of maven repositories
- Returns:
Spark context.
- Return type:
object
Tensorflow utilities#
- recommenders.utils.tf_utils.build_optimizer(name, lr=0.001, **kwargs)[source]#
Get an optimizer for TensorFlow high-level API Estimator.
Available options are: adadelta, adagrad, adam, ftrl, momentum, rmsprop or sgd.
- Parameters:
name (str) – Optimizer name.
lr (float) – Learning rate
kwargs – Optimizer arguments as key-value pairs
- Returns:
Tensorflow optimizer.
- Return type:
tf.train.Optimizer
- recommenders.utils.tf_utils.evaluation_log_hook(estimator, logger, true_df, y_col, eval_df, every_n_iter=10000, model_dir=None, batch_size=256, eval_fns=None, **eval_kwargs)[source]#
Evaluation log hook for TensorFlow high-level API Estimator.
Note
TensorFlow Estimator model uses the last checkpoint weights for evaluation or prediction. In order to get the most up-to-date evaluation results while training, set model’s save_checkpoints_steps to be equal or greater than hook’s every_n_iter.
- Parameters:
estimator (tf.estimator.Estimator) – Model to evaluate.
logger (Logger) – Custom logger to log the results. E.g., define a subclass of Logger for AzureML logging.
true_df (pd.DataFrame) – Ground-truth data.
y_col (str) – Label column name in true_df
eval_df (pd.DataFrame) – Evaluation data without label column.
every_n_iter (int) – Evaluation frequency (steps).
model_dir (str) – Model directory to save the summaries to. If None, does not record.
batch_size (int) – Number of samples fed into the model at a time. Note, the batch size doesn’t affect on evaluation results.
eval_fns (iterable of functions) – List of evaluation functions that have signature of (true_df, prediction_df, **eval_kwargs)->`float`. If None, loss is calculated on true_df.
eval_kwargs – Evaluation function’s keyword arguments. Note, prediction column name should be ‘prediction’
- Returns:
Session run hook to evaluate the model while training.
- Return type:
tf.train.SessionRunHook
- recommenders.utils.tf_utils.export_model(model, train_input_fn, eval_input_fn, tf_feat_cols, base_dir)[source]#
Export TensorFlow estimator (model).
- Parameters:
model (tf.estimator.Estimator) – Model to export.
train_input_fn (function) – Training input function to create data receiver spec.
eval_input_fn (function) – Evaluation input function to create data receiver spec.
tf_feat_cols (list(tf.feature_column)) – Feature columns.
base_dir (str) – Base directory to export the model.
- Returns:
Exported model path
- Return type:
str
- recommenders.utils.tf_utils.pandas_input_fn(df, y_col=None, batch_size=128, num_epochs=1, shuffle=False, seed=None)[source]#
Pandas input function for TensorFlow high-level API Estimator. This function returns a tf.data.Dataset function.
Note
tf.estimator.inputs.pandas_input_fn cannot handle array/list column properly. For more information, see https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/numpy_input_fn
- Parameters:
df (pandas.DataFrame) – Data containing features.
y_col (str) – Label column name if df has it.
batch_size (int) – Batch size for the input function.
num_epochs (int) – Number of epochs to iterate over data. If None, it will run forever.
shuffle (bool) – If True, shuffles the data queue.
seed (int) – Random seed for shuffle.
- Returns:
Function.
- Return type:
tf.data.Dataset
- recommenders.utils.tf_utils.pandas_input_fn_for_saved_model(df, feat_name_type)[source]#
Pandas input function for TensorFlow SavedModel.
- Parameters:
df (pandas.DataFrame) – Data containing features.
feat_name_type (dict) – Feature name and type spec. E.g. {‘userID’: int, ‘itemID’: int, ‘rating’: float}
- Returns:
Input function
- Return type:
func
Timer#
- class recommenders.utils.timer.Timer[source]#
Timer class.
Examples
>>> import time >>> t = Timer() >>> t.start() >>> time.sleep(1) >>> t.stop() >>> t.interval < 1 True >>> with Timer() as t: ... time.sleep(1) >>> t.interval < 1 True >>> "Time elapsed {}".format(t) 'Time elapsed 1...'
- property interval#
Get time interval in seconds.
- Returns:
Seconds.
- Return type:
float
Plot utilities#
- recommenders.utils.plot.line_graph(values, labels, x_guides=None, x_name=None, y_name=None, x_min_max=None, y_min_max=None, legend_loc=None, subplot=None, plot_size=(5, 5))[source]#
Plot line graph(s).
- Parameters:
values (list(list(float or tuple)) or list(float or tuple) – List of graphs or a graph to plot E.g. a graph = list(y) or list((y,x))
labels (list(str) or str) – List of labels or a label for graph. If labels is a string, this function assumes the values is a single graph.
x_guides (list(int)) – List of guidelines (a vertical dotted line)
x_name (str) – x axis label
y_name (str) – y axis label
x_min_max (list or tuple) – Min and max value of the x axis
y_min_max (list or tuple) – Min and max value of the y axis
legend_loc (str) – legend location
subplot (list or tuple) – matplotlib.pyplot.subplot format. E.g. to draw 1 x 2 subplot, pass (1,2,1) for the first subplot and (1,2,2) for the second subplot.
plot_size (list or tuple) – Plot size (width, height)