Common utilities module#

General utilities#

recommenders.utils.general_utils.get_number_processors()[source]#

Get the number of processors in a CPU.

Returns:

Number of processors.

Return type:

int

recommenders.utils.general_utils.get_physical_memory()[source]#

Get the physical memory in GBs.

Returns:

Physical memory in GBs.

Return type:

float

recommenders.utils.general_utils.invert_dictionary(dictionary)[source]#

Invert a dictionary

Note

If the dictionary has unique keys and unique values, the inversion would be perfect. However, if there are repeated values, the inversion can take different keys

Parameters:

dictionary (dict) – A dictionary

Returns:

inverted dictionary

Return type:

dict

GPU utilities#

recommenders.utils.gpu_utils.clear_memory_all_gpus()[source]#

Clear memory of all GPUs.

recommenders.utils.gpu_utils.get_cuda_version()[source]#

Get CUDA version

Returns:

Version of the library.

Return type:

str

recommenders.utils.gpu_utils.get_cudnn_version()[source]#

Get the CuDNN version

Returns:

Version of the library.

Return type:

str

recommenders.utils.gpu_utils.get_gpu_info()[source]#

Get information of GPUs.

Returns:

List of gpu information dictionary as with device_name, total_memory (in Mb) and free_memory (in Mb). Returns an empty list if there is no cuda device available.

Return type:

list

recommenders.utils.gpu_utils.get_number_gpus()[source]#

Get the number of GPUs in the system. :returns: Number of GPUs. :rtype: int

Kubernetes utilities#

recommenders.utils.k8s_utils.nodes_to_replicas(n_cores_per_node, n_nodes=3, cpu_cores_per_replica=0.1)[source]#

Provide a rough estimate of the number of replicas supported by a given number of nodes with n_cores_per_node cores each

Parameters:
  • n_cores_per_node (int) – Total number of cores per node within an AKS cluster that you want to use

  • n_nodes (int) – Number of nodes (i.e. VMs) used in the AKS cluster

  • cpu_cores_per_replica (float) – Cores assigned to each replica. This can be fractional and corresponds to the cpu_cores argument passed to AksWebservice.deploy_configuration()

Returns:

Total number of replicas supported by the configuration

Return type:

int

recommenders.utils.k8s_utils.qps_to_replicas(target_qps, processing_time, max_qp_replica=1, target_utilization=0.7)[source]#

Provide a rough estimate of the number of replicas to support a given load (queries per second)

Parameters:
  • target_qps (int) – target queries per second that you want to support

  • processing_time (float) – the estimated amount of time (in seconds) your service call takes

  • max_qp_replica (int) – maximum number of concurrent queries per replica

  • target_utilization (float) – proportion of CPU utilization you think is ideal

Returns:

Number of estimated replicas required to support a target number of queries per second.

Return type:

int

recommenders.utils.k8s_utils.replicas_to_qps(num_replicas, processing_time, max_qp_replica=1, target_utilization=0.7)[source]#

Provide a rough estimate of the queries per second supported by a number of replicas

Parameters:
  • num_replicas (int) – number of replicas

  • processing_time (float) – the estimated amount of time (in seconds) your service call takes

  • max_qp_replica (int) – maximum number of concurrent queries per replica

  • target_utilization (float) – proportion of CPU utilization you think is ideal

Returns:

queries per second supported by the number of replicas

Return type:

int

Notebook utilities#

recommenders.utils.notebook_utils.execute_notebook(input_notebook, output_notebook, parameters={}, kernel_name='python3', timeout=2200)[source]#

Execute a notebook while passing parameters to it.

Note

Ensure your Jupyter Notebook is set up with parameters that can be modified and read. Use Markdown cells to specify parameters that need modification and code cells to set parameters that need to be read.

Parameters:
  • input_notebook (str) – Path to the input notebook.

  • output_notebook (str) – Path to the output notebook

  • parameters (dict) – Dictionary of parameters to pass to the notebook.

  • kernel_name (str) – Kernel name.

  • timeout (int) – Timeout (in seconds) for each cell to execute.

recommenders.utils.notebook_utils.is_databricks()[source]#

Check if the module is running on Databricks.

Returns:

True if the module is running on Databricks notebook, False otherwise.

Return type:

bool

recommenders.utils.notebook_utils.is_jupyter()[source]#

Check if the module is running on Jupyter notebook/console.

Returns:

True if the module is running on Jupyter notebook or Jupyter console, False otherwise.

Return type:

bool

recommenders.utils.notebook_utils.read_notebook(path)[source]#

Read the metadata stored in the notebook’s output source code.

Parameters:

path (str) – Path to the notebook.

Returns:

Dictionary of data stored in the notebook.

Return type:

dict

recommenders.utils.notebook_utils.store_metadata(name, value)[source]#

Store data in the notebook’s output source code.

Parameters:
  • name (str) – Name of the data.

  • value (int,float,str) – Value of the data.

recommenders.utils.notebook_memory_management.pre_run_cell()[source]#

Capture current time before we execute the current command

recommenders.utils.notebook_memory_management.start_watching_memory()[source]#

Register memory profiling tools to IPython instance.

recommenders.utils.notebook_memory_management.stop_watching_memory()[source]#

Unregister memory profiling tools from IPython instance.

recommenders.utils.notebook_memory_management.watch_memory()[source]#

Bring in the global memory usage value from the previous iteration

Python utilities#

recommenders.utils.python_utils.binarize(a, threshold)[source]#

Binarize the values.

Parameters:
  • a (numpy.ndarray) – Input array that needs to be binarized.

  • threshold (float) – Threshold below which all values are set to 0, else 1.

Returns:

Binarized array.

Return type:

numpy.ndarray

recommenders.utils.python_utils.cosine_similarity(cooccurrence)[source]#

Helper method to calculate the Cosine similarity of a matrix of co-occurrences.

Cosine similarity can be interpreted as the angle between the i-th and j-th item.

Parameters:

cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.

Returns:

The matrix of cosine similarity between any two items.

Return type:

numpy.ndarray

recommenders.utils.python_utils.exponential_decay(value, max_val, half_life)[source]#

Compute decay factor for a given value based on an exponential decay.

Values greater than max_val will be set to 1.

Parameters:
  • value (numeric) – Value to calculate decay factor

  • max_val (numeric) – Value at which decay factor will be 1

  • half_life (numeric) – Value at which decay factor will be 0.5

Returns:

Decay factor

Return type:

float

recommenders.utils.python_utils.get_top_k_scored_items(scores, top_k, sort_top_k=False)[source]#

Extract top K items from a matrix of scores for each user-item pair, optionally sort results per user.

Parameters:
  • scores (numpy.ndarray) – Score matrix (users x items).

  • top_k (int) – Number of top items to recommend.

  • sort_top_k (bool) – Flag to sort top k results.

Returns:

  • Indices into score matrix for each user’s top items.

  • Scores corresponding to top items.

Return type:

numpy.ndarray, numpy.ndarray

recommenders.utils.python_utils.inclusion_index(cooccurrence)[source]#

Helper method to calculate the Inclusion Index of a matrix of co-occurrences.

Inclusion index measures the overlap between items.

Parameters:

cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.

Returns:

The matrix of inclusion index between any two items.

Return type:

numpy.ndarray

recommenders.utils.python_utils.jaccard(cooccurrence)[source]#

Helper method to calculate the Jaccard similarity of a matrix of co-occurrences. When comparing Jaccard with count co-occurrence and lift similarity, count favours predictability, meaning that the most popular items will be recommended most of the time. Lift, by contrast, favours discoverability/serendipity, meaning that an item that is less popular overall but highly favoured by a small subset of users is more likely to be recommended. Jaccard is a compromise between the two.

Parameters:

cooccurrence (numpy.ndarray) – the symmetric matrix of co-occurrences of items.

Returns:

The matrix of Jaccard similarities between any two items.

Return type:

numpy.ndarray

recommenders.utils.python_utils.lexicographers_mutual_information(cooccurrence)[source]#

Helper method to calculate the Lexicographers Mutual Information of a matrix of co-occurrences.

Due to the bias of mutual information for low frequency items, lexicographers mutual information corrects the formula by multiplying it by the co-occurrence frequency.

Parameters:

cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.

Returns:

The matrix of lexicographers mutual information between any two items.

Return type:

numpy.ndarray

recommenders.utils.python_utils.lift(cooccurrence)[source]#

Helper method to calculate the Lift of a matrix of co-occurrences. In comparison with basic co-occurrence and Jaccard similarity, lift favours discoverability and serendipity, as opposed to co-occurrence that favours the most popular items, and Jaccard that is a compromise between the two.

Parameters:

cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.

Returns:

The matrix of Lifts between any two items.

Return type:

numpy.ndarray

recommenders.utils.python_utils.mutual_information(cooccurrence)[source]#

Helper method to calculate the Mutual Information of a matrix of co-occurrences.

Mutual information is a measurement of the amount of information explained by the i-th j-th item column vector.

Parameters:

cooccurrence (numpy.ndarray) – The symmetric matrix of co-occurrences of items.

Returns:

The matrix of mutual information between any two items.

Return type:

numpy.ndarray

recommenders.utils.python_utils.rescale(data, new_min=0, new_max=1, data_min=None, data_max=None)[source]#

Rescale/normalize the data to be within the range [new_min, new_max] If data_min and data_max are explicitly provided, they will be used as the old min/max values instead of taken from the data.

Note

This is same as the scipy.MinMaxScaler with the exception that we can override the min/max of the old scale.

Parameters:
  • data (numpy.ndarray) – 1d scores vector or 2d score matrix (users x items).

  • new_min (int|float) – The minimum of the newly scaled data.

  • new_max (int|float) – The maximum of the newly scaled data.

  • data_min (None|number) – The minimum of the passed data [if omitted it will be inferred].

  • data_max (None|number) – The maximum of the passed data [if omitted it will be inferred].

Returns:

The newly scaled/normalized data.

Return type:

numpy.ndarray

Spark utilities#

recommenders.utils.spark_utils.start_or_get_spark(app_name='Sample', url='local[*]', memory='10g', config=None, packages=None, jars=None, repositories=None)[source]#

Start Spark if not started

Parameters:
  • app_name (str) – set name of the application

  • url (str) – URL for spark master

  • memory (str) – size of memory for spark driver. This will be ignored if spark.driver.memory is set in config.

  • config (dict) – dictionary of configuration options

  • packages (list) – list of packages to install

  • jars (list) – list of jar files to add

  • repositories (list) – list of maven repositories

Returns:

Spark context.

Return type:

object

Tensorflow utilities#

class recommenders.utils.tf_utils.MetricsLogger[source]#

Metrics logger

__init__()[source]#

Initializer

get_log()[source]#

Getter

Returns:

Log metrics.

Return type:

dict

log(metric, value)[source]#

Log metrics. Each metric’s log will be stored in the corresponding list.

Parameters:
  • metric (str) – Metric name.

  • value (float) – Value.

recommenders.utils.tf_utils.build_optimizer(name, lr=0.001, **kwargs)[source]#

Get an optimizer for TensorFlow high-level API Estimator.

Available options are: adadelta, adagrad, adam, ftrl, momentum, rmsprop or sgd.

Parameters:
  • name (str) – Optimizer name.

  • lr (float) – Learning rate

  • kwargs – Optimizer arguments as key-value pairs

Returns:

Tensorflow optimizer.

Return type:

tf.train.Optimizer

recommenders.utils.tf_utils.evaluation_log_hook(estimator, logger, true_df, y_col, eval_df, every_n_iter=10000, model_dir=None, batch_size=256, eval_fns=None, **eval_kwargs)[source]#

Evaluation log hook for TensorFlow high-level API Estimator.

Note

TensorFlow Estimator model uses the last checkpoint weights for evaluation or prediction. In order to get the most up-to-date evaluation results while training, set model’s save_checkpoints_steps to be equal or greater than hook’s every_n_iter.

Parameters:
  • estimator (tf.estimator.Estimator) – Model to evaluate.

  • logger (Logger) – Custom logger to log the results. E.g., define a subclass of Logger for AzureML logging.

  • true_df (pd.DataFrame) – Ground-truth data.

  • y_col (str) – Label column name in true_df

  • eval_df (pd.DataFrame) – Evaluation data without label column.

  • every_n_iter (int) – Evaluation frequency (steps).

  • model_dir (str) – Model directory to save the summaries to. If None, does not record.

  • batch_size (int) – Number of samples fed into the model at a time. Note, the batch size doesn’t affect on evaluation results.

  • eval_fns (iterable of functions) – List of evaluation functions that have signature of (true_df, prediction_df, **eval_kwargs)->`float`. If None, loss is calculated on true_df.

  • eval_kwargs – Evaluation function’s keyword arguments. Note, prediction column name should be ‘prediction’

Returns:

Session run hook to evaluate the model while training.

Return type:

tf.train.SessionRunHook

recommenders.utils.tf_utils.export_model(model, train_input_fn, eval_input_fn, tf_feat_cols, base_dir)[source]#

Export TensorFlow estimator (model).

Parameters:
  • model (tf.estimator.Estimator) – Model to export.

  • train_input_fn (function) – Training input function to create data receiver spec.

  • eval_input_fn (function) – Evaluation input function to create data receiver spec.

  • tf_feat_cols (list(tf.feature_column)) – Feature columns.

  • base_dir (str) – Base directory to export the model.

Returns:

Exported model path

Return type:

str

recommenders.utils.tf_utils.pandas_input_fn(df, y_col=None, batch_size=128, num_epochs=1, shuffle=False, seed=None)[source]#

Pandas input function for TensorFlow high-level API Estimator. This function returns a tf.data.Dataset function.

Note

tf.estimator.inputs.pandas_input_fn cannot handle array/list column properly. For more information, see https://www.tensorflow.org/api_docs/python/tf/estimator/inputs/numpy_input_fn

Parameters:
  • df (pandas.DataFrame) – Data containing features.

  • y_col (str) – Label column name if df has it.

  • batch_size (int) – Batch size for the input function.

  • num_epochs (int) – Number of epochs to iterate over data. If None, it will run forever.

  • shuffle (bool) – If True, shuffles the data queue.

  • seed (int) – Random seed for shuffle.

Returns:

Function.

Return type:

tf.data.Dataset

recommenders.utils.tf_utils.pandas_input_fn_for_saved_model(df, feat_name_type)[source]#

Pandas input function for TensorFlow SavedModel.

Parameters:
  • df (pandas.DataFrame) – Data containing features.

  • feat_name_type (dict) – Feature name and type spec. E.g. {‘userID’: int, ‘itemID’: int, ‘rating’: float}

Returns:

Input function

Return type:

func

Timer#

class recommenders.utils.timer.Timer[source]#

Timer class.

Original code.

Examples

>>> import time
>>> t = Timer()
>>> t.start()
>>> time.sleep(1)
>>> t.stop()
>>> t.interval < 1
True
>>> with Timer() as t:
...   time.sleep(1)
>>> t.interval < 1
True
>>> "Time elapsed {}".format(t) 
'Time elapsed 1...'
__init__()[source]#
property interval#

Get time interval in seconds.

Returns:

Seconds.

Return type:

float

start()[source]#

Start the timer.

stop()[source]#

Stop the timer. Calculate the interval in seconds.

Plot utilities#

recommenders.utils.plot.line_graph(values, labels, x_guides=None, x_name=None, y_name=None, x_min_max=None, y_min_max=None, legend_loc=None, subplot=None, plot_size=(5, 5))[source]#

Plot line graph(s).

Parameters:
  • values (list(list(float or tuple)) or list(float or tuple) – List of graphs or a graph to plot E.g. a graph = list(y) or list((y,x))

  • labels (list(str) or str) – List of labels or a label for graph. If labels is a string, this function assumes the values is a single graph.

  • x_guides (list(int)) – List of guidelines (a vertical dotted line)

  • x_name (str) – x axis label

  • y_name (str) – y axis label

  • x_min_max (list or tuple) – Min and max value of the x axis

  • y_min_max (list or tuple) – Min and max value of the y axis

  • legend_loc (str) – legend location

  • subplot (list or tuple) – matplotlib.pyplot.subplot format. E.g. to draw 1 x 2 subplot, pass (1,2,1) for the first subplot and (1,2,2) for the second subplot.

  • plot_size (list or tuple) – Plot size (width, height)