Recommender algorithms module

Contents

Recommender algorithms module#

Recommender system algorithms and utilities.

Cornac utilities#

recommenders.models.cornac.cornac_utils.predict(model, data, usercol='userID', itemcol='itemID', predcol='prediction')[source]#

Computes predictions of a recommender model from Cornac on the data. Can be used for computing rating metrics like RMSE.

Parameters:
  • model (cornac.models.Recommender) – A recommender model from Cornac

  • data (pandas.DataFrame) – The data on which to predict

  • usercol (str) – Name of the user column

  • itemcol (str) – Name of the item column

Returns:

Dataframe with usercol, itemcol, predcol

Return type:

pandas.DataFrame

recommenders.models.cornac.cornac_utils.predict_ranking(model, data, usercol='userID', itemcol='itemID', predcol='prediction', remove_seen=False)[source]#

Computes predictions of recommender model from Cornac on all users and items in data. It can be used for computing ranking metrics like NDCG.

Parameters:
  • model (cornac.models.Recommender) – A recommender model from Cornac

  • data (pandas.DataFrame) – The data from which to get the users and items

  • usercol (str) – Name of the user column

  • itemcol (str) – Name of the item column

  • remove_seen (bool) – Flag to remove (user, item) pairs seen in the training data

Returns:

Dataframe with usercol, itemcol, predcol

Return type:

pandas.DataFrame

DeepRec utilities#

Base model#

class recommenders.models.deeprec.models.base_model.BaseModel(hparams, iterator_creator, graph=None, seed=None)[source]#

Base class for models

__init__(hparams, iterator_creator, graph=None, seed=None)[source]#

Initializing the model. Create common logics which are needed by all deeprec models, such as loss function, parameter set.

Parameters:
  • hparams (object) – An HParams object, holds the entire set of hyperparameters.

  • iterator_creator (object) – An iterator to load the data.

  • graph (object) – An optional graph.

  • seed (int) – Random seed.

eval(sess, feed_dict)[source]#

Evaluate the data in feed_dict with current model.

Parameters:
  • sess (object) – The model session object.

  • feed_dict (dict) – Feed values for evaluation. This is a dictionary that maps graph elements to values.

Returns:

A list of evaluated results, including total loss value, data loss value, predicted scores, and ground-truth labels.

Return type:

list

fit(train_file, valid_file, test_file=None)[source]#

Fit the model with train_file. Evaluate the model on valid_file per epoch to observe the training status. If test_file is not None, evaluate it too.

Parameters:
  • train_file (str) – training data set.

  • valid_file (str) – validation set.

  • test_file (str) – test set.

Returns:

An instance of self.

Return type:

object

group_labels(labels, preds, group_keys)[source]#

Devide labels and preds into several group according to values in group keys.

Parameters:
  • labels (list) – ground truth label list.

  • preds (list) – prediction score list.

  • group_keys (list) – group key list.

Returns:

  • Labels after group.

  • Predictions after group.

Return type:

list, list

infer(sess, feed_dict)[source]#

Given feature data (in feed_dict), get predicted scores with current model.

Parameters:
  • sess (object) – The model session object.

  • feed_dict (dict) – Instances to predict. This is a dictionary that maps graph elements to values.

Returns:

Predicted scores for the given instances.

Return type:

list

load_model(model_path=None)[source]#

Load an existing model.

Parameters:

model_path – model path.

Raises:

IOError – if the restore operation failed.

predict(infile_name, outfile_name)[source]#

Make predictions on the given data, and output predicted scores to a file.

Parameters:
  • infile_name (str) – Input file name, format is same as train/val/test file.

  • outfile_name (str) – Output file name, each line is the predict score.

Returns:

An instance of self.

Return type:

object

run_eval(filename)[source]#

Evaluate the given file and returns some evaluation metrics.

Parameters:

filename (str) – A file name that will be evaluated.

Returns:

A dictionary that contains evaluation metrics.

Return type:

dict

train(sess, feed_dict)[source]#

Go through the optimization step once with training data in feed_dict.

Parameters:
  • sess (object) – The model session object.

  • feed_dict (dict) – Feed values to train the model. This is a dictionary that maps graph elements to values.

Returns:

A list of values, including update operation, total loss, data loss, and merged summary.

Return type:

list

Sequential base model#

class recommenders.models.deeprec.models.sequential.sequential_base_model.SequentialBaseModel(hparams, iterator_creator, graph=None, seed=None)[source]#

Base class for sequential models

__init__(hparams, iterator_creator, graph=None, seed=None)[source]#

Initializing the model. Create common logics which are needed by all sequential models, such as loss function, parameter set.

Parameters:
  • hparams (HParams) – A HParams object, hold the entire set of hyperparameters.

  • iterator_creator (object) – An iterator to load the data.

  • graph (object) – An optional graph.

  • seed (int) – Random seed.

fit(train_file, valid_file, valid_num_ngs, eval_metric='group_auc')[source]#

Fit the model with train_file. Evaluate the model on valid_file per epoch to observe the training status. If test_file is not None, evaluate it too.

Parameters:
  • train_file (str) – training data set.

  • valid_file (str) – validation set.

  • valid_num_ngs (int) – the number of negative instances with one positive instance in validation data.

  • eval_metric (str) – the metric that control early stopping. e.g. “auc”, “group_auc”, etc.

Returns:

An instance of self.

Return type:

object

predict(infile_name, outfile_name)[source]#

Make predictions on the given data, and output predicted scores to a file.

Parameters:
  • infile_name (str) – Input file name.

  • outfile_name (str) – Output file name.

Returns:

An instance of self.

Return type:

object

run_eval(filename, num_ngs)[source]#

Evaluate the given file and returns some evaluation metrics.

Parameters:
  • filename (str) – A file name that will be evaluated.

  • num_ngs (int) – The number of negative sampling for a positive instance.

Returns:

A dictionary that contains evaluation metrics.

Return type:

dict

Iterators#

class recommenders.models.deeprec.io.iterator.BaseIterator[source]#

Abstract base iterator class

abstract gen_feed_dict(data_dict)[source]#

Abstract method. Construct a dictionary that maps graph elements to values.

Parameters:

data_dict (dict) – A dictionary that maps string name to numpy arrays.

abstract load_data_from_file(infile)[source]#

Abstract method. Read and parse data from a file.

Parameters:

infile (str) – Text input file. Each line in this file is an instance.

abstract parser_one_line(line)[source]#

Abstract method. Parse one string line into feature values.

Parameters:

line (str) – A string indicating one instance.

class recommenders.models.deeprec.io.iterator.FFMTextIterator(hparams, graph, col_spliter=' ', ID_spliter='%')[source]#

Data loader for FFM format based models, such as xDeepFM. Iterator will not load the whole data into memory. Instead, it loads data into memory per mini-batch, so that large files can be used as input data.

__init__(hparams, graph, col_spliter=' ', ID_spliter='%')[source]#

Initialize an iterator. Create the necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key settings such as #_feature and #_field are there.

  • graph (object) – The running graph. All created placeholder will be added to this graph.

  • col_spliter (str) – column splitter in one line.

  • ID_spliter (str) – ID splitter in one line.

gen_feed_dict(data_dict)[source]#

Construct a dictionary that maps graph elements to values.

Parameters:

data_dict (dict) – A dictionary that maps string name to numpy arrays.

Returns:

A dictionary that maps graph elements to numpy arrays.

Return type:

dict

load_data_from_file(infile)[source]#

Read and parse data from a file.

Parameters:

infile (str) – Text input file. Each line in this file is an instance.

Returns:

An iterator that yields parsed results, in the format of graph feed_dict.

Return type:

object

parser_one_line(line)[source]#

Parse one string line into feature values.

Parameters:

line (str) – A string indicating one instance.

Returns:

Parsed results, including label, features and impression_id.

Return type:

list

class recommenders.models.deeprec.io.dkn_iterator.DKNTextIterator(hparams, graph, col_spliter=' ', ID_spliter='%')[source]#

Data loader for the DKN model. DKN requires a special type of data format, where each instance contains a label, the candidate news article, and user’s clicked news article. Articles are represented by title words and title entities. Words and entities are aligned.

Iterator will not load the whole data into memory. Instead, it loads data into memory per mini-batch, so that large files can be used as input data.

__init__(hparams, graph, col_spliter=' ', ID_spliter='%')[source]#

Initialize an iterator. Create necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as #_feature and #_field are there.

  • graph (object) – the running graph. All created placeholder will be added to this graph.

  • col_spliter (str) – column spliter in one line.

  • ID_spliter (str) – ID spliter in one line.

gen_feed_dict(data_dict)[source]#

Construct a dictionary that maps graph elements to values.

Parameters:

data_dict (dict) – a dictionary that maps string name to numpy arrays.

Returns:

A dictionary that maps graph elements to numpy arrays.

Return type:

dict

gen_infer_feed_dict(data_dict)[source]#

Construct a dictionary that maps graph elements to values.

Parameters:

data_dict (dict) – a dictionary that maps string name to numpy arrays.

Returns:

A dictionary that maps graph elements to numpy arrays.

Return type:

dict

load_data_from_file(infile)[source]#

Read and parse data from a file.

Parameters:

infile (str) – text input file. Each line in this file is an instance.

Yields:

obj, list, int

  • An iterator that yields parsed results, in the format of graph feed_dict.

  • Impression id list.

  • Size of the data in a batch.

load_infer_data_from_file(infile)[source]#

Read and parse data from a file for infer document embedding.

Parameters:

infile (str) – text input file. Each line in this file is an instance.

Yields:

obj, list, int

  • An iterator that yields parsed results, in the format of graph feed_dict.

  • Impression id list.

  • Size of the data in a batch.

parser_one_line(line)[source]#

Parse one string line into feature values.

Parameters:

line (str) – a string indicating one instance

Returns:

Parsed results including label, candidate_news_index, click_news_index, candidate_news_entity_index, click_news_entity_index, impression_id.

Return type:

list

class recommenders.models.deeprec.io.dkn_item2item_iterator.DKNItem2itemTextIterator(hparams, graph)[source]#
__init__(hparams, graph)[source]#

This new iterator is for DKN’s item-to-item recommendations version. The tutorial can be found on this notebook.

Compared with user-to-item recommendations, we don’t need the user behavior module. So the placeholder can be simplified from the original DKNTextIterator.

Parameters:
  • hparams (object) – Global hyper-parameters.

  • graph (object) – The running graph.

load_data_from_file(infile)[source]#

This function will return a mini-batch of data with features, by looking up news_word_index dictionary and news_entity_index dictionary according to the news article’s ID.

Parameters:

infile (str) – File path. Each line of infile is a news article’s ID.

Yields:

dict, list, int

  • A dictionary that maps graph elements to numpy arrays.

  • A list with news article’s ID.

  • Size of the data in a batch.

class recommenders.models.deeprec.io.nextitnet_iterator.NextItNetIterator(hparams, graph, col_spliter='\t')[source]#

Data loader for the NextItNet model.

NextItNet requires a special type of data format. In training stage, each instance will produce (sequence_length * train_num_ngs) target items and labels, to let NextItNet output predictions of every item in a sequence except only of the last item.

__init__(hparams, graph, col_spliter='\t')[source]#

Initialize an iterator. Create necessary placeholders for the model. Different from sequential iterator

Parameters:
  • hparams (object) – Global hyper-parameters. Some key settings such as #_feature and #_field are there.

  • graph (object) – The running graph. All created placeholder will be added to this graph.

  • col_spliter (str) – Column splitter in one line.

class recommenders.models.deeprec.io.sequential_iterator.SequentialIterator(hparams, graph, col_spliter='\t')[source]#
__init__(hparams, graph, col_spliter='\t')[source]#

Initialize an iterator. Create necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key settings such as #_feature and #_field are there.

  • graph (object) – The running graph. All created placeholder will be added to this graph.

  • col_spliter (str) – Column splitter in one line.

gen_feed_dict(data_dict)[source]#

Construct a dictionary that maps graph elements to values.

Parameters:

data_dict (dict) – A dictionary that maps string name to numpy arrays.

Returns:

A dictionary that maps graph elements to numpy arrays.

Return type:

dict

load_data_from_file(infile, batch_num_ngs=0, min_seq_length=1)[source]#

Read and parse data from a file.

Parameters:
  • infile (str) – Text input file. Each line in this file is an instance.

  • batch_num_ngs (int) – The number of negative sampling here in batch. 0 represents that there is no need to do negative sampling here.

  • min_seq_length (int) – The minimum number of a sequence length. Sequences with length lower than min_seq_length will be ignored.

Yields:

object – An iterator that yields parsed results, in the format of graph feed_dict.

parse_file(input_file)[source]#

Parse the file to A list ready to be used for downstream tasks.

Parameters:

input_file – One of train, valid or test file which has never been parsed.

Returns:

A list with parsing result.

Return type:

list

parser_one_line(line)[source]#

Parse one string line into feature values.

Parameters:

line (str) – a string indicating one instance. This string contains tab-separated values including: label, user_hash, item_hash, item_cate, operation_time, item_history_sequence, item_cate_history_sequence, and time_history_sequence.

Returns:

Parsed results including label, user_id, item_id, item_cate, item_history_sequence, cate_history_sequence, current_time, time_diff, time_from_first_action, time_to_now.

Return type:

list

Data processing utilities#

class recommenders.models.deeprec.DataModel.ImplicitCF.ImplicitCF(train, test=None, adj_dir=None, col_user='userID', col_item='itemID', col_rating='rating', col_prediction='prediction', seed=None)[source]#

Data processing class for GCN models which use implicit feedback.

Initialize train and test set, create normalized adjacency matrix and sample data for training epochs.

__init__(train, test=None, adj_dir=None, col_user='userID', col_item='itemID', col_rating='rating', col_prediction='prediction', seed=None)[source]#

Constructor

Parameters:
  • adj_dir (str) – Directory to save / load adjacency matrices. If it is None, adjacency matrices will be created and will not be saved.

  • train (pandas.DataFrame) – Training data with at least columns (col_user, col_item, col_rating).

  • test (pandas.DataFrame) – Test data with at least columns (col_user, col_item, col_rating). test can be None, if so, we only process the training data.

  • col_user (str) – User column name.

  • col_item (str) – Item column name.

  • col_rating (str) – Rating column name.

  • seed (int) – Seed.

create_norm_adj_mat()[source]#

Create normalized adjacency matrix.

Returns:

Normalized adjacency matrix.

Return type:

scipy.sparse.csr_matrix

get_norm_adj_mat()[source]#

Load normalized adjacency matrix if it exists, otherwise create (and save) it.

Returns:

Normalized adjacency matrix.

Return type:

scipy.sparse.csr_matrix

train_loader(batch_size)[source]#

Sample train data every batch. One positive item and one negative item sampled for each user.

Parameters:

batch_size (int) – Batch size of users.

Returns:

  • Sampled users.

  • Sampled positive items.

  • Sampled negative items.

Return type:

numpy.ndarray, numpy.ndarray, numpy.ndarray

Utilities#

class recommenders.models.deeprec.deeprec_utils.HParams(hparams_dict)[source]#

Class for holding hyperparameters for DeepRec algorithms.

__init__(hparams_dict)[source]#

Create an HParams object from a dictionary of hyperparameter values.

Parameters:

hparams_dict (dict) – Dictionary with the model hyperparameters.

__repr__()[source]#

Return repr(self).

values()[source]#

Return the hyperparameter values as a dictionary.

Returns:

Dictionary with the hyperparameter values.

Return type:

dict

recommenders.models.deeprec.deeprec_utils.cal_metric(labels, preds, metrics)[source]#

Calculate metrics.

Available options are: auc, rmse, logloss, acc (accurary), f1, mean_mrr, ndcg (format like: ndcg@2;4;6;8), hit (format like: hit@2;4;6;8), group_auc.

Parameters:
  • labels (array-like) – Labels.

  • preds (array-like) – Predictions.

  • metrics (list) – List of metric names.

Returns:

Metrics.

Return type:

dict

Examples

>>> cal_metric(labels, preds, ["ndcg@2;4;6", "group_auc"])
{'ndcg@2': 0.4026, 'ndcg@4': 0.4953, 'ndcg@6': 0.5346, 'group_auc': 0.8096}
recommenders.models.deeprec.deeprec_utils.check_nn_config(f_config)[source]#

Check neural networks configuration.

Parameters:

f_config (dict) – Neural network configuration.

Raises:

ValueError – If the parameters are not correct.

recommenders.models.deeprec.deeprec_utils.check_type(config)[source]#

Check that the config parameters are the correct type

Parameters:

config (dict) – Configuration dictionary.

Raises:

TypeError – If the parameters are not the correct type.

recommenders.models.deeprec.deeprec_utils.create_hparams(flags)[source]#

Create the model hyperparameters.

Parameters:

flags (dict) – Dictionary with the model requirements.

Returns:

Hyperparameter object.

Return type:

HParams

recommenders.models.deeprec.deeprec_utils.dcg_score(y_true, y_score, k=10)[source]#

Computing dcg score metric at k.

Parameters:
  • y_true (np.ndarray) – Ground-truth labels.

  • y_score (np.ndarray) – Predicted labels.

Returns:

dcg scores.

Return type:

np.ndarray

recommenders.models.deeprec.deeprec_utils.download_deeprec_resources(azure_container_url, data_path, remote_resource_name)[source]#

Download resources.

Parameters:
  • azure_container_url (str) – URL of Azure container.

  • data_path (str) – Path to download the resources.

  • remote_resource_name (str) – Name of the resource.

recommenders.models.deeprec.deeprec_utils.flat_config(config)[source]#

Flat config loaded from a yaml file to a flat dict.

Parameters:

config (dict) – Configuration loaded from a yaml file.

Returns:

Configuration dictionary.

Return type:

dict

recommenders.models.deeprec.deeprec_utils.hit_score(y_true, y_score, k=10)[source]#

Computing hit score metric at k.

Parameters:
  • y_true (np.ndarray) – ground-truth labels.

  • y_score (np.ndarray) – predicted labels.

Returns:

hit score.

Return type:

np.ndarray

recommenders.models.deeprec.deeprec_utils.load_dict(filename)[source]#

Load the vocabularies.

Parameters:

filename (str) – Filename of user, item or category vocabulary.

Returns:

A saved vocabulary.

Return type:

dict

recommenders.models.deeprec.deeprec_utils.load_yaml(filename)[source]#

Load a yaml file.

Parameters:

filename (str) – Filename.

Returns:

Dictionary.

Return type:

dict

recommenders.models.deeprec.deeprec_utils.mrr_score(y_true, y_score)[source]#

Computing mrr score metric.

Parameters:
  • y_true (np.ndarray) – Ground-truth labels.

  • y_score (np.ndarray) – Predicted labels.

Returns:

mrr scores.

Return type:

numpy.ndarray

recommenders.models.deeprec.deeprec_utils.ndcg_score(y_true, y_score, k=10)[source]#

Computing ndcg score metric at k.

Parameters:
  • y_true (np.ndarray) – Ground-truth labels.

  • y_score (np.ndarray) – Predicted labels.

Returns:

ndcg scores.

Return type:

numpy.ndarray

recommenders.models.deeprec.deeprec_utils.prepare_hparams(yaml_file=None, **kwargs)[source]#

Prepare the model hyperparameters and check that all have the correct value.

Parameters:

yaml_file (str) – YAML file as configuration.

Returns:

Hyperparameter object.

Return type:

HParams

DKN#

class recommenders.models.deeprec.models.dkn.DKN(hparams, iterator_creator)[source]#

DKN model (Deep Knowledge-Aware Network)

Citation:

H. Wang, F. Zhang, X. Xie and M. Guo, “DKN: Deep Knowledge-Aware Network for News Recommendation”, in Proceedings of the 2018 World Wide Web Conference on World Wide Web, 2018.

__init__(hparams, iterator_creator)[source]#

Initialization steps for DKN. Compared with the BaseModel, DKN requires two different pre-computed embeddings, i.e. word embedding and entity embedding. After creating these two embedding variables, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters.

  • iterator_creator (object) – DKN data loader class.

infer_embedding(sess, feed_dict)[source]#

Infer document embedding in feed_dict with current model.

Parameters:
  • sess (object) – The model session object.

  • feed_dict (dict) – Feed values for evaluation. This is a dictionary that maps graph elements to values.

Returns:

News embedding in a batch.

Return type:

list

run_get_embedding(infile_name, outfile_name)[source]#

infer document embedding with current model.

Parameters:
  • infile_name (str) – Input file name, format is [Newsid] [w1,w2,w3…] [e1,e2,e3…]

  • outfile_name (str) – Output file name, format is [Newsid] [embedding]

Returns:

An instance of self.

Return type:

object

DKN item-to-item#

class recommenders.models.deeprec.models.dkn_item2item.DKNItem2Item(hparams, iterator_creator)[source]#

Class for item-to-item recommendations using DKN. See microsoft/recommenders

eval(sess, feed_dict)[source]#

Evaluate the data in feed_dict with current model.

Parameters:
  • sess (object) – The model session object.

  • feed_dict (dict) – Feed values for evaluation. This is a dictionary that maps graph elements to values.

Returns:

A tuple with predictions and labels arrays.

Return type:

numpy.ndarray, numpy.ndarray

run_eval(filename)[source]#

Evaluate the given file and returns some evaluation metrics.

Parameters:

filename (str) – A file name that will be evaluated.

Returns:

A dictionary containing evaluation metrics.

Return type:

dict

xDeepFM#

class recommenders.models.deeprec.models.xDeepFM.XDeepFMModel(hparams, iterator_creator, graph=None, seed=None)[source]#

xDeepFM model

Citation:

J. Lian, X. Zhou, F. Zhang, Z. Chen, X. Xie, G. Sun, “xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems”, in Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, London, 2018.

LightGCN#

class recommenders.models.deeprec.models.graphrec.lightgcn.LightGCN(hparams, data, seed=None)[source]#

LightGCN model

Citation:

He, Xiangnan, Kuan Deng, Xiang Wang, Yan Li, Yongdong Zhang, and Meng Wang. “LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation.” arXiv preprint arXiv:2002.02126, 2020.

__init__(hparams, data, seed=None)[source]#

Initializing the model. Create parameters, placeholders, embeddings and loss function.

Parameters:
  • hparams (HParams) – A HParams object, hold the entire set of hyperparameters.

  • data (object) – A recommenders.models.deeprec.DataModel.ImplicitCF object, load and process data.

  • seed (int) – Seed.

fit()[source]#

Fit the model on self.data.train. If eval_epoch is not -1, evaluate the model on self.data.test every eval_epoch epoch to observe the training status.

infer_embedding(user_file, item_file)[source]#

Export user and item embeddings to csv files.

Parameters:
  • user_file (str) – Path of file to save user embeddings.

  • item_file (str) – Path of file to save item embeddings.

load(model_path=None)[source]#

Load an existing model.

Parameters:

model_path – Model path.

Raises:

IOError – if the restore operation failed.

recommend_k_items(test, top_k=10, sort_top_k=True, remove_seen=True, use_id=False)[source]#

Recommend top K items for all users in the test set.

Parameters:
  • test (pandas.DataFrame) – Test data.

  • top_k (int) – Number of top items to recommend.

  • sort_top_k (bool) – Flag to sort top k results.

  • remove_seen (bool) – Flag to remove items seen in training from recommendation.

Returns:

Top k recommendation items for each user.

Return type:

pandas.DataFrame

run_eval()[source]#

Run evaluation on self.data.test.

Returns:

Results of all metrics in self.metrics.

Return type:

dict

score(user_ids, remove_seen=True)[source]#

Score all items for test users.

Parameters:
  • user_ids (np.array) – Users to test.

  • remove_seen (bool) – Flag to remove items seen in training from recommendation.

Returns:

Value of interest of all items for the users.

Return type:

numpy.ndarray

A2SVD#

class recommenders.models.deeprec.models.sequential.asvd.A2SVDModel(hparams, iterator_creator, graph=None, seed=None)[source]#

A2SVD Model (Attentive Asynchronous Singular Value Decomposition)

It extends ASVD with an attention module.

Citation:

ASVD: Y. Koren, “Factorization Meets the Neighborhood: a Multifaceted Collaborative Filtering Model”, in Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 426–434, ACM, 2008.

A2SVD: Z. Yu, J. Lian, A. Mahmoody, G. Liu and X. Xie, “Adaptive User Modeling with Long and Short-Term Preferences for Personailzed Recommendation”, in Proceedings of the 28th International Joint Conferences on Artificial Intelligence, IJCAI’19, Pages 4213-4219, AAAI Press, 2019.

Caser#

class recommenders.models.deeprec.models.sequential.caser.CaserModel(hparams, iterator_creator, seed=None)[source]#

Caser Model

Citation:

J. Tang and K. Wang, “Personalized top-n sequential recommendation via convolutional sequence embedding”, in Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, ACM, 2018.

__init__(hparams, iterator_creator, seed=None)[source]#

Initialization of variables for caser

Parameters:
  • hparams (HParams) – A HParams object, hold the entire set of hyperparameters.

  • iterator_creator (object) – An iterator to load the data.

GRU#

NextItNet#

class recommenders.models.deeprec.models.sequential.nextitnet.NextItNetModel(hparams, iterator_creator, graph=None, seed=None)[source]#

NextItNet Model

Citation:

Yuan, Fajie, et al. “A Simple Convolutional Generative Network for Next Item Recommendation”, in Web Search and Data Mining, 2019.

Note

It requires strong sequence with dataset.

RNN Cells#

Module implementing RNN Cells.

This module provides a number of basic commonly used RNN cells, such as LSTM (Long Short Term Memory) or GRU (Gated Recurrent Unit), and a number of operators that allow adding dropouts, projections, or embeddings for inputs. Constructing multi-layer cells is supported by the class MultiRNNCell, or by calling the rnn ops several times.

class recommenders.models.deeprec.models.sequential.rnn_cell_implement.Time4ALSTMCell(*args, **kwargs)[source]#
call(inputs, state)[source]#

Call method for the Time4ALSTMCell.

Parameters:
  • inputs – A 2D Tensor of shape [batch_size, input_size].

  • state – A 2D Tensor of shape [batch_size, state_size].

Returns:

  • A 2D Tensor of shape [batch_size, output_size].

  • A 2D Tensor of shape [batch_size, state_size].

Return type:

A tuple containing

property output_size#

size of outputs produced by this cell.

Type:

Integer or TensorShape

property state_size#

size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

class recommenders.models.deeprec.models.sequential.rnn_cell_implement.Time4LSTMCell(*args, **kwargs)[source]#
call(inputs, state)[source]#

Call method for the Time4LSTMCell.

Parameters:
  • inputs – A 2D Tensor of shape [batch_size, input_size].

  • state – A 2D Tensor of shape [batch_size, state_size].

Returns:

  • A 2D Tensor of shape [batch_size, output_size].

  • A 2D Tensor of shape [batch_size, state_size].

Return type:

A tuple containing

property output_size#

size of outputs produced by this cell.

Type:

Integer or TensorShape

property state_size#

size(s) of state(s) used by this cell.

It can be represented by an Integer, a TensorShape or a tuple of Integers or TensorShapes.

SUM#

SLIRec#

class recommenders.models.deeprec.models.sequential.sli_rec.SLI_RECModel(hparams, iterator_creator, graph=None, seed=None)[source]#

SLI Rec model

Citation:

Z. Yu, J. Lian, A. Mahmoody, G. Liu and X. Xie, “Adaptive User Modeling with Long and Short-Term Preferences for Personailzed Recommendation”, in Proceedings of the 28th International Joint Conferences on Artificial Intelligence, IJCAI’19, Pages 4213-4219, AAAI Press, 2019.

FastAI utilities#

recommenders.models.fastai.fastai_utils.cartesian_product(*arrays)[source]#

Compute the Cartesian product in fastai algo. This is a helper function.

Parameters:

arrays (tuple of numpy.ndarray) – Input arrays

Returns:

product

Return type:

numpy.ndarray

recommenders.models.fastai.fastai_utils.hide_fastai_progress_bar()[source]#

Hide fastai progress bar

recommenders.models.fastai.fastai_utils.score(learner, test_df, user_col='userID', item_col='itemID', prediction_col='prediction', top_k=None)[source]#

Score all users+items provided and reduce to top_k items per user if top_k>0

Parameters:
  • learner (object) – Model.

  • test_df (pandas.DataFrame) – Test dataframe.

  • user_col (str) – User column name.

  • item_col (str) – Item column name.

  • prediction_col (str) – Prediction column name.

  • top_k (int) – Number of top items to recommend.

Returns:

Result of recommendation

Return type:

pandas.DataFrame

LightFM utilities#

recommenders.models.lightfm.lightfm_utils.compare_metric(df_list, metric='prec', stage='test')[source]#

Function to combine and prepare list of dataframes into tidy format.

Parameters:
  • df_list (list) – List of dataframes

  • metrics (str) – name of metric to be extracted, optional

  • stage (str) – name of model fitting stage to be extracted, optional

Returns:

Metrics

Return type:

pandas.DataFrame

recommenders.models.lightfm.lightfm_utils.model_perf_plots(df)[source]#

Function to plot model performance metrics.

Parameters:

df (pandas.DataFrame) – Dataframe in tidy format, with [‘epoch’,’level’,’value’] columns

Returns:

matplotlib axes

Return type:

object

recommenders.models.lightfm.lightfm_utils.prepare_all_predictions(data, uid_map, iid_map, interactions, model, num_threads, user_features=None, item_features=None)[source]#

Function to prepare all predictions for evaluation. :param data: dataframe of all users, items and ratings as loaded :type data: pandas df :param uid_map: Keys to map internal user indices to external ids. :type uid_map: dict :param iid_map: Keys to map internal item indices to external ids. :type iid_map: dict :param interactions: user-item interaction :type interactions: np.float32 coo_matrix :param model: fitted LightFM model :type model: LightFM instance :param num_threads: number of parallel computation threads :type num_threads: int :param user_features: User weights over features :type user_features: np.float32 csr_matrix :param item_features: Item weights over features :type item_features: np.float32 csr_matrix

Returns:

all predictions

Return type:

pandas.DataFrame

recommenders.models.lightfm.lightfm_utils.prepare_test_df(test_idx, uids, iids, uid_map, iid_map, weights)[source]#

Function to prepare test df for evaluation

Parameters:
  • test_idx (slice) – slice of test indices

  • uids (numpy.ndarray) – Array of internal user indices

  • iids (numpy.ndarray) – Array of internal item indices

  • uid_map (dict) – Keys to map internal user indices to external ids.

  • iid_map (dict) – Keys to map internal item indices to external ids.

  • weights (numpy.float32 coo_matrix) – user-item interaction

Returns:

user-item selected for testing

Return type:

pandas.DataFrame

recommenders.models.lightfm.lightfm_utils.similar_items(item_id, item_features, model, N=10)[source]#

Function to return top N similar items based on lyst/lightfm#244

Parameters:
  • item_id (int) – id of item to be used as reference

  • item_features (scipy sparse CSR matrix) – item feature matric

  • model (LightFM instance) – fitted LightFM model

  • N (int) – Number of top similar items to return

Returns:

top N most similar items with score

Return type:

pandas.DataFrame

recommenders.models.lightfm.lightfm_utils.similar_users(user_id, user_features, model, N=10)[source]#

Function to return top N similar users based on lyst/lightfm#244

Args:

user_id (int): id of user to be used as reference user_features (scipy sparse CSR matrix): user feature matric model (LightFM instance): fitted LightFM model N (int): Number of top similar users to return

Returns:

top N most similar users with score

Return type:

pandas.DataFrame

recommenders.models.lightfm.lightfm_utils.track_model_metrics(model, train_interactions, test_interactions, k=10, no_epochs=100, no_threads=8, show_plot=True, **kwargs)[source]#

Function to record model’s performance at each epoch, formats the performance into tidy format, plots the performance and outputs the performance data.

Parameters:
  • model (LightFM instance) – fitted LightFM model

  • train_interactions (scipy sparse COO matrix) – train interactions set

  • test_interactions (scipy sparse COO matrix) – test interaction set

  • k (int) – number of recommendations, optional

  • no_epochs (int) – Number of epochs to run, optional

  • no_threads (int) – Number of parallel threads to use, optional

  • **kwargs – other keyword arguments to be passed down

Returns:

  • Performance traces of the fitted model

  • Fitted model

  • Side effect of the method

Return type:

pandas.DataFrame, LightFM model, matplotlib axes

LightGBM utilities#

class recommenders.models.lightgbm.lightgbm_utils.NumEncoder(cate_cols, nume_cols, label_col, threshold=10, thresrate=0.99)[source]#

Encode all the categorical features into numerical ones by sequential label encoding, sequential count encoding, and binary encoding. Additionally, it also filters the low-frequency categories and fills the missing values.

fit_transform(df)[source]#

Input a training set (pandas.DataFrame) and return the converted 2 numpy.ndarray (x,y).

Parameters:

df (pandas.DataFrame) – Input dataframe

Returns:

New features and labels.

Return type:

numpy.ndarray, numpy.ndarray

transform(df)[source]#

Input a testing / validation set (pandas.DataFrame) and return the converted 2 numpy.ndarray (x,y).

Parameters:

df (pandas.DataFrame) – Input dataframe

Returns:

New features and labels.

Return type:

numpy.ndarray, numpy.ndarray

recommenders.models.lightgbm.lightgbm_utils.unpackbits(x, num_bits)[source]#

Convert a decimal value numpy.ndarray into multi-binary value numpy.ndarray ([1,2]->[[0,1],[1,0]])

Parameters:
  • x (numpy.ndarray) – Decimal array.

  • num_bits (int) – The max length of the converted binary value.

NCF#

class recommenders.models.ncf.dataset.DataFile(filename, col_user, col_item, col_rating, col_test_batch=None, binary=True)[source]#

DataFile class for NCF. Iterator to read data from a csv file. Data must be sorted by user. Includes utilities for loading user data from file, formatting it and returning a Pandas dataframe.

__init__(filename, col_user, col_item, col_rating, col_test_batch=None, binary=True)[source]#

Constructor

Parameters:
  • filename (str) – Path to file to be processed.

  • col_user (str) – User column name.

  • col_item (str) – Item column name.

  • col_rating (str) – Rating column name.

  • col_test_batch (str) – Test batch column name.

  • binary (bool) – If true, set rating > 0 to rating = 1.

load_data(key, by_user=True)[source]#

Load data for a specified user or test batch

Parameters:
  • key (int) – user or test batch index

  • by_user (bool) – load data by usr if True, else by test batch

Returns:

pandas.DataFrame

class recommenders.models.ncf.dataset.Dataset(train_file, test_file=None, test_file_full=None, overwrite_test_file_full=False, n_neg=4, n_neg_test=100, col_user='userID', col_item='itemID', col_rating='rating', binary=True, seed=None, sample_with_replacement=False, print_warnings=False)[source]#

Dataset class for NCF

__init__(train_file, test_file=None, test_file_full=None, overwrite_test_file_full=False, n_neg=4, n_neg_test=100, col_user='userID', col_item='itemID', col_rating='rating', binary=True, seed=None, sample_with_replacement=False, print_warnings=False)[source]#

Constructor

Parameters:
  • train_file (str) – Path to training dataset file.

  • test_file (str) – Path to test dataset file for leave-one-out evaluation.

  • test_file_full (str) – Path to full test dataset file including negative samples.

  • overwrite_test_file_full (bool) – If true, recreate and overwrite test_file_full.

  • n_neg (int) – Number of negative samples per positive example for training set.

  • n_neg_test (int) – Number of negative samples per positive example for test set.

  • col_user (str) – User column name.

  • col_item (str) – Item column name.

  • col_rating (str) – Rating column name.

  • binary (bool) – If true, set rating > 0 to rating = 1.

  • seed (int) – Seed.

  • sample_with_replacement (bool) – If true, sample negative examples with replacement, otherwise without replacement.

  • print_warnings (bool) – If true, prints warnings if sampling without replacement and there are not enough items to sample from to satisfy n_neg or n_neg_test.

test_loader(yield_id=False)[source]#

Generator for serving batches of test data for leave-one-out evaluation. Data is loaded from test_file_full.

Parameters:

yield_id (bool) – If true, return assigned user and item IDs, else return original values.

Returns:

list

train_loader(batch_size, shuffle_size=None, yield_id=False, write_to=None)[source]#

Generator for serving batches of training data. Positive examples are loaded from the original training file, to which negative samples are added. Data is loaded in memory into a shuffle buffer up to a maximum of shuffle_size rows, before the data is shuffled and released. If out-of-memory errors are encountered, try reducing shuffle_size.

Parameters:
  • batch_size (int) – Number of examples in each batch.

  • shuffle_size (int) – Maximum number of examples in shuffle buffer.

  • yield_id (bool) – If true, return assigned user and item IDs, else return original values.

  • write_to (str) – Path of file to write full dataset (including negative examples).

Returns:

list

exception recommenders.models.ncf.dataset.EmptyFileException[source]#

Exception raised if file is empty

exception recommenders.models.ncf.dataset.FileNotSortedException[source]#

Exception raised if file is not sorted correctly

exception recommenders.models.ncf.dataset.MissingFieldsException[source]#

Exception raised if file is missing expected fields

exception recommenders.models.ncf.dataset.MissingUserException[source]#

Exception raised if user is not in file

class recommenders.models.ncf.dataset.NegativeSampler(user, n_samples, user_positive_item_pool, item_pool, sample_with_replacement, print_warnings=True, training=True)[source]#

NegativeSampler class for NCF. Samples a subset of negative items from a given population of items.

__init__(user, n_samples, user_positive_item_pool, item_pool, sample_with_replacement, print_warnings=True, training=True)[source]#

Constructor

Parameters:
  • user (str or int) – User to be sampled for.

  • n_samples (int) – Number of required samples.

  • user_positive_item_pool (set) – Set of items with which user has previously interacted.

  • item_pool (set) – Set of all items in population.

  • sample_with_replacement (bool) – If true, sample negative examples with replacement, otherwise without replacement.

  • print_warnings (bool) – If true, prints warnings if sampling without replacement and there are not enough items to sample from to satisfy n_neg or n_neg_test.

  • training (bool) – Set to true if sampling for the training set or false if for the test set.

sample()[source]#

Method for sampling uniformly from a population of negative items

Returns: list

class recommenders.models.ncf.ncf_singlenode.NCF(n_users, n_items, model_type='NeuMF', n_factors=8, layer_sizes=[16, 8, 4], n_epochs=50, batch_size=64, learning_rate=0.005, verbose=1, seed=None)[source]#

Neural Collaborative Filtering (NCF) implementation

Citation:

He, Xiangnan, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-Seng Chua. “Neural collaborative filtering.” In Proceedings of the 26th International Conference on World Wide Web, pp. 173-182. International World Wide Web Conferences Steering Committee, 2017. Link: https://www.comp.nus.edu.sg/~xiangnan/papers/ncf.pdf

__init__(n_users, n_items, model_type='NeuMF', n_factors=8, layer_sizes=[16, 8, 4], n_epochs=50, batch_size=64, learning_rate=0.005, verbose=1, seed=None)[source]#

Constructor

Parameters:
  • n_users (int) – Number of users in the dataset.

  • n_items (int) – Number of items in the dataset.

  • model_type (str) – Model type.

  • n_factors (int) – Dimension of latent space.

  • layer_sizes (list) – Number of layers for MLP.

  • n_epochs (int) – Number of epochs for training.

  • batch_size (int) – Batch size.

  • learning_rate (float) – Learning rate.

  • verbose (int) – Whether to show the training output or not.

  • seed (int) – Seed.

fit(data)[source]#

Fit model with training data

Parameters:

data (NCFDataset) – initilized Dataset in ./dataset.py

load(gmf_dir=None, mlp_dir=None, neumf_dir=None, alpha=0.5)[source]#

Load model parameters for further use.

GMF model –> load parameters in gmf_dir

MLP model –> load parameters in mlp_dir

NeuMF model –> load parameters in neumf_dir or in gmf_dir and mlp_dir

Parameters:
  • gmf_dir (str) – Directory name for GMF model.

  • mlp_dir (str) – Directory name for MLP model.

  • neumf_dir (str) – Directory name for neumf model.

  • alpha (float) – the concatenation hyper-parameter for gmf and mlp output layer.

Returns:

Load parameters in this model.

Return type:

object

predict(user_input, item_input, is_list=False)[source]#

Predict function of this trained model

Parameters:
  • user_input (list or element of list) – userID or userID list

  • item_input (list or element of list) – itemID or itemID list

  • is_list (bool) – if true, the input is list type noting that list-wise type prediction is faster than element-wise’s.

Returns:

A list of predicted rating or predicted rating score.

Return type:

list or float

save(dir_name)[source]#

Save model parameters in dir_name

Parameters:

dir_name (str) – directory name, which should be a folder name instead of file name we will create a new directory if not existing.

NewsRec utilities#

Base model#

class recommenders.models.newsrec.models.base_model.BaseModel(hparams, iterator_creator, seed=None)[source]#

Basic class of models

hparams#

A HParams object, holds the entire set of hyperparameters.

Type:

HParams

train_iterator#

An iterator to load the data in training steps.

Type:

object

test_iterator#

An iterator to load the data in testing steps.

Type:

object

graph#

An optional graph.

Type:

object

seed#

Random seed.

Type:

int

__init__(hparams, iterator_creator, seed=None)[source]#

Initializing the model. Create common logics which are needed by all deeprec models, such as loss function, parameter set.

Parameters:
  • hparams (HParams) – A HParams object, holds the entire set of hyperparameters.

  • iterator_creator (object) – An iterator to load the data.

  • graph (object) – An optional graph.

  • seed (int) – Random seed.

eval(eval_batch_data)[source]#

Evaluate the data in feed_dict with current model.

Parameters:
  • sess (object) – The model session object.

  • feed_dict (dict) – Feed values for evaluation. This is a dictionary that maps graph elements to values.

Returns:

A list of evaluated results, including total loss value, data loss value, predicted scores, and ground-truth labels.

Return type:

list

fit(train_news_file, train_behaviors_file, valid_news_file, valid_behaviors_file, test_news_file=None, test_behaviors_file=None, step_limit=None)[source]#

Fit the model with train_file. Evaluate the model on valid_file per epoch to observe the training status. If test_news_file is not None, evaluate it too.

Parameters:
  • train_file (str) – training data set.

  • valid_file (str) – validation set.

  • test_news_file (str) – test set.

Returns:

An instance of self.

Return type:

object

group_labels(labels, preds, group_keys)[source]#

Devide labels and preds into several group according to values in group keys.

Parameters:
  • labels (list) – ground truth label list.

  • preds (list) – prediction score list.

  • group_keys (list) – group key list.

Returns:

  • Keys after group.

  • Labels after group.

  • Preds after group.

Return type:

list, list, list

run_eval(news_filename, behaviors_file)[source]#

Evaluate the given file and returns some evaluation metrics.

Parameters:

filename (str) – A file name that will be evaluated.

Returns:

A dictionary that contains evaluation metrics.

Return type:

dict

train(train_batch_data)[source]#

Go through the optimization step once with training data in feed_dict.

Parameters:
  • sess (object) – The model session object.

  • feed_dict (dict) – Feed values to train the model. This is a dictionary that maps graph elements to values.

Returns:

A list of values, including update operation, total loss, data loss, and merged summary.

Return type:

list

Iterators#

class recommenders.models.newsrec.io.mind_iterator.MINDIterator(hparams, npratio=-1, col_spliter='\t', ID_spliter='%')[source]#

Train data loader for NAML model. The model require a special type of data format, where each instance contains a label, impresion id, user id, the candidate news articles and user’s clicked news article. Articles are represented by title words, body words, verts and subverts.

Iterator will not load the whole data into memory. Instead, it loads data into memory per mini-batch, so that large files can be used as input data.

col_spliter#

column spliter in one line.

Type:

str

ID_spliter#

ID spliter in one line.

Type:

str

batch_size#

the samples num in one batch.

Type:

int

title_size#

max word num in news title.

Type:

int

his_size#

max clicked news num in user click history.

Type:

int

npratio#

negaive and positive ratio used in negative sampling. -1 means no need of negtive sampling.

Type:

int

__init__(hparams, npratio=-1, col_spliter='\t', ID_spliter='%')[source]#

Initialize an iterator. Create necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as head_num and head_dim are there.

  • npratio (int) – negaive and positive ratio used in negative sampling. -1 means no need of negtive sampling.

  • col_spliter (str) – column spliter in one line.

  • ID_spliter (str) – ID spliter in one line.

init_behaviors(behaviors_file)[source]#

init behavior logs given behaviors file.

Args: behaviors_file: path of behaviors file

init_news(news_file)[source]#

init news information given news file, such as news_title_index and nid2index. :param news_file: path of news file

load_data_from_file(news_file, behavior_file)[source]#

Read and parse data from news file and behavior file.

Parameters:
  • news_file (str) – A file contains several informations of news.

  • beahaviros_file (str) – A file contains information of user impressions.

Yields:

object – An iterator that yields parsed results, in the format of dict.

load_dict(file_path)[source]#

load pickle file

Parameters:

path (file) – file path

Returns:

pickle loaded object

Return type:

object

load_impression_from_file(behaivors_file)[source]#

Read and parse impression data from behaivors file.

Parameters:

behaivors_file (str) – A file contains several informations of behaviros.

Yields:

object – An iterator that yields parsed impression data, in the format of dict.

load_news_from_file(news_file)[source]#

Read and parse user data from news file.

Parameters:

news_file (str) – A file contains several informations of news.

Yields:

object – An iterator that yields parsed news feature, in the format of dict.

load_user_from_file(news_file, behavior_file)[source]#

Read and parse user data from news file and behavior file.

Parameters:
  • news_file (str) – A file contains several informations of news.

  • beahaviros_file (str) – A file contains information of user impressions.

Yields:

object – An iterator that yields parsed user feature, in the format of dict.

parser_one_line(line)[source]#

Parse one behavior sample into feature values. if npratio is larger than 0, return negtive sampled result.

Parameters:

line (int) – sample index.

Yields:

list – Parsed results including label, impression id , user id, candidate_title_index, clicked_title_index.

class recommenders.models.newsrec.io.mind_all_iterator.MINDAllIterator(hparams, npratio=-1, col_spliter='\t', ID_spliter='%')[source]#

Train data loader for NAML model. The model require a special type of data format, where each instance contains a label, impresion id, user id, the candidate news articles and user’s clicked news article. Articles are represented by title words, body words, verts and subverts.

Iterator will not load the whole data into memory. Instead, it loads data into memory per mini-batch, so that large files can be used as input data.

col_spliter#

column spliter in one line.

Type:

str

ID_spliter#

ID spliter in one line.

Type:

str

batch_size#

the samples num in one batch.

Type:

int

title_size#

max word num in news title.

Type:

int

body_size#

max word num in news body (abstract used in MIND).

Type:

int

his_size#

max clicked news num in user click history.

Type:

int

npratio#

negaive and positive ratio used in negative sampling. -1 means no need of negtive sampling.

Type:

int

__init__(hparams, npratio=-1, col_spliter='\t', ID_spliter='%')[source]#

Initialize an iterator. Create necessary placeholders for the model.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as head_num and head_dim are there.

  • graph (object) – the running graph. All created placeholder will be added to this graph.

  • col_spliter (str) – column spliter in one line.

  • ID_spliter (str) – ID spliter in one line.

init_behaviors(behaviors_file)[source]#

Init behavior logs given behaviors file.

Parameters:

behaviors_file (str) – path of behaviors file

init_news(news_file)[source]#

Init news information given news file, such as news_title_index, news_abstract_index.

Parameters:

news_file – path of news file

load_data_from_file(news_file, behavior_file)[source]#

Read and parse data from a file.

Parameters:
  • news_file (str) – A file contains several informations of news.

  • beahaviros_file (str) – A file contains information of user impressions.

Yields:

object – An iterator that yields parsed results, in the format of graph feed_dict.

load_dict(file_path)[source]#

Load pickled file

Parameters:

path (file) – File path

Returns:

pickle load obj

Return type:

object

load_impression_from_file(behaivors_file)[source]#

Read and parse impression data from behaivors file.

Parameters:

behaivors_file (str) – A file contains several informations of behaviros.

Yields:

object – An iterator that yields parsed impression data, in the format of dict.

load_news_from_file(news_file)[source]#

Read and parse user data from news file.

Parameters:

news_file (str) – A file contains several informations of news.

Yields:

object – An iterator that yields parsed news feature, in the format of dict.

load_user_from_file(news_file, behavior_file)[source]#

Read and parse user data from news file and behavior file.

Parameters:
  • news_file (str) – A file contains several informations of news.

  • beahaviros_file (str) – A file contains information of user impressions.

Yields:

object – An iterator that yields parsed user feature, in the format of dict.

parser_one_line(line)[source]#

Parse one string line into feature values.

Parameters:

line (str) – a string indicating one instance.

Yields:

list – Parsed results including label, impression id , user id, candidate_title_index, clicked_title_index, candidate_ab_index, clicked_ab_index, candidate_vert_index, clicked_vert_index, candidate_subvert_index, clicked_subvert_index,

Utilities#

class recommenders.models.newsrec.models.layers.AttLayer2(*args, **kwargs)[source]#

Soft alignment attention implement.

dim#

attention hidden dim

Type:

int

__init__(dim=200, seed=0, **kwargs)[source]#

Initialization steps for AttLayer2.

Parameters:

dim (int) – attention hidden dim

build(input_shape)[source]#

Initialization for variables in AttLayer2 There are there variables in AttLayer2, i.e. W, b and q.

Parameters:

input_shape (object) – shape of input tensor.

call(inputs, mask=None, **kwargs)[source]#

Core implementation of soft attention.

Parameters:

inputs (object) – input tensor.

Returns:

weighted sum of input tensors.

Return type:

object

compute_mask(input, input_mask=None)[source]#

Compte output mask value.

Parameters:
  • input (object) – input tensor.

  • input_mask – input mask

Returns:

output mask.

Return type:

object

compute_output_shape(input_shape)[source]#

Compute shape of output tensor.

Parameters:

input_shape (tuple) – shape of input tensor.

Returns:

shape of output tensor.

Return type:

tuple

class recommenders.models.newsrec.models.layers.ComputeMasking(*args, **kwargs)[source]#

Compute if inputs contains zero value.

Returns:

True for values not equal to zero.

Return type:

bool tensor

call(inputs, **kwargs)[source]#

Call method for ComputeMasking.

Parameters:

inputs (object) – input tensor.

Returns:

True for values not equal to zero.

Return type:

bool tensor

compute_output_shape(input_shape)[source]#

Computes the output shape of the layer.

This method will cause the layer’s state to be built, if that has not happened before. This requires that the layer will later be used with inputs that match the input shape provided here.

Parameters:

input_shape – Shape tuple (tuple of integers) or tf.TensorShape, or structure of shape tuples / tf.TensorShape instances (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.

Returns:

A tf.TensorShape instance or structure of tf.TensorShape instances.

class recommenders.models.newsrec.models.layers.OverwriteMasking(*args, **kwargs)[source]#

Set values at specific positions to zero.

Parameters:

inputs (list) – value tensor and mask tensor.

Returns:

tensor after setting values to zero.

Return type:

object

build(input_shape)[source]#

Creates the variables of the layer (for subclass implementers).

This is a method that implementers of subclasses of Layer or Model can override if they need a state-creation step in-between layer instantiation and layer call. It is invoked automatically before the first execution of call().

This is typically used to create the weights of Layer subclasses (at the discretion of the subclass implementer).

Parameters:

input_shape – Instance of TensorShape, or list of instances of TensorShape if the layer expects a list of inputs (one instance per input).

call(inputs, **kwargs)[source]#

Call method for OverwriteMasking.

Parameters:

inputs (list) – value tensor and mask tensor.

Returns:

tensor after setting values to zero.

Return type:

object

compute_output_shape(input_shape)[source]#

Computes the output shape of the layer.

This method will cause the layer’s state to be built, if that has not happened before. This requires that the layer will later be used with inputs that match the input shape provided here.

Parameters:

input_shape – Shape tuple (tuple of integers) or tf.TensorShape, or structure of shape tuples / tf.TensorShape instances (one per output tensor of the layer). Shape tuples can include None for free dimensions, instead of an integer.

Returns:

A tf.TensorShape instance or structure of tf.TensorShape instances.

recommenders.models.newsrec.models.layers.PersonalizedAttentivePooling(dim1, dim2, dim3, seed=0)[source]#

Soft alignment attention implement.

recommenders.models.newsrec.models.layers.dim1#

first dimention of value shape.

Type:

int

recommenders.models.newsrec.models.layers.dim2#

second dimention of value shape.

Type:

int

recommenders.models.newsrec.models.layers.dim3#

shape of query

Type:

int

Returns:

weighted summary of inputs value.

Return type:

object

class recommenders.models.newsrec.models.layers.SelfAttention(*args, **kwargs)[source]#

Multi-head self attention implement.

Parameters:
  • multiheads (int) – The number of heads.

  • head_dim (object) – Dimension of each head.

  • mask_right (boolean) – whether to mask right words.

Returns:

Weighted sum after attention.

Return type:

object

Mask(inputs, seq_len, mode='add')[source]#

Mask operation used in multi-head self attention

Parameters:
  • seq_len (object) – sequence length of inputs.

  • mode (str) – mode of mask.

Returns:

tensors after masking.

Return type:

object

__init__(multiheads, head_dim, seed=0, mask_right=False, **kwargs)[source]#

Initialization steps for AttLayer2.

Parameters:
  • multiheads (int) – The number of heads.

  • head_dim (object) – Dimension of each head.

  • mask_right (boolean) – Whether to mask right words.

build(input_shape)[source]#

Initialization for variables in SelfAttention. There are three variables in SelfAttention, i.e. WQ, WK ans WV. WQ is used for linear transformation of query. WK is used for linear transformation of key. WV is used for linear transformation of value.

Parameters:

input_shape (object) – shape of input tensor.

call(QKVs)[source]#

Core logic of multi-head self attention.

Parameters:

QKVs (list) – inputs of multi-head self attention i.e. query, key and value.

Returns:

ouput tensors.

Return type:

object

compute_output_shape(input_shape)[source]#

Compute shape of output tensor.

Returns:

output shape tuple.

Return type:

tuple

get_config()[source]#

add multiheads, multiheads and mask_right into layer config.

Returns:

config of SelfAttention layer.

Return type:

dict

recommenders.models.newsrec.newsrec_utils.check_nn_config(f_config)[source]#

Check neural networks configuration.

Parameters:

f_config (dict) – Neural network configuration.

Raises:

ValueError – If the parameters are not correct.

recommenders.models.newsrec.newsrec_utils.check_type(config)[source]#

Check that the config parameters are the correct type

Parameters:

config (dict) – Configuration dictionary.

Raises:

TypeError – If the parameters are not the correct type.

recommenders.models.newsrec.newsrec_utils.create_hparams(flags)[source]#

Create the model hyperparameters.

Parameters:

flags (dict) – Dictionary with the model requirements.

Returns:

Hyperparameter object.

Return type:

HParams

recommenders.models.newsrec.newsrec_utils.get_mind_data_set(type)[source]#

Get MIND dataset address

Parameters:

type (str) – type of mind dataset, must be in [‘large’, ‘small’, ‘demo’]

Returns:

data url and train valid dataset name

Return type:

list

recommenders.models.newsrec.newsrec_utils.newsample(news, ratio)[source]#

Sample ratio samples from news list. If length of news is less than ratio, pad zeros.

Parameters:
  • news (list) – input news list

  • ratio (int) – sample number

Returns:

output of sample list.

Return type:

list

recommenders.models.newsrec.newsrec_utils.prepare_hparams(yaml_file=None, **kwargs)[source]#

Prepare the model hyperparameters and check that all have the correct value.

Parameters:

yaml_file (str) – YAML file as configuration.

Returns:

Hyperparameter object.

Return type:

HParams

recommenders.models.newsrec.newsrec_utils.word_tokenize(sent)[source]#

Split sentence into word list using regex. :param sent: Input sentence :type sent: str

Returns:

word list

Return type:

list

LSTUR#

class recommenders.models.newsrec.models.lstur.LSTURModel(hparams, iterator_creator, seed=None)[source]#

LSTUR model(Neural News Recommendation with Multi-Head Self-Attention)

Mingxiao An, Fangzhao Wu, Chuhan Wu, Kun Zhang, Zheng Liu and Xing Xie: Neural News Recommendation with Long- and Short-term User Representations, ACL 2019

word2vec_embedding#

Pretrained word embedding matrix.

Type:

numpy.ndarray

hparam#

Global hyper-parameters.

Type:

object

__init__(hparams, iterator_creator, seed=None)[source]#

Initialization steps for LSTUR. Compared with the BaseModel, LSTUR need word embedding. After creating word embedding matrix, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as type and gru_unit are there.

  • iterator_creator_train (object) – LSTUR data loader class for train data.

  • iterator_creator_test (object) – LSTUR data loader class for test and validation data

NAML#

class recommenders.models.newsrec.models.naml.NAMLModel(hparams, iterator_creator, seed=None)[source]#

NAML model(Neural News Recommendation with Attentive Multi-View Learning)

Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang and Xing Xie, Neural News Recommendation with Attentive Multi-View Learning, IJCAI 2019

word2vec_embedding#

Pretrained word embedding matrix.

Type:

numpy.ndarray

hparam#

Global hyper-parameters.

Type:

object

__init__(hparams, iterator_creator, seed=None)[source]#

Initialization steps for NAML. Compared with the BaseModel, NAML need word embedding. After creating word embedding matrix, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as filter_num are there.

  • iterator_creator_train (object) – NAML data loader class for train data.

  • iterator_creator_test (object) – NAML data loader class for test and validation data

NPA#

class recommenders.models.newsrec.models.npa.NPAModel(hparams, iterator_creator, seed=None)[source]#

NPA model(Neural News Recommendation with Attentive Multi-View Learning)

Chuhan Wu, Fangzhao Wu, Mingxiao An, Jianqiang Huang, Yongfeng Huang and Xing Xie: NPA: Neural News Recommendation with Personalized Attention, KDD 2019, ADS track.

word2vec_embedding#

Pretrained word embedding matrix.

Type:

numpy.ndarray

hparam#

Global hyper-parameters.

Type:

object

__init__(hparams, iterator_creator, seed=None)[source]#

Initialization steps for MANL. Compared with the BaseModel, NPA need word embedding. After creating word embedding matrix, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as filter_num are there.

  • iterator_creator_train (object) – NPA data loader class for train data.

  • iterator_creator_test (object) – NPA data loader class for test and validation data

NRMS#

class recommenders.models.newsrec.models.nrms.NRMSModel(hparams, iterator_creator, seed=None)[source]#

NRMS model(Neural News Recommendation with Multi-Head Self-Attention)

Chuhan Wu, Fangzhao Wu, Suyu Ge, Tao Qi, Yongfeng Huang,and Xing Xie, “Neural News Recommendation with Multi-Head Self-Attention” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)

word2vec_embedding#

Pretrained word embedding matrix.

Type:

numpy.ndarray

hparam#

Global hyper-parameters.

Type:

object

__init__(hparams, iterator_creator, seed=None)[source]#

Initialization steps for NRMS. Compared with the BaseModel, NRMS need word embedding. After creating word embedding matrix, BaseModel’s __init__ method will be called.

Parameters:
  • hparams (object) – Global hyper-parameters. Some key setttings such as head_num and head_dim are there.

  • iterator_creator_train (object) – NRMS data loader class for train data.

  • iterator_creator_test (object) – NRMS data loader class for test and validation data

RBM#

class recommenders.models.rbm.rbm.RBM(possible_ratings, visible_units, hidden_units=500, keep_prob=0.7, init_stdv=0.1, learning_rate=0.004, minibatch_size=100, training_epoch=20, display_epoch=10, sampling_protocol=[50, 70, 80, 90, 100], debug=False, with_metrics=False, seed=42)[source]#

Restricted Boltzmann Machine

__init__(possible_ratings, visible_units, hidden_units=500, keep_prob=0.7, init_stdv=0.1, learning_rate=0.004, minibatch_size=100, training_epoch=20, display_epoch=10, sampling_protocol=[50, 70, 80, 90, 100], debug=False, with_metrics=False, seed=42)[source]#

Implementation of a multinomial Restricted Boltzmann Machine for collaborative filtering in numpy/pandas/tensorflow

Based on the article by Ruslan Salakhutdinov, Andriy Mnih and Geoffrey Hinton https://www.cs.toronto.edu/~rsalakhu/papers/rbmcf.pdf

In this implementation we use multinomial units instead of the one-hot-encoded used in the paper. This means that the weights are rank 2 (matrices) instead of rank 3 tensors.

Basic mechanics:

1) A computational graph is created when the RBM class is instantiated. For an item based recommender this consists of: visible units: The number n_visible of visible units equals the number of items hidden units : hyperparameter to fix during training

  1. Gibbs Sampling:

2.1) for each training epoch, the visible units are first clamped on the data

2.2) The activation probability of the hidden units, given a linear combination of the visibles, is evaluated P(h=1|phi_v). The latter is then used to sample the value of the hidden units.

2.3) The probability P(v=l|phi_h) is evaluated, where l=1,..,r are the ratings (e.g. r=5 for the movielens dataset). In general, this is a multinomial distribution, from which we sample the value of v.

2.4) This step is repeated k times, where k increases as optimization converges. It is essential to fix to zero the original unrated items during the all learning process.

3) Optimization: The free energy of the visible units given the hidden is evaluated at the beginning (F_0) and after k steps of Bernoulli sampling (F_k). The weights and biases are updated by minimizing the differene F_0 - F_k.

4) Inference: Once the joint probability distribution P(v,h) is learned, this is used to generate ratings for unrated items for all users

batch_training(num_minibatches)[source]#

Perform training over input minibatches. If self.with_metrics is False, no online metrics are evaluated.

Parameters:

num_minibatches (scalar, int32) – Number of training minibatches.

Returns:

Training error per single epoch. If self.with_metrics is False, this is zero.

Return type:

float

binomial_sampling(pr)[source]#

Binomial sampling of hidden units activations using a rejection method.

Basic mechanics:

1) Extract a random number from a uniform distribution (g) and compare it with the unit’s probability (pr)

2) Choose 0 if pr<g, 1 otherwise. It is convenient to implement this condtion using the relu function.

Parameters:
  • pr (tf.Tensor, float32) – Input conditional probability.

  • g (numpy.ndarray, float32) – Uniform probability used for comparison.

Returns:

Float32 tensor of sampled units. The value is 1 if pr>g and 0 otherwise.

Return type:

tf.Tensor

data_pipeline()[source]#

Define the data pipeline

eval_out()[source]#

Implement multinomial sampling from a trained model

fit(xtr)[source]#

Fit method

Training in generative models takes place in two steps:

  1. Gibbs sampling

  2. Gradient evaluation and parameters update

This estimate is later used in the weight update step by minimizing the distance between the model and the empirical free energy. Note that while the unit’s configuration space is sampled, the weights are determined via maximum likelihood (saddle point).

Main component of the algo; once instantiated, it generates the computational graph and performs model training

Parameters:
  • xtr (numpy.ndarray, integers) – the user/affinity matrix for the train set

  • xtst (numpy.ndarray, integers) – the user/affinity matrix for the test set

free_energy(x)[source]#

Free energy of the visible units given the hidden units. Since the sum is over the hidden units’ states, the functional form of the visible units Free energy is the same as the one for the binary model.

Parameters:

x (tf.Tensor) – This can be either the sampled value of the visible units (v_k) or the input data

Returns:

Free energy of the model.

Return type:

tf.Tensor

generate_graph()[source]#

Call the different RBM modules to generate the computational graph

gibbs_protocol(i)[source]#

Gibbs protocol.

Basic mechanics:

If the current epoch i is in the interval specified in the training protocol, the number of steps in Gibbs sampling (k) is incremented by one and gibbs_sampling is updated accordingly.

Parameters:

i (int) – Current epoch in the loop

gibbs_sampling()[source]#

Gibbs sampling: Determines an estimate of the model configuration via sampling. In the binary RBM we need to impose that unseen movies stay as such, i.e. the sampling phase should not modify the elements where v=0.

Parameters:
  • k (scalar, integer) – iterator. Number of sampling steps.

  • v (tf.Tensor, float32) – visible units.

Returns:

  • h_k: The sampled value of the hidden unit at step k, float32.

  • v_k: The sampled value of the visible unit at step k, float32.

Return type:

tf.Tensor, tf.Tensor

init_gpu()[source]#

Config GPU memory

init_metrics()[source]#

Initialize metrics

init_parameters()[source]#

Initialize the parameters of the model.

This is a single layer model with two biases. So we have a rectangular matrix w_{ij} and two bias vectors to initialize.

Parameters:
  • n_visible (int) – number of visible units (input layer)

  • n_hidden (int) – number of hidden units (latent variables of the model)

Returns:

  • w of size (n_visible, n_hidden): correlation matrix initialized by sampling from a normal distribution with zero mean and given variance init_stdv.

  • bv of size (1, n_visible): visible units’ bias, initialized to zero.

  • bh of size (1, n_hidden): hidden units’ bias, initiliazed to zero.

Return type:

tf.Tensor, tf.Tensor, tf.Tensor

init_training_session(xtr)[source]#

Initialize the TF session on training data

Parameters:

xtr (numpy.ndarray, int32) – The user/affinity matrix for the train set.

load(file_path='./rbm_model.ckpt')[source]#

Load model parameters for further use.

This function loads a saved tensorflow session.

Parameters:

file_path (str) – file path for RBM model checkpoint

losses(vv)[source]#

Calculate contrastive divergence, which is the difference between the free energy clamped on the data (v) and the model Free energy (v_k).

Parameters:

vv (tf.Tensor, float32) – empirical input

Returns:

contrastive divergence

Return type:

obj

multinomial_distribution(phi)[source]#

Probability that unit v has value l given phi: P(v=l|phi)

Parameters:
  • phi (tf.Tensor) – linear combination of values of the previous layer

  • r (float) – rating scale, corresponding to the number of classes

Returns:

  • A tensor of shape (r, m, Nv): This needs to be reshaped as (m, Nv, r) in the last step to allow for faster sampling when used in the multinomial function.

Return type:

tf.Tensor

multinomial_sampling(pr)[source]#

Multinomial Sampling of ratings

Basic mechanics: For r classes, we sample r binomial distributions using the rejection method. This is possible since each class is statistically independent from the other. Note that this is the same method used in numpy’s random.multinomial() function.

1) extract a size r array of random numbers from a uniform distribution (g). As pr is normalized, we need to normalize g as well.

2) For each user and item, compare pr with the reference distribution. Note that the latter needs to be the same for ALL the user/item pairs in the dataset, as by assumptions they are sampled from a common distribution.

Parameters:
  • pr (tf.Tensor, float32) – A distributions of shape (m, n, r), where m is the number of examples, n the number of features and r the number of classes. pr needs to be normalized, i.e. sum_k p(k) = 1 for all m, at fixed n.

  • f (tf.Tensor, float32) – Normalized, uniform probability used for comparison.

Returns:

An (m,n) float32 tensor of sampled rankings from 1 to r.

Return type:

tf.Tensor

placeholder()[source]#

Initialize the placeholders for the visible units

predict(x)[source]#

Returns the inferred ratings. This method is similar to recommend_k_items() with the exceptions that it returns all the inferred ratings

Basic mechanics:

The method samples new ratings from the learned joint distribution, together with their probabilities. The input x must have the same number of columns as the one used for training the model, i.e. the same number of items, but it can have an arbitrary number of rows (users).

Parameters:
  • x (numpy.ndarray, int32) – Input user/affinity matrix. Note that this can be a single vector, i.e.

  • user. (the ratings of a single)

Returns:

  • A matrix with the inferred ratings.

  • The elapsed time for predediction.

Return type:

numpy.ndarray, float

recommend_k_items(x, top_k=10, remove_seen=True)[source]#

Returns the top-k items ordered by a relevancy score.

Basic mechanics:

The method samples new ratings from the learned joint distribution, together with their probabilities. The input x must have the same number of columns as the one used for training the model (i.e. the same number of items) but it can have an arbitrary number of rows (users).

A recommendation score is evaluated by taking the element-wise product between the ratings and the associated probabilities. For example, we could have the following situation:

        rating     probability     score
item1     5           0.5          2.5
item2     4           0.8          3.2

then item2 will be recommended.

Parameters:
  • x (numpy.ndarray, int32) – input user/affinity matrix. Note that this can be a single vector, i.e. the ratings

  • user. (of a single)

  • top_k (scalar, int32) – the number of items to recommend.

Returns:

  • A sparse matrix containing the top_k elements ordered by their score.

  • The time taken to recommend k items.

Return type:

numpy.ndarray, float

sample_hidden_units(vv)[source]#

Sampling: In RBM we use Contrastive divergence to sample the parameter space. In order to do that we need to initialize the two conditional probabilities:

P(h|phi_v) –> returns the probability that the i-th hidden unit is active

P(v|phi_h) –> returns the probability that the i-th visible unit is active

Sample hidden units given the visibles. This can be thought of as a Forward pass step in a FFN

Parameters:

vv (tf.Tensor, float32) – visible units

Returns:

  • phv: The activation probability of the hidden unit.

  • h_: The sampled value of the hidden unit from a Bernoulli distributions having success probability phv.

Return type:

tf.Tensor, tf.Tensor

sample_visible_units(h)[source]#

Sample the visible units given the hiddens. This can be thought of as a Backward pass in a FFN (negative phase). Each visible unit can take values in [1,rating], while the zero is reserved for missing data; as such the value of the hidden unit is sampled from a multinomial distribution.

Basic mechanics:

1) For every training example we first sample Nv Multinomial distributions. The result is of the form [0,1,0,0,0,…,0] where the index of the 1 element corresponds to the rth rating. The index is extracted using the argmax function and we need to add 1 at the end since array indeces starts from 0.

2) Selects only those units that have been sampled. During the training phase it is important to not use the reconstructed inputs, so we beed to enforce a zero value in the reconstructed ratings in the same position as the original input.

Parameters:

h (tf.Tensor, float32) – visible units.

Returns:

  • pvh: The activation probability of the visible unit given the hidden.

  • v_: The sampled value of the visible unit from a Multinomial distributions having success probability pvh.

Return type:

tf.Tensor, tf.Tensor

save(file_path='./rbm_model.ckpt')[source]#

Save model parameters to file_path

This function saves the current tensorflow session to a specified path.

Parameters:

file_path (str) – output file path for the RBM model checkpoint we will create a new directory if not existing.

SAR#

class recommenders.models.sar.sar_singlenode.SARSingleNode(col_user='userID', col_item='itemID', col_rating='rating', col_timestamp='timestamp', col_prediction='prediction', similarity_type='jaccard', time_decay_coefficient=30, time_now=None, timedecay_formula=False, threshold=1, normalize=False)[source]#

Simple Algorithm for Recommendations (SAR) implementation

SAR is a fast scalable adaptive algorithm for personalized recommendations based on user transaction history and items description. The core idea behind SAR is to recommend items like those that a user already has demonstrated an affinity to. It does this by 1) estimating the affinity of users for items, 2) estimating similarity across items, and then 3) combining the estimates to generate a set of recommendations for a given user.

__init__(col_user='userID', col_item='itemID', col_rating='rating', col_timestamp='timestamp', col_prediction='prediction', similarity_type='jaccard', time_decay_coefficient=30, time_now=None, timedecay_formula=False, threshold=1, normalize=False)[source]#

Initialize model parameters

Parameters:
  • col_user (str) – user column name

  • col_item (str) – item column name

  • col_rating (str) – rating column name

  • col_timestamp (str) – timestamp column name

  • col_prediction (str) – prediction column name

  • similarity_type (str) – [‘cooccurrence’, ‘cosine’, ‘inclusion index’, ‘jaccard’, ‘lexicographers mutual information’, ‘lift’, ‘mutual information’] option for computing item-item similarity

  • time_decay_coefficient (float) – number of days till ratings are decayed by 1/2

  • time_now (int | None) – current time for time decay calculation

  • timedecay_formula (bool) – flag to apply time decay

  • threshold (int) – item-item co-occurrences below this threshold will be removed

  • normalize (bool) – option for normalizing predictions to scale of original ratings

compute_affinity_matrix(df, rating_col)[source]#

Affinity matrix.

The user-affinity matrix can be constructed by treating the users and items as indices in a sparse matrix, and the events as the data. Here, we’re treating the ratings as the event weights. We convert between different sparse-matrix formats to de-duplicate user-item pairs, otherwise they will get added up.

Parameters:
  • df (pandas.DataFrame) – Indexed df of users and items

  • rating_col (str) – Name of column to use for ratings

Returns:

Affinity matrix in Compressed Sparse Row (CSR) format.

Return type:

sparse.csr

compute_cooccurrence_matrix(df)[source]#

Co-occurrence matrix.

The co-occurrence matrix is defined as \(C = U^T * U\)

where U is the user_affinity matrix with 1’s as values (instead of ratings).

Parameters:

df (pandas.DataFrame) – DataFrame of users and items

Returns:

Co-occurrence matrix

Return type:

numpy.ndarray

compute_time_decay(df, decay_column)[source]#

Compute time decay on provided column.

Parameters:
  • df (pandas.DataFrame) – DataFrame of users and items

  • decay_column (str) – column to decay

Returns:

with column decayed

Return type:

pandas.DataFrame

fit(df)[source]#

Main fit method for SAR.

Note

Please make sure that df has no duplicates.

Parameters:

df (pandas.DataFrame) – User item rating dataframe (without duplicates).

get_item_based_topk(items, top_k=10, sort_top_k=True)[source]#

Get top K similar items to provided seed items based on similarity metric defined. This method will take a set of items and use them to recommend the most similar items to that set based on the similarity matrix fit during training. This allows recommendations for cold-users (unseen during training), note - the model is not updated.

The following options are possible based on information provided in the items input: 1. Single user or seed of items: only item column (ratings are assumed to be 1) 2. Single user or seed of items w/ ratings: item column and rating column 3. Separate users or seeds of items: item and user column (user ids are only used to separate item sets) 4. Separate users or seeds of items with ratings: item, user and rating columns provided

Parameters:
  • items (pandas.DataFrame) – DataFrame with item, user (optional), and rating (optional) columns

  • top_k (int) – number of top items to recommend

  • sort_top_k (bool) – flag to sort top k results

Returns:

sorted top k recommendation items

Return type:

pandas.DataFrame

get_popularity_based_topk(top_k=10, sort_top_k=True, items=True)[source]#

Get top K most frequently occurring items across all users.

Parameters:
  • top_k (int) – number of top items to recommend.

  • sort_top_k (bool) – flag to sort top k results.

  • items (bool) – if false, return most frequent users instead

Returns:

top k most popular items.

Return type:

pandas.DataFrame

get_topk_most_similar_users(user, top_k, sort_top_k=True)[source]#

Based on user affinity towards items, calculate the most similar users to the given user.

Parameters:
  • user (int) – user to retrieve most similar users for

  • top_k (int) – number of top items to recommend

  • sort_top_k (bool) – flag to sort top k results

Returns:

top k most similar users and their scores

Return type:

pandas.DataFrame

predict(test)[source]#

Output SAR scores for only the users-items pairs which are in the test set

Parameters:

test (pandas.DataFrame) – DataFrame that contains users and items to test

Returns:

DataFrame contains the prediction results

Return type:

pandas.DataFrame

recommend_k_items(test, top_k=10, sort_top_k=True, remove_seen=False)[source]#

Recommend top K items for all users which are in the test set

Parameters:
  • test (pandas.DataFrame) – users to test

  • top_k (int) – number of top items to recommend

  • sort_top_k (bool) – flag to sort top k results

  • remove_seen (bool) – flag to remove items seen in training from recommendation

Returns:

top k recommendation items for each user

Return type:

pandas.DataFrame

score(test, remove_seen=False)[source]#

Score all items for test users.

Parameters:
  • test (pandas.DataFrame) – user to test

  • remove_seen (bool) – flag to remove items seen in training from recommendation

Returns:

Value of interest of all items for the users.

Return type:

numpy.ndarray

set_index(df)[source]#

Generate continuous indices for users and items to reduce memory usage.

Parameters:

df (pandas.DataFrame) – dataframe with user and item ids

SASRec#

class recommenders.models.sasrec.model.Encoder(*args, **kwargs)[source]#

Invokes Transformer based encoder with user defined number of layers

__init__(num_layers, seq_max_len, embedding_dim, attention_dim, num_heads, conv_dims, dropout_rate)[source]#

Initialize parameters.

Parameters:
  • num_layers (int) – Number of layers.

  • seq_max_len (int) – Maximum sequence length.

  • embedding_dim (int) – Embedding dimension.

  • attention_dim (int) – Dimension of the attention embeddings.

  • num_heads (int) – Number of heads in the multi-head self-attention module.

  • conv_dims (list) – List of the dimensions of the Feedforward layer.

  • dropout_rate (float) – Dropout probability.

call(x, training, mask)[source]#

Model forward pass.

Parameters:
  • x (tf.Tensor) – Input tensor.

  • training (Boolean) – True if in training mode.

  • mask (tf.Tensor) – Mask tensor.

Returns:

Output tensor.

Return type:

tf.Tensor

class recommenders.models.sasrec.model.EncoderLayer(*args, **kwargs)[source]#

Transformer based encoder layer

__init__(seq_max_len, embedding_dim, attention_dim, num_heads, conv_dims, dropout_rate)[source]#

Initialize parameters.

Parameters:
  • seq_max_len (int) – Maximum sequence length.

  • embedding_dim (int) – Embedding dimension.

  • attention_dim (int) – Dimension of the attention embeddings.

  • num_heads (int) – Number of heads in the multi-head self-attention module.

  • conv_dims (list) – List of the dimensions of the Feedforward layer.

  • dropout_rate (float) – Dropout probability.

call(x, training, mask)[source]#

Model forward pass.

Parameters:
  • x (tf.Tensor) – Input tensor.

  • training (Boolean) – True if in training mode.

  • mask (tf.Tensor) – Mask tensor.

Returns:

Output tensor.

Return type:

tf.Tensor

call_(x, training, mask)[source]#

Model forward pass.

Parameters:
  • x (tf.Tensor) – Input tensor.

  • training (tf.Tensor) – Training tensor.

  • mask (tf.Tensor) – Mask tensor.

Returns:

Output tensor.

Return type:

tf.Tensor

class recommenders.models.sasrec.model.LayerNormalization(*args, **kwargs)[source]#

Layer normalization using mean and variance gamma and beta are the learnable parameters

__init__(seq_max_len, embedding_dim, epsilon)[source]#

Initialize parameters.

Parameters:
  • seq_max_len (int) – Maximum sequence length.

  • embedding_dim (int) – Embedding dimension.

  • epsilon (float) – Epsilon value.

call(x)[source]#

Model forward pass.

Parameters:

x (tf.Tensor) – Input tensor.

Returns:

Output tensor.

Return type:

tf.Tensor

class recommenders.models.sasrec.model.MultiHeadAttention(*args, **kwargs)[source]#
  • Q (query), K (key) and V (value) are split into multiple heads (num_heads)

  • each tuple (q, k, v) are fed to scaled_dot_product_attention

  • all attention outputs are concatenated

__init__(attention_dim, num_heads, dropout_rate)[source]#

Initialize parameters.

Parameters:
  • attention_dim (int) – Dimension of the attention embeddings.

  • num_heads (int) – Number of heads in the multi-head self-attention module.

  • dropout_rate (float) – Dropout probability.

call(queries, keys)[source]#

Model forward pass.

Parameters:
  • queries (tf.Tensor) – Tensor of queries.

  • keys (tf.Tensor) – Tensor of keys

Returns:

Output tensor.

Return type:

tf.Tensor

class recommenders.models.sasrec.model.PointWiseFeedForward(*args, **kwargs)[source]#

Convolution layers with residual connection

__init__(conv_dims, dropout_rate)[source]#

Initialize parameters.

Parameters:
  • conv_dims (list) – List of the dimensions of the Feedforward layer.

  • dropout_rate (float) – Dropout probability.

call(x)[source]#

Model forward pass.

Parameters:

x (tf.Tensor) – Input tensor.

Returns:

Output tensor.

Return type:

tf.Tensor

class recommenders.models.sasrec.model.SASREC(*args, **kwargs)[source]#

SAS Rec model Self-Attentive Sequential Recommendation Using Transformer

Citation:

Wang-Cheng Kang, Julian McAuley (2018), Self-Attentive Sequential Recommendation. Proceedings of IEEE International Conference on Data Mining (ICDM’18)

Original source code from nnkkmto/SASRec-tf2, nnkkmto/SASRec-tf2

__init__(**kwargs)[source]#

Model initialization.

Parameters:
  • item_num (int) – Number of items in the dataset.

  • seq_max_len (int) – Maximum number of items in user history.

  • num_blocks (int) – Number of Transformer blocks to be used.

  • embedding_dim (int) – Item embedding dimension.

  • attention_dim (int) – Transformer attention dimension.

  • conv_dims (list) – List of the dimensions of the Feedforward layer.

  • dropout_rate (float) – Dropout rate.

  • l2_reg (float) – Coefficient of the L2 regularization.

  • num_neg_test (int) – Number of negative examples used in testing.

call(x, training)[source]#

Model forward pass.

Parameters:
  • x (tf.Tensor) – Input tensor.

  • training (tf.Tensor) – Training tensor.

Returns:

  • Logits of the positive examples.

  • Logits of the negative examples.

  • Mask for nonzero targets

Return type:

tf.Tensor, tf.Tensor, tf.Tensor

create_combined_dataset(u, seq, pos, neg)[source]#

function to create model inputs from sampled batch data. This function is used only during training.

embedding(input_seq)[source]#

Compute the sequence and positional embeddings.

Parameters:

input_seq (tf.Tensor) – Input sequence

Returns:

  • Sequence embeddings.

  • Positional embeddings.

Return type:

tf.Tensor, tf.Tensor

evaluate(dataset)[source]#

Evaluation on the test users (users with at least 3 items)

evaluate_valid(dataset)[source]#

Evaluation on the validation users

loss_function(pos_logits, neg_logits, istarget)[source]#

Losses are calculated separately for the positive and negative items based on the corresponding logits. A mask is included to take care of the zero items (added for padding).

Parameters:
  • pos_logits (tf.Tensor) – Logits of the positive examples.

  • neg_logits (tf.Tensor) – Logits of the negative examples.

  • istarget (tf.Tensor) – Mask for nonzero targets.

Returns:

Loss.

Return type:

float

predict(inputs)[source]#

Returns the logits for the test items.

Parameters:

inputs (tf.Tensor) – Input tensor.

Returns:

Output tensor.

Return type:

tf.Tensor

train(dataset, sampler, **kwargs)[source]#

High level function for model training as well as evaluation on the validation and test dataset

class recommenders.models.sasrec.sampler.WarpSampler(User, usernum, itemnum, batch_size=64, maxlen=10, n_workers=1)[source]#

Sampler object that creates an iterator for feeding batch data while training.

User#

dict, all the users (keys) with items as values

usernum#

integer, total number of users

itemnum#

integer, total number of items

batch_size#

batch size

Type:

int

maxlen#

maximum input sequence length

Type:

int

n_workers#

number of workers for parallel execution

Type:

int

__init__(User, usernum, itemnum, batch_size=64, maxlen=10, n_workers=1)[source]#
recommenders.models.sasrec.sampler.sample_function(user_train, usernum, itemnum, batch_size, maxlen, result_queue, seed)[source]#

Batch sampler that creates a sequence of negative items based on the original sequence of items (positive) that the user has interacted with.

Parameters:
  • user_train (dict) – dictionary of training exampled for each user

  • usernum (int) – number of users

  • itemnum (int) – number of items

  • batch_size (int) – batch size

  • maxlen (int) – maximum input sequence length

  • result_queue (multiprocessing.Queue) – queue for storing sample results

  • seed (int) – seed for random generator

class recommenders.models.sasrec.util.SASRecDataSet(**kwargs)[source]#

A class for creating SASRec specific dataset used during train, validation and testing.

usernum#

integer, total number of users

itemnum#

integer, total number of items

User#

dict, all the users (keys) with items as values

Items#

set of all the items

user_train#

dict, subset of User that are used for training

user_valid#

dict, subset of User that are used for validation

user_test#

dict, subset of User that are used for testing

col_sep#

column separator in the data file

filename#

data filename

SSE-PT#

class recommenders.models.sasrec.ssept.SSEPT(*args, **kwargs)[source]#

SSE-PT Model

Citation:

Wu L., Li S., Hsieh C-J., Sharpnack J., SSE-PT: Sequential Recommendation Via Personalized Transformer, RecSys, 2020. TF 1.x codebase: SSE-PT/SSE-PT TF 2.x codebase (SASREc): nnkkmto/SASRec-tf2

__init__(**kwargs)[source]#

Model initialization.

Parameters:
  • item_num (int) – Number of items in the dataset.

  • seq_max_len (int) – Maximum number of items in user history.

  • num_blocks (int) – Number of Transformer blocks to be used.

  • embedding_dim (int) – Item embedding dimension.

  • attention_dim (int) – Transformer attention dimension.

  • conv_dims (list) – List of the dimensions of the Feedforward layer.

  • dropout_rate (float) – Dropout rate.

  • l2_reg (float) – Coefficient of the L2 regularization.

  • num_neg_test (int) – Number of negative examples used in testing.

  • user_num (int) – Number of users in the dataset.

  • user_embedding_dim (int) – User embedding dimension.

  • item_embedding_dim (int) – Item embedding dimension.

call(x, training)[source]#

Model forward pass.

Parameters:
  • x (tf.Tensor) – Input tensor.

  • training (tf.Tensor) – Training tensor.

Returns:

  • Logits of the positive examples.

  • Logits of the negative examples.

  • Mask for nonzero targets

Return type:

tf.Tensor, tf.Tensor, tf.Tensor

loss_function(pos_logits, neg_logits, istarget)[source]#

Losses are calculated separately for the positive and negative items based on the corresponding logits. A mask is included to take care of the zero items (added for padding).

Parameters:
  • pos_logits (tf.Tensor) – Logits of the positive examples.

  • neg_logits (tf.Tensor) – Logits of the negative examples.

  • istarget (tf.Tensor) – Mask for nonzero targets.

Returns:

Loss.

Return type:

float

predict(inputs)[source]#

Model prediction for candidate (negative) items

Surprise utilities#

recommenders.models.surprise.surprise_utils.compute_ranking_predictions(algo, data, usercol='userID', itemcol='itemID', predcol='prediction', remove_seen=False)[source]#

Computes predictions of an algorithm from Surprise on all users and items in data. It can be used for computing ranking metrics like NDCG.

Parameters:
  • algo (surprise.prediction_algorithms.algo_base.AlgoBase) – an algorithm from Surprise

  • data (pandas.DataFrame) – the data from which to get the users and items

  • usercol (str) – name of the user column

  • itemcol (str) – name of the item column

  • remove_seen (bool) – flag to remove (user, item) pairs seen in the training data

Returns:

Dataframe with usercol, itemcol, predcol

Return type:

pandas.DataFrame

recommenders.models.surprise.surprise_utils.predict(algo, data, usercol='userID', itemcol='itemID', predcol='prediction')[source]#

Computes predictions of an algorithm from Surprise on the data. Can be used for computing rating metrics like RMSE.

Parameters:
  • algo (surprise.prediction_algorithms.algo_base.AlgoBase) – an algorithm from Surprise

  • data (pandas.DataFrame) – the data on which to predict

  • usercol (str) – name of the user column

  • itemcol (str) – name of the item column

Returns:

Dataframe with usercol, itemcol, predcol

Return type:

pandas.DataFrame

recommenders.models.surprise.surprise_utils.surprise_trainset_to_df(trainset, col_user='uid', col_item='iid', col_rating='rating')[source]#

Converts a surprise.Trainset object to pandas.DataFrame

More info: https://surprise.readthedocs.io/en/stable/trainset.html

Parameters:
  • trainset (object) – A surprise.Trainset object.

  • col_user (str) – User column name.

  • col_item (str) – Item column name.

  • col_rating (str) – Rating column name.

Returns:

A dataframe with user column (str), item column (str), and rating column (float).

Return type:

pandas.DataFrame

TF-IDF utilities#

class recommenders.models.tfidf.tfidf_utils.TfidfRecommender(id_col, tokenization_method='scibert')[source]#

Term Frequency - Inverse Document Frequency (TF-IDF) Recommender

This class provides content-based recommendations using TF-IDF vectorization in combination with cosine similarity.

clean_dataframe(df, cols_to_clean, new_col_name='cleaned_text')[source]#

Clean the text within the columns of interest and return a dataframe with cleaned and combined text.

Parameters:
  • df (pandas.DataFrame) – Dataframe containing the text content to clean.

  • cols_to_clean (list of str) – List of columns to clean by name (e.g., [‘abstract’,’full_text’]).

  • new_col_name (str) – Name of the new column that will contain the cleaned text.

Returns:

Dataframe with cleaned text in the new column.

Return type:

pandas.DataFrame

fit(tf, vectors_tokenized)[source]#

Fit TF-IDF vectorizer to the cleaned and tokenized text.

Parameters:
  • tf (TfidfVectorizer) – sklearn.feature_extraction.text.TfidfVectorizer object defined in .tokenize_text().

  • vectors_tokenized (pandas.Series) – Each row contains tokens for respective documents separated by spaces.

get_stop_words()[source]#

Return the stop words excluded in the TF-IDF vectorizer.

Returns:

Frozenset of stop words used by the TF-IDF vectorizer (can be converted to list).

Return type:

list

get_tokens()[source]#

Return the tokens generated by the TF-IDF vectorizer.

Returns:

Dictionary of tokens generated by the TF-IDF vectorizer.

Return type:

dict

get_top_k_recommendations(metadata, query_id, cols_to_keep=[], verbose=True)[source]#

Return the top k recommendations with useful metadata for each recommendation.

Parameters:
  • metadata (pandas.DataFrame) – Dataframe holding metadata for all public domain papers.

  • query_id (str) – ID of item of interest.

  • cols_to_keep (list of str) – List of columns from the metadata dataframe to include (e.g., [‘title’,’authors’,’journal’,’publish_time’,’url’]). By default, all columns are kept.

  • verbose (boolean) – Set to True if you want to print the table.

Returns:

Stylized dataframe holding recommendations and associated metadata just for the item of interest (can access as normal dataframe by using df.data).

Return type:

pandas.Styler

recommend_top_k_items(df_clean, k=5)[source]#

Recommend k number of items similar to the item of interest.

Parameters:
  • df_clean (pandas.DataFrame) – Dataframe with cleaned text.

  • k (int) – Number of recommendations to return.

Returns:

Dataframe containing id of top k recommendations for all items.

Return type:

pandas.DataFrame

tokenize_text(df_clean, text_col='cleaned_text', ngram_range=(1, 3), min_df=0.0)[source]#

Tokenize the input text. For more details on the TfidfVectorizer, see https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html

Parameters:
  • df_clean (pandas.DataFrame) – Dataframe with cleaned text in the new column.

  • text_col (str) – Name of column containing the cleaned text.

  • ngram_range (tuple of int) – The lower and upper boundary of the range of n-values for different n-grams to be extracted.

  • min_df (float) – When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold.

Returns:

  • Scikit-learn TfidfVectorizer object defined in .tokenize_text().

  • Each row contains tokens for respective documents separated by spaces.

Return type:

TfidfVectorizer, pandas.Series

Standard VAE#

class recommenders.models.vae.standard_vae.AnnealingCallback(beta, anneal_cap, total_anneal_steps)[source]#

This class is used for updating the value of β during the annealing process. When β reaches the value of anneal_cap, it stops increasing.

__init__(beta, anneal_cap, total_anneal_steps)[source]#

Constructor

Parameters:
  • beta (float) – current value of beta.

  • anneal_cap (float) – maximum value that beta can reach.

  • total_anneal_steps (int) – total number of annealing steps.

get_data()[source]#

Returns a list of the beta values per epoch.

on_batch_end(epoch, logs={})[source]#

At the end of each batch the beta should is updated until it reaches the values of anneal cap.

on_epoch_end(epoch, logs={})[source]#

At the end of each epoch save the value of beta in _beta list.

on_train_begin(logs={})[source]#

Initialise a list in which the beta value will be saved at the end of each epoch.

class recommenders.models.vae.standard_vae.LossHistory[source]#

This class is used for saving the validation loss and the training loss per epoch.

on_epoch_end(epoch, logs={})[source]#

Save the loss of training and validation set at the end of each epoch.

on_train_begin(logs={})[source]#

Initialise the lists where the loss of training and validation will be saved.

class recommenders.models.vae.standard_vae.Metrics(model, val_tr, val_te, mapper, k, save_path=None)[source]#

Callback function used to calculate the NDCG@k metric of validation set at the end of each epoch. Weights of the model with the highest NDCG@k value is saved.

__init__(model, val_tr, val_te, mapper, k, save_path=None)[source]#

Initialize the class parameters.

Parameters:
  • model – trained model for validation.

  • val_tr (numpy.ndarray, float) – the click matrix for the validation set training part.

  • val_te (numpy.ndarray, float) – the click matrix for the validation set testing part.

  • mapper (AffinityMatrix) – the mapper for converting click matrix to dataframe.

  • k (int) – number of top k items per user (optional).

  • save_path (str) – Default path to save weights.

get_data()[source]#

Returns a list of the NDCG@k of the validation set metrics calculated at the end of each epoch.

on_epoch_end(batch, logs={})[source]#

At the end of each epoch calculate NDCG@k of the validation set. If the model performance is improved, the model weights are saved. Update the list of validation NDCG@k by adding obtained value.

on_train_begin(logs={})[source]#

Initialise the list for validation NDCG@k.

recommend_k_items(x, k, remove_seen=True)[source]#

Returns the top-k items ordered by a relevancy score. Obtained probabilities are used as recommendation score.

Parameters:
  • x (numpy.ndarray, int32) – input click matrix.

  • k (scalar, int32) – the number of items to recommend.

Returns:

A sparse matrix containing the top_k elements ordered by their score.

Return type:

numpy.ndarray

class recommenders.models.vae.standard_vae.StandardVAE(n_users, original_dim, intermediate_dim=200, latent_dim=70, n_epochs=400, batch_size=100, k=100, verbose=1, drop_encoder=0.5, drop_decoder=0.5, beta=1.0, annealing=False, anneal_cap=1.0, seed=None, save_path=None)[source]#

Standard Variational Autoencoders (VAE) for Collaborative Filtering implementation.

__init__(n_users, original_dim, intermediate_dim=200, latent_dim=70, n_epochs=400, batch_size=100, k=100, verbose=1, drop_encoder=0.5, drop_decoder=0.5, beta=1.0, annealing=False, anneal_cap=1.0, seed=None, save_path=None)[source]#

Initialize class parameters.

Parameters:
  • n_users (int) – Number of unique users in the train set.

  • original_dim (int) – Number of unique items in the train set.

  • intermediate_dim (int) – Dimension of intermediate space.

  • latent_dim (int) – Dimension of latent space.

  • n_epochs (int) – Number of epochs for training.

  • batch_size (int) – Batch size.

  • k (int) – number of top k items per user.

  • verbose (int) – Whether to show the training output or not.

  • drop_encoder (float) – Dropout percentage of the encoder.

  • drop_decoder (float) – Dropout percentage of the decoder.

  • beta (float) – a constant parameter β in the ELBO function, when you are not using annealing (annealing=False)

  • annealing (bool) – option of using annealing method for training the model (True) or not using annealing, keeping a constant beta (False)

  • anneal_cap (float) – maximum value that beta can take during annealing process.

  • seed (int) – Seed.

  • save_path (str) – Default path to save weights.

display_metrics()[source]#

Plots: 1) Loss per epoch both for validation and train sets 2) NDCG@k per epoch of the validation set

fit(x_train, x_valid, x_val_tr, x_val_te, mapper)[source]#

Fit model with the train sets and validate on the validation set.

Parameters:
  • x_train (numpy.ndarray) – The click matrix for the train set.

  • x_valid (numpy.ndarray) – The click matrix for the validation set.

  • x_val_tr (numpy.ndarray) – The click matrix for the validation set training part.

  • x_val_te (numpy.ndarray) – The click matrix for the validation set testing part.

  • mapper (object) – The mapper for converting click matrix to dataframe. It can be AffinityMatrix.

get_optimal_beta()[source]#

Returns the value of the optimal beta.

ndcg_per_epoch()[source]#

Returns the list of NDCG@k at each epoch.

nn_batch_generator(x_train)[source]#

Used for splitting dataset in batches.

Parameters:

x_train (numpy.ndarray) – The click matrix for the train set with float values.

recommend_k_items(x, k, remove_seen=True)[source]#

Returns the top-k items ordered by a relevancy score.

Obtained probabilities are used as recommendation score.

Parameters:
  • x (numpy.ndarray) – Input click matrix, with int32 values.

  • k (scalar) – The number of items to recommend.

Returns:

A sparse matrix containing the top_k elements ordered by their score.

Return type:

numpy.ndarray

Multinomial VAE#

class recommenders.models.vae.multinomial_vae.AnnealingCallback(beta, anneal_cap, total_anneal_steps)[source]#

This class is used for updating the value of β during the annealing process. When β reaches the value of anneal_cap, it stops increasing.

__init__(beta, anneal_cap, total_anneal_steps)[source]#

Constructor

Parameters:
  • beta (float) – current value of beta.

  • anneal_cap (float) – maximum value that beta can reach.

  • total_anneal_steps (int) – total number of annealing steps.

get_data()[source]#

Returns a list of the beta values per epoch.

on_batch_end(epoch, logs={})[source]#

At the end of each batch the beta should is updated until it reaches the values of anneal cap.

on_epoch_end(epoch, logs={})[source]#

At the end of each epoch save the value of beta in _beta list.

on_train_begin(logs={})[source]#

Initialise a list in which the beta value will be saved at the end of each epoch.

class recommenders.models.vae.multinomial_vae.LossHistory[source]#

This class is used for saving the validation loss and the training loss per epoch.

on_epoch_end(epoch, logs={})[source]#

Save the loss of training and validation set at the end of each epoch.

on_train_begin(logs={})[source]#

Initialise the lists where the loss of training and validation will be saved.

class recommenders.models.vae.multinomial_vae.Metrics(model, val_tr, val_te, mapper, k, save_path=None)[source]#

Callback function used to calculate the NDCG@k metric of validation set at the end of each epoch. Weights of the model with the highest NDCG@k value is saved.

__init__(model, val_tr, val_te, mapper, k, save_path=None)[source]#

Initialize the class parameters.

Parameters:
  • model – trained model for validation.

  • val_tr (numpy.ndarray, float) – the click matrix for the validation set training part.

  • val_te (numpy.ndarray, float) – the click matrix for the validation set testing part.

  • mapper (AffinityMatrix) – the mapper for converting click matrix to dataframe.

  • k (int) – number of top k items per user (optional).

  • save_path (str) – Default path to save weights.

get_data()[source]#

Returns a list of the NDCG@k of the validation set metrics calculated at the end of each epoch.

on_epoch_end(batch, logs={})[source]#

At the end of each epoch calculate NDCG@k of the validation set.

If the model performance is improved, the model weights are saved. Update the list of validation NDCG@k by adding obtained value

on_train_begin(logs={})[source]#

Initialise the list for validation NDCG@k.

recommend_k_items(x, k, remove_seen=True)[source]#

Returns the top-k items ordered by a relevancy score. Obtained probabilities are used as recommendation score.

Parameters:
  • x (numpy.ndarray, int32) – input click matrix.

  • k (scalar, int32) – the number of items to recommend.

Returns:

A sparse matrix containing the top_k elements ordered by their score.

Return type:

numpy.ndarray

class recommenders.models.vae.multinomial_vae.Mult_VAE(n_users, original_dim, intermediate_dim=200, latent_dim=70, n_epochs=400, batch_size=100, k=100, verbose=1, drop_encoder=0.5, drop_decoder=0.5, beta=1.0, annealing=False, anneal_cap=1.0, seed=None, save_path=None)[source]#

Multinomial Variational Autoencoders (Multi-VAE) for Collaborative Filtering implementation

Citation:

Liang, Dawen, et al. “Variational autoencoders for collaborative filtering.” Proceedings of the 2018 World Wide Web Conference. 2018. https://arxiv.org/pdf/1802.05814.pdf

__init__(n_users, original_dim, intermediate_dim=200, latent_dim=70, n_epochs=400, batch_size=100, k=100, verbose=1, drop_encoder=0.5, drop_decoder=0.5, beta=1.0, annealing=False, anneal_cap=1.0, seed=None, save_path=None)[source]#

Constructor

Parameters:
  • n_users (int) – Number of unique users in the train set.

  • original_dim (int) – Number of unique items in the train set.

  • intermediate_dim (int) – Dimension of intermediate space.

  • latent_dim (int) – Dimension of latent space.

  • n_epochs (int) – Number of epochs for training.

  • batch_size (int) – Batch size.

  • k (int) – number of top k items per user.

  • verbose (int) – Whether to show the training output or not.

  • drop_encoder (float) – Dropout percentage of the encoder.

  • drop_decoder (float) – Dropout percentage of the decoder.

  • beta (float) – a constant parameter β in the ELBO function, when you are not using annealing (annealing=False)

  • annealing (bool) – option of using annealing method for training the model (True) or not using annealing, keeping a constant beta (False)

  • anneal_cap (float) – maximum value that beta can take during annealing process.

  • seed (int) – Seed.

  • save_path (str) – Default path to save weights.

display_metrics()[source]#

Plots: 1) Loss per epoch both for validation and train set 2) NDCG@k per epoch of the validation set

fit(x_train, x_valid, x_val_tr, x_val_te, mapper)[source]#

Fit model with the train sets and validate on the validation set.

Parameters:
  • x_train (numpy.ndarray) – the click matrix for the train set.

  • x_valid (numpy.ndarray) – the click matrix for the validation set.

  • x_val_tr (numpy.ndarray) – the click matrix for the validation set training part.

  • x_val_te (numpy.ndarray) – the click matrix for the validation set testing part.

  • mapper (object) – the mapper for converting click matrix to dataframe. It can be AffinityMatrix.

get_optimal_beta()[source]#

Returns the value of the optimal beta.

ndcg_per_epoch()[source]#

Returns the list of NDCG@k at each epoch.

nn_batch_generator(x_train)[source]#

Used for splitting dataset in batches.

Parameters:

x_train (numpy.ndarray) – The click matrix for the train set, with float values.

recommend_k_items(x, k, remove_seen=True)[source]#

Returns the top-k items ordered by a relevancy score. Obtained probabilities are used as recommendation score.

Parameters:
  • x (numpy.ndarray, int32) – input click matrix.

  • k (scalar, int32) – the number of items to recommend.

Returns:

A sparse matrix containing the top_k elements ordered by their score.

Return type:

numpy.ndarray, float

Vowpal Wabbit utilities#

This file provides a wrapper to run Vowpal Wabbit from the command line through python. It is not recommended to use this approach in production, there are python bindings that can be installed from the repository or pip or the command line can be used. This is merely to demonstrate vw usage in the example notebooks.

class recommenders.models.vowpal_wabbit.vw.VW(col_user='userID', col_item='itemID', col_rating='rating', col_timestamp='timestamp', col_prediction='prediction', **kwargs)[source]#

Vowpal Wabbit Class

fit(df)[source]#

Train model

Parameters:

df (pandas.DataFrame) – input training data

parse_test_params(params)[source]#

Parse input hyper-parameters to build vw test commands

Parameters:

params (dict) – key = parameter, value = value (use True if parameter is just a flag)

Returns:

vw command line parameters as list of strings

Return type:

list[str]

parse_train_params(params)[source]#

Parse input hyper-parameters to build vw train commands

Parameters:

params (dict) – key = parameter, value = value (use True if parameter is just a flag)

Returns:

vw command line parameters as list of strings

Return type:

list[str]

predict(df)[source]#

Predict results

Parameters:

df (pandas.DataFrame) – input test data

static to_vw_cmd(params)[source]#

Convert dictionary of parameters to vw command line.

Parameters:

params (dict) – key = parameter, value = value (use True if parameter is just a flag)

Returns:

vw command line parameters as list of strings

Return type:

list[str]

to_vw_file(df, train=True)[source]#

Convert Pandas DataFrame to vw input format file

Parameters:
  • df (pandas.DataFrame) – input DataFrame

  • train (bool) – flag for train mode (or test mode if False)

Wide & Deep#

recommenders.models.wide_deep.wide_deep_utils.build_feature_columns(users, items, user_col='userID', item_col='itemID', item_feat_col=None, crossed_feat_dim=1000, user_dim=8, item_dim=8, item_feat_shape=None, model_type='wide_deep')[source]#

Build wide and/or deep feature columns for TensorFlow high-level API Estimator.

Parameters:
  • users (iterable) – Distinct user ids.

  • items (iterable) – Distinct item ids.

  • user_col (str) – User column name.

  • item_col (str) – Item column name.

  • item_feat_col (str) – Item feature column name for ‘deep’ or ‘wide_deep’ model.

  • crossed_feat_dim (int) – Crossed feature dimension for ‘wide’ or ‘wide_deep’ model.

  • user_dim (int) – User embedding dimension for ‘deep’ or ‘wide_deep’ model.

  • item_dim (int) – Item embedding dimension for ‘deep’ or ‘wide_deep’ model.

  • item_feat_shape (int or an iterable of integers) – Item feature array shape for ‘deep’ or ‘wide_deep’ model.

  • model_type (str) – Model type, either ‘wide’ for a linear model, ‘deep’ for a deep neural networks, or ‘wide_deep’ for a combination of linear model and neural networks.

Returns:

  • The wide feature columns

  • The deep feature columns. If only the wide model is selected, the deep column list is empty and viceversa.

Return type:

list, list

recommenders.models.wide_deep.wide_deep_utils.build_model(model_dir='model_checkpoints', wide_columns=(), deep_columns=(), linear_optimizer='Ftrl', dnn_optimizer='Adagrad', dnn_hidden_units=(128, 128), dnn_dropout=0.0, dnn_batch_norm=True, log_every_n_iter=1000, save_checkpoints_steps=10000, seed=None)[source]#

Build wide-deep model.

To generate wide model, pass wide_columns only. To generate deep model, pass deep_columns only. To generate wide_deep model, pass both wide_columns and deep_columns.

Parameters:
  • model_dir (str) – Model checkpoint directory.

  • wide_columns (list of tf.feature_column) – Wide model feature columns.

  • deep_columns (list of tf.feature_column) – Deep model feature columns.

  • linear_optimizer (str or tf.train.Optimizer) – Wide model optimizer name or object.

  • dnn_optimizer (str or tf.train.Optimizer) – Deep model optimizer name or object.

  • dnn_hidden_units (list of int) – Deep model hidden units. E.g., [10, 10, 10] is three layers of 10 nodes each.

  • dnn_dropout (float) – Deep model’s dropout rate.

  • dnn_batch_norm (bool) – Deep model’s batch normalization flag.

  • log_every_n_iter (int) – Log the training loss for every n steps.

  • save_checkpoints_steps (int) – Model checkpoint frequency.

  • seed (int) – Random seed.

Returns:

Model

Return type:

tf.estimator.Estimator