Dataset module#
Recommendation datasets and related utilities
Recommendation datasets#
Amazon Reviews#
Amazon Reviews dataset consists of reviews from Amazon. The data span a period of 18 years, including ~35 million reviews up to March 2013. Reviews include product and user information, ratings, and a plaintext review.
- Citation:
J. McAuley and J. Leskovec, “Hidden factors and hidden topics: understanding rating dimensions with review text”, RecSys, 2013.
- recommenders.datasets.amazon_reviews.data_preprocessing(reviews_file, meta_file, train_file, valid_file, test_file, user_vocab, item_vocab, cate_vocab, sample_rate=0.01, valid_num_ngs=4, test_num_ngs=9, is_history_expanding=True)[source]#
Create data for training, validation and testing from original dataset
- Parameters:
reviews_file (str) – Reviews dataset downloaded from former operations.
meta_file (str) – Meta dataset downloaded from former operations.
- recommenders.datasets.amazon_reviews.download_and_extract(name, dest_path)[source]#
Downloads and extracts Amazon reviews and meta datafiles if they don’t already exist
- Parameters:
name (str) – Category of reviews.
dest_path (str) – File path for the downloaded file.
- Returns:
File path for the extracted file.
- Return type:
str
CORD-19#
COVID-19 Open Research Dataset (CORD-19) is a full-text and metadata dataset of COVID-19 and coronavirus-related scholarly articles optimized for machine readability and made available for use by the global research community.
In response to the COVID-19 pandemic, the Allen Institute for AI has partnered with leading research groups to prepare and distribute the COVID-19 Open Research Dataset (CORD-19), a free resource of over 47,000 scholarly articles, including over 36,000 with full text, about COVID-19 and the coronavirus family of viruses for use by the global research community.
This dataset is intended to mobilize researchers to apply recent advances in natural language processing to generate new insights in support of the fight against this infectious disease.
- Citation:
Wang, L.L., Lo, K., Chandrasekhar, Y., Reas, R., Yang, J., Eide, D., Funk, K., Kinney, R., Liu, Z., Merrill, W. and Mooney, P. “Cord-19: The COVID-19 Open Research Dataset.”, 2020.
- recommenders.datasets.covid_utils.clean_dataframe(df)[source]#
Clean up the dataframe.
- Parameters:
df (pandas.DataFrame) – Pandas dataframe.
- Returns:
Cleaned pandas dataframe.
- Return type:
df (pandas.DataFrame)
- recommenders.datasets.covid_utils.get_public_domain_text(df, container_name, azure_storage_account_name='azureopendatastorage', azure_storage_sas_token='')[source]#
Get all public domain text.
- Parameters:
df (pandas.DataFrame) – Metadata dataframe for public domain text.
container_name (str) – Azure storage container name.
azure_storage_account_name (str) – Azure storage account name.
azure_storage_sas_token (str) – Azure storage SAS token.
- Returns:
Dataframe with select metadata and full article text.
- Return type:
df_full (pandas.DataFrame)
- recommenders.datasets.covid_utils.load_pandas_df(azure_storage_account_name='azureopendatastorage', azure_storage_sas_token='', container_name='covid19temp', metadata_filename='metadata.csv')[source]#
Loads the Azure Open Research COVID-19 dataset as a pd.DataFrame.
The Azure COVID-19 Open Research Dataset may be found at https://azure.microsoft.com/en-us/services/open-datasets/catalog/covid-19-open-research/
- Parameters:
azure_storage_account_name (str) – Azure storage account name.
azure_storage_sas_token (str) – Azure storage SAS token.
container_name (str) – Azure storage container name.
metadata_filename (str) – Name of file containing top-level metadata for the dataset.
- Returns:
Metadata dataframe.
- Return type:
metadata (pandas.DataFrame)
- recommenders.datasets.covid_utils.remove_duplicates(df, cols)[source]#
Remove duplicated entries.
- Parameters:
df (pd.DataFrame) – Pandas dataframe.
cols (list of str) – Name of columns in which to look for duplicates.
- Returns:
Pandas dataframe with duplicate rows dropped.
- Return type:
df (pandas.DataFrame)
- recommenders.datasets.covid_utils.remove_nan(df, cols)[source]#
Remove rows with NaN values in specified column.
- Parameters:
df (pandas.DataFrame) – Pandas dataframe.
cols (list of str) – Name of columns in which to look for NaN.
- Returns:
Pandas dataframe with invalid rows dropped.
- Return type:
df (pandas.DataFrame)
- recommenders.datasets.covid_utils.retrieve_text(entry, container_name, azure_storage_account_name='azureopendatastorage', azure_storage_sas_token='')[source]#
Retrieve body text from article of interest.
- Parameters:
entry (pd.Series) – A single row from the dataframe (df.iloc[n]).
container_name (str) – Azure storage container name.
azure_storage_account_name (str) – Azure storage account name.
azure_storage_sas_token (str) – Azure storage SAS token.
- Results:
text (str): Full text of the blob as a single string.
Criteo#
Criteo dataset, released by Criteo Labs, is an online advertising dataset that contains feature values and click feedback for millions of display Ads. Every Ad has has 40 attributes, the first attribute is the label where a value 1 represents that the Ad has been clicked on and a 0 represents it wasn’t clicked on. The rest consist of 13 integer columns and 26 categorical columns.
- recommenders.datasets.criteo.download_criteo(size='sample', work_directory='.')[source]#
Download criteo dataset as a compressed file.
- Parameters:
size (str) – Size of criteo dataset. It can be “full” or “sample”.
work_directory (str) – Working directory.
- Returns:
Path of the downloaded file.
- Return type:
str
- recommenders.datasets.criteo.extract_criteo(size, compressed_file, path=None)[source]#
Extract Criteo dataset tar.
- Parameters:
size (str) – Size of Criteo dataset. It can be “full” or “sample”.
compressed_file (str) – Path to compressed file.
path (str) – Path to extract the file.
- Returns:
Path to the extracted file.
- Return type:
str
- recommenders.datasets.criteo.get_spark_schema(header=['label', 'int00', 'int01', 'int02', 'int03', 'int04', 'int05', 'int06', 'int07', 'int08', 'int09', 'int10', 'int11', 'int12', 'cat00', 'cat01', 'cat02', 'cat03', 'cat04', 'cat05', 'cat06', 'cat07', 'cat08', 'cat09', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18', 'cat19', 'cat20', 'cat21', 'cat22', 'cat23', 'cat24', 'cat25'])[source]#
Get Spark schema from header.
- Parameters:
header (list) – Dataset header names.
- Returns:
Spark schema.
- Return type:
pyspark.sql.types.StructType
- recommenders.datasets.criteo.load_pandas_df(size='sample', local_cache_path=None, header=['label', 'int00', 'int01', 'int02', 'int03', 'int04', 'int05', 'int06', 'int07', 'int08', 'int09', 'int10', 'int11', 'int12', 'cat00', 'cat01', 'cat02', 'cat03', 'cat04', 'cat05', 'cat06', 'cat07', 'cat08', 'cat09', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18', 'cat19', 'cat20', 'cat21', 'cat22', 'cat23', 'cat24', 'cat25'])[source]#
Loads the Criteo DAC dataset as pandas.DataFrame. This function download, untar, and load the dataset.
The dataset consists of a portion of Criteo’s traffic over a period of 24 days. Each row corresponds to a display ad served by Criteo and the first column indicates whether this ad has been clicked or not.
There are 13 features taking integer values (mostly count features) and 26 categorical features. The values of the categorical features have been hashed onto 32 bits for anonymization purposes.
The schema is:
<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
More details (need to accept user terms to see the information): http://labs.criteo.com/2013/12/download-terabyte-click-logs/
- Parameters:
size (str) – Dataset size. It can be “sample” or “full”.
local_cache_path (str) – Path where to cache the tar.gz file locally
header (list) – Dataset header names.
- Returns:
Criteo DAC sample dataset.
- Return type:
pandas.DataFrame
- recommenders.datasets.criteo.load_spark_df(spark, size='sample', header=['label', 'int00', 'int01', 'int02', 'int03', 'int04', 'int05', 'int06', 'int07', 'int08', 'int09', 'int10', 'int11', 'int12', 'cat00', 'cat01', 'cat02', 'cat03', 'cat04', 'cat05', 'cat06', 'cat07', 'cat08', 'cat09', 'cat10', 'cat11', 'cat12', 'cat13', 'cat14', 'cat15', 'cat16', 'cat17', 'cat18', 'cat19', 'cat20', 'cat21', 'cat22', 'cat23', 'cat24', 'cat25'], local_cache_path=None, dbfs_datapath='dbfs:/FileStore/dac', dbutils=None)[source]#
Loads the Criteo DAC dataset as pySpark.DataFrame.
The dataset consists of a portion of Criteo’s traffic over a period of 24 days. Each row corresponds to a display ad served by Criteo and the first column is indicates whether this ad has been clicked or not.
There are 13 features taking integer values (mostly count features) and 26 categorical features. The values of the categorical features have been hashed onto 32 bits for anonymization purposes.
The schema is:
<label> <integer feature 1> ... <integer feature 13> <categorical feature 1> ... <categorical feature 26>
More details (need to accept user terms to see the information): http://labs.criteo.com/2013/12/download-terabyte-click-logs/
- Parameters:
spark (pySpark.SparkSession) – Spark session.
size (str) – Dataset size. It can be “sample” or “full”.
local_cache_path (str) – Path where to cache the tar.gz file locally.
header (list) – Dataset header names.
dbfs_datapath (str) – Where to store the extracted files on Databricks.
dbutils (Databricks.dbutils) – Databricks utility object.
- Returns:
Criteo DAC training dataset.
- Return type:
pyspark.sql.DataFrame
MIND#
MIcrosoft News Dataset (MIND), is a large-scale dataset for news recommendation research. It was collected from anonymized behavior logs of Microsoft News website.
MIND contains about 160k English news articles and more than 15 million impression logs generated by 1 million users. Every news article contains rich textual content including title, abstract, body, category and entities. Each impression log contains the click events, non-clicked events and historical news click behaviors of this user before this impression. To protect user privacy, each user was de-linked from the production system when securely hashed into an anonymized ID.
- Citation:
Fangzhao Wu, Ying Qiao, Jiun-Hung Chen, Chuhan Wu, Tao Qi, Jianxun Lian, Danyang Liu, Xing Xie, Jianfeng Gao, Winnie Wu and Ming Zhou, “MIND: A Large-scale Dataset for News Recommendation”, ACL, 2020.
- recommenders.datasets.mind.download_and_extract_glove(dest_path)[source]#
Download and extract the Glove embedding
- Parameters:
dest_path (str) – Destination directory path for the downloaded file
- Returns:
File path where Glove was extracted.
- Return type:
str
- recommenders.datasets.mind.download_mind(size='small', dest_path=None)[source]#
Download MIND dataset
- Parameters:
size (str) – Dataset size. One of [“small”, “large”]
dest_path (str) – Download path. If path is None, it will download the dataset on a temporal path
- Returns:
Path to train and validation sets.
- Return type:
str, str
- recommenders.datasets.mind.extract_mind(train_zip, valid_zip, train_folder='train', valid_folder='valid', clean_zip_file=True)[source]#
Extract MIND dataset
- Parameters:
train_zip (str) – Path to train zip file
valid_zip (str) – Path to valid zip file
train_folder (str) – Destination forder for train set
valid_folder (str) – Destination forder for validation set
- Returns:
Train and validation folders
- Return type:
str, str
- recommenders.datasets.mind.generate_embeddings(data_path, news_words, news_entities, train_entities, valid_entities, max_sentence=10, word_embedding_dim=100)[source]#
Generate embeddings.
- Parameters:
data_path (str) – Data path.
news_words (dict) – News word dictionary.
news_entities (dict) – News entity dictionary.
train_entities (str) – Train entity file.
valid_entities (str) – Validation entity file.
max_sentence (int) – Max sentence size.
word_embedding_dim (int) – Word embedding dimension.
- Returns:
File paths to news, word and entity embeddings.
- Return type:
str, str, str
- recommenders.datasets.mind.get_train_input(session, train_file_path, npratio=4)[source]#
Generate train file.
- Parameters:
session (list) – List of user session with user_id, clicks, positive and negative interactions.
train_file_path (str) – Path to file.
npration (int) – Ratio for negative sampling.
- recommenders.datasets.mind.get_user_history(train_history, valid_history, user_history_path)[source]#
Generate user history file.
- Parameters:
train_history (list) – Train history.
valid_history (list) – Validation history
user_history_path (str) – Path to file.
- recommenders.datasets.mind.get_valid_input(session, valid_file_path)[source]#
Generate validation file.
- Parameters:
session (list) – List of user session with user_id, clicks, positive and negative interactions.
valid_file_path (str) – Path to file.
- recommenders.datasets.mind.get_words_and_entities(train_news, valid_news)[source]#
Load words and entities
- Parameters:
train_news (str) – News train file.
valid_news (str) – News validation file.
- Returns:
Words and entities dictionaries.
- Return type:
dict, dict
- recommenders.datasets.mind.load_glove_matrix(path_emb, word_dict, word_embedding_dim)[source]#
Load pretrained embedding metrics of words in word_dict
- Parameters:
path_emb (string) – Folder path of downloaded glove file
word_dict (dict) – word dictionary
word_embedding_dim – dimention of word embedding vectors
- Returns:
pretrained word embedding metrics, words can be found in glove files
- Return type:
numpy.ndarray, list
- recommenders.datasets.mind.read_clickhistory(path, filename)[source]#
Read click history file
- Parameters:
path (str) – Folder path
filename (str) – Filename
- Returns:
A list of user session with user_id, clicks, positive and negative interactions.
A dictionary with user_id click history.
- Return type:
list, dict
MovieLens#
The MovieLens datasets, first released in 1998, describe people’s expressed preferences for movies. These preferences take the form of <user, item, rating, timestamp> tuples, each the result of a person expressing a preference (a 0-5 star rating) for a movie at a particular time.
It comes with several sizes:
MovieLens 100k: 100,000 ratings from 1000 users on 1700 movies.
MovieLens 1M: 1 million ratings from 6000 users on 4000 movies.
MovieLens 10M: 10 million ratings from 72000 users on 10000 movies.
MovieLens 20M: 20 million ratings from 138000 users on 27000 movies
- Citation:
F. M. Harper and J. A. Konstan. “The MovieLens Datasets: History and Context”. ACM Transactions on Interactive Intelligent Systems (TiiS) 5, 4, Article 19, DOI=http://dx.doi.org/10.1145/2827872, 2015.
- class recommenders.datasets.movielens.MockMovielensSchema(*args, **kwargs)[source]#
Mock dataset schema to generate fake data for testing purpose. This schema is configured to mimic the Movielens dataset
https://files.grouplens.org/datasets/movielens/ml-100k/
Dataset schema and generation is configured using pandera. Please see https://pandera.readthedocs.io/en/latest/schema_models.html for more information.
- classmethod get_df(size: int = 3, seed: int = 100, keep_first_n_cols: int | None = None, keep_title_col: bool = False, keep_genre_col: bool = False) DataFrame [source]#
Return fake movielens dataset as a Pandas Dataframe with specified rows.
- Parameters:
size (int) – number of rows to generate
seed (int, optional) – seeding the pseudo-number generation. Defaults to 100.
keep_first_n_cols (int, optional) – keep the first n default movielens columns.
keep_title_col (bool) – remove the title column if False. Defaults to True.
keep_genre_col (bool) – remove the genre column if False. Defaults to True.
- Returns:
a mock dataset
- Return type:
pandas.DataFrame
- classmethod get_spark_df(spark, size: int = 3, seed: int = 100, keep_title_col: bool = False, keep_genre_col: bool = False, tmp_path: str | None = None)[source]#
Return fake movielens dataset as a Spark Dataframe with specified rows
- Parameters:
spark (SparkSession) – spark session to load the dataframe into
size (int) – number of rows to generate
seed (int) – seeding the pseudo-number generation. Defaults to 100.
keep_title_col (bool) – remove the title column if False. Defaults to False.
keep_genre_col (bool) – remove the genre column if False. Defaults to False.
tmp_path (str, optional) – path to store files for serialization purpose when transferring data from python to java. If None, a temporal path is used instead
- Returns:
a mock dataset
- Return type:
pyspark.sql.DataFrame
- recommenders.datasets.movielens.download_movielens(size, dest_path)[source]#
Downloads MovieLens datafile.
- Parameters:
size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”).
dest_path (str) – File path for the downloaded file
- recommenders.datasets.movielens.extract_movielens(size, rating_path, item_path, zip_path)[source]#
Extract MovieLens rating and item datafiles from the MovieLens raw zip file.
To extract all files instead of just rating and item datafiles, use ZipFile’s extractall(path) instead.
- Parameters:
size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”).
rating_path (str) – Destination path for rating datafile
item_path (str) – Destination path for item datafile
zip_path (str) – zipfile path
- recommenders.datasets.movielens.load_item_df(size='100k', local_cache_path=None, movie_col='itemID', title_col=None, genres_col=None, year_col=None)[source]#
Loads Movie info.
- Parameters:
size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”).
local_cache_path (str) – Path (directory or a zip file) to cache the downloaded zip file. If None, all the intermediate files will be stored in a temporary directory and removed after use.
movie_col (str) – Movie id column name.
title_col (str) – Movie title column name. If None, the column will not be loaded.
genres_col (str) – Genres column name. Genres are ‘|’ separated string. If None, the column will not be loaded.
year_col (str) – Movie release year column name. If None, the column will not be loaded.
- Returns:
Movie information data, such as title, genres, and release year.
- Return type:
pandas.DataFrame
- recommenders.datasets.movielens.load_pandas_df(size='100k', header=None, local_cache_path=None, title_col=None, genres_col=None, year_col=None)[source]#
Loads the MovieLens dataset as pd.DataFrame.
Download the dataset from https://files.grouplens.org/datasets/movielens, unzip, and load. To load movie information only, you can use load_item_df function.
- Parameters:
size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”, “mock100”).
header (list or tuple or None) – Rating dataset header. If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored and data is rendered using the ‘DEFAULT_HEADER’ instead.
local_cache_path (str) – Path (directory or a zip file) to cache the downloaded zip file. If None, all the intermediate files will be stored in a temporary directory and removed after use. If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored.
title_col (str) – Movie title column name. If None, the column will not be loaded.
genres_col (str) – Genres column name. Genres are ‘|’ separated string. If None, the column will not be loaded.
year_col (str) – Movie release year column name. If None, the column will not be loaded. If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored.
- Returns:
Movie rating dataset.
- Return type:
pandas.DataFrame
Examples
# To load just user-id, item-id, and ratings from MovieLens-1M dataset, df = load_pandas_df('1m', ('UserId', 'ItemId', 'Rating')) # To load rating's timestamp together, df = load_pandas_df('1m', ('UserId', 'ItemId', 'Rating', 'Timestamp')) # To load movie's title, genres, and released year info along with the ratings data, df = load_pandas_df('1m', ('UserId', 'ItemId', 'Rating', 'Timestamp'), title_col='Title', genres_col='Genres', year_col='Year' )
- recommenders.datasets.movielens.load_spark_df(spark, size='100k', header=None, schema=None, local_cache_path=None, dbutils=None, title_col=None, genres_col=None, year_col=None)[source]#
Loads the MovieLens dataset as pyspark.sql.DataFrame.
Download the dataset from https://files.grouplens.org/datasets/movielens, unzip, and load as pyspark.sql.DataFrame.
To load movie information only, you can use load_item_df function.
- Parameters:
spark (pyspark.SparkSession) – Spark session.
size (str) – Size of the data to load. One of (“100k”, “1m”, “10m”, “20m”, “mock100”).
header (list or tuple) – Rating dataset header. If schema is provided or size is set to any of ‘MOCK_DATA_FORMAT’, this argument is ignored.
schema (pyspark.StructType) – Dataset schema. If size is set to any of ‘MOCK_DATA_FORMAT’, data is rendered in the ‘MockMovielensSchema’ instead.
local_cache_path (str) – Path (directory or a zip file) to cache the downloaded zip file. If None, all the intermediate files will be stored in a temporary directory and removed after use.
dbutils (Databricks.dbutils) – Databricks utility object If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored.
title_col (str) – Title column name. If None, the column will not be loaded.
genres_col (str) – Genres column name. Genres are ‘|’ separated string. If None, the column will not be loaded.
year_col (str) – Movie release year column name. If None, the column will not be loaded. If size is set to any of ‘MOCK_DATA_FORMAT’, this parameter is ignored.
- Returns:
Movie rating dataset.
- Return type:
pyspark.sql.DataFrame
Examples
# To load just user-id, item-id, and ratings from MovieLens-1M dataset: spark_df = load_spark_df(spark, '1m', ('UserId', 'ItemId', 'Rating')) # The schema can be defined as well: schema = StructType([ StructField(DEFAULT_USER_COL, IntegerType()), StructField(DEFAULT_ITEM_COL, IntegerType()), StructField(DEFAULT_RATING_COL, FloatType()), StructField(DEFAULT_TIMESTAMP_COL, LongType()), ]) spark_df = load_spark_df(spark, '1m', ('UserId', 'ItemId', 'Rating'), schema=schema) # To load rating's timestamp together: spark_df = load_spark_df(spark, '1m', ('UserId', 'ItemId', 'Rating', 'Timestamp')) # To load movie's title, genres, and released year info along with the ratings data: spark_df = load_spark_df(spark, '1m', ('UserId', 'ItemId', 'Rating', 'Timestamp'), title_col='Title', genres_col='Genres', year_col='Year' ) # On DataBricks, pass the dbutils argument as follows: spark_df = load_spark_df(spark, dbutils=dbutils)
Download utilities#
- recommenders.datasets.download_utils.download_path(path=None)[source]#
Return a path to download data. If path=None, then it yields a temporal path that is eventually deleted, otherwise the real path of the input.
- Parameters:
path (str) – Path to download data.
- Returns:
Real path where the data is stored.
- Return type:
str
Examples
>>> with download_path() as path: >>> ... maybe_download(url="http://example.com/file.zip", work_directory=path)
- recommenders.datasets.download_utils.maybe_download(url, filename=None, work_directory='.', expected_bytes=None)[source]#
Download a file if it is not already downloaded.
- Parameters:
filename (str) – File name.
work_directory (str) – Working directory.
url (str) – URL of the file to download.
expected_bytes (int) – Expected file size in bytes.
- Returns:
File path of the file downloaded.
- Return type:
str
Pandas dataframe utilities#
- class recommenders.datasets.pandas_df_utils.LibffmConverter(filepath=None)[source]#
Converts an input dataframe to another dataframe in libffm format. A text file of the converted Dataframe is optionally generated.
Note
The input dataframe is expected to represent the feature data in the following schema:
|field-1|field-2|...|field-n|rating| |feature-1-1|feature-2-1|...|feature-n-1|1| |feature-1-2|feature-2-2|...|feature-n-2|0| ... |feature-1-i|feature-2-j|...|feature-n-k|0|
Where 1. each field-* is the column name of the dataframe (column of label/rating is excluded), and 2. feature-*-* can be either a string or a numerical value, representing the categorical variable or actual numerical variable of the feature value in the field, respectively. 3. If there are ordinal variables represented in int types, users should make sure these columns are properly converted to string type.
The above data will be converted to the libffm format by following the convention as explained in this paper.
i.e. <field_index>:<field_feature_index>:1 or <field_index>:<field_feature_index>:<field_feature_value>, depending on the data type of the features in the original dataframe.
- Parameters:
filepath (str) – path to save the converted data.
- field_count#
count of field in the libffm format data
- Type:
int
- feature_count#
count of feature in the libffm format data
- Type:
int
- filepath#
file path where the output is stored - it can be None or a string
- Type:
str or None
Examples
>>> import pandas as pd >>> df_feature = pd.DataFrame({ 'rating': [1, 0, 0, 1, 1], 'field1': ['xxx1', 'xxx2', 'xxx4', 'xxx4', 'xxx4'], 'field2': [3, 4, 5, 6, 7], 'field3': [1.0, 2.0, 3.0, 4.0, 5.0], 'field4': ['1', '2', '3', '4', '5'] }) >>> converter = LibffmConverter().fit(df_feature, col_rating='rating') >>> df_out = converter.transform(df_feature) >>> df_out rating field1 field2 field3 field4 0 1 1:1:1 2:4:3 3:5:1.0 4:6:1 1 0 1:2:1 2:4:4 3:5:2.0 4:7:1 2 0 1:3:1 2:4:5 3:5:3.0 4:8:1 3 1 1:3:1 2:4:6 3:5:4.0 4:9:1 4 1 1:3:1 2:4:7 3:5:5.0 4:10:1
- fit(df, col_rating='rating')[source]#
Fit the dataframe for libffm format. This method does nothing but check the validity of the input columns
- Parameters:
df (pandas.DataFrame) – input Pandas dataframe.
col_rating (str) – rating of the data.
- Returns:
the instance of the converter
- Return type:
object
- fit_transform(df, col_rating='rating')[source]#
Do fit and transform in a row
- Parameters:
df (pandas.DataFrame) – input Pandas dataframe.
col_rating (str) – rating of the data.
- Returns:
Output libffm format dataframe.
- Return type:
pandas.DataFrame
- class recommenders.datasets.pandas_df_utils.PandasHash(pandas_object)[source]#
Wrapper class to allow pandas objects (DataFrames or Series) to be hashable
- recommenders.datasets.pandas_df_utils.filter_by(df, filter_by_df, filter_by_cols)[source]#
From the input DataFrame df, remove the records whose target column filter_by_cols values are exist in the filter-by DataFrame filter_by_df.
- Parameters:
df (pandas.DataFrame) – Source dataframe.
filter_by_df (pandas.DataFrame) – Filter dataframe.
filter_by_cols (iterable of str) – Filter columns.
- Returns:
Dataframe filtered by filter_by_df on filter_by_cols.
- Return type:
pandas.DataFrame
- recommenders.datasets.pandas_df_utils.has_columns(df, columns)[source]#
Check if DataFrame has necessary columns
- Parameters:
df (pandas.DataFrame) – DataFrame
columns (iterable(str)) – columns to check for
- Returns:
True if DataFrame has specified columns.
- Return type:
bool
- recommenders.datasets.pandas_df_utils.has_same_base_dtype(df_1, df_2, columns=None)[source]#
Check if specified columns have the same base dtypes across both DataFrames
- Parameters:
df_1 (pandas.DataFrame) – first DataFrame
df_2 (pandas.DataFrame) – second DataFrame
columns (list(str)) – columns to check, None checks all columns
- Returns:
True if DataFrames columns have the same base dtypes.
- Return type:
bool
- recommenders.datasets.pandas_df_utils.lru_cache_df(maxsize, typed=False)[source]#
Least-recently-used cache decorator for pandas Dataframes.
Decorator to wrap a function with a memoizing callable that saves up to the maxsize most recent calls. It can save time when an expensive or I/O bound function is periodically called with the same arguments.
Inspired in the lru_cache function.
- Parameters:
maxsize (int|None) – max size of cache, if set to None cache is boundless
typed (bool) – arguments of different types are cached separately
- recommenders.datasets.pandas_df_utils.negative_feedback_sampler(df, col_user='userID', col_item='itemID', col_label='label', col_feedback='feedback', ratio_neg_per_user=1, pos_value=1, neg_value=0, seed=42)[source]#
Utility function to sample negative feedback from user-item interaction dataset. This negative sampling function will take the user-item interaction data to create binarized feedback, i.e., 1 and 0 indicate positive and negative feedback, respectively.
Negative sampling is used in the literature frequently to generate negative samples from a user-item interaction data.
See for example the neural collaborative filtering paper.
- Parameters:
df (pandas.DataFrame) – input data that contains user-item tuples.
col_user (str) – user id column name.
col_item (str) – item id column name.
col_label (str) – label column name in df.
col_feedback (str) – feedback column name in the returned data frame; it is used for the generated column of positive and negative feedback.
ratio_neg_per_user (int) – ratio of negative feedback w.r.t to the number of positive feedback for each user. If the samples exceed the number of total possible negative feedback samples, it will be reduced to the number of all the possible samples.
pos_value (float) – value of positive feedback.
neg_value (float) – value of negative feedback.
inplace (bool)
seed (int) – seed for the random state of the sampling function.
- Returns:
Data with negative feedback.
- Return type:
pandas.DataFrame
Examples
>>> import pandas as pd >>> df = pd.DataFrame({ 'userID': [1, 2, 3], 'itemID': [1, 2, 3], 'rating': [5, 5, 5] }) >>> df_neg_sampled = negative_feedback_sampler( df, col_user='userID', col_item='itemID', ratio_neg_per_user=1 ) >>> df_neg_sampled userID itemID feedback 1 1 1 1 2 0 2 2 1 2 1 0 3 3 1 3 1 0
- recommenders.datasets.pandas_df_utils.user_item_pairs(user_df, item_df, user_col='userID', item_col='itemID', user_item_filter_df=None, shuffle=True, seed=None)[source]#
Get all pairs of users and items data.
- Parameters:
user_df (pandas.DataFrame) – User data containing unique user ids and maybe their features.
item_df (pandas.DataFrame) – Item data containing unique item ids and maybe their features.
user_col (str) – User id column name.
item_col (str) – Item id column name.
user_item_filter_df (pd.DataFrame) – User-item pairs to be used as a filter.
shuffle (bool) – If True, shuffles the result.
seed (int) – Random seed for shuffle
- Returns:
All pairs of user-item from user_df and item_df, excepting the pairs in user_item_filter_df.
- Return type:
pandas.DataFrame
Splitter utilities#
Python splitters#
- recommenders.datasets.python_splitters.numpy_stratified_split(X, ratio=0.75, seed=42)[source]#
Split the user/item affinity matrix (sparse matrix) into train and test set matrices while maintaining local (i.e. per user) ratios.
Main points :
1. In a typical recommender problem, different users rate a different number of items, and therefore the user/affinity matrix has a sparse structure with variable number of zeroes (unrated items) per row (user). Cutting a total amount of ratings will result in a non-homogeneous distribution between train and test set, i.e. some test users may have many ratings while other very little if none.
2. In an unsupervised learning problem, no explicit answer is given. For this reason the split needs to be implemented in a different way then in supervised learningself. In the latter, one typically split the dataset by rows (by examples), ending up with the same number of features but different number of examples in the train/test setself. This scheme does not work in the unsupervised case, as part of the rated items needs to be used as a test set for fixed number of users.
Solution:
1. Instead of cutting a total percentage, for each user we cut a relative ratio of the rated items. For example, if user1 has rated 4 items and user2 10, cutting 25% will correspond to 1 and 2.6 ratings in the test set, approximated as 1 and 3 according to the round() function. In this way, the 0.75 ratio is satisfied both locally and globally, preserving the original distribution of ratings across the train and test set.
2. It is easy (and fast) to satisfy this requirements by creating the test via element subtraction from the original dataset X. We first create two copies of X; for each user we select a random sample of local size ratio (point 1) and erase the remaining ratings, obtaining in this way the train set matrix Xtst. The train set matrix is obtained in the opposite way.
- Parameters:
X (numpy.ndarray, int) – a sparse matrix to be split
ratio (float) – fraction of the entire dataset to constitute the train set
seed (int) – random seed
- Returns:
Xtr: The train set user/item affinity matrix.
Xtst: The test set user/item affinity matrix.
- Return type:
numpy.ndarray, numpy.ndarray
- recommenders.datasets.python_splitters.python_chrono_split(data, ratio=0.75, min_rating=1, filter_by='user', col_user='userID', col_item='itemID', col_timestamp='timestamp')[source]#
Pandas chronological splitter.
This function splits data in a chronological manner. That is, for each user / item, the split function takes proportions of ratings which is specified by the split ratio(s). The split is stratified.
- Parameters:
data (pandas.DataFrame) – Pandas DataFrame to be split.
ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
seed (int) – Seed.
min_rating (int) – minimum number of ratings for user or item.
filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
col_user (str) – column name of user IDs.
col_item (str) – column name of item IDs.
col_timestamp (str) – column name of timestamps.
- Returns:
Splits of the input data as pandas.DataFrame.
- Return type:
list
- recommenders.datasets.python_splitters.python_random_split(data, ratio=0.75, seed=42)[source]#
Pandas random splitter.
The splitter randomly splits the input data.
- Parameters:
data (pandas.DataFrame) – Pandas DataFrame to be split.
ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
seed (int) – Seed.
- Returns:
Splits of the input data as pandas.DataFrame.
- Return type:
list
- recommenders.datasets.python_splitters.python_stratified_split(data, ratio=0.75, min_rating=1, filter_by='user', col_user='userID', col_item='itemID', seed=42)[source]#
Pandas stratified splitter.
For each user / item, the split function takes proportions of ratings which is specified by the split ratio(s). The split is stratified.
- Parameters:
data (pandas.DataFrame) – Pandas DataFrame to be split.
ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
seed (int) – Seed.
min_rating (int) – minimum number of ratings for user or item.
filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
col_user (str) – column name of user IDs.
col_item (str) – column name of item IDs.
- Returns:
Splits of the input data as pandas.DataFrame.
- Return type:
list
PySpark splitters#
- recommenders.datasets.spark_splitters.spark_chrono_split(data, ratio=0.75, min_rating=1, filter_by='user', col_user='userID', col_item='itemID', col_timestamp='timestamp', no_partition=False)[source]#
Spark chronological splitter.
This function splits data in a chronological manner. That is, for each user / item, the split function takes proportions of ratings which is specified by the split ratio(s). The split is stratified.
- Parameters:
data (pyspark.sql.DataFrame) – Spark DataFrame to be split.
ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two sets and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
min_rating (int) – minimum number of ratings for user or item.
filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
col_user (str) – column name of user IDs.
col_item (str) – column name of item IDs.
col_timestamp (str) – column name of timestamps.
no_partition (bool) – set to enable more accurate and less efficient splitting.
- Returns:
Splits of the input data as pyspark.sql.DataFrame.
- Return type:
list
- recommenders.datasets.spark_splitters.spark_random_split(data, ratio=0.75, seed=42)[source]#
Spark random splitter.
Randomly split the data into several splits.
- Parameters:
data (pyspark.sql.DataFrame) – Spark DataFrame to be split.
ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized.
seed (int) – Seed.
- Returns:
Splits of the input data as pyspark.sql.DataFrame.
- Return type:
list
- recommenders.datasets.spark_splitters.spark_stratified_split(data, ratio=0.75, min_rating=1, filter_by='user', col_user='userID', col_item='itemID', seed=42)[source]#
Spark stratified splitter.
For each user / item, the split function takes proportions of ratings which is specified by the split ratio(s). The split is stratified.
- Parameters:
data (pyspark.sql.DataFrame) – Spark DataFrame to be split.
ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two halves and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized. Earlier indexed splits will have earlier times (e.g. the latest time per user or item in split[0] <= the earliest time per user or item in split[1])
seed (int) – Seed.
min_rating (int) – minimum number of ratings for user or item.
filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
col_user (str) – column name of user IDs.
col_item (str) – column name of item IDs.
- Returns:
Splits of the input data as pyspark.sql.DataFrame.
- Return type:
list
- recommenders.datasets.spark_splitters.spark_timestamp_split(data, ratio=0.75, col_user='userID', col_item='itemID', col_timestamp='timestamp')[source]#
Spark timestamp based splitter.
The splitter splits the data into sets by timestamps without stratification on either user or item. The ratios are applied on the timestamp column which is divided accordingly into several partitions.
- Parameters:
data (pyspark.sql.DataFrame) – Spark DataFrame to be split.
ratio (float or list) – Ratio for splitting data. If it is a single float number it splits data into two sets and the ratio argument indicates the ratio of training data set; if it is a list of float numbers, the splitter splits data into several portions corresponding to the split ratios. If a list is provided and the ratios are not summed to 1, they will be normalized. Earlier indexed splits will have earlier times (e.g. the latest time in split[0] <= the earliest time in split[1])
col_user (str) – column name of user IDs.
col_item (str) – column name of item IDs.
col_timestamp (str) – column name of timestamps. Float number represented in
Epoch. (seconds since)
- Returns:
Splits of the input data as pyspark.sql.DataFrame.
- Return type:
list
Other splitters utilities#
- recommenders.datasets.split_utils.filter_k_core(data, core_num=0, col_user='userID', col_item='itemID')[source]#
Filter rating dataframe for minimum number of users and items by repeatedly applying min_rating_filter until the condition is satisfied.
- recommenders.datasets.split_utils.min_rating_filter_pandas(data, min_rating=1, filter_by='user', col_user='userID', col_item='itemID')[source]#
Filter rating DataFrame for each user with minimum rating.
Filter rating data frame with minimum number of ratings for user/item is usually useful to generate a new data frame with warm user/item. The warmth is defined by min_rating argument. For example, a user is called warm if he has rated at least 4 items.
- Parameters:
data (pandas.DataFrame) – DataFrame of user-item tuples. Columns of user and item should be present in the DataFrame while other columns like rating, timestamp, etc. can be optional.
min_rating (int) – minimum number of ratings for user or item.
filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
col_user (str) – column name of user ID.
col_item (str) – column name of item ID.
- Returns:
DataFrame with at least columns of user and item that has been filtered by the given specifications.
- Return type:
pandas.DataFrame
- recommenders.datasets.split_utils.min_rating_filter_spark(data, min_rating=1, filter_by='user', col_user='userID', col_item='itemID')[source]#
Filter rating DataFrame for each user with minimum rating.
Filter rating data frame with minimum number of ratings for user/item is usually useful to generate a new data frame with warm user/item. The warmth is defined by min_rating argument. For example, a user is called warm if he has rated at least 4 items.
- Parameters:
data (pyspark.sql.DataFrame) – DataFrame of user-item tuples. Columns of user and item should be present in the DataFrame while other columns like rating, timestamp, etc. can be optional.
min_rating (int) – minimum number of ratings for user or item.
filter_by (str) – either “user” or “item”, depending on which of the two is to filter with min_rating.
col_user (str) – column name of user ID.
col_item (str) – column name of item ID.
- Returns:
DataFrame with at least columns of user and item that has been filtered by the given specifications.
- Return type:
pyspark.sql.DataFrame
- recommenders.datasets.split_utils.process_split_ratio(ratio)[source]#
Generate split ratio lists.
- Parameters:
ratio (float or list) – a float number that indicates split ratio or a list of float
ratios (numbers that indicate split)
- Returns:
bool: A boolean variable multi that indicates if the splitting is multi or single.
list: A list of normalized split ratios.
- Return type:
tuple
- recommenders.datasets.split_utils.split_pandas_data_with_ratios(data, ratios, seed=42, shuffle=False)[source]#
Helper function to split pandas DataFrame with given ratios
Note
Implementation referenced from this source.
- Parameters:
data (pandas.DataFrame) – Pandas data frame to be split.
ratios (list of floats) – list of ratios for split. The ratios have to sum to 1.
seed (int) – random seed.
shuffle (bool) – whether data will be shuffled when being split.
- Returns:
List of pd.DataFrame split by the given specifications.
- Return type:
list
Sparse utilities#
- class recommenders.datasets.sparse.AffinityMatrix(df, items_list=None, col_user='userID', col_item='itemID', col_rating='rating', col_pred='prediction', save_path=None)[source]#
Generate the user/item affinity matrix from a pandas dataframe and vice versa
- gen_affinity_matrix()[source]#
Generate the user/item affinity matrix.
As a first step, two new columns are added to the input DF, containing the index maps generated by the gen_index() method. The new indices, together with the ratings, are then used to generate the user/item affinity matrix using scipy’s sparse matrix method coo_matrix; for reference see: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html. The input format is: coo_matrix((data, (rows, columns)), shape=(rows, columns))
- Returns:
User-affinity matrix of dimensions (Nusers, Nitems) in numpy format. Unrated movies are assigned a value of 0.
- Return type:
scipy.sparse.coo_matrix
Knowledge graph utilities#
- recommenders.datasets.wikidata.find_wikidata_id(name, limit=1, session=None)[source]#
Find the entity ID in wikidata from a title string.
- Parameters:
name (str) – A string with search terms (eg. “Batman (1989) film”)
limit (int) – Number of results to return
session (requests.Session) – requests session to reuse connections
- Returns:
wikidata entityID corresponding to the title string. ‘entityNotFound’ will be returned if no page is found
- Return type:
str
- recommenders.datasets.wikidata.get_session(session=None)[source]#
Get session object
- Parameters:
session (requests.Session) – request session object
- Returns:
request session object
- Return type:
requests.Session
- recommenders.datasets.wikidata.query_entity_description(entity_id, session=None)[source]#
Query entity wikidata description from entityID
- Parameters:
entity_id (str) – A wikidata page ID.
session (requests.Session) – requests session to reuse connections
- Returns:
Wikidata short description of the entityID descriptionNotFound’ will be returned if no description is found
- Return type:
str
- recommenders.datasets.wikidata.query_entity_links(entity_id, session=None)[source]#
Query all linked pages from a wikidata entityID
- Parameters:
entity_id (str) – A wikidata entity ID
session (requests.Session) – requests session to reuse connections
- Returns:
Dictionary with linked pages.
- Return type:
json
- recommenders.datasets.wikidata.read_linked_entities(data)[source]#
Obtain lists of liken entities (IDs and names) from dictionary
- Parameters:
data (json) – dictionary with linked pages
- Returns:
List of liked entityIDs.
List of liked entity names.
- Return type:
list, list
- recommenders.datasets.wikidata.search_wikidata(names, extras=None, describe=True, verbose=False)[source]#
Create DataFrame of Wikidata search results
- Parameters:
names (list[str]) – List of names to search for
(dict(str (extras) – list)): Optional extra items to assign to results for corresponding name
describe (bool) – Optional flag to include description of entity
verbose (bool) – Optional flag to print out intermediate data
- Returns:
Wikipedia results for all names with found entities
- Return type:
pandas.DataFrame