[논문구현]Wide & Deep Learning for Recommender System(2016)
Wide & Deep Learning for Recommender System
- Google에서 App Store를 활용해서 발표한 논문(링크)
Google 공식 문서
- Google의 AI Blog(링크)
- Google의 Tensorflow github(링크) : API 어떻게 구현했는지 확인 가능
- TensorFlow v2.4 API
- tf.keras.experimental.WideDeepModel
- tf.estimator.DNNLinearCombinedClassifier : Linear(Wide) + DNN(Deep) => classifier API
PyTorch Library 구현
- tensorflow보다 pytorch가 쓰기 편함(직관적, 디버깅 편함).. 그래서 가져와봄
- 이 라이브러리는 text, image도 가져와서 넣고 만들 수 있음!
- pytorch-widedeep
# restart 후 install 확인
import pytorch_widedeep
/usr/lib/python3.7/importlib/_bootstrap.py:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
return f(*args, **kwds)
import os
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from google.colab import drive
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
data_path = "/content/drive/MyDrive/capstone/data/kmrd"
%cd $data_path
if not os.path.exists(data_path):
!git clone https://github.com/lovit/kmrd
!python setup.py install
print("data and path already exists!")
path = data_path + '/kmr_dataset/datafile/kmrd-small'
data and path already exists!
df = pd.read_csv(os.path.join(path,'rates.csv'))
train_df, val_df = train_test_split(df, test_size=0.2, random_state=1234, shuffle=True)
(112568, 4)
# 리소스 한계로 1000개만 자를게욥
train_df = train_df[:1000]
# Load all related dataframe
movies_df = pd.read_csv(os.path.join(path, 'movies.txt'), sep='\t', encoding='utf-8')
movies_df = movies_df.set_index('movie')
castings_df = pd.read_csv(os.path.join(path, 'castings.csv'), encoding='utf-8')
countries_df = pd.read_csv(os.path.join(path, 'countries.csv'), encoding='utf-8')
genres_df = pd.read_csv(os.path.join(path, 'genres.csv'), encoding='utf-8')
# Get genre information
genres = [(list(set(x['movie'].values))[0], '/'.join(x['genre'].values)) for index, x in genres_df.groupby('movie')]
combined_genres_df = pd.DataFrame(data=genres, columns=['movie', 'genres'])
combined_genres_df = combined_genres_df.set_index('movie')
# Get castings information
castings = [(list(set(x['movie'].values))[0], x['people'].values) for index, x in castings_df.groupby('movie')]
combined_castings_df = pd.DataFrame(data=castings, columns=['movie','people'])
combined_castings_df = combined_castings_df.set_index('movie')
# Get countries for movie information
countries = [(list(set(x['movie'].values))[0], ','.join(x['country'].values)) for index, x in countries_df.groupby('movie')]
combined_countries_df = pd.DataFrame(data=countries, columns=['movie', 'country'])
combined_countries_df = combined_countries_df.set_index('movie')
movies_df = pd.concat([movies_df, combined_genres_df, combined_castings_df, combined_countries_df], axis=1)
print(movies_df.shape) # movie_df 정보를 wide part에서 cross-product 쓰려고 가져옴
(999, 7)
title ... country
movie ...
10001 시네마 천국 ... 이탈리아,프랑스
10002 빽 투 더 퓨쳐 ... 미국
10003 빽 투 더 퓨쳐 2 ... 미국
10004 빽 투 더 퓨쳐 3 ... 미국
10005 스타워즈 에피소드 4 - 새로운 희망 ... 미국
[5 rows x 7 columns]
Index(['title', 'title_eng', 'year', 'grade', 'genres', 'people', 'country'], dtype='object')
dummy_genres_df = movies_df['genres'].str.get_dummies(sep='/')
train_genres_df = train_df['movie'].apply(lambda x: dummy_genres_df.loc[x])
SF | 가족 | 공포 | 느와르 | 다큐멘터리 | 드라마 | 로맨스 | 멜로 | 모험 | 뮤지컬 | 미스터리 | 범죄 | 서부 | 서사 | 스릴러 | 애니메이션 | 액션 | 에로 | 전쟁 | 코미디 | 판타지 | |
137023 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
92868 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
94390 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
22289 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
80155 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
dummy_grade_df = pd.get_dummies(movies_df['grade'], prefix='grade')
train_grade_df = train_df['movie'].apply(lambda x: dummy_grade_df.loc[x])
grade_12세 관람가 | grade_15세 관람가 | grade_G | grade_NR | grade_PG | grade_PG-13 | grade_R | grade_전체 관람가 | grade_청소년 관람불가 | |
137023 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
92868 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
94390 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
22289 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
80155 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
train_df['year'] = train_df.apply(lambda x: movies_df.loc[x['movie']]['year'], axis=1) # continous value
train_df = pd.concat([train_df, train_grade_df, train_genres_df], axis=1)
user | movie | rate | time | year | grade_12세 관람가 | grade_15세 관람가 | grade_G | grade_NR | grade_PG | grade_PG-13 | grade_R | grade_전체 관람가 | grade_청소년 관람불가 | SF | 가족 | 공포 | 느와르 | 다큐멘터리 | 드라마 | 로맨스 | 멜로 | 모험 | 뮤지컬 | 미스터리 | 범죄 | 서부 | 서사 | 스릴러 | 애니메이션 | 액션 | 에로 | 전쟁 | 코미디 | 판타지 | |
137023 | 48423 | 10764 | 10 | 1212241560 | 1987.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
92868 | 17307 | 10170 | 10 | 1122185220 | 1985.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
94390 | 18180 | 10048 | 10 | 1573403460 | 2016.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
22289 | 1498 | 10001 | 9 | 1432684500 | 2013.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
80155 | 12541 | 10022 | 10 | 1370458140 | 1980.0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
wide_cols = list(dummy_genres_df.columns) + list(dummy_grade_df.columns) # genre와 grade간의 interaction 보고자 함
'grade_12세 관람가',
'grade_15세 관람가',
'grade_전체 관람가',
'grade_청소년 관람불가']
# 조합 너무 많아지니까 2개씩만 쓸게욥
wide_cols = list(dummy_genres_df.columns)[0:3] + list(dummy_grade_df.columns)[0:3]
['SF', '가족', '공포', '느와르', '다큐멘터리', '드라마', '로맨스', '멜로', '모험', '뮤지컬', '미스터리', '범죄', '서부', '서사', '스릴러', '애니메이션', '액션', '에로', '전쟁', '코미디', '판타지', 'grade_12세 관람가', 'grade_15세 관람가', 'grade_G', 'grade_NR', 'grade_PG', 'grade_PG-13', 'grade_R', 'grade_전체 관람가', 'grade_청소년 관람불가']
# wide_cols = ['genre', 'grade']
# cross_cols = [('genre', 'grade')]
['SF', '가족', '공포', 'grade_12세 관람가', 'grade_15세 관람가', 'grade_G']
## cross-col 만들기
import itertools
from itertools import product
unique_combinations = list(list(zip(wide_cols, element))
for element in product(wide_cols, repeat = len(wide_cols)))
cross_cols = [item for sublist in unique_combinations for item in sublist]
cross_cols = [x for x in cross_cols if x[0] != x[1]]
cross_cols = list(set(cross_cols))
[('grade_15세 관람가', 'grade_12세 관람가'), ('grade_G', '공포'), ('SF', '가족'), ('grade_12세 관람가', '공포'), ('공포', 'grade_G'), ('가족', 'grade_12세 관람가'), ('SF', 'grade_15세 관람가'), ('grade_15세 관람가', '가족'), ('공포', 'SF'), ('가족', 'SF'), ('grade_12세 관람가', '가족'), ('공포', 'grade_12세 관람가'), ('grade_G', '가족'), ('grade_G', 'grade_15세 관람가'), ('공포', '가족'), ('grade_12세 관람가', 'SF'), ('grade_15세 관람가', 'grade_G'), ('grade_15세 관람가', '공포'), ('공포', 'grade_15세 관람가'), ('가족', '공포'), ('SF', 'grade_G'), ('가족', 'grade_G'), ('SF', '공포'), ('grade_G', 'grade_12세 관람가'), ('grade_G', 'SF'), ('가족', 'grade_15세 관람가'), ('grade_12세 관람가', 'grade_15세 관람가'), ('SF', 'grade_12세 관람가'), ('grade_12세 관람가', 'grade_G'), ('grade_15세 관람가', 'SF')]
## Deep
# embed_cols = [('genre', 16),('grade', 16)]
embed_cols = list(set([(x[0], 16) for x in cross_cols])) # embedding size만 정해줌
continuous_cols = ['year']
[('가족', 16), ('grade_15세 관람가', 16), ('공포', 16), ('grade_12세 관람가', 16), ('grade_G', 16), ('SF', 16)]
target = train_df['rate'].apply(lambda x: 1 if x > 9 else 0).values # 10점인 애들을 1, 아닌애들 0으로 binary로
Wide & Deep
from pytorch_widedeep.preprocessing import WidePreprocessor,TabPreprocessor
from pytorch_widedeep.models import Wide, TabMlp, WideDeep
from pytorch_widedeep.metrics import Accuracy
Wide Component
preprocess_wide = WidePreprocessor(wide_cols=wide_cols, crossed_cols=cross_cols)
X_wide = preprocess_wide.fit_transform(train_df)
wide = Wide(wide_dim=np.unique(X_wide).shape[0], pred_dim=1)
(wide_linear): Embedding(108, 1, padding_idx=0)
Deep Component
preprocess_deep = TabPreprocessor(embed_cols=embed_cols, continuous_cols=continuous_cols)
X_deep = preprocess_deep.fit_transform(train_df)
deepdense = TabMlp(
mlp_hidden_dims=[64, 32],
(cat_embed_and_cont): CatEmbeddingsAndCont(
(embed_layers): ModuleDict(
(emb_layer_가족): Embedding(3, 16, padding_idx=0)
(emb_layer_grade_15세 관람가): Embedding(3, 16, padding_idx=0)
(emb_layer_공포): Embedding(3, 16, padding_idx=0)
(emb_layer_grade_12세 관람가): Embedding(3, 16, padding_idx=0)
(emb_layer_grade_G): Embedding(2, 16, padding_idx=0)
(emb_layer_SF): Embedding(3, 16, padding_idx=0)
(embedding_dropout): Dropout(p=0.1, inplace=False)
(cont_norm): BatchNorm1d(1, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(tab_mlp): MLP(
(mlp): Sequential(
(dense_layer_0): Sequential(
(0): Dropout(p=0.1, inplace=False)
(1): Linear(in_features=97, out_features=64, bias=True)
(2): ReLU(inplace=True)
(dense_layer_1): Sequential(
(0): Dropout(p=0.1, inplace=False)
(1): Linear(in_features=64, out_features=32, bias=True)
(2): ReLU(inplace=True)
Build and Train
# build, compile and fit
from pytorch_widedeep import Trainer
model = WideDeep(wide=wide, deeptabular=deepdense)
trainer = Trainer(model, objective="binary", metrics=[Accuracy])
/usr/local/lib/python3.7/dist-packages/sklearn/utils/__init__.py:1108: DeprecationWarning: `np.int` is a deprecated alias for the builtin `int`. To silence this warning, use `int` by itself. Doing this will not modify any behavior and is safe. When replacing `np.int`, you may wish to use e.g. `np.int64` or `np.int32` to specify the precision. If you wish to review your current use, check the release note link for additional information.
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
return floored.astype(np.int)
Deprecated in NumPy 1.20; for more details and guidance: https://numpy.org/devdocs/release/1.20.0-notes.html#deprecations
return floored.astype(np.int)
epoch 1: 100%|██████████| 4/4 [00:00<00:00, 9.40it/s, loss=nan, metrics={'acc': 0.1833}]
valid: 100%|██████████| 1/1 [00:00<00:00, 12.79it/s, loss=nan, metrics={'acc': 0.0}]
epoch 2: 100%|██████████| 4/4 [00:00<00:00, 63.76it/s, loss=nan, metrics={'acc': 0.0}]
valid: 100%|██████████| 1/1 [00:00<00:00, 15.79it/s, loss=nan, metrics={'acc': 0.0}]
epoch 3: 100%|██████████| 4/4 [00:00<00:00, 54.59it/s, loss=nan, metrics={'acc': 0.0}]
valid: 100%|██████████| 1/1 [00:00<00:00, 12.24it/s, loss=nan, metrics={'acc': 0.0}]
epoch 4: 100%|██████████| 4/4 [00:00<00:00, 61.07it/s, loss=nan, metrics={'acc': 0.0}]
valid: 100%|██████████| 1/1 [00:00<00:00, 14.63it/s, loss=nan, metrics={'acc': 0.0}]
epoch 5: 100%|██████████| 4/4 [00:00<00:00, 54.58it/s, loss=nan, metrics={'acc': 0.0}]
valid: 100%|██████████| 1/1 [00:00<00:00, 13.19it/s, loss=nan, metrics={'acc': 0.0}]
(1000, 4)
(1000, 9)