Skip to content

Commit e0f1739

Browse files
authored
Add files via upload
1 parent e6884b5 commit e0f1739

19 files changed

+688
-19
lines changed

.gitignore

Lines changed: 33 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,33 @@
1+
# Byte-compiled / optimized / DLL files
2+
__pycache__/
3+
*.py[cod]
4+
5+
# C extensions
6+
*.so
7+
8+
#log files
9+
*.log
10+
tb_logs/
11+
lightning_logs/
12+
wandb/
13+
logs
14+
15+
#Keops related generate files
16+
CMakeCache.txt
17+
CMakeFiles/
18+
build-libKeOps*
19+
20+
#Do not add jupyter notebooks
21+
*.ipynb
22+
.ipynb_checkpoints/
23+
24+
#other
25+
EV_GCN/
26+
dgcnn.pytorch/
27+
*.tikz
28+
stats_*
29+
train_*
30+
data/
31+
32+
# how_to_install.txt
33+
how_to_install.txt

README.md

Lines changed: 5 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -1,22 +1,7 @@
1-
# Empirical Validations of Graph Structure Learning methods for Citation Network Applications
1+
# DGM_pytorch
22

3-
Code for the BSc Thesis "Empirical Validations of Graph Structure Learning methods for Citation Network Applications" by Hoang Thien Ly at Faculty of Mathematics and Information Science, Warsaw University of Technology.
3+
Code for the paper "Differentiable Graph Module (DGM) for Graph Convolutional Networks" by Anees Kazi*, Luca Cosmo*, Seyed-Ahmad Ahmadi, Nassir Navab, and Michael Bronstein
44

5-
## Abstract
6-
7-
This Bachelor’s Thesis aims to examine the classification accuracy of graph structure learning methods in graph neural networks domain, with a focus on classifying a paper
8-
in citation network datasets. Graph neural networks (GNNs) have recently emerged as a powerful machine learning concept allowing to generalize successful deep neural archi-
9-
tectures to non-Euclidean structured data with high performance. However, one of the limitations of the majority of current GNNs is the assumption that the underlying graph
10-
is known and fixed. In practice, real-world graphs are often noisy and incomplete or might even be completely unknown. In such cases, it would be helpful to infer the graph
11-
structure directly from the data. Additionally, graph structure learning permits learning latent structures, which may increase the understanding of GNN models by providing edge
12-
weights among entities in the graph structure, allowing modellers to further analysis.
13-
14-
As part of the work, we will:
15-
* review the current state-of-the-art graph structure learning (GSL) methods.
16-
* empirically validate GSL methods by accuracy scores with citation network datasets.
17-
* analyze the mechanism of these approaches and analyze the influence of hyperparameters on model’s behavior.
18-
* discuss future work.
19-
Keywords: graph neural network, graph structure learning, empirical validations, citation network applications.
205

216
## Installation
227

@@ -39,9 +24,10 @@ pip install torch-geometric
3924

4025
## Training
4126

42-
For the dDGM framework, to train a model with the default options run the following command:
27+
To train a model with the default options run the following command:
4328
```
4429
python train.py
4530
```
4631

47-
Other frameworks, codes are presented in the Jupyter notebook.
32+
## Notes
33+
The graph sampling code is based on a modified version of the KeOps libray (www.kernel-operations.io) to speed-up the computation. In particular, the argKmin function of the original libray has been modified to handle the stochasticity of the sampling strategy, adding samples drawn from a Gumbel distribution to the input before performing the reduction.

__init__.py

Whitespace-only changes.

compute_stat.sh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#! /bin/bash
2+
3+
for i in Cora CiteSeer PubMed
4+
do
5+
for j in 1 3 5 10 20
6+
do
7+
python ./stats/stats.py --dataset $i --k $j --dim 4
8+
done
9+
done

compute_stat_embed.sh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#! /bin/bash
2+
3+
for i in Cora CiteSeer PubMed
4+
do
5+
for j in 2 6 8
6+
do
7+
python ./stats/stats.py --dataset $i --dim $j
8+
done
9+
done

compute_stat_ffun.sh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
##! /bin/bash
2+
3+
for data in Cora CiteSeer PubMed
4+
do
5+
for f in gcn gat mlp
6+
do
7+
python ./stats/stats.py --ffun $f --dataset $data
8+
done
9+
done

compute_stat_gfun.sh

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
##! /bin/bash
2+
3+
for data in Cora CiteSeer PubMed
4+
do
5+
for dgm in [[], [], []] [[32,16,4],[],[]]
6+
do
7+
for g in gcn gat edgeconv
8+
do
9+
python ./stats/stats.py --gfun $g --dataset $data --dgm_layers $dgm
10+
done
11+
done
12+
done

compute_stat_hyperbolic.sh

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
#! /bin/bash
2+
3+
for i in Cora CiteSeer PubMed
4+
do
5+
for j in 1 3 5 10 20
6+
do
7+
python ./stats/stats.py --dataset $i --k $j --distance hyperbolic --dim 4
8+
done
9+
done

datasets.py

Lines changed: 153 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,153 @@
1+
import torch
2+
import pickle
3+
import numpy as np
4+
import os.path as osp
5+
import torch
6+
from torch_geometric.datasets import Planetoid
7+
import torch_geometric.transforms as T
8+
9+
10+
class UKBBAgeDataset(torch.utils.data.Dataset):
11+
"""Face Landmarks dataset."""
12+
13+
def __init__(self, fold=0, train=True, samples_per_epoch=100, device='cpu'):
14+
with open('data/UKBB.pickle', 'rb') as f:
15+
X_,y_,train_mask_,test_mask_, weight_ = pickle.load(f) # Load the data
16+
17+
self.X = torch.from_numpy(X_[:,:,fold]).float().to(device)
18+
self.y = torch.from_numpy(y_[:,:,fold]).float().to(device)
19+
self.weight = torch.from_numpy(np.squeeze(weight_[:1,fold])).float().to(device)
20+
if train:
21+
self.mask = torch.from_numpy(train_mask_[:,fold]).to(device)
22+
else:
23+
self.mask = torch.from_numpy(test_mask_[:,fold]).to(device)
24+
25+
self.samples_per_epoch = samples_per_epoch
26+
27+
def __len__(self):
28+
return self.samples_per_epoch
29+
30+
def __getitem__(self, idx):
31+
return self.X,self.y,self.mask
32+
33+
34+
35+
class TadpoleDataset(torch.utils.data.Dataset):
36+
"""Face Landmarks dataset."""
37+
38+
def __init__(self, fold=0, train=True, samples_per_epoch=100, device='cpu',full=False):
39+
with open('data/tadpole_data.pickle', 'rb') as f:
40+
X_,y_,train_mask_,test_mask_, weight_ = pickle.load(f) # Load the data
41+
42+
if not full:
43+
X_ = X_[...,:30,:] # For DGM we use modality 1 (M1) for both node representation and graph learning.
44+
45+
46+
self.n_features = X_.shape[-2]
47+
self.num_classes = y_.shape[-2]
48+
49+
self.X = torch.from_numpy(X_[:,:,fold]).float().to(device)
50+
self.y = torch.from_numpy(y_[:,:,fold]).float().to(device)
51+
self.weight = torch.from_numpy(np.squeeze(weight_[:1,fold])).float().to(device)
52+
if train:
53+
self.mask = torch.from_numpy(train_mask_[:,fold]).to(device)
54+
else:
55+
self.mask = torch.from_numpy(test_mask_[:,fold]).to(device)
56+
57+
self.samples_per_epoch = samples_per_epoch
58+
59+
def __len__(self):
60+
return self.samples_per_epoch
61+
62+
def __getitem__(self, idx):
63+
return self.X,self.y,self.mask, [[]]
64+
65+
66+
# class TadpoleDataset(torch.utils.data.Dataset):
67+
# """Face Landmarks dataset."""
68+
69+
# def __init__(self, fold=0, split='train', samples_per_epoch=100, device='cpu'):
70+
71+
# with open('data/train_data.pickle', 'rb') as f:
72+
# X_,y_,train_mask_,test_mask_, weight_ = pickle.load(f) # Load the data
73+
74+
# X_ = X_[...,:30,:] # For DGM we use modality 1 (M1) for both node representation and graph learning.
75+
76+
# self.X = torch.from_numpy(X_[:,:,fold]).float().to(device)
77+
# self.y = torch.from_numpy(y_[:,:,fold]).float().to(device)
78+
# self.weight = torch.from_numpy(np.squeeze(weight_[:1,fold])).float().to(device)
79+
80+
# # split train set in train/val
81+
# train_mask = train_mask_[:,fold]
82+
# nval = int(train_mask.sum()*0.2)
83+
# val_idxs = np.random.RandomState(1).choice(np.nonzero(train_mask.flatten())[0],(nval,),replace=False)
84+
# train_mask[val_idxs] = 0;
85+
# val_mask = train_mask*0
86+
# val_mask[val_idxs] = 1
87+
88+
# print('DATA STATS: train: %d val: %d' % (train_mask.sum(),val_mask.sum()))
89+
90+
# if split=='train':
91+
# self.mask = torch.from_numpy(train_mask).to(device)
92+
# if split=='val':
93+
# self.mask = torch.from_numpy(val_mask).to(device)
94+
# if split=='test':
95+
# self.mask = torch.from_numpy(test_mask_[:,fold]).to(device)
96+
97+
# self.samples_per_epoch = samples_per_epoch
98+
99+
# def __len__(self):
100+
# return self.samples_per_epoch
101+
102+
# def __getitem__(self, idx):
103+
# return self.X,self.y,self.mask
104+
105+
106+
107+
def get_planetoid_dataset(name, normalize_features=True, transform=None, split="complete"):
108+
path = osp.join('.', 'data', name)
109+
if split == 'complete':
110+
dataset = Planetoid(path, name)
111+
dataset[0].train_mask.fill_(False)
112+
dataset[0].train_mask[:dataset[0].num_nodes - 1000] = 1
113+
dataset[0].val_mask.fill_(False)
114+
dataset[0].val_mask[dataset[0].num_nodes - 1000:dataset[0].num_nodes - 500] = 1
115+
dataset[0].test_mask.fill_(False)
116+
dataset[0].test_mask[dataset[0].num_nodes - 500:] = 1
117+
else:
118+
dataset = Planetoid(path, name, split=split)
119+
if transform is not None and normalize_features:
120+
dataset.transform = T.Compose([T.NormalizeFeatures(), transform])
121+
elif normalize_features:
122+
dataset.transform = T.NormalizeFeatures()
123+
elif transform is not None:
124+
dataset.transform = transform
125+
return dataset
126+
127+
def one_hot_embedding(labels, num_classes):
128+
y = torch.eye(num_classes)
129+
return y[labels]
130+
131+
class PlanetoidDataset(torch.utils.data.Dataset):
132+
def __init__(self, split='train', samples_per_epoch=100, name='Cora', device='cpu'):
133+
dataset = get_planetoid_dataset(name)
134+
self.X = dataset[0].x.float().to(device)
135+
self.y = one_hot_embedding(dataset[0].y,dataset.num_classes).float().to(device)
136+
self.edge_index = dataset[0].edge_index.to(device)
137+
self.n_features = dataset[0].num_node_features
138+
self.num_classes = dataset.num_classes
139+
140+
if split=='train':
141+
self.mask = dataset[0].train_mask.to(device)
142+
if split=='val':
143+
self.mask = dataset[0].val_mask.to(device)
144+
if split=='test':
145+
self.mask = dataset[0].test_mask.to(device)
146+
147+
self.samples_per_epoch = samples_per_epoch
148+
149+
def __len__(self):
150+
return self.samples_per_epoch
151+
152+
def __getitem__(self, idx):
153+
return self.X,self.y,self.mask,self.edge_index

documentation.md

Lines changed: 58 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,58 @@
1+
# Current Issues
2+
1. Experiments 5.1 (5.1.1, 5.1.2, 5.1.3, 5.1.4, 5.1.5) for discrete DGM are not consistent.
3+
2. Point Cloud 3D experiments are based on DGCNN: https://github.com/WangYueFt/dgcnn
4+
- Successfully re-run experiments in dgcnn.
5+
- Section 5.3 describes that kNN sampling scheme by DGCNN was replaced by the discrete sampling strategy of DGM. This experiment is demonstrated only for the **discrete** case.
6+
7+
3. Zero-shot learning application is also done with **discrete** DGM. Not much information was provided to reproduce experiments in this section.
8+
9+
**Currently asking two co-authors on this issue, no update yet**
10+
Luca Cosmo: [email protected]
11+
Kazi Anees: [email protected]
12+
13+
# Code Info
14+
## Dataset
15+
- Synthetic datasets for testing discrete DGM: Citeseer, PubMed, CiteSeer
16+
- Classification for disease, ages: Tadpole, UK Biobank
17+
- Currently only *transductive setting* is provided.
18+
- Computer Vision 3D application on point clouds: ShapeNet (only discrete DGM).
19+
20+
## Available settings
21+
- Multiple configurations for discrete case.
22+
- Citeseer, Pubmed, Citeseer epxeriments.
23+
24+
## Not (completely) available
25+
- Tadpole, UK Biobank requires some modifications for the model from the descriptions of the paper.
26+
- **Continuous** layer is provided, but the continuous model (i.e. definitions, training, etc) is not provided.
27+
- Continuous model for tadpole, UK Biobank also requires modifications.
28+
29+
## Important files
30+
### ./DGM_pytorch/DGMlib
31+
1. model_dDGM.py: discrete DGM model definition, training/validation code
32+
2. layers.py: include some different layers for experiments
33+
- Euclidean distance
34+
- Poincare distance
35+
- Discrete DGM module
36+
- Continuous DGM module
37+
- MLP module
38+
- Identity module
39+
### ./DGM_pytorch/train.py
40+
Main file to run experiments: simply run **python train.py** with parameters:
41+
42+
1. --num_gpus: total number of gpus
43+
2. --dataset: which dataset used to train (UKBiobank, Tadpole require additional changes)
44+
3. --fold: k-fold validation (for UKBiobank, Tadpole)
45+
4. --conv_layers: number of convolutional layers
46+
5. --dgm_layers: number of dgm layers
47+
6. --fc_layers: number of linear layers
48+
7. --pre_lc: pre linear layer setting
49+
8. --gfun: diffusion function types: use state-of-the-art layers: gcn, gat, edgeconv
50+
9. --ffun: graph embedding function types: use state-of-the-art layers: gcn, gat (+mlp, id for experiments)
51+
10. --k: k param for k-gumbel sampling
52+
11. --pooling: pooling type (default = add)
53+
12. --dropout: drop out probability during training (not touching)
54+
13. --lr: learning rate (not touching)
55+
14. --test_eval: number of epoch for evaluation (not touching)
56+
57+
## Notes: all settings above are for discrete case sampling.
58+

0 commit comments

Comments
 (0)