lhthien09
diff --git a/‎.gitignore
Lines changed: 33 additions & 0 deletions b/‎.gitignore
Lines changed: 33 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 5 additions & 19 deletions b/‎README.md
Lines changed: 5 additions & 19 deletions
diff --git a/‎__init__.py b/‎__init__.py
diff --git a/‎compute_stat.sh
Lines changed: 9 additions & 0 deletions b/‎compute_stat.sh
Lines changed: 9 additions & 0 deletions
diff --git a/‎compute_stat_embed.sh
Lines changed: 9 additions & 0 deletions b/‎compute_stat_embed.sh
Lines changed: 9 additions & 0 deletions
diff --git a/‎compute_stat_ffun.sh
Lines changed: 9 additions & 0 deletions b/‎compute_stat_ffun.sh
Lines changed: 9 additions & 0 deletions
diff --git a/‎compute_stat_gfun.sh
Lines changed: 12 additions & 0 deletions b/‎compute_stat_gfun.sh
Lines changed: 12 additions & 0 deletions
diff --git a/‎compute_stat_hyperbolic.sh
Lines changed: 9 additions & 0 deletions b/‎compute_stat_hyperbolic.sh
Lines changed: 9 additions & 0 deletions
diff --git a/‎datasets.py
Lines changed: 153 additions & 0 deletions b/‎datasets.py
Lines changed: 153 additions & 0 deletions
diff --git a/‎documentation.md
Lines changed: 58 additions & 0 deletions b/‎documentation.md
Lines changed: 58 additions & 0 deletions
@@ -0,0 +1,33 @@
+# Byte-compiled / optimized / DLL files
+__pycache__/
+*.py[cod]
+
+# C extensions
+*.so
+
+#log files
+*.log
+tb_logs/
+lightning_logs/
+wandb/
+logs
+
+#Keops related generate files
+CMakeCache.txt
+CMakeFiles/
+build-libKeOps*
+
+#Do not add jupyter notebooks
+*.ipynb
+.ipynb_checkpoints/
+
+#other
+EV_GCN/
+dgcnn.pytorch/
+*.tikz
+stats_*
+train_*
+data/
+
+# how_to_install.txt
+how_to_install.txt
@@ -1,22 +1,7 @@
-# Empirical Validations of Graph Structure Learning methods for Citation Network Applications
+# DGM_pytorch
 
-Code for the BSc Thesis "Empirical Validations of Graph Structure Learning methods for Citation Network Applications" by Hoang Thien Ly at Faculty of Mathematics and Information Science, Warsaw University of Technology.
+Code for the paper "Differentiable Graph Module (DGM) for Graph Convolutional Networks" by Anees Kazi*, Luca Cosmo*, Seyed-Ahmad Ahmadi, Nassir Navab, and Michael Bronstein
 
-## Abstract
-
-This Bachelor’s Thesis aims to examine the classification accuracy of graph structure learning methods in graph neural networks domain, with a focus on classifying a paper
-in citation network datasets. Graph neural networks (GNNs) have recently emerged as a powerful machine learning concept allowing to generalize successful deep neural archi-
-tectures to non-Euclidean structured data with high performance. However, one of the limitations of the majority of current GNNs is the assumption that the underlying graph
-is known and fixed. In practice, real-world graphs are often noisy and incomplete or might even be completely unknown. In such cases, it would be helpful to infer the graph
-structure directly from the data. Additionally, graph structure learning permits learning latent structures, which may increase the understanding of GNN models by providing edge
-weights among entities in the graph structure, allowing modellers to further analysis.
-
-As part of the work, we will:
-* review the current state-of-the-art graph structure learning (GSL) methods.
-* empirically validate GSL methods by accuracy scores with citation network datasets.
-* analyze the mechanism of these approaches and analyze the influence of hyperparameters on model’s behavior.
-* discuss future work.
-Keywords: graph neural network, graph structure learning, empirical validations, citation network applications.
 
 ## Installation
 
@@ -39,9 +24,10 @@ pip install torch-geometric
 
 ## Training
 
-For the dDGM framework, to train a model with the default options run the following command:
+To train a model with the default options run the following command:
 ```
 python train.py
 ``` 
 
-Other frameworks, codes are presented in the Jupyter notebook.
+## Notes
+The graph sampling code is based on a modified version of the KeOps libray (www.kernel-operations.io) to speed-up the computation. In particular, the argKmin function of the original libray has been modified to handle the stochasticity of the sampling strategy, adding samples drawn from a Gumbel distribution to the input before performing the reduction.
@@ -0,0 +1,9 @@
+#! /bin/bash
+
+for i in Cora CiteSeer PubMed
+do
+    for j in 1 3 5 10 20 
+    do
+        python ./stats/stats.py --dataset $i --k $j --dim 4
+    done
+done
@@ -0,0 +1,9 @@
+#! /bin/bash
+
+for i in Cora CiteSeer PubMed
+do
+    for j in 2 6 8
+    do
+        python ./stats/stats.py --dataset $i --dim $j
+    done
+done
@@ -0,0 +1,9 @@
+##! /bin/bash
+
+for data in Cora CiteSeer PubMed
+do
+    for f in gcn gat mlp 
+    do 
+        python ./stats/stats.py --ffun $f --dataset $data 
+    done
+done
@@ -0,0 +1,12 @@
+##! /bin/bash
+
+for data in Cora CiteSeer PubMed
+do
+    for dgm in [[], [], []] [[32,16,4],[],[]]
+    do 
+        for g in gcn gat edgeconv 
+        do 
+            python ./stats/stats.py --gfun $g --dataset $data --dgm_layers $dgm
+        done
+    done
+done
@@ -0,0 +1,9 @@
+#! /bin/bash
+
+for i in Cora CiteSeer PubMed
+do
+    for j in 1 3 5 10 20 
+    do
+        python ./stats/stats.py --dataset $i --k $j --distance hyperbolic --dim 4
+    done
+done
@@ -0,0 +1,153 @@
+import torch
+import pickle
+import numpy as np
+import os.path as osp
+import torch
+from torch_geometric.datasets import Planetoid
+import torch_geometric.transforms as T
+
+
+class UKBBAgeDataset(torch.utils.data.Dataset):
+    """Face Landmarks dataset."""
+
+    def __init__(self, fold=0, train=True, samples_per_epoch=100, device='cpu'):
+        with open('data/UKBB.pickle', 'rb') as f:
+            X_,y_,train_mask_,test_mask_, weight_ = pickle.load(f) # Load the data
+
+        self.X = torch.from_numpy(X_[:,:,fold]).float().to(device)
+        self.y = torch.from_numpy(y_[:,:,fold]).float().to(device)
+        self.weight = torch.from_numpy(np.squeeze(weight_[:1,fold])).float().to(device)
+        if train:
+            self.mask = torch.from_numpy(train_mask_[:,fold]).to(device)
+        else:
+            self.mask = torch.from_numpy(test_mask_[:,fold]).to(device)
+            
+        self.samples_per_epoch = samples_per_epoch
+
+    def __len__(self):
+        return self.samples_per_epoch
+
+    def __getitem__(self, idx):
+        return self.X,self.y,self.mask
+    
+    
+    
+class TadpoleDataset(torch.utils.data.Dataset):
+    """Face Landmarks dataset."""
+
+    def __init__(self, fold=0, train=True, samples_per_epoch=100, device='cpu',full=False):
+        with open('data/tadpole_data.pickle', 'rb') as f:
+            X_,y_,train_mask_,test_mask_, weight_ = pickle.load(f) # Load the data
+        
+        if not full:
+            X_ = X_[...,:30,:] # For DGM we use modality 1 (M1) for both node representation and graph learning.
+
+        
+        self.n_features = X_.shape[-2]
+        self.num_classes = y_.shape[-2]
+        
+        self.X = torch.from_numpy(X_[:,:,fold]).float().to(device)
+        self.y = torch.from_numpy(y_[:,:,fold]).float().to(device)
+        self.weight = torch.from_numpy(np.squeeze(weight_[:1,fold])).float().to(device)
+        if train:
+            self.mask = torch.from_numpy(train_mask_[:,fold]).to(device)
+        else:
+            self.mask = torch.from_numpy(test_mask_[:,fold]).to(device)
+            
+        self.samples_per_epoch = samples_per_epoch
+
+    def __len__(self):
+        return self.samples_per_epoch
+
+    def __getitem__(self, idx):
+        return self.X,self.y,self.mask, [[]]
+ 
+    
+# class TadpoleDataset(torch.utils.data.Dataset):
+#     """Face Landmarks dataset."""
+
+#     def __init__(self, fold=0, split='train', samples_per_epoch=100, device='cpu'):
+       
+#         with open('data/train_data.pickle', 'rb') as f:
+#             X_,y_,train_mask_,test_mask_, weight_ = pickle.load(f) # Load the data
+        
+#         X_ = X_[...,:30,:] # For DGM we use modality 1 (M1) for both node representation and graph learning.
+
+#         self.X = torch.from_numpy(X_[:,:,fold]).float().to(device)
+#         self.y = torch.from_numpy(y_[:,:,fold]).float().to(device)
+#         self.weight = torch.from_numpy(np.squeeze(weight_[:1,fold])).float().to(device)
+
+#         # split train set in train/val
+#         train_mask = train_mask_[:,fold]
+#         nval = int(train_mask.sum()*0.2)
+#         val_idxs = np.random.RandomState(1).choice(np.nonzero(train_mask.flatten())[0],(nval,),replace=False)
+#         train_mask[val_idxs] = 0;
+#         val_mask = train_mask*0
+#         val_mask[val_idxs] = 1
+                          
+#         print('DATA STATS: train: %d val: %d' % (train_mask.sum(),val_mask.sum()))
+            
+#         if split=='train':
+#             self.mask = torch.from_numpy(train_mask).to(device)
+#         if split=='val':
+#             self.mask = torch.from_numpy(val_mask).to(device)
+#         if split=='test':
+#             self.mask = torch.from_numpy(test_mask_[:,fold]).to(device)
+            
+#         self.samples_per_epoch = samples_per_epoch
+
+#     def __len__(self):
+#         return self.samples_per_epoch
+
+#     def __getitem__(self, idx):
+#         return self.X,self.y,self.mask
+    
+
+
+def get_planetoid_dataset(name, normalize_features=True, transform=None, split="complete"):
+    path = osp.join('.', 'data', name)
+    if split == 'complete':
+        dataset = Planetoid(path, name)
+        dataset[0].train_mask.fill_(False)
+        dataset[0].train_mask[:dataset[0].num_nodes - 1000] = 1
+        dataset[0].val_mask.fill_(False)
+        dataset[0].val_mask[dataset[0].num_nodes - 1000:dataset[0].num_nodes - 500] = 1
+        dataset[0].test_mask.fill_(False)
+        dataset[0].test_mask[dataset[0].num_nodes - 500:] = 1
+    else:
+        dataset = Planetoid(path, name, split=split)
+    if transform is not None and normalize_features:
+        dataset.transform = T.Compose([T.NormalizeFeatures(), transform])
+    elif normalize_features:
+        dataset.transform = T.NormalizeFeatures()
+    elif transform is not None:
+        dataset.transform = transform
+    return dataset
+
+def one_hot_embedding(labels, num_classes):
+    y = torch.eye(num_classes) 
+    return y[labels] 
+
+class PlanetoidDataset(torch.utils.data.Dataset):
+    def __init__(self, split='train', samples_per_epoch=100, name='Cora', device='cpu'):
+        dataset = get_planetoid_dataset(name)
+        self.X = dataset[0].x.float().to(device)
+        self.y = one_hot_embedding(dataset[0].y,dataset.num_classes).float().to(device)
+        self.edge_index = dataset[0].edge_index.to(device)
+        self.n_features = dataset[0].num_node_features
+        self.num_classes = dataset.num_classes
+        
+        if split=='train':
+            self.mask = dataset[0].train_mask.to(device)
+        if split=='val':
+            self.mask = dataset[0].val_mask.to(device)
+        if split=='test':
+            self.mask = dataset[0].test_mask.to(device)
+         
+        self.samples_per_epoch = samples_per_epoch
+
+    def __len__(self):
+        return self.samples_per_epoch
+
+    def __getitem__(self, idx):
+        return self.X,self.y,self.mask,self.edge_index
@@ -0,0 +1,58 @@
+# Current Issues 
+1. Experiments 5.1 (5.1.1, 5.1.2, 5.1.3, 5.1.4, 5.1.5) for discrete DGM are not consistent. 
+2. Point Cloud 3D experiments are based on DGCNN: https://github.com/WangYueFt/dgcnn
+    - Successfully re-run experiments in dgcnn.
+    - Section 5.3 describes that kNN sampling scheme by DGCNN was replaced by the discrete sampling strategy of DGM. This experiment is demonstrated only for the **discrete** case. 
+
+3. Zero-shot learning application is also done with **discrete** DGM. Not much information was provided to reproduce experiments in this section.
+
+**Currently asking two co-authors on this issue, no update yet**
+Luca Cosmo: [email protected]
+Kazi Anees: [email protected]
+
+# Code Info
+## Dataset
+- Synthetic datasets for testing discrete DGM: Citeseer, PubMed, CiteSeer
+- Classification for disease, ages: Tadpole, UK Biobank
+    - Currently only *transductive setting* is provided.
+- Computer Vision 3D application on point clouds: ShapeNet (only discrete DGM).
+
+## Available settings
+- Multiple configurations for discrete case.
+- Citeseer, Pubmed, Citeseer epxeriments.
+
+## Not (completely) available
+- Tadpole, UK Biobank requires some modifications for the model from the descriptions of the paper.
+- **Continuous** layer is provided, but the continuous model (i.e. definitions, training, etc) is not provided.
+- Continuous model for tadpole, UK Biobank also requires modifications.
+
+## Important files
+### ./DGM_pytorch/DGMlib
+1. model_dDGM.py: discrete DGM model definition, training/validation code
+2. layers.py: include some different layers for experiments
+    - Euclidean distance
+    - Poincare distance
+    - Discrete DGM module
+    - Continuous DGM module
+    - MLP module
+    - Identity module
+### ./DGM_pytorch/train.py
+Main file to run experiments: simply run **python train.py** with parameters:
+
+1. --num_gpus: total number of gpus
+2. --dataset: which dataset used to train (UKBiobank, Tadpole require additional changes)
+3. --fold: k-fold validation (for UKBiobank, Tadpole)
+4. --conv_layers: number of convolutional layers
+5. --dgm_layers: number of dgm layers
+6. --fc_layers: number of linear layers
+7. --pre_lc: pre linear layer setting
+8. --gfun: diffusion function types: use state-of-the-art layers: gcn, gat, edgeconv
+9. --ffun: graph embedding function types: use state-of-the-art layers: gcn, gat (+mlp, id for experiments)
+10. --k: k param for k-gumbel sampling
+11. --pooling: pooling type (default = add)
+12. --dropout: drop out probability during training (not touching)
+13. --lr: learning rate (not touching)
+14. --test_eval: number of epoch for evaluation (not touching)
+
+## Notes: all settings above are for discrete case sampling.
+