lstrgar
diff --git a/‎.gitignore
Lines changed: 2 additions & 0 deletions b/‎.gitignore
Lines changed: 2 additions & 0 deletions
diff --git a/‎.gitmodules
Lines changed: 4 additions & 0 deletions b/‎.gitmodules
Lines changed: 4 additions & 0 deletions
diff --git a/‎LICENSE
Lines changed: 674 additions & 0 deletions b/‎LICENSE
Lines changed: 674 additions & 0 deletions
diff --git a/‎README.md
Lines changed: 58 additions & 0 deletions b/‎README.md
Lines changed: 58 additions & 0 deletions
diff --git a/‎config/conf.yaml
Lines changed: 27 additions & 0 deletions b/‎config/conf.yaml
Lines changed: 27 additions & 0 deletions
diff --git a/‎experiment/__pycache__/train_test.cpython-39.pyc
4.22 KB b/‎experiment/__pycache__/train_test.cpython-39.pyc
4.22 KB
diff --git a/‎experiment/train_test.py
Lines changed: 125 additions & 0 deletions b/‎experiment/train_test.py
Lines changed: 125 additions & 0 deletions
diff --git a/‎fairseq b/‎fairseq
diff --git a/‎models/__pycache__/classifier.cpython-39.pyc
2.32 KB b/‎models/__pycache__/classifier.cpython-39.pyc
2.32 KB
diff --git a/‎models/classifier.py
Lines changed: 62 additions & 0 deletions b/‎models/classifier.py
Lines changed: 62 additions & 0 deletions
@@ -0,0 +1,2 @@
+outputs
+checkpoints
@@ -0,0 +1,4 @@
+[submodule "fairseq"]
+	path = fairseq
+	url = [email protected]:lstrgar/fairseq.git
+	branch = lvs
@@ -0,0 +1,58 @@
+# Phoneme Segmentation Using Self-Supervised Speech Models
+
+## Usage
+
+### Obtain Pre-trained Model Checkpoints
+wav2vec2.0 and HuBERT checkpoints are available via fairseq at the following links. Download these models and place in a new folder titled `checkpoints`. 
+
+https://github.com/facebookresearch/fairseq/blob/main/examples/wav2vec/README.md
+https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec_small.pt
+
+https://github.com/facebookresearch/fairseq/blob/main/examples/hubert/README.md
+https://dl.fbaipublicfiles.com/hubert/hubert_base_ls960.pt
+
+### Obtain and Process TIMIT and/or Buckeye Speech Corpus
+
+Once the data has been obtained it must be stored in disk an a fashion that can be read by the provided dataloader, the core of which is borrowed from Kreuk Et al. (https://github.com/felixkreuk/UnsupSeg). See the Data Structure section of this repo for specifics, or simply use the provided `utils/make_timit.py` and `utils/make_buckeye.py` to split and organize the data exactly how we did it. Note: both of these scripts we also credit to Kreuk Et al., save a few minor changes. 
+
+You can run `make_timit.py` and `make_buckeye.py` as follows:
+
+`python utils/make_timit.py --inpath /path/to/original/timit --outpath /path/to/output/timit`
+
+`python utils/make_buckeye.py --spkr --source /path/to/original/buckeye --target /path/to/output/buckeye --min_phonemes 20 --max_phonemes 50`
+
+Note, here we do not provide the infrastructure to train these models using the pseudo-labels derived from a trained unsupervised model; however, the core implementation can be easily extended to train with alternate label supervision so long as the dataloader's interface remains unchanges. For those interested in training such a model, we would direct you to Kreuk Et al., where a pretrained unsupervised model can be used to generate pseudo-labels for TIMIT. 
+
+### Update Configuration YAML
+
+The following fields will need to be updated to reflect local paths on your machine:
+
+- timit_path
+- buckeye_path
+- base_ckpt_path
+
+You may also want to experiment with the `num_workers` attribute depending on your hardware. 
+
+### Training and Testing
+
+To freeze the pre-trained model weights and train only a classifier readout model on TIMIT with a wav2vec2.0 backbone run the following
+ 
+`python run.py data=timit lr=0.001 base_ckpt_path=/path/to/wav2vec2.0_ckpt mode=readout`
+
+`data=timit` can easily be swapped for `data=buckeye` just as `base_ckpt_path=/path/to/wav2vec2.0_ckpt` can be swapped with `base_ckpt_path=/path/to/hubert_ckpt`. 
+
+To finetune the whole pre-trained model and simply project final features with a linear readout run the you should set `lr=0.0001` and `mode=finetune`. Otherwise, the same swapping for TIMIT/Buckeye and wav2vec2.0/HuBERT applies. 
+
+Invoking `run.py` will train a model from scratch for 50 epochs while printing training stats every 10 batches and running model validation every 50 batches. Print preferences can be changed in the config with attributes `print_interval` and `val_interval`. `epochs` can also be modified if desired.
+
+During training models are saved to disk if they so-far demonstrate the best R-Value on the validation set. After training is complete, the best model is loaded from disk and tested with the testing set. Performance metrics in the harsh and lenient evaluation scheme are logged to standard out. 
+
+Lastly, every invocation of `run.py` will create an output folder under `outputs/datestamp/{exp_name}_timestamp`, which is where model checkpoints are saved along with the whole runtime config and a run.log. Everything logged to standard output during training will also be logged to the run.log file. 
+
+### Additional
+
+This codebase assumes CUDA availability.
+
+The config `seed` attribute can be changed to control random shuffling and initialization. 
+
+`train_percent` indicates the fraction of the training set to use. Some may be interested in observing model / training data efficiency by sweeping over this attribute. Sweeps can be easily accomodated using hydra's multi-run command line option. For more see the hydra docs. 
@@ -0,0 +1,27 @@
+# NOTE: do not name me "config.yml" to avoid conflict with fairseq defaults
+
+hydra:
+  run:
+    dir: ./outputs/${now:%Y-%m-%d}/${exp_name}-${now:%H-%M-%S}
+exp_name: null
+buckeye_path: /home/lvs/data/buckeye
+timit_path: /home/lvs/data/timit
+data: timit
+val_ratio: 0.1  # Ratio of training set to use for timit validation
+train_percent: 1.0 # Percentage of training data to use 
+num_workers: 5
+base_ckpt_path: /home/lvs/code/segment-public/checkpoints/w2v2_small_lib.pt
+seed: 0
+mode: readout
+label_dist_threshold: 1 # 20ms tolerance
+print_interval: 10 # Train batches to print loss stats
+val_interval: 50 # Train batches to eval step
+optim_type: adam
+beta1: 0.9
+beta2: 0.999
+momentum: 0.9
+weight_decay: 0
+pos_weight: 1.0 # BCE loss weighting
+epochs: 50
+batch_size: 16
+lr: 0.001
@@ -0,0 +1,125 @@
+import torch
+from torch.nn import BCEWithLogitsLoss
+from utils.eval import PrecisionRecallMetric
+from utils.dataloader import construct_mask
+from models.classifier import get_features
+from utils.misc import load_from_checkpoint, save_checkpoint, get_optimizer
+
+def train_test(cfg, model, classifier, trainloader, valloader, testloader, logger):
+    device = model.parameters().__next__().device
+    logger.info("TRAINING MODEL")
+    ckpt, _ = train(model, classifier, trainloader, valloader, cfg, logger, device)
+    logger.info("Training complete. Loading best model from checkpoint: {}".format(ckpt))
+    model, _, classifier, _, metrics = load_from_checkpoint(cfg, device, ckpt)
+    logger.info("Best model's VALIDATION METRICS:")
+    for k, v in metrics.items():
+        logger.info(f"{k}:")
+        for m, s in v.items():
+            logger.info(f"\t{m+':':<10} {s:>4.4f}")
+    logger.info("Testing best model")
+    test(model, classifier, testloader, cfg, logger, device)
+
+
+
+def train(model, classifier, trainloader, valloader, cfg, logger, device):
+    loss_fn = BCEWithLogitsLoss(
+        reduction="none", 
+        pos_weight=torch.tensor([cfg.pos_weight]).to(device)
+    )
+
+    params_dict = {
+        "classifier": classifier.parameters(),
+    }
+    if cfg.mode == "finetune":
+        logger.info("Fine-tuning encoder layers")
+        params_dict["model"] = model.parameters()
+    else:
+        logger.info("Training readout (classifier) weights ONLY")
+
+    optimizer = get_optimizer(cfg, params_dict)
+
+    global_step = 0
+    best_rval = 0
+    best_model = None
+
+    for e in range(cfg.epochs):
+        running_loss = 0.0
+        for i, samp in enumerate(trainloader):
+            if cfg.mode == "finetune":
+                model.train()
+            else:
+                model.eval()
+            classifier.train()
+            wavs, _, labels, _, lengths, _ = samp
+            mask = construct_mask(lengths, device).float()
+            wavs = wavs.to(device)
+            labels = labels.to(device)
+            optimizer.zero_grad()
+            results = model.extract_features(wavs, padding_mask=None)
+            features = get_features(results, cfg.mode)
+            logits = classifier(features).squeeze()
+            if len(logits.shape) == 1:
+                logits = logits.unsqueeze(0)
+            bce_loss = (loss_fn(logits, labels) * mask).sum() / mask.sum()
+            loss = bce_loss
+            running_loss += loss.item()
+            loss.backward()
+            optimizer.step()
+
+            if global_step % cfg.print_interval == cfg.print_interval - 1:
+                logger.info("Epoch: {}/{} | Batch: {}/{} | Loss: {:.4f}".format(
+                    e+1, cfg.epochs, i+1, len(trainloader), running_loss/cfg.print_interval,
+                ))
+                running_loss = 0.0
+
+            if cfg.val_interval and global_step % cfg.val_interval == cfg.val_interval - 1:
+                logger.info("MODEL VALIDATION: Epoch: {}/{} | Batch: {}/{}".format(e+1, cfg.epochs, i+1, len(trainloader)))
+                harsh_metrics_val, lenient_metrics_val = test(model, classifier, valloader, cfg, logger, device)
+                if harsh_metrics_val["rval"] > best_rval:
+                    best_rval = harsh_metrics_val["rval"]
+                    logger.info("New best (harsh) validation rval: {:.4f}".format(best_rval))
+                    metrics = {"harsh": harsh_metrics_val, "lenient": lenient_metrics_val}
+                    checkpoint_path = save_checkpoint(model, classifier, optimizer, metrics, e+1)
+                    best_model = checkpoint_path
+                    logger.info("Checkpoint saved to: {}".format(checkpoint_path))
+            
+            global_step += 1
+    
+    return best_model, best_rval
+
+
+def test(model, classifier, dataloader, cfg, logger, device):
+
+    model.eval()
+    classifier.eval()
+    metric_tracker_harsh = PrecisionRecallMetric(tolerance=cfg.label_dist_threshold, mode="harsh")
+    metric_tracker_lenient = PrecisionRecallMetric(tolerance=cfg.label_dist_threshold, mode="lenient")
+    sigmoid = torch.nn.Sigmoid()
+    logger.info("Evaluating model on {} samples".format(len(dataloader.dataset)))
+
+    for samp in dataloader:
+        wavs, segs, labels, _, lengths, _ = samp
+        segs = [[*segs[i][0]] + [s[1] for s in segs[i][1:]] for i in range(len(segs))]
+        wavs = wavs.to(device)
+        labels = labels.to(device)
+        results = model.extract_features(wavs, padding_mask=None)
+        features = get_features(results, cfg.mode)
+        preds = classifier(features).squeeze()
+        preds = sigmoid(preds)
+        preds = preds > 0.5
+        preds = [
+            torch.where(preds[i, :lengths[i]] == 1)[0].tolist() for i in range(preds.size(0))
+        ]
+        metric_tracker_harsh.update(segs, preds)
+        metric_tracker_lenient.update(segs, preds)
+
+    logger.info("Computing metrics with distance threshold of {} frames".format(cfg.label_dist_threshold))
+    
+    tracker_metrics_harsh = metric_tracker_harsh.get_stats()
+    tracker_metrics_lenient = metric_tracker_lenient.get_stats()
+
+    logger.info(f"{'SCORES:':<15} {'Lenient':>10} {'Harsh':>10}")
+    for k in tracker_metrics_harsh.keys():
+        logger.info("{:<15} {:>10.4f} {:>10.4f}".format(k+":", tracker_metrics_lenient[k], tracker_metrics_harsh[k]))
+
+    return tracker_metrics_harsh, tracker_metrics_lenient
@@ -0,0 +1,62 @@
+import torch.nn as nn
+import torch
+
+class Classifier(nn.Module):
+    def __init__(
+        self, 
+        mode="finetune",
+        n_layers=12
+    ):
+        super(Classifier, self).__init__()
+        self.mode = mode
+
+        if self.mode == "readout":
+            self.n_weights = n_layers
+            self.weight = nn.parameter.Parameter(torch.ones(self.n_weights, 1, 1, 1) / self.n_weights)
+            self.layerwise_convolutions = nn.ModuleList([
+                nn.Sequential(
+                    nn.Conv1d(768, 768, kernel_size=9, padding=4, stride=1),
+                    nn.ReLU(),
+                ) for _ in range(self.n_weights)
+            ])
+            self.network = nn.Sequential(
+                nn.Conv1d(768, 512, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(),
+                nn.Conv1d(512, 256, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(),
+                nn.Conv1d(256, 128, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(),
+                nn.Conv1d(128, 64, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(),
+                nn.Conv1d(64, 32, kernel_size=3, stride=1, padding=1),
+                nn.ReLU(),
+            )
+            self.out = nn.Linear(32, 1)
+        elif self.mode == "finetune":
+            self.out = nn.Linear(768, 1)
+
+    
+    def forward(self, x):
+        if self.mode == "readout":
+            layers = []
+            for i in range(x.size(0)):
+                layers.append(self.layerwise_convolutions[i](x[i, :, :, :].permute(0, 2, 1)).permute(0, 2, 1))
+            x = torch.stack(layers, dim=0)
+            x = torch.mul(x, self.weight).sum(0)
+            x = x.permute(0, 2, 1)
+            x = self.network(x)
+            x = x.permute(0, 2, 1)
+        
+        out = self.out(x)
+        return out
+
+
+def get_features(results, mode):
+    if mode == "finetune":
+        return results["x"]
+    elif mode == "readout":
+        zeros = torch.zeros_like(results["x"])
+        results = [r for r in results["layer_results"]]
+        features = [r[0].permute(1, 0, 2) if r[0] is not None else zeros.clone() for r in results]
+        features = torch.stack(features, dim=0).squeeze(0)
+        return features